association dataset methodological: Topics by Science.gov

Sample records for association dataset methodological

Secondary analysis of national survey datasets.

PubMed

Boo, Sunjoo; Froelicher, Erika Sivarajan

2013-06-01

This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.
Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach

NASA Astrophysics Data System (ADS)

Landgrebe, T. C. W.; Merdith, A.; Dutkiewicz, A.; Müller, R. D.

2013-07-01

Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.
An Automated Method to Identify Mesoscale Convective Complexes in the Regional Climate Model Evaluation System

NASA Astrophysics Data System (ADS)

Whitehall, K. D.; Jenkins, G. S.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Goodale, C. E.; Hart, A. F.; Ramirez, P.; Whittell, J.; Zimdars, P. A.

2012-12-01

Mesoscale convective complexes (MCCs) are large (2 - 3 x 105 km2) nocturnal convectively-driven weather systems that are generally associated with high precipitation events in short durations (less than 12hrs) in various locations through out the tropics and midlatitudes (Maddox 1980). These systems are particularly important for climate in the West Sahel region, where the precipitation associated with them is a principal component of the rainfall season (Laing and Fritsch 1993). These systems occur on weather timescales and are historically identified from weather data analysis via manual and more recently automated processes (Miller and Fritsch 1991, Nesbett 2006, Balmey and Reason 2012). The Regional Climate Model Evaluation System (RCMES) is an open source tool designed for easy evaluation of climate and Earth system data through access to standardized datasets, and intrinsic tools that perform common analysis and visualization tasks (Hart et al. 2011). The RCMES toolkit also provides the flexibility of user-defined subroutines for further metrics, visualization and even dataset manipulation. The purpose of this study is to present a methodology for identifying MCCs in observation datasets using the RCMES framework. TRMM 3 hourly datasets will be used to demonstrate the methodology for 2005 boreal summer. This method promotes the use of open source software for scientific data systems to address a concern to multiple stakeholders in the earth sciences. A historical MCC dataset provides a platform with regards to further studies of the variability of frequency on various timescales of MCCs that is important for many including climate scientists, meteorologists, water resource managers, and agriculturalists. The methodology of using RCMES for searching and clipping datasets will engender a new realm of studies as users of the system will no longer be restricted to solely using the datasets as they reside in their own local systems; instead will be afforded rapid, effective, and transparent access, processing and visualization of the wealth of remote sensing datasets and climate model outputs available.
Industrial Ecology Approach to MSW Methodology Data Set

EPA Pesticide Factsheets

U.S. municipal solid waste data for the year 2012. This dataset is associated with the following publication:Smith , R., D. Sengupta, S. Takkellapati , and C. Lee. An industrial ecology approach to municipal solid wastemanagement: I. Methodology. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 104: 311-316, (2015).
Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach.

PubMed

Chowdhury, Nilotpal; Sapru, Shantanu

2015-01-01

Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

PubMed Central

Chowdhury, Nilotpal; Sapru, Shantanu

2015-01-01

Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057
Data-driven probability concentration and sampling on manifold

DOE Office of Scientific and Technical Information (OSTI.GOV)

Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu

2016-09-15

A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less
A New Combinatorial Optimization Approach for Integrated Feature Selection Using Different Datasets: A Prostate Cancer Transcriptomic Study

PubMed Central

Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

2015-01-01

Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884
Al-Qaeda in Iraq (AQI): An Al-Qaeda Affiliate Case Study

DTIC Science & Technology

2017-10-01

a comparative methodology that included eight case studies on groups affiliated or associated with Al-Qaeda. These case studies were then used as a... methodology that included eight case studies on groups affiliated or associated with Al-Qaeda. These case studies were then used as a dataset for cross...Case Study Zack Gold With contributions from Pamela G. Faber October 2017 This work was performed under Federal Government
Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

PubMed

Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent

2014-05-01

Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Combining users' activity survey and simulators to evaluate human activity recognition systems.

PubMed

Azkune, Gorka; Almeida, Aitor; López-de-Ipiña, Diego; Chen, Liming

2015-04-08

Evaluating human activity recognition systems usually implies following expensive and time-consuming methodologies, where experiments with humans are run with the consequent ethical and legal issues. We propose a novel evaluation methodology to overcome the enumerated problems, which is based on surveys for users and a synthetic dataset generator tool. Surveys allow capturing how different users perform activities of daily living, while the synthetic dataset generator is used to create properly labelled activity datasets modelled with the information extracted from surveys. Important aspects, such as sensor noise, varying time lapses and user erratic behaviour, can also be simulated using the tool. The proposed methodology is shown to have very important advantages that allow researchers to carry out their work more efficiently. To evaluate the approach, a synthetic dataset generated following the proposed methodology is compared to a real dataset computing the similarity between sensor occurrence frequencies. It is concluded that the similarity between both datasets is more than significant.
Method applied to the background analysis of energy data to be considered for the European Reference Life Cycle Database (ELCD).

PubMed

Fazio, Simone; Garraín, Daniel; Mathieux, Fabrice; De la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda

2015-01-01

Under the framework of the European Platform on Life Cycle Assessment, the European Reference Life-Cycle Database (ELCD - developed by the Joint Research Centre of the European Commission), provides core Life Cycle Inventory (LCI) data from front-running EU-level business associations and other sources. The ELCD contains energy-related data on power and fuels. This study describes the methods to be used for the quality analysis of energy data for European markets (available in third-party LC databases and from authoritative sources) that are, or could be, used in the context of the ELCD. The methodology was developed and tested on the energy datasets most relevant for the EU context, derived from GaBi (the reference database used to derive datasets for the ELCD), Ecoinvent, E3 and Gemis. The criteria for the database selection were based on the availability of EU-related data, the inclusion of comprehensive datasets on energy products and services, and the general approval of the LCA community. The proposed approach was based on the quality indicators developed within the International Reference Life Cycle Data System (ILCD) Handbook, further refined to facilitate their use in the analysis of energy systems. The overall Data Quality Rating (DQR) of the energy datasets can be calculated by summing up the quality rating (ranging from 1 to 5, where 1 represents very good, and 5 very poor quality) of each of the quality criteria indicators, divided by the total number of indicators considered. The quality of each dataset can be estimated for each indicator, and then compared with the different databases/sources. The results can be used to highlight the weaknesses of each dataset and can be used to guide further improvements to enhance the data quality with regard to the established criteria. This paper describes the application of the methodology to two exemplary datasets, in order to show the potential of the methodological approach. The analysis helps LCA practitioners to evaluate the usefulness of the ELCD datasets for their purposes, and dataset developers and reviewers to derive information that will help improve the overall DQR of databases.
Temperature, Geochemistry, and Gravity Data of the Tularosa Basin

DOE Data Explorer

Nash, Greg

2017-06-16

This submission contains multiple excel spreadsheets and associated written reports. The datasets area are representative of shallow temperature, geochemistry, and other well logging observations made across WSMR (white sands missile range); located to the west of the Tularosa Basin but still within the study area. Written reports accompany some of the datasets, and they provide ample description of the methodology and results obtained from these studies. Gravity data is also included, as point data in a shapefile, along with a written report describing that particular study.
Deciphering the complex: methodological overview of statistical models to derive OMICS-based biomarkers.

PubMed

Chadeau-Hyam, Marc; Campanella, Gianluca; Jombart, Thibaut; Bottolo, Leonardo; Portengen, Lutzen; Vineis, Paolo; Liquet, Benoit; Vermeulen, Roel C H

2013-08-01

Recent technological advances in molecular biology have given rise to numerous large-scale datasets whose analysis imposes serious methodological challenges mainly relating to the size and complex structure of the data. Considerable experience in analyzing such data has been gained over the past decade, mainly in genetics, from the Genome-Wide Association Study era, and more recently in transcriptomics and metabolomics. Building upon the corresponding literature, we provide here a nontechnical overview of well-established methods used to analyze OMICS data within three main types of regression-based approaches: univariate models including multiple testing correction strategies, dimension reduction techniques, and variable selection models. Our methodological description focuses on methods for which ready-to-use implementations are available. We describe the main underlying assumptions, the main features, and advantages and limitations of each of the models. This descriptive summary constitutes a useful tool for driving methodological choices while analyzing OMICS data, especially in environmental epidemiology, where the emergence of the exposome concept clearly calls for unified methods to analyze marginally and jointly complex exposure and OMICS datasets. Copyright © 2013 Wiley Periodicals, Inc.
Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding.

PubMed

Rios, Anthony; Kavuluru, Ramakanth

2013-09-01

Extracting diagnosis codes from medical records is a complex task carried out by trained coders by reading all the documents associated with a patient's visit. With the popularity of electronic medical records (EMRs), computational approaches to code extraction have been proposed in the recent years. Machine learning approaches to multi-label text classification provide an important methodology in this task given each EMR can be associated with multiple codes. In this paper, we study the the role of feature selection, training data selection, and probabilistic threshold optimization in improving different multi-label classification approaches. We conduct experiments based on two different datasets: a recent gold standard dataset used for this task and a second larger and more complex EMR dataset we curated from the University of Kentucky Medical Center. While conventional approaches achieve results comparable to the state-of-the-art on the gold standard dataset, on our complex in-house dataset, we show that feature selection, training data selection, and probabilistic thresholding provide significant gains in performance.
An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams

ERIC Educational Resources Information Center

Teymourlouei, Haydar

2013-01-01

The emerging large datasets have made efficient data processing a much more difficult task for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with time. The purpose of this research is to give an overview of some of the tools and techniques that can be utilized to manage and analyze large datasets. We…
Exploring Relationships in Big Data

NASA Astrophysics Data System (ADS)

Mahabal, A.; Djorgovski, S. G.; Crichton, D. J.; Cinquini, L.; Kelly, S.; Colbert, M. A.; Kincaid, H.

2015-12-01

Big Data are characterized by several different 'V's. Volume, Veracity, Volatility, Value and so on. For many datasets inflated Volumes through redundant features often make the data more noisy and difficult to extract Value out of. This is especially true if one is comparing/combining different datasets, and the metadata are diverse. We have been exploring ways to exploit such datasets through a variety of statistical machinery, and visualization. We show how we have applied it to time-series from large astronomical sky-surveys. This was done in the Virtual Observatory framework. More recently we have been doing similar work for a completely different domain viz. biology/cancer. The methodology reuse involves application to diverse datasets gathered through the various centers associated with the Early Detection Research Network (EDRN) for cancer, an initiative of the National Cancer Institute (NCI). Application to Geo datasets is a natural extension.
A hybrid approach to select features and classify diseases based on medical data

NASA Astrophysics Data System (ADS)

AbdelLatif, Hisham; Luo, Jiawei

2018-03-01

Feature selection is popular problem in the classification of diseases in clinical medicine. Here, we developing a hybrid methodology to classify diseases, based on three medical datasets, Arrhythmia, Breast cancer, and Hepatitis datasets. This methodology called k-means ANOVA Support Vector Machine (K-ANOVA-SVM) uses K-means cluster with ANOVA statistical to preprocessing data and selection the significant features, and Support Vector Machines in the classification process. To compare and evaluate the performance, we choice three classification algorithms, decision tree Naïve Bayes, Support Vector Machines and applied the medical datasets direct to these algorithms. Our methodology was a much better classification accuracy is given of 98% in Arrhythmia datasets, 92% in Breast cancer datasets and 88% in Hepatitis datasets, Compare to use the medical data directly with decision tree Naïve Bayes, and Support Vector Machines. Also, the ROC curve and precision with (K-ANOVA-SVM) Achieved best results than other algorithms
Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations.

PubMed

Agapito, Giuseppe; Milano, Marianna; Guzzi, Pietro Hiram; Cannataro, Mario

2016-01-01

Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.
Meta-Analysis in Genome-Wide Association Datasets: Strategies and Application in Parkinson Disease

PubMed Central

Evangelou, Evangelos; Maraganore, Demetrius M.; Ioannidis, John P.A.

2007-01-01

Background Genome-wide association studies hold substantial promise for identifying common genetic variants that regulate susceptibility to complex diseases. However, for the detection of small genetic effects, single studies may be underpowered. Power may be improved by combining genome-wide datasets with meta-analytic techniques. Methodology/Principal Findings Both single and two-stage genome-wide data may be combined and there are several possible strategies. In the two-stage framework, we considered the options of (1) enhancement of replication data and (2) enhancement of first-stage data, and then, we also considered (3) joint meta-analyses including all first-stage and second-stage data. These strategies were examined empirically using data from two genome-wide association studies (three datasets) on Parkinson disease. In the three strategies, we derived 12, 5, and 49 single nucleotide polymorphisms that show significant associations at conventional levels of statistical significance. None of these remained significant after conservative adjustment for the number of performed analyses in each strategy. However, some may warrant further consideration: 6 SNPs were identified with at least 2 of the 3 strategies and 3 SNPs [rs1000291 on chromosome 3, rs2241743 on chromosome 4 and rs3018626 on chromosome 11] were identified with all 3 strategies and had no or minimal between-dataset heterogeneity (I2 = 0, 0 and 15%, respectively). Analyses were primarily limited by the suboptimal overlap of tested polymorphisms across different datasets (e.g., only 31,192 shared polymorphisms between the two tier 1 datasets). Conclusions/Significance Meta-analysis may be used to improve the power and examine the between-dataset heterogeneity of genome-wide association studies. Prospective designs may be most efficient, if they try to maximize the overlap of genotyping platforms and anticipate the combination of data across many genome-wide association studies. PMID:17332845

Exploring the Impact of Chronic Tic Disorders on Youth: Results from the Tourette Syndrome Impact Survey

ERIC Educational Resources Information Center

Conelea, Christine A.; Woods, Douglas W.; Zinner, Samuel H.; Budman, Cathy; Murphy, Tanya; Scahill, Lawrence D.; Compton, Scott N.; Walkup, John

2011-01-01

Prior research has demonstrated that chronic tic disorders (CTD) are associated with functional impairment across several domains. However, methodological limitations, such as data acquired by parental report, datasets aggregated across child and adult samples, and small treatment-seeking samples, curtail interpretation. The current study explored…
An Integrated Science-based methodology

EPA Pesticide Factsheets

The data is secondary in nature. Meaning that no data was generated as part of this review effort. Rather, data that was available in the peer-reviewed literature was used.This dataset is associated with the following publication:Tolaymat , T., A. El Badawy, R. Sequeira, and A. Genaidy. An integrated science-based methodology to assess potential risks and implications of engineered nanomaterials. Diana Aga, Wonyong Choi, Andrew Daugulis, Gianluca Li Puma, Gerasimos Lyberatos, and Joo Hwa Tay JOURNAL OF HAZARDOUS MATERIALS. Elsevier Science Ltd, New York, NY, USA, 298: 270-281, (2015).
A Novel Performance Evaluation Methodology for Single-Target Trackers.

PubMed

Kristan, Matej; Matas, Jiri; Leonardis, Ales; Vojir, Tomas; Pflugfelder, Roman; Fernandez, Gustavo; Nebehay, Georg; Porikli, Fatih; Cehovin, Luka

2016-11-01

This paper addresses the problem of single-target tracker performance evaluation. We consider the performance measures, the dataset and the evaluation system to be the most important components of tracker evaluation and propose requirements for each of them. The requirements are the basis of a new evaluation methodology that aims at a simple and easily interpretable tracker comparison. The ranking-based methodology addresses tracker equivalence in terms of statistical significance and practical differences. A fully-annotated dataset with per-frame annotations with several visual attributes is introduced. The diversity of its visual properties is maximized in a novel way by clustering a large number of videos according to their visual attributes. This makes it the most sophistically constructed and annotated dataset to date. A multi-platform evaluation system allowing easy integration of third-party trackers is presented as well. The proposed evaluation methodology was tested on the VOT2014 challenge on the new dataset and 38 trackers, making it the largest benchmark to date. Most of the tested trackers are indeed state-of-the-art since they outperform the standard baselines, resulting in a highly-challenging benchmark. An exhaustive analysis of the dataset from the perspective of tracking difficulty is carried out. To facilitate tracker comparison a new performance visualization technique is proposed.
A replication and methodological critique of the study "Evaluating drug trafficking on the Tor Network".

PubMed

Munksgaard, Rasmus; Demant, Jakob; Branwen, Gwern

2016-09-01

The development of cryptomarkets has gained increasing attention from academics, including growing scientific literature on the distribution of illegal goods using cryptomarkets. Dolliver's 2015 article "Evaluating drug trafficking on the Tor Network: Silk Road 2, the Sequel" addresses this theme by evaluating drug trafficking on one of the most well-known cryptomarkets, Silk Road 2.0. The research on cryptomarkets in general-particularly in Dolliver's article-poses a number of new questions for methodologies. This commentary is structured around a replication of Dolliver's original study. The replication study is not based on Dolliver's original dataset, but on a second dataset collected applying the same methodology. We have found that the results produced by Dolliver differ greatly from our replicated study. While a margin of error is to be expected, the inconsistencies we found are too great to attribute to anything other than methodological issues. The analysis and conclusions drawn from studies using these methods are promising and insightful. However, based on the replication of Dolliver's study, we suggest that researchers using these methodologies consider and that datasets be made available for other researchers, and that methodology and dataset metrics (e.g. number of downloaded pages, error logs) are described thoroughly in the context of web-o-metrics and web crawling. Copyright © 2016 Elsevier B.V. All rights reserved.
Background qualitative analysis of the European Reference Life Cycle Database (ELCD) energy datasets - part I: fuel datasets.

PubMed

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.
Quantile-based bias correction and uncertainty quantification of extreme event attribution statements

DOE PAGES

Jeon, Soyoung; Paciorek, Christopher J.; Wehner, Michael F.

2016-02-16

Extreme event attribution characterizes how anthropogenic climate change may have influenced the probability and magnitude of selected individual extreme weather and climate events. Attribution statements often involve quantification of the fraction of attributable risk (FAR) or the risk ratio (RR) and associated confidence intervals. Many such analyses use climate model output to characterize extreme event behavior with and without anthropogenic influence. However, such climate models may have biases in their representation of extreme events. To account for discrepancies in the probabilities of extreme events between observational datasets and model datasets, we demonstrate an appropriate rescaling of the model output basedmore » on the quantiles of the datasets to estimate an adjusted risk ratio. Our methodology accounts for various components of uncertainty in estimation of the risk ratio. In particular, we present an approach to construct a one-sided confidence interval on the lower bound of the risk ratio when the estimated risk ratio is infinity. We demonstrate the methodology using the summer 2011 central US heatwave and output from the Community Earth System Model. In this example, we find that the lower bound of the risk ratio is relatively insensitive to the magnitude and probability of the actual event.« less
77 FR 15052 - Dataset Workshop-U.S. Billion Dollar Disasters Dataset (1980-2011): Assessing Dataset Strengths...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-03-14

... and related methodology. Emphasis will be placed on dataset accuracy and time-dependent biases. Pathways to overcome accuracy and bias issues will be an important focus. Participants will consider...] Guidance for improving these methods. [cir] Recommendations for rectifying any known time-dependent biases...
Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

PubMed

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.
Novel linkage disequilibrium clustering algorithm identifies new lupus genes on meta-analysis of GWAS datasets.

PubMed

Saeed, Mohammad

2017-05-01

Systemic lupus erythematosus (SLE) is a complex disorder. Genetic association studies of complex disorders suffer from the following three major issues: phenotypic heterogeneity, false positive (type I error), and false negative (type II error) results. Hence, genes with low to moderate effects are missed in standard analyses, especially after statistical corrections. OASIS is a novel linkage disequilibrium clustering algorithm that can potentially address false positives and negatives in genome-wide association studies (GWAS) of complex disorders such as SLE. OASIS was applied to two SLE dbGAP GWAS datasets (6077 subjects; ∼0.75 million single-nucleotide polymorphisms). OASIS identified three known SLE genes viz. IFIH1, TNIP1, and CD44, not previously reported using these GWAS datasets. In addition, 22 novel loci for SLE were identified and the 5 SLE genes previously reported using these datasets were verified. OASIS methodology was validated using single-variant replication and gene-based analysis with GATES. This led to the verification of 60% of OASIS loci. New SLE genes that OASIS identified and were further verified include TNFAIP6, DNAJB3, TTF1, GRIN2B, MON2, LATS2, SNX6, RBFOX1, NCOA3, and CHAF1B. This study presents the OASIS algorithm, software, and the meta-analyses of two publicly available SLE GWAS datasets along with the novel SLE genes. Hence, OASIS is a novel linkage disequilibrium clustering method that can be universally applied to existing GWAS datasets for the identification of new genes.
Dose calculation for photon-emitting brachytherapy sources with average energy higher than 50 keV: report of the AAPM and ESTRO.

PubMed

Perez-Calatayud, Jose; Ballester, Facundo; Das, Rupak K; Dewerd, Larry A; Ibbott, Geoffrey S; Meigooni, Ali S; Ouhib, Zoubir; Rivard, Mark J; Sloboda, Ron S; Williamson, Jeffrey F

2012-05-01

Recommendations of the American Association of Physicists in Medicine (AAPM) and the European Society for Radiotherapy and Oncology (ESTRO) on dose calculations for high-energy (average energy higher than 50 keV) photon-emitting brachytherapy sources are presented, including the physical characteristics of specific (192)Ir, (137)Cs, and (60)Co source models. This report has been prepared by the High Energy Brachytherapy Source Dosimetry (HEBD) Working Group. This report includes considerations in the application of the TG-43U1 formalism to high-energy photon-emitting sources with particular attention to phantom size effects, interpolation accuracy dependence on dose calculation grid size, and dosimetry parameter dependence on source active length. Consensus datasets for commercially available high-energy photon sources are provided, along with recommended methods for evaluating these datasets. Recommendations on dosimetry characterization methods, mainly using experimental procedures and Monte Carlo, are established and discussed. Also included are methodological recommendations on detector choice, detector energy response characterization and phantom materials, and measurement specification methodology. Uncertainty analyses are discussed and recommendations for high-energy sources without consensus datasets are given. Recommended consensus datasets for high-energy sources have been derived for sources that were commercially available as of January 2010. Data are presented according to the AAPM TG-43U1 formalism, with modified interpolation and extrapolation techniques of the AAPM TG-43U1S1 report for the 2D anisotropy function and radial dose function.
MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.

PubMed

Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S

2014-01-01

A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA. Copyright © 2014 Elsevier Inc. All rights reserved.
A robust dataset-agnostic heart disease classifier from Phonocardiogram.

PubMed

Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

2017-07-01

Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.
Updated population metadata for United States historical climatology network stations

USGS Publications Warehouse

Owen, T.W.; Gallo, K.P.

2000-01-01

The United States Historical Climatology Network (HCN) serial temperature dataset is comprised of 1221 high-quality, long-term climate observing stations. The HCN dataset is available in several versions, one of which includes population-based temperature modifications to adjust urban temperatures for the "heat-island" effect. Unfortunately, the decennial population metadata file is not complete as missing values are present for 17.6% of the 12 210 population values associated with the 1221 individual stations during the 1900-90 interval. Retrospective grid-based populations. Within a fixed distance of an HCN station, were estimated through the use of a gridded population density dataset and historically available U.S. Census county data. The grid-based populations for the HCN stations provide values derived from a consistent methodology compared to the current HCN populations that can vary as definitions of the area associated with a city change over time. The use of grid-based populations may minimally be appropriate to augment populations for HCN climate stations that lack any population data, and are recommended when consistent and complete population data are required. The recommended urban temperature adjustments based on the HCN and grid-based methods of estimating station population can be significantly different for individual stations within the HCN dataset.
Associating uncertainty with datasets using Linked Data and allowing propagation via provenance chains

NASA Astrophysics Data System (ADS)

Car, Nicholas; Cox, Simon; Fitch, Peter

2015-04-01

With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.
Using GO-WAR for mining cross-ontology weighted association rules.

PubMed

Agapito, Giuseppe; Cannataro, Mario; Guzzi, Pietro Hiram; Milano, Marianna

2015-07-01

The Gene Ontology (GO) is a structured repository of concepts (GO terms) that are associated to one or more gene products. The process of association is referred to as annotation. The relevance and the specificity of both GO terms and annotations are evaluated by a measure defined as information content (IC). The analysis of annotated data is thus an important challenge for bioinformatics. There exist different approaches of analysis. From those, the use of association rules (AR) may provide useful knowledge, and it has been used in some applications, e.g. improving the quality of annotations. Nevertheless classical association rules algorithms do not take into account the source of annotation nor the importance yielding to the generation of candidate rules with low IC. This paper presents GO-WAR (Gene Ontology-based Weighted Association Rules) a methodology for extracting weighted association rules. GO-WAR can extract association rules with a high level of IC without loss of support and confidence from a dataset of annotated data. A case study on using of GO-WAR on publicly available GO annotation datasets is used to demonstrate that our method outperforms current state of the art approaches. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis.

PubMed

Zhou, Yan; Wang, Pei; Wang, Xianlong; Zhu, Ji; Song, Peter X-K

2017-01-01

The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodology-sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer. © 2016 WILEY PERIODICALS, INC.
Ancestral haplotype-based association mapping with generalized linear mixed models accounting for stratification.

PubMed

Zhang, Z; Guillaume, F; Sartelet, A; Charlier, C; Georges, M; Farnir, F; Druet, T

2012-10-01

In many situations, genome-wide association studies are performed in populations presenting stratification. Mixed models including a kinship matrix accounting for genetic relatedness among individuals have been shown to correct for population and/or family structure. Here we extend this methodology to generalized linear mixed models which properly model data under various distributions. In addition we perform association with ancestral haplotypes inferred using a hidden Markov model. The method was shown to properly account for stratification under various simulated scenari presenting population and/or family structure. Use of ancestral haplotypes resulted in higher power than SNPs on simulated datasets. Application to real data demonstrates the usefulness of the developed model. Full analysis of a dataset with 4600 individuals and 500 000 SNPs was performed in 2 h 36 min and required 2.28 Gb of RAM. The software GLASCOW can be freely downloaded from www.giga.ulg.ac.be/jcms/prod_381171/software. francois.guillaume@jouy.inra.fr Supplementary data are available at Bioinformatics online.
U.S. Heat Demand by Sector for Potential Application of Direct Use Geothermal

DOE Data Explorer

Katherine Young

2016-06-23

This dataset includes heat demand for potential application of direct use geothermal broken down into 4 sectors: agricultural, commercial, manufacturing and residential. The data for each sector are organized by county, were disaggregated specifically to assess the market demand for geothermal direct use, and were derived using methodologies customized for each sector based on the availability of data and other sector-specific factors. This dataset also includes a paper containing a full explanation of the methodologies used.
ECOALIM: A Dataset of Environmental Impacts of Feed Ingredients Used in French Animal Production.

PubMed

Wilfart, Aurélie; Espagnol, Sandrine; Dauguet, Sylvie; Tailleur, Aurélie; Gac, Armelle; Garcia-Launay, Florence

2016-01-01

Feeds contribute highly to environmental impacts of livestock products. Therefore, formulating low-impact feeds requires data on environmental impacts of feed ingredients with consistent perimeters and methodology for life cycle assessment (LCA). We created the ECOALIM dataset of life cycle inventories (LCIs) and associated impacts of feed ingredients used in animal production in France. It provides several perimeters for LCIs (field gate, storage agency gate, plant gate and harbour gate) with homogeneously collected data from French R&D institutes covering the 2005-2012 period. The dataset of environmental impacts is available as a Microsoft® Excel spreadsheet on the ECOALIM website and provides climate change, acidification, eutrophication, non-renewable and total cumulative energy demand, phosphorus demand, and land occupation. LCIs in the ECOALIM dataset are available in the AGRIBALYSE® database in SimaPro® software. The typology performed on the dataset classified the 149 average feed ingredients into categories of low impact (co-products of plant origin and minerals), high impact (feed-use amino acids, fats and vitamins) and intermediate impact (cereals, oilseeds, oil meals and protein crops). Therefore, the ECOALIM dataset can be used by feed manufacturers and LCA practitioners to investigate formulation of low-impact feeds. It also provides data for environmental evaluation of feeds and animal production systems. Included in AGRIBALYSE® database and SimaPro®, the ECOALIM dataset will benefit from their procedures for maintenance and regular updating. Future use can also include environmental labelling of commercial products from livestock production.
ECOALIM: A Dataset of Environmental Impacts of Feed Ingredients Used in French Animal Production

PubMed Central

Espagnol, Sandrine; Dauguet, Sylvie; Tailleur, Aurélie; Gac, Armelle; Garcia-Launay, Florence

2016-01-01

Feeds contribute highly to environmental impacts of livestock products. Therefore, formulating low-impact feeds requires data on environmental impacts of feed ingredients with consistent perimeters and methodology for life cycle assessment (LCA). We created the ECOALIM dataset of life cycle inventories (LCIs) and associated impacts of feed ingredients used in animal production in France. It provides several perimeters for LCIs (field gate, storage agency gate, plant gate and harbour gate) with homogeneously collected data from French R&D institutes covering the 2005–2012 period. The dataset of environmental impacts is available as a Microsoft® Excel spreadsheet on the ECOALIM website and provides climate change, acidification, eutrophication, non-renewable and total cumulative energy demand, phosphorus demand, and land occupation. LCIs in the ECOALIM dataset are available in the AGRIBALYSE® database in SimaPro® software. The typology performed on the dataset classified the 149 average feed ingredients into categories of low impact (co-products of plant origin and minerals), high impact (feed-use amino acids, fats and vitamins) and intermediate impact (cereals, oilseeds, oil meals and protein crops). Therefore, the ECOALIM dataset can be used by feed manufacturers and LCA practitioners to investigate formulation of low-impact feeds. It also provides data for environmental evaluation of feeds and animal production systems. Included in AGRIBALYSE® database and SimaPro®, the ECOALIM dataset will benefit from their procedures for maintenance and regular updating. Future use can also include environmental labelling of commercial products from livestock production. PMID:27930682

A Systems Biology Methodology Combining Transcriptome and Interactome Datasets to Assess the Implications of Cytokinin Signaling for Plant Immune Networks.

PubMed

Kunz, Meik; Dandekar, Thomas; Naseem, Muhammad

2017-01-01

Cytokinins (CKs) play an important role in plant growth and development. Also, several studies highlight the modulatory implications of CKs for plant-pathogen interaction. However, the underlying mechanisms of CK mediating immune networks in plants are still not fully understood. A detailed analysis of high-throughput transcriptome (RNA-Seq and microarrays) datasets under modulated conditions of plant CKs and its mergence with cellular interactome (large-scale protein-protein interaction data) has the potential to unlock the contribution of CKs to plant defense. Here, we specifically describe a detailed systems biology methodology pertinent to the acquisition and analysis of various omics datasets that delineate the role of plant CKs in impacting immune pathways in Arabidopsis.
A Novel Methodology for Improving Plant Pest Surveillance in Vineyards and Crops Using UAV-Based Hyperspectral and Spatial Data.

PubMed

Vanegas, Fernando; Bratanov, Dmitry; Powell, Kevin; Weiss, John; Gonzalez, Felipe

2018-01-17

Recent advances in remote sensed imagery and geospatial image processing using unmanned aerial vehicles (UAVs) have enabled the rapid and ongoing development of monitoring tools for crop management and the detection/surveillance of insect pests. This paper describes a (UAV) remote sensing-based methodology to increase the efficiency of existing surveillance practices (human inspectors and insect traps) for detecting pest infestations (e.g., grape phylloxera in vineyards). The methodology uses a UAV integrated with advanced digital hyperspectral, multispectral, and RGB sensors. We implemented the methodology for the development of a predictive model for phylloxera detection. In this method, we explore the combination of airborne RGB, multispectral, and hyperspectral imagery with ground-based data at two separate time periods and under different levels of phylloxera infestation. We describe the technology used-the sensors, the UAV, and the flight operations-the processing workflow of the datasets from each imagery type, and the methods for combining multiple airborne with ground-based datasets. Finally, we present relevant results of correlation between the different processed datasets. The objective of this research is to develop a novel methodology for collecting, processing, analising and integrating multispectral, hyperspectral, ground and spatial data to remote sense different variables in different applications, such as, in this case, plant pest surveillance. The development of such methodology would provide researchers, agronomists, and UAV practitioners reliable data collection protocols and methods to achieve faster processing techniques and integrate multiple sources of data in diverse remote sensing applications.
Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: Evolutionary enhanced Markov clustering.

PubMed

Theofilatos, Konstantinos; Pavlopoulou, Niki; Papasavvas, Christoforos; Likothanassis, Spiros; Dimitrakopoulos, Christos; Georgopoulos, Efstratios; Moschopoulos, Charalampos; Mavroudi, Seferina

2015-03-01

Proteins are considered to be the most important individual components of biological systems and they combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as they require parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs. The proposed methodology is called evolutionary enhanced Markov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-of-the-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-of-the-art methodologies when applied to datasets from the human and the yeast Saccharomyces cerevisiae organisms. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexes which consist of proteins with high functional similarity. In specific, 5737 protein complexes were predicted and 72.58% of them are enriched for at least one gene ontology (GO) function term. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggested as candidates for further validation using experimental techniques. Copyright © 2015 Elsevier B.V. All rights reserved.
Factors affecting reproducibility between genome-scale siRNA-based screens

PubMed Central

Barrows, Nicholas J.; Le Sommer, Caroline; Garcia-Blanco, Mariano A.; Pearson, James L.

2011-01-01

RNA interference-based screening is a powerful new genomic technology which addresses gene function en masse. To evaluate factors influencing hit list composition and reproducibility, we performed two identically designed small interfering RNA (siRNA)-based, whole genome screens for host factors supporting yellow fever virus infection. These screens represent two separate experiments completed five months apart and allow the direct assessment of the reproducibility of a given siRNA technology when performed in the same environment. Candidate hit lists generated by sum rank, median absolute deviation, z-score, and strictly standardized mean difference were compared within and between whole genome screens. Application of these analysis methodologies within a single screening dataset using a fixed threshold equivalent to a p-value ≤ 0.001 resulted in hit lists ranging from 82 to 1,140 members and highlighted the tremendous impact analysis methodology has on hit list composition. Intra- and inter-screen reproducibility was significantly influenced by the analysis methodology and ranged from 32% to 99%. This study also highlighted the power of testing at least two independent siRNAs for each gene product in primary screens. To facilitate validation we conclude by suggesting methods to reduce false discovery at the primary screening stage. In this study we present the first comprehensive comparison of multiple analysis strategies, and demonstrate the impact of the analysis methodology on the composition of the “hit list”. Therefore, we propose that the entire dataset derived from functional genome-scale screens, especially if publicly funded, should be made available as is done with data derived from gene expression and genome-wide association studies. PMID:20625183
a Metadata Based Approach for Analyzing Uav Datasets for Photogrammetric Applications

NASA Astrophysics Data System (ADS)

Dhanda, A.; Remondino, F.; Santana Quintero, M.

2018-05-01

This paper proposes a methodology for pre-processing and analysing Unmanned Aerial Vehicle (UAV) datasets before photogrammetric processing. In cases where images are gathered without a detailed flight plan and at regular acquisition intervals the datasets can be quite large and be time consuming to process. This paper proposes a method to calculate the image overlap and filter out images to reduce large block sizes and speed up photogrammetric processing. The python-based algorithm that implements this methodology leverages the metadata in each image to determine the end and side overlap of grid-based UAV flights. Utilizing user input, the algorithm filters out images that are unneeded for photogrammetric processing. The result is an algorithm that can speed up photogrammetric processing and provide valuable information to the user about the flight path.
DMET-Miner: Efficient discovery of association rules from pharmacogenomic data.

PubMed

Agapito, Giuseppe; Guzzi, Pietro H; Cannataro, Mario

2015-08-01

Microarray platforms enable the investigation of allelic variants that may be correlated to phenotypes. Among those, the Affymetrix DMET (Drug Metabolism Enzymes and Transporters) platform enables the simultaneous investigation of all the genes that are related to drug absorption, distribution, metabolism and excretion (ADME). Although recent studies demonstrated the effectiveness of the use of DMET data for studying drug response or toxicity in clinical studies, there is a lack of tools for the automatic analysis of DMET data. In a previous work we developed DMET-Analyzer, a methodology and a supporting platform able to automatize the statistical study of allelic variants, that has been validated in several clinical studies. Although DMET-Analyzer is able to correlate a single variant for each probe (related to a portion of a gene) through the use of the Fisher test, it is unable to discover multiple associations among allelic variants, due to its underlying statistic analysis strategy that focuses on a single variant for each time. To overcome those limitations, here we propose a new analysis methodology for DMET data based on Association Rules mining, and an efficient implementation of this methodology, named DMET-Miner. DMET-Miner extends the DMET-Analyzer tool with data mining capabilities and correlates the presence of a set of allelic variants with the conditions of patient's samples by exploiting association rules. To face the high number of frequent itemsets generated when considering large clinical studies based on DMET data, DMET-Miner uses an efficient data structure and implements an optimized search strategy that reduces the search space and the execution time. Preliminary experiments on synthetic DMET datasets, show how DMET-Miner outperforms off-the-shelf data mining suites such as the FP-Growth algorithms available in Weka and RapidMiner. To demonstrate the biological relevance of the extracted association rules and the effectiveness of the proposed approach from a medical point of view, some preliminary studies on a real clinical dataset are currently under medical investigation. Copyright © 2015 Elsevier Inc. All rights reserved.
Bayesian Integration of Isotope Ratio for Geographic Sourcing of Castor Beans

DOE PAGES

Webb-Robertson, Bobbie-Jo; Kreuzer, Helen; Hart, Garret; ...

2012-01-01

Recenmore » t years have seen an increase in the forensic interest associated with the poison ricin, which is extracted from the seeds of the Ricinus communis plant. Both light element (C, N, O, and H) and strontium (Sr) isotope ratios have previously been used to associate organic material with geographic regions of origin. We present a Bayesian integration methodology that can more accurately predict the region of origin for a castor bean than individual models developed independently for light element stable isotopes or Sr isotope ratios. Our results demonstrate a clear improvement in the ability to correctly classify regions based on the integrated model with a class accuracy of 60.9 ± 2.1 % versus 55.9 ± 2.1 % and 40.2 ± 1.8 % for the light element and strontium (Sr) isotope ratios, respectively. In addition, we show graphically the strengths and weaknesses of each dataset in respect to class prediction and how the integration of these datasets strengthens the overall model.« less
Bayesian Integration of Isotope Ratios for Geographic Sourcing of Castor Beans

DOE Office of Scientific and Technical Information (OSTI.GOV)

Webb-Robertson, Bobbie-Jo M.; Kreuzer, Helen W.; Hart, Garret L.

Recent years have seen an increase in the forensic interest associated with the poison ricin, which is extracted from the seeds of the Ricinus communis plant. Both light element (C, N, O, and H) and strontium (Sr) isotope ratios have previously been used to associate organic material with geographic regions of origin. We present a Bayesian integration methodology that can more accurately predict the region of origin for a castor bean than individual models developed independently for light element stable isotopes or Sr isotope ratios. Our results demonstrate a clear improvement in the ability to correctly classify regions based onmore » the integrated model with a class accuracy of 6 0 . 9 {+-} 2 . 1 % versus 5 5 . 9 {+-} 2 . 1 % and 4 0 . 2 {+-} 1 . 8 % for the light element and strontium (Sr) isotope ratios, respectively. In addition, we show graphically the strengths and weaknesses of each dataset in respect to class prediction and how the integration of these datasets strengthens the overall model.« less
Bayesian Integration of Isotope Ratio for Geographic Sourcing of Castor Beans

DOE Office of Scientific and Technical Information (OSTI.GOV)

Webb-Robertson, Bobbie-Jo; Kreuzer, Helen; Hart, Garret

Recenmore » t years have seen an increase in the forensic interest associated with the poison ricin, which is extracted from the seeds of the Ricinus communis plant. Both light element (C, N, O, and H) and strontium (Sr) isotope ratios have previously been used to associate organic material with geographic regions of origin. We present a Bayesian integration methodology that can more accurately predict the region of origin for a castor bean than individual models developed independently for light element stable isotopes or Sr isotope ratios. Our results demonstrate a clear improvement in the ability to correctly classify regions based on the integrated model with a class accuracy of 60.9 ± 2.1 % versus 55.9 ± 2.1 % and 40.2 ± 1.8 % for the light element and strontium (Sr) isotope ratios, respectively. In addition, we show graphically the strengths and weaknesses of each dataset in respect to class prediction and how the integration of these datasets strengthens the overall model.« less
Bayesian Integration of Isotope Ratio for Geographic Sourcing of Castor Beans

PubMed Central

Webb-Robertson, Bobbie-Jo; Kreuzer, Helen; Hart, Garret; Ehleringer, James; West, Jason; Gill, Gary; Duckworth, Douglas

2012-01-01

Recent years have seen an increase in the forensic interest associated with the poison ricin, which is extracted from the seeds of the Ricinus communis plant. Both light element (C, N, O, and H) and strontium (Sr) isotope ratios have previously been used to associate organic material with geographic regions of origin. We present a Bayesian integration methodology that can more accurately predict the region of origin for a castor bean than individual models developed independently for light element stable isotopes or Sr isotope ratios. Our results demonstrate a clear improvement in the ability to correctly classify regions based on the integrated model with a class accuracy of 60.9 ± 2.1% versus 55.9 ± 2.1% and 40.2 ± 1.8% for the light element and strontium (Sr) isotope ratios, respectively. In addition, we show graphically the strengths and weaknesses of each dataset in respect to class prediction and how the integration of these datasets strengthens the overall model. PMID:22919270
CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

PubMed

Planey, Catherine R; Gevaert, Olivier

2016-03-09

Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.
A Novel Methodology for Improving Plant Pest Surveillance in Vineyards and Crops Using UAV-Based Hyperspectral and Spatial Data

PubMed Central

Vanegas, Fernando; Weiss, John; Gonzalez, Felipe

2018-01-01

Recent advances in remote sensed imagery and geospatial image processing using unmanned aerial vehicles (UAVs) have enabled the rapid and ongoing development of monitoring tools for crop management and the detection/surveillance of insect pests. This paper describes a (UAV) remote sensing-based methodology to increase the efficiency of existing surveillance practices (human inspectors and insect traps) for detecting pest infestations (e.g., grape phylloxera in vineyards). The methodology uses a UAV integrated with advanced digital hyperspectral, multispectral, and RGB sensors. We implemented the methodology for the development of a predictive model for phylloxera detection. In this method, we explore the combination of airborne RGB, multispectral, and hyperspectral imagery with ground-based data at two separate time periods and under different levels of phylloxera infestation. We describe the technology used—the sensors, the UAV, and the flight operations—the processing workflow of the datasets from each imagery type, and the methods for combining multiple airborne with ground-based datasets. Finally, we present relevant results of correlation between the different processed datasets. The objective of this research is to develop a novel methodology for collecting, processing, analysing and integrating multispectral, hyperspectral, ground and spatial data to remote sense different variables in different applications, such as, in this case, plant pest surveillance. The development of such methodology would provide researchers, agronomists, and UAV practitioners reliable data collection protocols and methods to achieve faster processing techniques and integrate multiple sources of data in diverse remote sensing applications. PMID:29342101
A new time-series methodology for estimating relationships between elderly frailty, remaining life expectancy, and ambient air quality.

PubMed

Murray, Christian J; Lipfert, Frederick W

2012-01-01

Many publications estimate short-term air pollution-mortality risks, but few estimate the associated changes in life-expectancies. We present a new methodology for analyzing time series of health effects, in which prior frailty is assumed to precede short-term elderly nontraumatic mortality. The model is based on a subpopulation of frail individuals whose entries and exits (deaths) are functions of daily and lagged environmental conditions: ambient temperature/season, airborne particles, and ozone. This frail susceptible population is unknown; its fluctuations cannot be observed but are estimated using maximum-likelihood methods with the Kalman filter. We used an existing 14-y set of daily data to illustrate the model and then tested the assumption of prior frailty with a new generalized model that estimates the portion of the daily death count allocated to nonfrail individuals. In this demonstration dataset, new entries into the high-risk pool are associated with lower ambient temperatures and higher concentrations of particulate matter and ozone. Accounting for these effects on antecedent frailty reduces this at-risk population, yielding frail life expectancies of 5-7 days. Associations between environmental factors and entries to the at-risk pool are about twice as strong as for mortality. Nonfrail elderly deaths are seen to make only small contributions. This new model predicts a small short-lived frail population-at-risk that is stable over a wide range of environmental conditions. The predicted effects of pollution on new entries and deaths are robust and consistent with conventional morbidity/mortality times-series studies. We recommend model verification using other suitable datasets.
Mining Context-Aware Association Rules Using Grammar-Based Genetic Programming.

PubMed

Luna, Jose Maria; Pechenizkiy, Mykola; Del Jesus, Maria Jose; Ventura, Sebastian

2017-09-25

Real-world data usually comprise features whose interpretation depends on some contextual information. Such contextual-sensitive features and patterns are of high interest to be discovered and analyzed in order to obtain the right meaning. This paper formulates the problem of mining context-aware association rules, which refers to the search for associations between itemsets such that the strength of their implication depends on a contextual feature. For the discovery of this type of associations, a model that restricts the search space and includes syntax constraints by means of a grammar-based genetic programming methodology is proposed. Grammars can be considered as a useful way of introducing subjective knowledge to the pattern mining process as they are highly related to the background knowledge of the user. The performance and usefulness of the proposed approach is examined by considering synthetically generated datasets. A posteriori analysis on different domains is also carried out to demonstrate the utility of this kind of associations. For example, in educational domains, it is essential to identify and understand contextual and context-sensitive factors that affect overall and individual student behavior and performance. The results of the experiments suggest that the approach is feasible and it automatically identifies interesting context-aware associations from real-world datasets.
Rear-end vision-based collision detection system for motorcyclists

NASA Astrophysics Data System (ADS)

Muzammel, Muhammad; Yusoff, Mohd Zuki; Meriaudeau, Fabrice

2017-05-01

In many countries, the motorcyclist fatality rate is much higher than that of other vehicle drivers. Among many other factors, motorcycle rear-end collisions are also contributing to these biker fatalities. To increase the safety of motorcyclists and minimize their road fatalities, this paper introduces a vision-based rear-end collision detection system. The binary road detection scheme contributes significantly to reduce the negative false detections and helps to achieve reliable results even though shadows and different lane markers are present on the road. The methodology is based on Harris corner detection and Hough transform. To validate this methodology, two types of dataset are used: (1) self-recorded datasets (obtained by placing a camera at the rear end of a motorcycle) and (2) online datasets (recorded by placing a camera at the front of a car). This method achieved 95.1% accuracy for the self-recorded dataset and gives reliable results for the rear-end vehicle detections under different road scenarios. This technique also performs better for the online car datasets. The proposed technique's high detection accuracy using a monocular vision camera coupled with its low computational complexity makes it a suitable candidate for a motorbike rear-end collision detection system.
GAPIT: genome association and prediction integrated tool.

PubMed

Lipka, Alexander E; Tian, Feng; Wang, Qishan; Peiffer, Jason; Li, Meng; Bradbury, Peter J; Gore, Michael A; Buckler, Edward S; Zhang, Zhiwu

2012-09-15

Software programs that conduct genome-wide association studies and genomic prediction and selection need to use methodologies that maximize statistical power, provide high prediction accuracy and run in a computationally efficient manner. We developed an R package called Genome Association and Prediction Integrated Tool (GAPIT) that implements advanced statistical methods including the compressed mixed linear model (CMLM) and CMLM-based genomic prediction and selection. The GAPIT package can handle large datasets in excess of 10 000 individuals and 1 million single-nucleotide polymorphisms with minimal computational time, while providing user-friendly access and concise tables and graphs to interpret results. http://www.maizegenetics.net/GAPIT. zhiwu.zhang@cornell.edu Supplementary data are available at Bioinformatics online.
Multi-decadal Hydrological Retrospective: Case study of Amazon floods and droughts

NASA Astrophysics Data System (ADS)

Wongchuig Correa, Sly; Paiva, Rodrigo Cauduro Dias de; Espinoza, Jhan Carlo; Collischonn, Walter

2017-06-01

Recently developed methodologies such as climate reanalysis make it possible to create a historical record of climate systems. This paper proposes a methodology called Hydrological Retrospective (HR), which essentially simulates large rainfall datasets, using this as input into hydrological models to develop a record of past hydrology, making it possible to analyze past floods and droughts. We developed a methodology for the Amazon basin, where studies have shown an increase in the intensity and frequency of hydrological extreme events in recent decades. We used eight large precipitation datasets (more than 30 years) as input for a large scale hydrological and hydrodynamic model (MGB-IPH). HR products were then validated against several in situ discharge gauges controlling the main Amazon sub-basins, focusing on maximum and minimum events. For the most accurate HR, based on performance metrics, we performed a forecast skill of HR to detect floods and droughts, comparing the results with in-situ observations. A statistical temporal series trend was performed for intensity of seasonal floods and droughts in the entire Amazon basin. Results indicate that HR could represent most past extreme events well, compared with in-situ observed data, and was consistent with many events reported in literature. Because of their flow duration, some minor regional events were not reported in literature but were captured by HR. To represent past regional hydrology and seasonal hydrological extreme events, we believe it is feasible to use some large precipitation datasets such as i) climate reanalysis, which is mainly based on a land surface component, and ii) datasets based on merged products. A significant upward trend in intensity was seen in maximum annual discharge (related to floods) in western and northwestern regions and for minimum annual discharge (related to droughts) in south and central-south regions of the Amazon basin. Because of the global coverage of rainfall datasets, this methodology can be transferred to other regions for better estimation of future hydrological behavior and its impact on society.
How does spatial extent of fMRI datasets affect independent component analysis decomposition?

PubMed

Aragri, Adriana; Scarabino, Tommaso; Seifritz, Erich; Comani, Silvia; Cirillo, Sossio; Tedeschi, Gioacchino; Esposito, Fabrizio; Di Salle, Francesco

2006-09-01

Spatial independent component analysis (sICA) of functional magnetic resonance imaging (fMRI) time series can generate meaningful activation maps and associated descriptive signals, which are useful to evaluate datasets of the entire brain or selected portions of it. Besides computational implications, variations in the input dataset combined with the multivariate nature of ICA may lead to different spatial or temporal readouts of brain activation phenomena. By reducing and increasing a volume of interest (VOI), we applied sICA to different datasets from real activation experiments with multislice acquisition and single or multiple sensory-motor task-induced blood oxygenation level-dependent (BOLD) signal sources with different spatial and temporal structure. Using receiver operating characteristics (ROC) methodology for accuracy evaluation and multiple regression analysis as benchmark, we compared sICA decompositions of reduced and increased VOI fMRI time-series containing auditory, motor and hemifield visual activation occurring separately or simultaneously in time. Both approaches yielded valid results; however, the results of the increased VOI approach were spatially more accurate compared to the results of the decreased VOI approach. This is consistent with the capability of sICA to take advantage of extended samples of statistical observations and suggests that sICA is more powerful with extended rather than reduced VOI datasets to delineate brain activity. (c) 2006 Wiley-Liss, Inc.
NASA Astrophysics Data System (ADS)

2018-01-01

The test dataset was also useful to compare visual range estimates carried out by the Koschmieder equation and visibility measured at the Milano-Linate airport. It is worthy to note that in this work the test dataset was used primarily for checking the proposed methodology and it was not meant to give an assessment of bext and VR in Milan for a wintertime period as done by Vecchi et al., [in press], who applied the tailored equation to a larger aerosol dataset.
TESTING TREE-CLASSIFIER VARIANTS AND ALTERNATE MODELING METHODOLOGIES IN THE EAST GREAT BASIN MAPPING UNIT OF THE SOUTHWEST REGIONAL GAP ANALYSIS PROJECT (SW REGAP)

EPA Science Inventory

We tested two methods for dataset generation and model construction, and three tree-classifier variants to identify the most parsimonious and thematically accurate mapping methodology for the SW ReGAP project. Competing methodologies were tested in the East Great Basin mapping un...

Choosing the Most Effective Pattern Classification Model under Learning-Time Constraint.

PubMed

Saito, Priscila T M; Nakamura, Rodrigo Y M; Amorim, Willian P; Papa, João P; de Rezende, Pedro J; Falcão, Alexandre X

2015-01-01

Nowadays, large datasets are common and demand faster and more effective pattern analysis techniques. However, methodologies to compare classifiers usually do not take into account the learning-time constraints required by applications. This work presents a methodology to compare classifiers with respect to their ability to learn from classification errors on a large learning set, within a given time limit. Faster techniques may acquire more training samples, but only when they are more effective will they achieve higher performance on unseen testing sets. We demonstrate this result using several techniques, multiple datasets, and typical learning-time limits required by applications.
A methodology for least-squares local quasi-geoid modelling using a noisy satellite-only gravity field model

NASA Astrophysics Data System (ADS)

Klees, R.; Slobbe, D. C.; Farahani, H. H.

2018-04-01

The paper is about a methodology to combine a noisy satellite-only global gravity field model (GGM) with other noisy datasets to estimate a local quasi-geoid model using weighted least-squares techniques. In this way, we attempt to improve the quality of the estimated quasi-geoid model and to complement it with a full noise covariance matrix for quality control and further data processing. The methodology goes beyond the classical remove-compute-restore approach, which does not account for the noise in the satellite-only GGM. We suggest and analyse three different approaches of data combination. Two of them are based on a local single-scale spherical radial basis function (SRBF) model of the disturbing potential, and one is based on a two-scale SRBF model. Using numerical experiments, we show that a single-scale SRBF model does not fully exploit the information in the satellite-only GGM. We explain this by a lack of flexibility of a single-scale SRBF model to deal with datasets of significantly different bandwidths. The two-scale SRBF model performs well in this respect, provided that the model coefficients representing the two scales are estimated separately. The corresponding methodology is developed in this paper. Using the statistics of the least-squares residuals and the statistics of the errors in the estimated two-scale quasi-geoid model, we demonstrate that the developed methodology provides a two-scale quasi-geoid model, which exploits the information in all datasets.
A methodological investigation of hominoid craniodental morphology and phylogenetics.

PubMed

Bjarnason, Alexander; Chamberlain, Andrew T; Lockwood, Charles A

2011-01-01

The evolutionary relationships of extant great apes and humans have been largely resolved by molecular studies, yet morphology-based phylogenetic analyses continue to provide conflicting results. In order to further investigate this discrepancy we present bootstrap clade support of morphological data based on two quantitative datasets, one dataset consisting of linear measurements of the whole skull from 5 hominoid genera and the second dataset consisting of 3D landmark data from the temporal bone of 5 hominoid genera, including 11 sub-species. Using similar protocols for both datasets, we were able to 1) compare distance-based phylogenetic methods to cladistic parsimony of quantitative data converted into discrete character states, 2) vary outgroup choice to observe its effect on phylogenetic inference, and 3) analyse male and female data separately to observe the effect of sexual dimorphism on phylogenies. Phylogenetic analysis was sensitive to methodological decisions, particularly outgroup selection, where designation of Pongo as an outgroup and removal of Hylobates resulted in greater congruence with the proposed molecular phylogeny. The performance of distance-based methods also justifies their use in phylogenetic analysis of morphological data. It is clear from our analyses that hominoid phylogenetics ought not to be used as an example of conflict between the morphological and molecular, but as an example of how outgroup and methodological choices can affect the outcome of phylogenetic analysis. Copyright © 2010 Elsevier Ltd. All rights reserved.
Climatic Analysis of Oceanic Water Vapor Transports Based on Satellite E-P Datasets

NASA Technical Reports Server (NTRS)

Smith, Eric A.; Sohn, Byung-Ju; Mehta, Vikram

2004-01-01

Understanding the climatically varying properties of water vapor transports from a robust observational perspective is an essential step in calibrating climate models. This is tantamount to measuring year-to-year changes of monthly- or seasonally-averaged, divergent water vapor transport distributions. This cannot be done effectively with conventional radiosonde data over ocean regions where sounding data are generally sparse. This talk describes how a methodology designed to derive atmospheric water vapor transports over the world oceans from satellite-retrieved precipitation (P) and evaporation (E) datasets circumvents the problem of inadequate sampling. Ultimately, the method is intended to take advantage of the relatively complete and consistent coverage, as well as continuity in sampling, associated with E and P datasets obtained from satellite measurements. Independent P and E retrievals from Special Sensor Microwave Imager (SSM/I) measurements, along with P retrievals from Tropical Rainfall Measuring Mission (TRMM) measurements, are used to obtain transports by solving a potential function for the divergence of water vapor transport as balanced by large scale E - P conditions.
Realistic computer network simulation for network intrusion detection dataset generation

NASA Astrophysics Data System (ADS)

Payer, Garrett

2015-05-01

The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.
A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays

PubMed Central

Craig, Hugh; Berretta, Regina; Moscato, Pablo

2016-01-01

In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays. PMID:27571416
Statistical testing and power analysis for brain-wide association study.

PubMed

Gong, Weikang; Wan, Lin; Lu, Wenlian; Ma, Liang; Cheng, Fan; Cheng, Wei; Grünewald, Stefan; Feng, Jianfeng

2018-04-05

The identification of connexel-wise associations, which involves examining functional connectivities between pairwise voxels across the whole brain, is both statistically and computationally challenging. Although such a connexel-wise methodology has recently been adopted by brain-wide association studies (BWAS) to identify connectivity changes in several mental disorders, such as schizophrenia, autism and depression, the multiple correction and power analysis methods designed specifically for connexel-wise analysis are still lacking. Therefore, we herein report the development of a rigorous statistical framework for connexel-wise significance testing based on the Gaussian random field theory. It includes controlling the family-wise error rate (FWER) of multiple hypothesis testings using topological inference methods, and calculating power and sample size for a connexel-wise study. Our theoretical framework can control the false-positive rate accurately, as validated empirically using two resting-state fMRI datasets. Compared with Bonferroni correction and false discovery rate (FDR), it can reduce false-positive rate and increase statistical power by appropriately utilizing the spatial information of fMRI data. Importantly, our method bypasses the need of non-parametric permutation to correct for multiple comparison, thus, it can efficiently tackle large datasets with high resolution fMRI images. The utility of our method is shown in a case-control study. Our approach can identify altered functional connectivities in a major depression disorder dataset, whereas existing methods fail. A software package is available at https://github.com/weikanggong/BWAS. Copyright © 2018 Elsevier B.V. All rights reserved.
Studying Child Care Subsidies with Secondary Data Sources. Methodological Brief OPRE 2012-54

ERIC Educational Resources Information Center

Ha, Yoonsook; Johnson, Anna D.

2012-01-01

This brief describes four national surveys with data relevant to subsidy-related research and provides a useful set of considerations for subsidy researchers considering use of secondary data. Specifically, this brief describes each of the four datasets reviewed, highlighting unique features of each dataset and providing information on the survey…
Post-MBA Industry Shifts: An Investigation of Career, Educational and Demographic Factors

ERIC Educational Resources Information Center

Hwang, Alvin; Bento, Regina; Arbaugh, J. B.

2011-01-01

Purpose: The purpose of this study is to examine factors that predict industry-level career change among MBA graduates. Design/methodology/approach: The study analyzed longitudinal data from the Management Education Research Institute (MERI)'s Global MBA Graduate Survey Dataset and MBA Alumni Perspectives Survey Datasets, using principal component…
Measurement properties of comorbidity indices in maternal health research: a systematic review.

PubMed

Aoyama, Kazuyoshi; D'Souza, Rohan; Inada, Eiichi; Lapinsky, Stephen E; Fowler, Robert A

2017-11-13

Maternal critical illness occurs in 1.2 to 4.7 of every 1000 live births in the United States and approximately 1 in 100 women who become critically ill will die. Patient characteristics and comorbid conditions are commonly summarized as an index or score for the purpose of predicting the likelihood of dying; however, most such indices have arisen from non-pregnant patient populations. We sought to systematically review comorbidity indices used in health administrative datasets of pregnant women, in order to critically appraise their measurement properties and recommend optimal tools for clinicians and maternal health researchers. We conducted a systematic search of MEDLINE and EMBASE to identify studies published from 1946 and 1947, respectively, to May 2017 that describe predictive validity of comorbidity indices using health administrative datasets in the field of maternal health research. We applied a methodological PubMed search filter to identify all studies of measurement properties for each index. Our initial search retrieved 8944 citations. The full text of 61 articles were identified and assessed for final eligibility. Finally, two eligible articles, describing three comorbidity indices appropriate for health administrative data remained: The Maternal comorbidity index, the Charlson comorbidity index and the Elixhauser Comorbidity Index. These studies of identified indices had a low risk of bias. The lack of an established consensus-building methodology in generating each index resulted in marginal sensibility for all indices. Only the Maternal Comorbidity Index was derived and validated specifically from a cohort of pregnant and postpartum women, using an administrative dataset, and had an associated c-statistic of 0.675 (95% Confidence Interval 0.647-0.666) in predicting mortality. Only the Maternal Comorbidity Index directly evaluated measurement properties relevant to pregnant women in health administrative datasets; however, it has only modest predictive ability for mortality among development and validation studies. Further research to investigate the feasibility of applying this index in clinical research, and its reliability across a variety of health administrative datasets would be incrementally helpful. Evolution of this and other tools for risk prediction and risk adjustment in pregnant and post-partum patients is an important area for ongoing study.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

NASA Astrophysics Data System (ADS)

Moulik, P.; Lekic, V.; Romanowicz, B. A.

2017-12-01

A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.
Consolidating drug data on a global scale using Linked Data.

PubMed

Jovanovik, Milos; Trajanov, Dimitar

2017-01-21

Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.
Design and analysis of multiple diseases genome-wide association studies without controls.

PubMed

Chen, Zhongxue; Huang, Hanwen; Ng, Hon Keung Tony

2012-11-15

In genome-wide association studies (GWAS), multiple diseases with shared controls is one of the case-control study designs. If data obtained from these studies are appropriately analyzed, this design can have several advantages such as improving statistical power in detecting associations and reducing the time and cost in the data collection process. In this paper, we propose a study design for GWAS which involves multiple diseases but without controls. We also propose corresponding statistical data analysis strategy for GWAS with multiple diseases but no controls. Through a simulation study, we show that the statistical association test with the proposed study design is more powerful than the test with single disease sharing common controls, and it has comparable power to the overall test based on the whole dataset including the controls. We also apply the proposed method to a real GWAS dataset to illustrate the methodologies and the advantages of the proposed design. Some possible limitations of this study design and testing method and their solutions are also discussed. Our findings indicate that the proposed study design and statistical analysis strategy could be more efficient than the usual case-control GWAS as well as those with shared controls. Copyright © 2012 Elsevier B.V. All rights reserved.
Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

PubMed Central

Dipnall, Joanna F.

2016-01-01

Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571
Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

PubMed

Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny

2016-01-01

Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
School Climate Reports from Norwegian Teachers: A Methodological and Substantive Study.

ERIC Educational Resources Information Center

Kallestad, Jan Helge; Olweus, Dan; Alsaker, Francoise

1998-01-01

Explores methodological and substantive issues relating to school climate, using a dataset derived from 42 Norwegian schools at two points of time and a standard definition of organizational climate. Identifies and analyzes four school-climate dimensions. Three dimensions (collegial communication, orientation to change, and teacher influence over…
Hydrological Retrospective of floods and droughts: Case study in the Amazon

NASA Astrophysics Data System (ADS)

Wongchuig Correa, Sly; Cauduro Dias de Paiva, Rodrigo; Carlo Espinoza Villar, Jhan; Collischonn, Walter

2017-04-01

Recent studies have reported an increase in intensity and frequency of hydrological extreme events in many regions of the Amazon basin over last decades, these events such as seasonal floods and droughts have originated a significant impact in human and natural systems. Recently, methodologies such as climatic reanalysis are being developed in order to create a coherent register of climatic systems, thus taking this notion, this research efforts to produce a methodology called Hydrological Retrospective (HR), that essentially simulate large rainfall datasets over hydrological models in order to develop a record over past hydrology, enabling the analysis of past floods and droughts. We developed our methodology on the Amazon basin, thus we used eight large precipitation datasets (more than 30 years) through a large scale hydrological and hydrodynamic model (MGB-IPH), after that HR products were validated against several in situ discharge gauges dispersed throughout Amazon basin, given focus in maximum and minimum events. For better HR results according performance metrics, we performed a forecast skill of HR to detect floods and droughts considering in-situ observations. Furthermore, statistical temporal series trend was performed for intensity of seasonal floods and drought in the whole Amazon basin. Results indicate that better HR represented well most past extreme events registered by in-situ observed data and also showed coherent with many events cited by literature, thus we consider viable to use some large precipitation datasets as climatic reanalysis mainly based on land surface component and datasets based in merged products for represent past regional hydrology and seasonal hydrological extreme events. On the other hand, an increase trend of intensity was realized for maximum annual discharges (related to floods) in north-western regions and for minimum annual discharges (related to drought) in central-south regions of the Amazon basin, these features were previously detected by other researches. In the whole basin, we estimated an upward trend of maximum annual discharges at Amazon River. In order to estimate better future hydrological behavior and their impacts on the society, HR could be used as a methodology to understand past extreme events occurrence in many places considering the global coverage of rainfall datasets.
Advancing the Potential of Citizen Science for Urban Water Quality Monitoring: Exploring Research Design and Methodology in New York City

NASA Astrophysics Data System (ADS)

Hsueh, D.; Farnham, D. J.; Gibson, R.; McGillis, W. R.; Culligan, P. J.; Cooper, C.; Larson, L.; Mailloux, B. J.; Buchanan, R.; Borus, N.; Zain, N.; Eddowes, D.; Butkiewicz, L.; Loiselle, S. A.

2015-12-01

Citizen Science is a fast-growing ecological research tool with proven potential to rapidly produce large datasets. While the fields of astronomy and ornithology demonstrate particularly successful histories of enlisting the public in conducting scientific work, citizen science applications to the field of hydrology have been relatively underutilized. We demonstrate the potential of citizen science for monitoring water quality, particularly in the impervious, urban environment of New York City (NYC) where pollution via stormwater runoff is a leading source of waterway contamination. Through partnerships with HSBC, Earthwatch, and the NYC Water Trail Association, we have trained two citizen science communities to monitor the quality of NYC waterways, testing for a suite of water quality parameters including pH, turbidity, phosphate, nitrate, and Enterococci (an indicator bacteria for the presence of harmful pathogens associated with fecal pollution). We continue to enhance these citizen science programs with two additions to our methodology. First, we designed and produced at-home incubation ovens for Enterococci analysis, and second, we are developing automated photo-imaging for nitrate and phosphate concentrations. These improvements make our work more publicly accessible while maintaining scientific accuracy. We also initiated a volunteer survey assessing the motivations for participation among our citizen scientists. These three endeavors will inform future applications of citizen science for urban hydrological research. Ultimately, the spatiotemporally-rich dataset of waterway quality produced from our citizen science efforts will help advise NYC policy makers about the impacts of green infrastructure and other types of government-led efforts to clean up NYC waterways.
Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

PubMed

Keuleers, Emmanuel; Balota, David A

2015-01-01

This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.
A fuzzy hill-climbing algorithm for the development of a compact associative classifier

NASA Astrophysics Data System (ADS)

Mitra, Soumyaroop; Lam, Sarah S.

2012-02-01

Classification, a data mining technique, has widespread applications including medical diagnosis, targeted marketing, and others. Knowledge discovery from databases in the form of association rules is one of the important data mining tasks. An integrated approach, classification based on association rules, has drawn the attention of the data mining community over the last decade. While attention has been mainly focused on increasing classifier accuracies, not much efforts have been devoted towards building interpretable and less complex models. This paper discusses the development of a compact associative classification model using a hill-climbing approach and fuzzy sets. The proposed methodology builds the rule-base by selecting rules which contribute towards increasing training accuracy, thus balancing classification accuracy with the number of classification association rules. The results indicated that the proposed associative classification model can achieve competitive accuracies on benchmark datasets with continuous attributes and lend better interpretability, when compared with other rule-based systems.

Historic AVHRR Processing in the Eumetsat Climate Monitoring Satellite Application Facility (cmsaf) (Invited)

NASA Astrophysics Data System (ADS)

Karlsson, K.

2010-12-01

The EUMETSAT CMSAF project (www.cmsaf.eu) compiles climatological datasets from various satellite sources with emphasis on the use of EUMETSAT-operated satellites. However, since climate monitoring primarily has a global scope, also datasets merging data from various satellites and satellite operators are prepared. One such dataset is the CMSAF historic GAC (Global Area Coverage) dataset which is based on AVHRR data from the full historic series of NOAA-satellites and the European METOP satellite in mid-morning orbit launched in October 2006. The CMSAF GAC dataset consists of three groups of products: Macroscopical cloud products (cloud amount, cloud type and cloud top), cloud physical products (cloud phase, cloud optical thickness and cloud liquid water path) and surface radiation products (including surface albedo). Results will be presented and discussed for all product groups, including some preliminary inter-comparisons with other datasets (e.g., PATMOS-X, MODIS and CloudSat/CALIPSO datasets). A background will also be given describing the basic methodology behind the derivation of all products. This will include a short historical review of AVHRR cloud processing and resulting AVHRR applications at SMHI. Historic GAC processing is one of five pilot projects selected by the SCOPE-CM (Sustained Co-Ordinated Processing of Environmental Satellite data for Climate Monitoring) project organised by the WMO Space programme. The pilot project is carried out jointly between CMSAF and NOAA with the purpose of finding an optimal GAC processing approach. The initial activity is to inter-compare results of the CMSAF GAC dataset and the NOAA PATMOS-X dataset for the case when both datasets have been derived using the same inter-calibrated AVHRR radiance dataset. The aim is to get further knowledge of e.g. most useful multispectral methods and the impact of ancillary datasets (for example from meteorological reanalysis datasets from NCEP and ECMWF). The CMSAF project is currently defining plans for another five years (2012-2017) of operations and development. New GAC reprocessing efforts are planned and new methodologies will be tested. Central questions here will be how to increase the quantitative use of the products through improving error and uncertainty estimates and how to compile the information in a way to allow meaningful and efficient ways of using the data for e.g. validation of climate model information.
The sponge microbiome project.

PubMed

Moitinho-Silva, Lucas; Nielsen, Shaun; Amir, Amnon; Gonzalez, Antonio; Ackermann, Gail L; Cerrano, Carlo; Astudillo-Garcia, Carmen; Easson, Cole; Sipkema, Detmer; Liu, Fang; Steinert, Georg; Kotoulas, Giorgos; McCormack, Grace P; Feng, Guofang; Bell, James J; Vicente, Jan; Björk, Johannes R; Montoya, Jose M; Olson, Julie B; Reveillaud, Julie; Steindler, Laura; Pineda, Mari-Carmen; Marra, Maria V; Ilan, Micha; Taylor, Michael W; Polymenakou, Paraskevi; Erwin, Patrick M; Schupp, Peter J; Simister, Rachel L; Knight, Rob; Thacker, Robert W; Costa, Rodrigo; Hill, Russell T; Lopez-Legentil, Susanna; Dailianis, Thanos; Ravasi, Timothy; Hentschel, Ute; Li, Zhiyong; Webster, Nicole S; Thomas, Torsten

2017-10-01

Marine sponges (phylum Porifera) are a diverse, phylogenetically deep-branching clade known for forming intimate partnerships with complex communities of microorganisms. To date, 16S rRNA gene sequencing studies have largely utilised different extraction and amplification methodologies to target the microbial communities of a limited number of sponge species, severely limiting comparative analyses of sponge microbial diversity and structure. Here, we provide an extensive and standardised dataset that will facilitate sponge microbiome comparisons across large spatial, temporal, and environmental scales. Samples from marine sponges (n = 3569 specimens), seawater (n = 370), marine sediments (n = 65) and other environments (n = 29) were collected from different locations across the globe. This dataset incorporates at least 268 different sponge species, including several yet unidentified taxa. The V4 region of the 16S rRNA gene was amplified and sequenced from extracted DNA using standardised procedures. Raw sequences (total of 1.1 billion sequences) were processed and clustered with (i) a standard protocol using QIIME closed-reference picking resulting in 39 543 operational taxonomic units (OTU) at 97% sequence identity, (ii) a de novo clustering using Mothur resulting in 518 246 OTUs, and (iii) a new high-resolution Deblur protocol resulting in 83 908 unique bacterial sequences. Abundance tables, representative sequences, taxonomic classifications, and metadata are provided. This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge-associated rare biosphere. © The Authors 2017. Published by Oxford University Press.
Hierarchical Naive Bayes for genetic association studies.

PubMed

Malovini, Alberto; Barbarini, Nicola; Bellazzi, Riccardo; de Michelis, Francesca

2012-01-01

Genome Wide Association Studies represent powerful approaches that aim at disentangling the genetic and molecular mechanisms underlying complex traits. The usual "one-SNP-at-the-time" testing strategy cannot capture the multi-factorial nature of this kind of disorders. We propose a Hierarchical Naïve Bayes classification model for taking into account associations in SNPs data characterized by Linkage Disequilibrium. Validation shows that our model reaches classification performances superior to those obtained by the standard Naïve Bayes classifier for simulated and real datasets. In the Hierarchical Naïve Bayes implemented, the SNPs mapping to the same region of Linkage Disequilibrium are considered as "details" or "replicates" of the locus, each contributing to the overall effect of the region on the phenotype. A latent variable for each block, which models the "population" of correlated SNPs, can be then used to summarize the available information. The classification is thus performed relying on the latent variables conditional probability distributions and on the SNPs data available. The developed methodology has been tested on simulated datasets, each composed by 300 cases, 300 controls and a variable number of SNPs. Our approach has been also applied to two real datasets on the genetic bases of Type 1 Diabetes and Type 2 Diabetes generated by the Wellcome Trust Case Control Consortium. The approach proposed in this paper, called Hierarchical Naïve Bayes, allows dealing with classification of examples for which genetic information of structurally correlated SNPs are available. It improves the Naïve Bayes performances by properly handling the within-loci variability.
Developing a 1 km resolution daily air temperature dataset for urban and surrounding areas in the conterminous United States

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Xiaoma; Zhou, Yuyu; Asrar, Ghassem R.

High spatiotemporal resolution air temperature (Ta) datasets are increasingly needed for assessing the impact of temperature change on people, ecosystems, and energy system, especially in the urban domains. However, such datasets are not widely available because of the large spatiotemporal heterogeneity of Ta caused by complex biophysical and socioeconomic factors such as built infrastructure and human activities. In this study, we developed a 1-km gridded dataset of daily minimum Ta (Tmin) and maximum Ta (Tmax), and the associated uncertainties, in urban and surrounding areas in the conterminous U.S. for the 2003–2016 period. Daily geographically weighted regression (GWR) models were developedmore » and used to interpolate Ta using 1 km daily land surface temperature and elevation as explanatory variables. The leave-one-out cross-validation approach indicates that our method performs reasonably well, with root mean square errors of 2.1 °C and 1.9 °C, mean absolute errors of 1.5 °C and 1.3 °C, and R 2 of 0.95 and 0.97, for Tmin and Tmax, respectively. The resulting dataset captures reasonably the spatial heterogeneity of Ta in the urban areas, and also captures effectively the urban heat island (UHI) phenomenon that Ta rises with the increase of urban development (i.e., impervious surface area). The new dataset is valuable for studying environmental impacts of urbanization such as UHI and other related effects (e.g., on building energy consumption and human health). The proposed methodology also shows a potential to build a long-term record of Ta worldwide, to fill the data gap that currently exists for studies of urban systems.« less
EPA’s AP-42 development methodology: Converting or rerating current AP-42 datasets

USDA-ARS?s Scientific Manuscript database

In August 2013, the U.S. Environmental Protection Agency’s (EPA) published their new methodology for updating the Compilation of Air Pollution Emission Factors (AP-42). The “Recommended Procedures for Development of Emissions Factors and Use of the WebFIRE Database” instructs that the ratings of the...
Evaluating EPA’s AP-42 development methodology using a cotton gin total PM dataset

USDA-ARS?s Scientific Manuscript database

In August 2013, the U.S. Environmental Protection Agency’s (EPA) published their new methodology for updating the Compilation of Air Pollution Emission Factors (AP-42). The “Recommended Procedures for Development of Emissions Factors and Use of the WebFIRE Database” has yet to be widely used. These ...
Theory of impossible worlds: Toward a physics of information.

PubMed

Buscema, Paolo Massimo; Sacco, Pier Luigi; Della Torre, Francesca; Massini, Giulia; Breda, Marco; Ferilli, Guido

2018-05-01

In this paper, we introduce an innovative approach to the fusion between datasets in terms of attributes and observations, even when they are not related at all. With our technique, starting from datasets representing independent worlds, it is possible to analyze a single global dataset, and transferring each dataset onto the others is always possible. This procedure allows a deeper perspective in the study of a problem, by offering the chance of looking into it from other, independent points of view. Even unrelated datasets create a metaphoric representation of the problem, useful in terms of speed of convergence and predictive results, preserving the fundamental relationships in the data. In order to extract such knowledge, we propose a new learning rule named double backpropagation, by which an auto-encoder concurrently codifies all the different worlds. We test our methodology on different datasets and different issues, to underline the power and flexibility of the Theory of Impossible Worlds.
Theory of impossible worlds: Toward a physics of information

NASA Astrophysics Data System (ADS)

Buscema, Paolo Massimo; Sacco, Pier Luigi; Della Torre, Francesca; Massini, Giulia; Breda, Marco; Ferilli, Guido

2018-05-01

In this paper, we introduce an innovative approach to the fusion between datasets in terms of attributes and observations, even when they are not related at all. With our technique, starting from datasets representing independent worlds, it is possible to analyze a single global dataset, and transferring each dataset onto the others is always possible. This procedure allows a deeper perspective in the study of a problem, by offering the chance of looking into it from other, independent points of view. Even unrelated datasets create a metaphoric representation of the problem, useful in terms of speed of convergence and predictive results, preserving the fundamental relationships in the data. In order to extract such knowledge, we propose a new learning rule named double backpropagation, by which an auto-encoder concurrently codifies all the different worlds. We test our methodology on different datasets and different issues, to underline the power and flexibility of the Theory of Impossible Worlds.
Usefulness of DARPA dataset for intrusion detection system evaluation

NASA Astrophysics Data System (ADS)

Thomas, Ciza; Sharma, Vishwas; Balakrishnan, N.

2008-03-01

The MIT Lincoln Laboratory IDS evaluation methodology is a practical solution in terms of evaluating the performance of Intrusion Detection Systems, which has contributed tremendously to the research progress in that field. The DARPA IDS evaluation dataset has been criticized and considered by many as a very outdated dataset, unable to accommodate the latest trend in attacks. Then naturally the question arises as to whether the detection systems have improved beyond detecting these old level of attacks. If not, is it worth thinking of this dataset as obsolete? The paper presented here tries to provide supporting facts for the use of the DARPA IDS evaluation dataset. The two commonly used signature-based IDSs, Snort and Cisco IDS, and two anomaly detectors, the PHAD and the ALAD, are made use of for this evaluation purpose and the results support the usefulness of DARPA dataset for IDS evaluation.
A methodology for estimating risks associated with landslides of contaminated soil into rivers.

PubMed

Göransson, Gunnel; Norrman, Jenny; Larson, Magnus; Alén, Claes; Rosén, Lars

2014-02-15

Urban areas adjacent to surface water are exposed to soil movements such as erosion and slope failures (landslides). A landslide is a potential mechanism for mobilisation and spreading of pollutants. This mechanism is in general not included in environmental risk assessments for contaminated sites, and the consequences associated with contamination in the soil are typically not considered in landslide risk assessments. This study suggests a methodology to estimate the environmental risks associated with landslides in contaminated sites adjacent to rivers. The methodology is probabilistic and allows for datasets with large uncertainties and the use of expert judgements, providing quantitative estimates of probabilities for defined failures. The approach is illustrated by a case study along the river Göta Älv, Sweden, where failures are defined and probabilities for those failures are estimated. Failures are defined from a pollution perspective and in terms of exceeding environmental quality standards (EQSs) and acceptable contaminant loads. Models are then suggested to estimate probabilities of these failures. A landslide analysis is carried out to assess landslide probabilities based on data from a recent landslide risk classification study along the river Göta Älv. The suggested methodology is meant to be a supplement to either landslide risk assessment (LRA) or environmental risk assessment (ERA), providing quantitative estimates of the risks associated with landslide in contaminated sites. The proposed methodology can also act as a basis for communication and discussion, thereby contributing to intersectoral management solutions. From the case study it was found that the defined failures are governed primarily by the probability of a landslide occurring. The overall probabilities for failure are low; however, if a landslide occurs the probabilities of exceeding EQS are high and the probability of having at least a 10% increase in the contamination load within one year is also high. Copyright © 2013 Elsevier B.V. All rights reserved.
Critical care medicine beds, use, occupancy and costs in the United States: a methodological review

PubMed Central

Halpern, Neil A; Pastores, Stephen M.

2017-01-01

This article is a methodological review to help the intensivist gain insights into the classic and sometimes arcane maze of national databases and methodologies used to determine and analyze the intensive care unit (ICU) bed supply, occupancy rates, and costs in the United States (US). Data for total ICU beds, use and occupancy can be derived from two large national healthcare databases: the Healthcare Cost Report Information System (HCRIS) maintained by the federal Centers for Medicare and Medicaid Services (CMS) and the proprietary Hospital Statistics of the American Hospital Association (AHA). Two costing methodologies can be used to calculate ICU costs: the Russell equation and national projections. Both methods are based on cost and use data from the national hospital datasets or from defined groups of hospitals or patients. At the national level, an understanding of US ICU beds, use and cost helps provide clarity to the width and scope of the critical care medicine (CCM) enterprise within the US healthcare system. This review will also help the intensivist better understand published studies on administrative topics related to CCM and be better prepared to participate in their own local hospital organizations or regional CCM programs. PMID:26308432
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation

PubMed Central

Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.

2016-01-01

Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field. PMID:27853419
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.

PubMed

Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B

2016-01-01

Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field.
Conducting high-value secondary dataset analysis: an introductory guide and resources.

PubMed

Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

2011-08-01

Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.
The 3D Reference Earth Model: Status and Preliminary Results

NASA Astrophysics Data System (ADS)

Moulik, P.; Lekic, V.; Romanowicz, B. A.

2017-12-01

In the 20th century, seismologists constructed models of how average physical properties (e.g. density, rigidity, compressibility, anisotropy) vary with depth in the Earth's interior. These one-dimensional (1D) reference Earth models (e.g. PREM) have proven indispensable in earthquake location, imaging of interior structure, understanding material properties under extreme conditions, and as a reference in other fields, such as particle physics and astronomy. Over the past three decades, new datasets motivated more sophisticated efforts that yielded models of how properties vary both laterally and with depth in the Earth's interior. Though these three-dimensional (3D) models exhibit compelling similarities at large scales, differences in the methodology, representation of structure, and dataset upon which they are based, have prevented the creation of 3D community reference models. As part of the REM-3D project, we are compiling and reconciling reference seismic datasets of body wave travel-time measurements, fundamental mode and overtone surface wave dispersion measurements, and normal mode frequencies and splitting functions. These reference datasets are being inverted for a long-wavelength, 3D reference Earth model that describes the robust long-wavelength features of mantle heterogeneity. As a community reference model with fully quantified uncertainties and tradeoffs and an associated publically available dataset, REM-3D will facilitate Earth imaging studies, earthquake characterization, inferences on temperature and composition in the deep interior, and be of improved utility to emerging scientific endeavors, such as neutrino geoscience. Here, we summarize progress made in the construction of the reference long period dataset and present a preliminary version of REM-3D in the upper-mantle. In order to determine the level of detail warranted for inclusion in REM-3D, we analyze the spectrum of discrepancies between models inverted with different subsets of the reference dataset. This procedure allows us to evaluate the extent of consistency in imaging heterogeneity at various depths and between spatial scales.
Broad-scale phylogenomics provides insights into retrovirus–host evolution

PubMed Central

Hayward, Alexander; Grabherr, Manfred; Jern, Patric

2013-01-01

Genomic data provide an excellent resource to improve understanding of retrovirus evolution and the complex relationships among viruses and their hosts. In conjunction with broad-scale in silico screening of vertebrate genomes, this resource offers an opportunity to complement data on the evolution and frequency of past retroviral spread and so evaluate future risks and limitations for horizontal transmission between different host species. Here, we develop a methodology for extracting phylogenetic signal from large endogenous retrovirus (ERV) datasets by collapsing information to facilitate broad-scale phylogenomics across a wide sample of hosts. Starting with nearly 90,000 ERVs from 60 vertebrate host genomes, we construct phylogenetic hypotheses and draw inferences regarding the designation, host distribution, origin, and transmission of the Gammaretrovirus genus and associated class I ERVs. Our results uncover remarkable depths in retroviral sequence diversity, supported within a phylogenetic context. This finding suggests that current infectious exogenous retrovirus diversity may be underestimated, adding credence to the possibility that many additional exogenous retroviruses may remain to be discovered in vertebrate taxa. We demonstrate a history of frequent horizontal interorder transmissions from a rodent reservoir and suggest that rats may have acted as important overlooked facilitators of gammaretrovirus spread across diverse mammalian hosts. Together, these results demonstrate the promise of the methodology used here to analyze large ERV datasets and improve understanding of retroviral evolution and diversity for utilization in wider applications. PMID:24277832
Broad-scale phylogenomics provides insights into retrovirus-host evolution.

PubMed

Hayward, Alexander; Grabherr, Manfred; Jern, Patric

2013-12-10

Genomic data provide an excellent resource to improve understanding of retrovirus evolution and the complex relationships among viruses and their hosts. In conjunction with broad-scale in silico screening of vertebrate genomes, this resource offers an opportunity to complement data on the evolution and frequency of past retroviral spread and so evaluate future risks and limitations for horizontal transmission between different host species. Here, we develop a methodology for extracting phylogenetic signal from large endogenous retrovirus (ERV) datasets by collapsing information to facilitate broad-scale phylogenomics across a wide sample of hosts. Starting with nearly 90,000 ERVs from 60 vertebrate host genomes, we construct phylogenetic hypotheses and draw inferences regarding the designation, host distribution, origin, and transmission of the Gammaretrovirus genus and associated class I ERVs. Our results uncover remarkable depths in retroviral sequence diversity, supported within a phylogenetic context. This finding suggests that current infectious exogenous retrovirus diversity may be underestimated, adding credence to the possibility that many additional exogenous retroviruses may remain to be discovered in vertebrate taxa. We demonstrate a history of frequent horizontal interorder transmissions from a rodent reservoir and suggest that rats may have acted as important overlooked facilitators of gammaretrovirus spread across diverse mammalian hosts. Together, these results demonstrate the promise of the methodology used here to analyze large ERV datasets and improve understanding of retroviral evolution and diversity for utilization in wider applications.
Reliability in content analysis: The case of semantic feature norms classification.

PubMed

Bolognesi, Marianna; Pilgram, Roosmaryn; van den Heerik, Romy

2017-12-01

Semantic feature norms (e.g., STIMULUS: car → RESPONSE: ) are commonly used in cognitive psychology to look into salient aspects of given concepts. Semantic features are typically collected in experimental settings and then manually annotated by the researchers into feature types (e.g., perceptual features, taxonomic features, etc.) by means of content analyses-that is, by using taxonomies of feature types and having independent coders perform the annotation task. However, the ways in which such content analyses are typically performed and reported are not consistent across the literature. This constitutes a serious methodological problem that might undermine the theoretical claims based on such annotations. In this study, we first offer a review of some of the released datasets of annotated semantic feature norms and the related taxonomies used for content analysis. We then provide theoretical and methodological insights in relation to the content analysis methodology. Finally, we apply content analysis to a new dataset of semantic features and show how the method should be applied in order to deliver reliable annotations and replicable coding schemes. We tackle the following issues: (1) taxonomy structure, (2) the description of categories, (3) coder training, and (4) sustainability of the coding scheme-that is, comparison of the annotations provided by trained versus novice coders. The outcomes of the project are threefold: We provide methodological guidelines for semantic feature classification; we provide a revised and adapted taxonomy that can (arguably) be applied to both concrete and abstract concepts; and we provide a dataset of annotated semantic feature norms.
Evaluation of Gene-Based Family-Based Methods to Detect Novel Genes Associated With Familial Late Onset Alzheimer Disease

PubMed Central

Fernández, Maria V.; Budde, John; Del-Aguila, Jorge L.; Ibañez, Laura; Deming, Yuetiva; Harari, Oscar; Norton, Joanne; Morris, John C.; Goate, Alison M.; Cruchaga, Carlos

2018-01-01

Gene-based tests to study the combined effect of rare variants on a particular phenotype have been widely developed for case-control studies, but their evolution and adaptation for family-based studies, especially studies of complex incomplete families, has been slower. In this study, we have performed a practical examination of all the latest gene-based methods available for family-based study designs using both simulated and real datasets. We examined the performance of several collapsing, variance-component, and transmission disequilibrium tests across eight different software packages and 22 models utilizing a cohort of 285 families (N = 1,235) with late-onset Alzheimer disease (LOAD). After a thorough examination of each of these tests, we propose a methodological approach to identify, with high confidence, genes associated with the tested phenotype and we provide recommendations to select the best software and model for family-based gene-based analyses. Additionally, in our dataset, we identified PTK2B, a GWAS candidate gene for sporadic AD, along with six novel genes (CHRD, CLCN2, HDLBP, CPAMD8, NLRP9, and MAS1L) as candidate genes for familial LOAD. PMID:29670507
Evaluation of Gene-Based Family-Based Methods to Detect Novel Genes Associated With Familial Late Onset Alzheimer Disease.

PubMed

Fernández, Maria V; Budde, John; Del-Aguila, Jorge L; Ibañez, Laura; Deming, Yuetiva; Harari, Oscar; Norton, Joanne; Morris, John C; Goate, Alison M; Cruchaga, Carlos

2018-01-01

Gene-based tests to study the combined effect of rare variants on a particular phenotype have been widely developed for case-control studies, but their evolution and adaptation for family-based studies, especially studies of complex incomplete families, has been slower. In this study, we have performed a practical examination of all the latest gene-based methods available for family-based study designs using both simulated and real datasets. We examined the performance of several collapsing, variance-component, and transmission disequilibrium tests across eight different software packages and 22 models utilizing a cohort of 285 families ( N = 1,235) with late-onset Alzheimer disease (LOAD). After a thorough examination of each of these tests, we propose a methodological approach to identify, with high confidence, genes associated with the tested phenotype and we provide recommendations to select the best software and model for family-based gene-based analyses. Additionally, in our dataset, we identified PTK2B , a GWAS candidate gene for sporadic AD, along with six novel genes ( CHRD, CLCN2, HDLBP, CPAMD8, NLRP9 , and MAS1L ) as candidate genes for familial LOAD.

Identifying and acting on inappropriate metadata: a critique of the Grattan Institute Report on questionable care in Australian hospitals.

PubMed

Cooper, P David; Smart, David R

2017-03-01

In an era of ever-increasing medical costs, the identification and prohibition of ineffective medical therapies is of considerable economic interest to healthcare funding bodies. Likewise, the avoidance of interventions with an unduly elevated clinical risk/benefit ratio would be similarly advantageous for patients. Regrettably, the identification of such therapies has proven problematic. A recent paper from the Grattan Institute in Australia (identifying five hospital procedures as having the potential for disinvestment on these grounds) serves as a timely illustration of the difficulties inherent in non-clinicians attempting to accurately recognize such interventions using non-clinical, indirect or poorly validated datasets. To evaluate the Grattan Institute report and associated publications, and determine the validity of their assertions regarding hyperbaric oxygen treatment (HBOT) utilisation in Australia. Critical analysis of the HBOT metadata included in the Grattan Institute study was undertaken and compared against other publicly available Australian Government and independent data sources. The consistency, accuracy and reproducibility of data definitions and terminology across the various publications were appraised and the authors' methodology was reviewed. Reference sources were examined for relevance and temporal eligibility. Review of the Grattan publications demonstrated multiple problems, including (but not limited to): confusing patient-treatments with total patient numbers; incorrect identification of 'appropriate' vs. 'inappropriate' indications for HBOT; reliance upon a compromised primary dataset; lack of appropriate clinical input, muddled methodology and use of inapplicable references. These errors resulted in a more than seventy-fold over-estimation of the number of patients potentially treated inappropriately with HBOT in Australia that year. Numerous methodological flaws and factual errors have been identified in this Grattan Institute study. Its conclusions are not valid and a formal retraction is required.
Influence of spatial and temporal scales in identifying temperature extremes

NASA Astrophysics Data System (ADS)

van Eck, Christel M.; Friedlingstein, Pierre; Mulder, Vera L.; Regnier, Pierre A. G.

2016-04-01

Extreme heat events are becoming more frequent. Notable are severe heatwaves such as the European heatwave of 2003, the Russian heat wave of 2010 and the Australian heatwave of 2013. Surface temperature is attaining new maxima not only during the summer but also during the winter. The year of 2015 is reported to be a temperature record breaking year for both summer and winter. These extreme temperatures are taking their human and environmental toll, emphasizing the need for an accurate method to define a heat extreme in order to fully understand the spatial and temporal spread of an extreme and its impact. This research aims to explore how the use of different spatial and temporal scales influences the identification of a heat extreme. For this purpose, two near-surface temperature datasets of different temporal scale and spatial scale are being used. First, the daily ERA-Interim dataset of 0.25 degree and a time span of 32 years (1979-2010). Second, the daily Princeton Meteorological Forcing Dataset of 0.5 degree and a time span of 63 years (1948-2010). A temperature is considered extreme anomalous when it is surpassing the 90th, 95th, or the 99th percentile threshold based on the aforementioned pre-processed datasets. The analysis is conducted on a global scale, dividing the world in IPCC's so-called SREX regions developed for the analysis of extreme climate events. Pre-processing is done by detrending and/or subtracting the monthly climatology based on 32 years of data for both datasets and on 63 years of data for only the Princeton Meteorological Forcing Dataset. This results in 6 datasets of temperature anomalies from which the location in time and space of the anomalous warm days are identified. Comparison of the differences between these 6 datasets in terms of absolute threshold temperatures for extremes and the temporal and spatial spread of the extreme anomalous warm days show a dependence of the results on the datasets and methodology used. This stresses the need for a careful selection of data and methodology when identifying heat extremes.
Cloud Computing: A model Construct of Real-Time Monitoring for Big Dataset Analytics Using Apache Spark

NASA Astrophysics Data System (ADS)

Alkasem, Ameen; Liu, Hongwei; Zuo, Decheng; Algarash, Basheer

2018-01-01

The volume of data being collected, analyzed, and stored has exploded in recent years, in particular in relation to the activity on the cloud computing. While large-scale data processing, analysis, storage, and platform model such as cloud computing were previously and currently are increasingly. Today, the major challenge is it address how to monitor and control these massive amounts of data and perform analysis in real-time at scale. The traditional methods and model systems are unable to cope with these quantities of data in real-time. Here we present a new methodology for constructing a model for optimizing the performance of real-time monitoring of big datasets, which includes a machine learning algorithms and Apache Spark Streaming to accomplish fine-grained fault diagnosis and repair of big dataset. As a case study, we use the failure of Virtual Machines (VMs) to start-up. The methodology proposition ensures that the most sensible action is carried out during the procedure of fine-grained monitoring and generates the highest efficacy and cost-saving fault repair through three construction control steps: (I) data collection; (II) analysis engine and (III) decision engine. We found that running this novel methodology can save a considerate amount of time compared to the Hadoop model, without sacrificing the classification accuracy or optimization of performance. The accuracy of the proposed method (92.13%) is an improvement on traditional approaches.
Feature extraction through parallel Probabilistic Principal Component Analysis for heart disease diagnosis

NASA Astrophysics Data System (ADS)

Shah, Syed Muhammad Saqlain; Batool, Safeera; Khan, Imran; Ashraf, Muhammad Usman; Abbas, Syed Hussnain; Hussain, Syed Adnan

2017-09-01

Automatic diagnosis of human diseases are mostly achieved through decision support systems. The performance of these systems is mainly dependent on the selection of the most relevant features. This becomes harder when the dataset contains missing values for the different features. Probabilistic Principal Component Analysis (PPCA) has reputation to deal with the problem of missing values of attributes. This research presents a methodology which uses the results of medical tests as input, extracts a reduced dimensional feature subset and provides diagnosis of heart disease. The proposed methodology extracts high impact features in new projection by using Probabilistic Principal Component Analysis (PPCA). PPCA extracts projection vectors which contribute in highest covariance and these projection vectors are used to reduce feature dimension. The selection of projection vectors is done through Parallel Analysis (PA). The feature subset with the reduced dimension is provided to radial basis function (RBF) kernel based Support Vector Machines (SVM). The RBF based SVM serves the purpose of classification into two categories i.e., Heart Patient (HP) and Normal Subject (NS). The proposed methodology is evaluated through accuracy, specificity and sensitivity over the three datasets of UCI i.e., Cleveland, Switzerland and Hungarian. The statistical results achieved through the proposed technique are presented in comparison to the existing research showing its impact. The proposed technique achieved an accuracy of 82.18%, 85.82% and 91.30% for Cleveland, Hungarian and Switzerland dataset respectively.
On the Multi-Modal Object Tracking and Image Fusion Using Unsupervised Deep Learning Methodologies

NASA Astrophysics Data System (ADS)

LaHaye, N.; Ott, J.; Garay, M. J.; El-Askary, H. M.; Linstead, E.

2017-12-01

The number of different modalities of remote-sensors has been on the rise, resulting in large datasets with different complexity levels. Such complex datasets can provide valuable information separately, yet there is a bigger value in having a comprehensive view of them combined. As such, hidden information can be deduced through applying data mining techniques on the fused data. The curse of dimensionality of such fused data, due to the potentially vast dimension space, hinders our ability to have deep understanding of them. This is because each dataset requires a user to have instrument-specific and dataset-specific knowledge for optimum and meaningful usage. Once a user decides to use multiple datasets together, deeper understanding of translating and combining these datasets in a correct and effective manner is needed. Although there exists data centric techniques, generic automated methodologies that can potentially solve this problem completely don't exist. Here we are developing a system that aims to gain a detailed understanding of different data modalities. Such system will provide an analysis environment that gives the user useful feedback and can aid in research tasks. In our current work, we show the initial outputs our system implementation that leverages unsupervised deep learning techniques so not to burden the user with the task of labeling input data, while still allowing for a detailed machine understanding of the data. Our goal is to be able to track objects, like cloud systems or aerosols, across different image-like data-modalities. The proposed system is flexible, scalable and robust to understand complex likenesses within multi-modal data in a similar spatio-temporal range, and also to be able to co-register and fuse these images when needed.
Iterative dataset optimization in automated planning: Implementation for breast and rectal cancer radiotherapy.

PubMed

Fan, Jiawei; Wang, Jiazhou; Zhang, Zhen; Hu, Weigang

2017-06-01

To develop a new automated treatment planning solution for breast and rectal cancer radiotherapy. The automated treatment planning solution developed in this study includes selection of the iterative optimized training dataset, dose volume histogram (DVH) prediction for the organs at risk (OARs), and automatic generation of clinically acceptable treatment plans. The iterative optimized training dataset is selected by an iterative optimization from 40 treatment plans for left-breast and rectal cancer patients who received radiation therapy. A two-dimensional kernel density estimation algorithm (noted as two parameters KDE) which incorporated two predictive features was implemented to produce the predicted DVHs. Finally, 10 additional new left-breast treatment plans are re-planned using the Pinnacle 3 Auto-Planning (AP) module (version 9.10, Philips Medical Systems) with the objective functions derived from the predicted DVH curves. Automatically generated re-optimized treatment plans are compared with the original manually optimized plans. By combining the iterative optimized training dataset methodology and two parameters KDE prediction algorithm, our proposed automated planning strategy improves the accuracy of the DVH prediction. The automatically generated treatment plans using the dose derived from the predicted DVHs can achieve better dose sparing for some OARs without compromising other metrics of plan quality. The proposed new automated treatment planning solution can be used to efficiently evaluate and improve the quality and consistency of the treatment plans for intensity-modulated breast and rectal cancer radiation therapy. © 2017 American Association of Physicists in Medicine.
Crowd-Sourced Amputee Gait Data: A Feasibility Study Using YouTube Videos of Unilateral Trans-Femoral Gait.

PubMed

Gardiner, James; Gunarathne, Nuwan; Howard, David; Kenney, Laurence

2016-01-01

Collecting large datasets of amputee gait data is notoriously difficult. Additionally, collecting data on less prevalent amputations or on gait activities other than level walking and running on hard surfaces is rarely attempted. However, with the wealth of user-generated content on the Internet, the scope for collecting amputee gait data from alternative sources other than traditional gait labs is intriguing. Here we investigate the potential of YouTube videos to provide gait data on amputee walking. We use an example dataset of trans-femoral amputees level walking at self-selected speeds to collect temporal gait parameters and calculate gait asymmetry. We compare our YouTube data with typical literature values, and show that our methodology produces results that are highly comparable to data collected in a traditional manner. The similarity between the results of our novel methodology and literature values lends confidence to our technique. Nevertheless, clear challenges with the collection and interpretation of crowd-sourced gait data remain, including long term access to datasets, and a lack of validity and reliability studies in this area.
Crowd-Sourced Amputee Gait Data: A Feasibility Study Using YouTube Videos of Unilateral Trans-Femoral Gait

PubMed Central

Gardiner, James; Gunarathne, Nuwan; Howard, David; Kenney, Laurence

2016-01-01

Collecting large datasets of amputee gait data is notoriously difficult. Additionally, collecting data on less prevalent amputations or on gait activities other than level walking and running on hard surfaces is rarely attempted. However, with the wealth of user-generated content on the Internet, the scope for collecting amputee gait data from alternative sources other than traditional gait labs is intriguing. Here we investigate the potential of YouTube videos to provide gait data on amputee walking. We use an example dataset of trans-femoral amputees level walking at self-selected speeds to collect temporal gait parameters and calculate gait asymmetry. We compare our YouTube data with typical literature values, and show that our methodology produces results that are highly comparable to data collected in a traditional manner. The similarity between the results of our novel methodology and literature values lends confidence to our technique. Nevertheless, clear challenges with the collection and interpretation of crowd-sourced gait data remain, including long term access to datasets, and a lack of validity and reliability studies in this area. PMID:27764226
A multi-source dataset of urban life in the city of Milan and the Province of Trentino.

PubMed

Barlacchi, Gianni; De Nadai, Marco; Larcher, Roberto; Casella, Antonio; Chitic, Cristiana; Torrisi, Giovanni; Antonelli, Fabrizio; Vespignani, Alessandro; Pentland, Alex; Lepri, Bruno

2015-01-01

The study of socio-technical systems has been revolutionized by the unprecedented amount of digital records that are constantly being produced by human activities such as accessing Internet services, using mobile devices, and consuming energy and knowledge. In this paper, we describe the richest open multi-source dataset ever released on two geographical areas. The dataset is composed of telecommunications, weather, news, social networks and electricity data from the city of Milan and the Province of Trentino. The unique multi-source composition of the dataset makes it an ideal testbed for methodologies and approaches aimed at tackling a wide range of problems including energy consumption, mobility planning, tourist and migrant flows, urban structures and interactions, event detection, urban well-being and many others.
A multi-source dataset of urban life in the city of Milan and the Province of Trentino

NASA Astrophysics Data System (ADS)

Barlacchi, Gianni; de Nadai, Marco; Larcher, Roberto; Casella, Antonio; Chitic, Cristiana; Torrisi, Giovanni; Antonelli, Fabrizio; Vespignani, Alessandro; Pentland, Alex; Lepri, Bruno

2015-10-01

The study of socio-technical systems has been revolutionized by the unprecedented amount of digital records that are constantly being produced by human activities such as accessing Internet services, using mobile devices, and consuming energy and knowledge. In this paper, we describe the richest open multi-source dataset ever released on two geographical areas. The dataset is composed of telecommunications, weather, news, social networks and electricity data from the city of Milan and the Province of Trentino. The unique multi-source composition of the dataset makes it an ideal testbed for methodologies and approaches aimed at tackling a wide range of problems including energy consumption, mobility planning, tourist and migrant flows, urban structures and interactions, event detection, urban well-being and many others.
A multi-source dataset of urban life in the city of Milan and the Province of Trentino

PubMed Central

Barlacchi, Gianni; De Nadai, Marco; Larcher, Roberto; Casella, Antonio; Chitic, Cristiana; Torrisi, Giovanni; Antonelli, Fabrizio; Vespignani, Alessandro; Pentland, Alex; Lepri, Bruno

2015-01-01

The study of socio-technical systems has been revolutionized by the unprecedented amount of digital records that are constantly being produced by human activities such as accessing Internet services, using mobile devices, and consuming energy and knowledge. In this paper, we describe the richest open multi-source dataset ever released on two geographical areas. The dataset is composed of telecommunications, weather, news, social networks and electricity data from the city of Milan and the Province of Trentino. The unique multi-source composition of the dataset makes it an ideal testbed for methodologies and approaches aimed at tackling a wide range of problems including energy consumption, mobility planning, tourist and migrant flows, urban structures and interactions, event detection, urban well-being and many others. PMID:26528394
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies.

PubMed

Boulesteix, Anne-Laure; Wilson, Rory; Hapfelmeier, Alexander

2017-09-09

The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
How do the methodological choices of your climate change study affect your results? A hydrologic case study across the Pacific Northwest

NASA Astrophysics Data System (ADS)

Chegwidden, O.; Nijssen, B.; Rupp, D. E.; Kao, S. C.; Clark, M. P.

2017-12-01

We describe results from a large hydrologic climate change dataset developed across the Pacific Northwestern United States and discuss how the analysis of those results can be seen as a framework for other large hydrologic ensemble investigations. This investigation will better inform future modeling efforts and large ensemble analyses across domains within and beyond the Pacific Northwest. Using outputs from the Coupled Model Intercomparison Project Phase 5 (CMIP5), we provide projections of hydrologic change for the domain through the end of the 21st century. The dataset is based upon permutations of four methodological choices: (1) ten global climate models (2) two representative concentration pathways (3) three meteorological downscaling methods and (4) four unique hydrologic model set-ups (three of which entail the same hydrologic model using independently calibrated parameter sets). All simulations were conducted across the Columbia River Basin and Pacific coastal drainages at a 1/16th ( 6 km) resolution and at a daily timestep. In total, the 172 distinct simulations offer an updated, comprehensive view of climate change projections through the end of the 21st century. The results consist of routed streamflow at 400 sites throughout the domain as well as distributed spatial fields of relevant hydrologic variables like snow water equivalent and soil moisture. In this presentation, we discuss the level of agreement with previous hydrologic projections for the study area and how these projections differ with specific methodological choices. By controlling for some methodological choices we can show how each choice affects key climatic change metrics. We discuss how the spread in results varies across hydroclimatic regimes. We will use this large dataset as a case study for distilling a wide range of hydroclimatological projections into useful climate change assessments.
A hybrid Land Cover Dataset for Russia: a new methodology for merging statistics, remote sensing and in-situ information

NASA Astrophysics Data System (ADS)

Schepaschenko, D.; McCallum, I.; Shvidenko, A.; Kraxner, F.; Fritz, S.

2009-04-01

There is a critical need for accurate land cover information for resource assessment, biophysical modeling, greenhouse gas studies, and for estimating possible terrestrial responses and feedbacks to climate change. However, practically all existing land cover datasets have quite a high level of uncertainty and suffer from a lack of important details that does not allow for relevant parameterization, e.g., data derived from different forest inventories. The objective of this study is to develop a methodology in order to create a hybrid land cover dataset at the level which would satisfy requirements of the verified terrestrial biota full greenhouse gas account (Shvidenko et al., 2008) for large regions i.e. Russia. Such requirements necessitate a detailed quantification of land classes (e.g., for forests - dominant species, age, growing stock, net primary production, etc.) with additional information on uncertainties of the major biometric and ecological parameters in the range of 10-20% and a confidence interval of around 0.9. The approach taken here allows the integration of different datasets to explore synergies and in particular the merging and harmonization of land and forest inventories, ecological monitoring, remote sensing data and in-situ information. The following datasets have been integrated: Remote sensing: Global Land Cover 2000 (Fritz et al., 2003), Vegetation Continuous Fields (Hansen et al., 2002), Vegetation Fire (Sukhinin, 2007), Regional land cover (Schmullius et al., 2005); GIS: Soil 1:2.5 Mio (Dokuchaev Soil Science Institute, 1996), Administrative Regions 1:2.5 Mio, Vegetation 1:4 Mio, Bioclimatic Zones 1:4 Mio (Stolbovoi & McCallum, 2002), Forest Enterprises 1:2.5 Mio, Rivers/Lakes and Roads/Railways 1:1 Mio (IIASA's data base); Inventories and statistics: State Land Account (FARSC RF, 2006), State Forest Account - SFA (FFS RF, 2003), Disturbances in forests (FFS RF, 2006). The resulting hybrid land cover dataset at 1-km resolution comprises the following classes: Forest (each grid links to the SFA database, which contains 86,613 records); Agriculture (5 classes, parameterized by 89 administrative units); Wetlands (8 classes, parameterized by 83 zone/region units); Open Woodland, Burnt area; Shrub/grassland (50 classes, parameterized by 300 zone/region units); Water; Unproductive area. This study has demonstrated the ability to produce a highly detailed (both spatially and thematically) land cover dataset over Russia. Future efforts include further validation of the hybrid land cover dataset for Russia, and its use for assessment of the terrestrial biota full greenhouse gas budget across Russia. The methodology proposed in this study could be applied at the global level. Results of such an undertaking would however be highly dependent upon the quality of the available ground data. The implementation of the hybrid land cover dataset was undertaken in a way that it can be regularly updated based on new ground data and remote sensing products (ie. MODIS).
An effective approach for gap-filling continental scale remotely sensed time-series

PubMed Central

Weiss, Daniel J.; Atkinson, Peter M.; Bhatt, Samir; Mappin, Bonnie; Hay, Simon I.; Gething, Peter W.

2014-01-01

The archives of imagery and modeled data products derived from remote sensing programs with high temporal resolution provide powerful resources for characterizing inter- and intra-annual environmental dynamics. The impressive depth of available time-series from such missions (e.g., MODIS and AVHRR) affords new opportunities for improving data usability by leveraging spatial and temporal information inherent to longitudinal geospatial datasets. In this research we develop an approach for filling gaps in imagery time-series that result primarily from cloud cover, which is particularly problematic in forested equatorial regions. Our approach consists of two, complementary gap-filling algorithms and a variety of run-time options that allow users to balance competing demands of model accuracy and processing time. We applied the gap-filling methodology to MODIS Enhanced Vegetation Index (EVI) and daytime and nighttime Land Surface Temperature (LST) datasets for the African continent for 2000–2012, with a 1 km spatial resolution, and an 8-day temporal resolution. We validated the method by introducing and filling artificial gaps, and then comparing the original data with model predictions. Our approach achieved R2 values above 0.87 even for pixels within 500 km wide introduced gaps. Furthermore, the structure of our approach allows estimation of the error associated with each gap-filled pixel based on the distance to the non-gap pixels used to model its fill value, thus providing a mechanism for including uncertainty associated with the gap-filling process in downstream applications of the resulting datasets. PMID:25642100
Polymorphisms in O-methyltransferase genes are associated with stover cell wall digestibility in European maize (Zea mays L.).

PubMed

Brenner, Everton A; Zein, Imad; Chen, Yongsheng; Andersen, Jeppe R; Wenzel, Gerhard; Ouzunova, Milena; Eder, Joachim; Darnhofer, Birte; Frei, Uschi; Barrière, Yves; Lübberstedt, Thomas

2010-02-12

OMT (O-methyltransferase) genes are involved in lignin biosynthesis, which relates to stover cell wall digestibility. Reduced lignin content is an important determinant of both forage quality and ethanol conversion efficiency of maize stover. Variation in genomic sequences coding for COMT, CCoAOMT1, and CCoAOMT2 was analyzed in relation to stover cell wall digestibility for a panel of 40 European forage maize inbred lines, and re-analyzed for a panel of 34 lines from a published French study. Different methodologies for association analysis were performed and compared. Across association methodologies, a total number of 25, 12, 1, 6 COMT polymorphic sites were significantly associated with DNDF, OMD, NDF, and WSC, respectively. Association analysis for CCoAOMT1 and CCoAOMT2 identified substantially fewer polymorphic sites (3 and 2, respectively) associated with the investigated traits. Our re-analysis on the 34 lines from a published French dataset identified 14 polymorphic sites significantly associated with cell wall digestibility, two of them were consistent with our study. Promising polymorphisms putatively causally associated with variability of cell wall digestibility were inferred from the total number of significantly associated SNPs/Indels. Several polymorphic sites for three O-methyltransferase loci were associated with stover cell wall digestibility. All three tested genes seem to be involved in controlling DNDF, in particular COMT. Thus, considerable variation among Bm3 wildtype alleles can be exploited for improving cell-wall digestibility. Target sites for functional markers were identified enabling development of efficient marker-based selection strategies.
Polymorphisms in O-methyltransferase genes are associated with stover cell wall digestibility in European maize (Zea mays L.)

PubMed Central

2010-01-01

Background OMT (O-methyltransferase) genes are involved in lignin biosynthesis, which relates to stover cell wall digestibility. Reduced lignin content is an important determinant of both forage quality and ethanol conversion efficiency of maize stover. Results Variation in genomic sequences coding for COMT, CCoAOMT1, and CCoAOMT2 was analyzed in relation to stover cell wall digestibility for a panel of 40 European forage maize inbred lines, and re-analyzed for a panel of 34 lines from a published French study. Different methodologies for association analysis were performed and compared. Across association methodologies, a total number of 25, 12, 1, 6 COMT polymorphic sites were significantly associated with DNDF, OMD, NDF, and WSC, respectively. Association analysis for CCoAOMT1 and CCoAOMT2 identified substantially fewer polymorphic sites (3 and 2, respectively) associated with the investigated traits. Our re-analysis on the 34 lines from a published French dataset identified 14 polymorphic sites significantly associated with cell wall digestibility, two of them were consistent with our study. Promising polymorphisms putatively causally associated with variability of cell wall digestibility were inferred from the total number of significantly associated SNPs/Indels. Conclusions Several polymorphic sites for three O-methyltransferase loci were associated with stover cell wall digestibility. All three tested genes seem to be involved in controlling DNDF, in particular COMT. Thus, considerable variation among Bm3 wildtype alleles can be exploited for improving cell-wall digestibility. Target sites for functional markers were identified enabling development of efficient marker-based selection strategies. PMID:20152036
Quantifying Interannual Variability for Photovoltaic Systems in PVWatts

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ryberg, David Severin; Freeman, Janine; Blair, Nate

2015-10-01

The National Renewable Energy Laboratory's (NREL's) PVWatts is a relatively simple tool used by industry and individuals alike to easily estimate the amount of energy a photovoltaic (PV) system will produce throughout the course of a typical year. PVWatts Version 5 has previously been shown to be able to reasonably represent an operating system's output when provided with concurrent weather data, however this type of data is not available when estimating system output during future time frames. For this purpose PVWatts uses weather data from typical meteorological year (TMY) datasets which are available on the NREL website. The TMY filesmore » represent a statistically 'typical' year which by definition excludes anomalous weather patterns and as a result may not provide sufficient quantification of project risk to the financial community. It was therefore desired to quantify the interannual variability associated with TMY files in order to improve the understanding of risk associated with these projects. To begin to understand the interannual variability of a PV project, we simulated two archetypal PV system designs, which are common in the PV industry, in PVWatts using the NSRDB's 1961-1990 historical dataset. This dataset contains measured hourly weather data and spans the thirty years from 1961-1990 for 239 locations in the United States. To note, this historical dataset was used to compose the TMY2 dataset. Using the results of these simulations we computed several statistical metrics which may be of interest to the financial community and normalized the results with respect to the TMY energy prediction at each location, so that these results could be easily translated to similar systems. This report briefly describes the simulation process used and the statistical methodology employed for this project, but otherwise focuses mainly on a sample of our results. A short discussion of these results is also provided. It is our hope that this quantification of the interannual variability of PV systems will provide a starting point for variability considerations in future PV system designs and investigations. however this type of data is not available when estimating system output during future time frames.« less
WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data

PubMed Central

Yi, Ming; Horton, Jay D; Cohen, Jonathan C; Hobbs, Helen H; Stephens, Robert M

2006-01-01

Background Analysis of High Throughput (HTP) Data such as microarray and proteomics data has provided a powerful methodology to study patterns of gene regulation at genome scale. A major unresolved problem in the post-genomic era is to assemble the large amounts of data generated into a meaningful biological context. We have developed a comprehensive software tool, WholePathwayScope (WPS), for deriving biological insights from analysis of HTP data. Result WPS extracts gene lists with shared biological themes through color cue templates. WPS statistically evaluates global functional category enrichment of gene lists and pathway-level pattern enrichment of data. WPS incorporates well-known biological pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) and Biocarta, GO (Gene Ontology) terms as well as user-defined pathways or relevant gene clusters or groups, and explores gene-term relationships within the derived gene-term association networks (GTANs). WPS simultaneously compares multiple datasets within biological contexts either as pathways or as association networks. WPS also integrates Genetic Association Database and Partial MedGene Database for disease-association information. We have used this program to analyze and compare microarray and proteomics datasets derived from a variety of biological systems. Application examples demonstrated the capacity of WPS to significantly facilitate the analysis of HTP data for integrative discovery. Conclusion This tool represents a pathway-based platform for discovery integration to maximize analysis power. The tool is freely available at . PMID:16423281
Reconstructing missing information on precipitation datasets: impact of tails on adopted statistical distributions.

NASA Astrophysics Data System (ADS)

Pedretti, Daniele; Beckie, Roger Daniel

2014-05-01

Missing data in hydrological time-series databases are ubiquitous in practical applications, yet it is of fundamental importance to make educated decisions in problems involving exhaustive time-series knowledge. This includes precipitation datasets, since recording or human failures can produce gaps in these time series. For some applications, directly involving the ratio between precipitation and some other quantity, lack of complete information can result in poor understanding of basic physical and chemical dynamics involving precipitated water. For instance, the ratio between precipitation (recharge) and outflow rates at a discharge point of an aquifer (e.g. rivers, pumping wells, lysimeters) can be used to obtain aquifer parameters and thus to constrain model-based predictions. We tested a suite of methodologies to reconstruct missing information in rainfall datasets. The goal was to obtain a suitable and versatile method to reduce the errors given by the lack of data in specific time windows. Our analyses included both a classical chronologically-pairing approach between rainfall stations and a probability-based approached, which accounted for the probability of exceedence of rain depths measured at two or multiple stations. Our analyses proved that it is not clear a priori which method delivers the best methodology. Rather, this selection should be based considering the specific statistical properties of the rainfall dataset. In this presentation, our emphasis is to discuss the effects of a few typical parametric distributions used to model the behavior of rainfall. Specifically, we analyzed the role of distributional "tails", which have an important control on the occurrence of extreme rainfall events. The latter strongly affect several hydrological applications, including recharge-discharge relationships. The heavy-tailed distributions we considered were parametric Log-Normal, Generalized Pareto, Generalized Extreme and Gamma distributions. The methods were first tested on synthetic examples, to have a complete control of the impact of several variables such as minimum amount of data required to obtain reliable statistical distributions from the selected parametric functions. Then, we applied the methodology to precipitation datasets collected in the Vancouver area and on a mining site in Peru.

Multivariate Formation Pressure Prediction with Seismic-derived Petrophysical Properties from Prestack AVO inversion and Poststack Seismic Motion Inversion

NASA Astrophysics Data System (ADS)

Yu, H.; Gu, H.

2017-12-01

A novel multivariate seismic formation pressure prediction methodology is presented, which incorporates high-resolution seismic velocity data from prestack AVO inversion, and petrophysical data (porosity and shale volume) derived from poststack seismic motion inversion. In contrast to traditional seismic formation prediction methods, the proposed methodology is based on a multivariate pressure prediction model and utilizes a trace-by-trace multivariate regression analysis on seismic-derived petrophysical properties to calibrate model parameters in order to make accurate predictions with higher resolution in both vertical and lateral directions. With prestack time migration velocity as initial velocity model, an AVO inversion was first applied to prestack dataset to obtain high-resolution seismic velocity with higher frequency that is to be used as the velocity input for seismic pressure prediction, and the density dataset to calculate accurate Overburden Pressure (OBP). Seismic Motion Inversion (SMI) is an inversion technique based on Markov Chain Monte Carlo simulation. Both structural variability and similarity of seismic waveform are used to incorporate well log data to characterize the variability of the property to be obtained. In this research, porosity and shale volume are first interpreted on well logs, and then combined with poststack seismic data using SMI to build porosity and shale volume datasets for seismic pressure prediction. A multivariate effective stress model is used to convert velocity, porosity and shale volume datasets to effective stress. After a thorough study of the regional stratigraphic and sedimentary characteristics, a regional normally compacted interval model is built, and then the coefficients in the multivariate prediction model are determined in a trace-by-trace multivariate regression analysis on the petrophysical data. The coefficients are used to convert velocity, porosity and shale volume datasets to effective stress and then to calculate formation pressure with OBP. Application of the proposed methodology to a research area in East China Sea has proved that the method can bridge the gap between seismic and well log pressure prediction and give predicted pressure values close to pressure meassurements from well testing.
Season of birth and primary central nervous system tumors: a systematic review of the literature with critical appraisal of underlying mechanisms.

PubMed

Georgakis, Marios K; Ntinopoulou, Erato; Chatzopoulou, Despoina; Petridou, Eleni Th

2017-09-01

Season of birth has been considered a proxy of seasonally varying exposures around perinatal period, potentially implicated in the etiology of several health outcomes, including malignancies. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we have systematically reviewed published literature on the association of birth seasonality with risk of central nervous system tumors in children and adults. Seventeen eligible studies using various methodologies were identified, encompassing 20,523 cases. Eight of 10 studies in children versus four of eight in adults showed some statistically significant associations between birth seasonality and central nervous system tumor or tumor subtype occurrence, pointing to a clustering of births mostly in fall and winter months, albeit no consistent pattern was identified by histologic subtype. A plethora of perinatal factors might underlie or confound the associations, such as variations in birth weight, maternal diet during pregnancy, perinatal vitamin D levels, pesticides, infectious agents, immune system maturity, and epigenetic modifications. Inherent methodological weaknesses of to-date published individual investigations, including mainly underpowered size to explore the hypothesis by histological subtype, call for more elegant concerted actions using primary data of large datasets taking also into account the interplay between the potential underlying etiologic factors. Copyright © 2017 Elsevier Inc. All rights reserved.
Prediction of brain tissue temperature using near-infrared spectroscopy.

PubMed

Holper, Lisa; Mitra, Subhabrata; Bale, Gemma; Robertson, Nicola; Tachtsidis, Ilias

2017-04-01

Broadband near-infrared spectroscopy (NIRS) can provide an endogenous indicator of tissue temperature based on the temperature dependence of the water absorption spectrum. We describe a first evaluation of the calibration and prediction of brain tissue temperature obtained during hypothermia in newborn piglets (animal dataset) and rewarming in newborn infants (human dataset) based on measured body (rectal) temperature. The calibration using partial least squares regression proved to be a reliable method to predict brain tissue temperature with respect to core body temperature in the wavelength interval of 720 to 880 nm with a strong mean predictive power of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). In addition, we applied regression receiver operating characteristic curves for the first time to evaluate the temperature prediction, which provided an overall mean error bias between NIRS predicted brain temperature and body temperature of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). We discuss main methodological aspects, particularly the well-known aspect of over- versus underestimation between brain and body temperature, which is relevant for potential clinical applications.
Evaluation of a Change Detection Methodology by Means of Binary Thresholding Algorithms and Informational Fusion Processes

PubMed Central

Molina, Iñigo; Martinez, Estibaliz; Arquero, Agueda; Pajares, Gonzalo; Sanchez, Javier

2012-01-01

Landcover is subject to continuous changes on a wide variety of temporal and spatial scales. Those changes produce significant effects in human and natural activities. Maintaining an updated spatial database with the occurred changes allows a better monitoring of the Earth’s resources and management of the environment. Change detection (CD) techniques using images from different sensors, such as satellite imagery, aerial photographs, etc., have proven to be suitable and secure data sources from which updated information can be extracted efficiently, so that changes can also be inventoried and monitored. In this paper, a multisource CD methodology for multiresolution datasets is applied. First, different change indices are processed, then different thresholding algorithms for change/no_change are applied to these indices in order to better estimate the statistical parameters of these categories, finally the indices are integrated into a change detection multisource fusion process, which allows generating a single CD result from several combination of indices. This methodology has been applied to datasets with different spectral and spatial resolution properties. Then, the obtained results are evaluated by means of a quality control analysis, as well as with complementary graphical representations. The suggested methodology has also been proved efficiently for identifying the change detection index with the higher contribution. PMID:22737023
Evaluation of a change detection methodology by means of binary thresholding algorithms and informational fusion processes.

PubMed

Molina, Iñigo; Martinez, Estibaliz; Arquero, Agueda; Pajares, Gonzalo; Sanchez, Javier

2012-01-01

Landcover is subject to continuous changes on a wide variety of temporal and spatial scales. Those changes produce significant effects in human and natural activities. Maintaining an updated spatial database with the occurred changes allows a better monitoring of the Earth's resources and management of the environment. Change detection (CD) techniques using images from different sensors, such as satellite imagery, aerial photographs, etc., have proven to be suitable and secure data sources from which updated information can be extracted efficiently, so that changes can also be inventoried and monitored. In this paper, a multisource CD methodology for multiresolution datasets is applied. First, different change indices are processed, then different thresholding algorithms for change/no_change are applied to these indices in order to better estimate the statistical parameters of these categories, finally the indices are integrated into a change detection multisource fusion process, which allows generating a single CD result from several combination of indices. This methodology has been applied to datasets with different spectral and spatial resolution properties. Then, the obtained results are evaluated by means of a quality control analysis, as well as with complementary graphical representations. The suggested methodology has also been proved efficiently for identifying the change detection index with the higher contribution.
The P/N (Positive-to-Negative Links) Ratio in Complex Networks-A Promising In Silico Biomarker for Detecting Changes Occurring in the Human Microbiome.

PubMed

Ma, Zhanshan Sam

2018-05-01

Relatively little progress in the methodology for differentiating between the healthy and diseased microbiomes, beyond comparing microbial community diversities with traditional species richness or Shannon index, has been made. Network analysis has increasingly been called for the task, but most currently available microbiome datasets only allows for the construction of simple species correlation networks (SCNs). The main results from SCN analysis are a series of network properties such as network degree and modularity, but the metrics for these network properties often produce inconsistent evidence. We propose a simple new network property, the P/N ratio, defined as the ratio of positive links to the number of negative links in the microbial SCN. We postulate that the P/N ratio should reflect the balance between facilitative and inhibitive interactions among microbial species, possibly one of the most important changes occurring in diseased microbiome. We tested our hypothesis with five datasets representing five major human microbiome sites and discovered that the P/N ratio exhibits contrasting differences between healthy and diseased microbiomes and may be harnessed as an in silico biomarker for detecting disease-associated changes in the human microbiome, and may play an important role in personalized diagnosis of the human microbiome-associated diseases.
Knowledge mining from clinical datasets using rough sets and backpropagation neural network.

PubMed

Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan

2015-01-01

The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.
ProDaMa: an open source Python library to generate protein structure datasets.

PubMed

Armano, Giuliano; Manconi, Andrea

2009-10-02

The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.
Knowledge management in secondary pharmaceutical manufacturing by mining of data historians-A proof-of-concept study.

PubMed

Meneghetti, Natascia; Facco, Pierantonio; Bezzo, Fabrizio; Himawan, Chrismono; Zomer, Simeone; Barolo, Massimiliano

2016-05-30

In this proof-of-concept study, a methodology is proposed to systematically analyze large data historians of secondary pharmaceutical manufacturing systems using data mining techniques. The objective is to develop an approach enabling to automatically retrieve operation-relevant information that can assist the management in the periodic review of a manufactory system. The proposed methodology allows one to automatically perform three tasks: the identification of single batches within the entire data-sequence of the historical dataset, the identification of distinct operating phases within each batch, and the characterization of a batch with respect to an assigned multivariate set of operating characteristics. The approach is tested on a six-month dataset of a commercial-scale granulation/drying system, where several millions of data entries are recorded. The quality of results and the generality of the approach indicate that there is a strong potential for extending the method to even larger historical datasets and to different operations, thus making it an advanced PAT tool that can assist the implementation of continual improvement paradigms within a quality-by-design framework. Copyright © 2016 Elsevier B.V. All rights reserved.
Analyzing legacy U.S. Geological Survey geochemical databases using GIS: applications for a national mineral resource assessment

USGS Publications Warehouse

Yager, Douglas B.; Hofstra, Albert H.; Granitto, Matthew

2012-01-01

This report emphasizes geographic information system analysis and the display of data stored in the legacy U.S. Geological Survey National Geochemical Database for use in mineral resource investigations. Geochemical analyses of soils, stream sediments, and rocks that are archived in the National Geochemical Database provide an extensive data source for investigating geochemical anomalies. A study area in the Egan Range of east-central Nevada was used to develop a geographic information system analysis methodology for two different geochemical datasets involving detailed (Bureau of Land Management Wilderness) and reconnaissance-scale (National Uranium Resource Evaluation) investigations. ArcGIS was used to analyze and thematically map geochemical information at point locations. Watershed-boundary datasets served as a geographic reference to relate potentially anomalous sample sites with hydrologic unit codes at varying scales. The National Hydrography Dataset was analyzed with Hydrography Event Management and ArcGIS Utility Network Analyst tools to delineate potential sediment-sample provenance along a stream network. These tools can be used to track potential upstream-sediment-contributing areas to a sample site. This methodology identifies geochemically anomalous sample sites, watersheds, and streams that could help focus mineral resource investigations in the field.
Basis function models for animal movement

USGS Publications Warehouse

Hooten, Mevin B.; Johnson, Devin S.

2017-01-01

Advances in satellite-based data collection techniques have served as a catalyst for new statistical methodology to analyze these data. In wildlife ecological studies, satellite-based data and methodology have provided a wealth of information about animal space use and the investigation of individual-based animal–environment relationships. With the technology for data collection improving dramatically over time, we are left with massive archives of historical animal telemetry data of varying quality. While many contemporary statistical approaches for inferring movement behavior are specified in discrete time, we develop a flexible continuous-time stochastic integral equation framework that is amenable to reduced-rank second-order covariance parameterizations. We demonstrate how the associated first-order basis functions can be constructed to mimic behavioral characteristics in realistic trajectory processes using telemetry data from mule deer and mountain lion individuals in western North America. Our approach is parallelizable and provides inference for heterogenous trajectories using nonstationary spatial modeling techniques that are feasible for large telemetry datasets. Supplementary materials for this article are available online.
Semantic similarity measures in the biomedical domain by leveraging a web search engine.

PubMed

Hsieh, Sheau-Ling; Chang, Wen-Yung; Chen, Chi-Huang; Weng, Yung-Ching

2013-07-01

Various researches in web related semantic similarity measures have been deployed. However, measuring semantic similarity between two terms remains a challenging task. The traditional ontology-based methodologies have a limitation that both concepts must be resided in the same ontology tree(s). Unfortunately, in practice, the assumption is not always applicable. On the other hand, if the corpus is sufficiently adequate, the corpus-based methodologies can overcome the limitation. Now, the web is a continuous and enormous growth corpus. Therefore, a method of estimating semantic similarity is proposed via exploiting the page counts of two biomedical concepts returned by Google AJAX web search engine. The features are extracted as the co-occurrence patterns of two given terms P and Q, by querying P, Q, as well as P AND Q, and the web search hit counts of the defined lexico-syntactic patterns. These similarity scores of different patterns are evaluated, by adapting support vector machines for classification, to leverage the robustness of semantic similarity measures. Experimental results validating against two datasets: dataset 1 provided by A. Hliaoutakis; dataset 2 provided by T. Pedersen, are presented and discussed. In dataset 1, the proposed approach achieves the best correlation coefficient (0.802) under SNOMED-CT. In dataset 2, the proposed method obtains the best correlation coefficient (SNOMED-CT: 0.705; MeSH: 0.723) with physician scores comparing with measures of other methods. However, the correlation coefficients (SNOMED-CT: 0.496; MeSH: 0.539) with coder scores received opposite outcomes. In conclusion, the semantic similarity findings of the proposed method are close to those of physicians' ratings. Furthermore, the study provides a cornerstone investigation for extracting fully relevant information from digitizing, free-text medical records in the National Taiwan University Hospital database.
The use of hierarchical clustering for the design of optimized monitoring networks

NASA Astrophysics Data System (ADS)

Soares, Joana; Makar, Paul Andrew; Aklilu, Yayne; Akingunola, Ayodeji

2018-05-01

Associativity analysis is a powerful tool to deal with large-scale datasets by clustering the data on the basis of (dis)similarity and can be used to assess the efficacy and design of air quality monitoring networks. We describe here our use of Kolmogorov-Zurbenko filtering and hierarchical clustering of NO2 and SO2 passive and continuous monitoring data to analyse and optimize air quality networks for these species in the province of Alberta, Canada. The methodology applied in this study assesses dissimilarity between monitoring station time series based on two metrics: 1 - R, R being the Pearson correlation coefficient, and the Euclidean distance; we find that both should be used in evaluating monitoring site similarity. We have combined the analytic power of hierarchical clustering with the spatial information provided by deterministic air quality model results, using the gridded time series of model output as potential station locations, as a proxy for assessing monitoring network design and for network optimization. We demonstrate that clustering results depend on the air contaminant analysed, reflecting the difference in the respective emission sources of SO2 and NO2 in the region under study. Our work shows that much of the signal identifying the sources of NO2 and SO2 emissions resides in shorter timescales (hourly to daily) due to short-term variation of concentrations and that longer-term averages in data collection may lose the information needed to identify local sources. However, the methodology identifies stations mainly influenced by seasonality, if larger timescales (weekly to monthly) are considered. We have performed the first dissimilarity analysis based on gridded air quality model output and have shown that the methodology is capable of generating maps of subregions within which a single station will represent the entire subregion, to a given level of dissimilarity. We have also shown that our approach is capable of identifying different sampling methodologies as well as outliers (stations' time series which are markedly different from all others in a given dataset).
Methodological challenges to bridge the gap between regional climate and hydrology models

NASA Astrophysics Data System (ADS)

Bozhinova, Denica; José Gómez-Navarro, Juan; Raible, Christoph; Felder, Guido

2017-04-01

The frequency and severity of floods worldwide, together with their impacts, are expected to increase under climate change scenarios. It is therefore very important to gain insight into the physical mechanisms responsible for such events in order to constrain the associated uncertainties. Model simulations of the climate and hydrological processes are important tools that can provide insight in the underlying physical processes and thus enable an accurate assessment of the risks. Coupled together, they can provide a physically consistent picture that allows to assess the phenomenon in a comprehensive way. However, climate and hydrological models work at different temporal and spatial scales, so there are a number of methodological challenges that need to be carefully addressed. An important issue pertains the presence of biases in the simulation of precipitation. Climate models in general, and Regional Climate models (RCMs) in particular, are affected by a number of systematic biases that limit their reliability. In many studies, prominently the assessment of changes due to climate change, such biases are minimised by applying the so-called delta approach, which focuses on changes disregarding absolute values that are more affected by biases. However, this approach is not suitable in this scenario, as the absolute value of precipitation, rather than the change, is fed into the hydrological model. Therefore, bias has to be previously removed, being this a complex matter where various methodologies have been proposed. In this study, we apply and discuss the advantages and caveats of two different methodologies that correct the simulated precipitation to minimise differences with respect an observational dataset: a linear fit (FIT) of the accumulated distributions and Quantile Mapping (QM). The target region is Switzerland, and therefore the observational dataset is provided by MeteoSwiss. The RCM is the Weather Research and Forecasting model (WRF), driven at the boundaries by the Community Earth System Model (CESM). The raw simulation driven by CESM exhibit prominent biases that stand out in the evolution of the annual cycle and demonstrate that the correction of biases is mandatory in this type of studies, rather than a minor correction that might be neglected. The simulation spans the period 1976 - 2005, although the application of the correction is carried out on a daily basis. Both methods lead to a corrected field of precipitation that respects the temporal evolution of the simulated precipitation, at the same time that mimics the distribution of precipitation according to the one in the observations. Due to the nature of the two methodologies, there are important differences between the products of both corrections, that lead to dataset with different properties. FIT is generally more accurate regarding the reproduction of the tails of the distribution, i.e. extreme events, whereas the nature of QM renders it a general-purpose correction whose skill is equally distributed across the full distribution of precipitation, including central values.
A methodology for cloud masking uncalibrated lidar signals

NASA Astrophysics Data System (ADS)

Binietoglou, Ioannis; D'Amico, Giuseppe; Baars, Holger; Belegante, Livio; Marinou, Eleni

2018-04-01

Most lidar processing algorithms, such as those included in EARLINET's Single Calculus Chain, can be applied only to cloud-free atmospheric scenes. In this paper, we present a methodology for masking clouds in uncalibrated lidar signals. First, we construct a reference dataset based on manual inspection and then train a classifier to separate clouds and cloud-free regions. Here we present details of this approach together with an example cloud masks from an EARLINET station.
Development of an Integrated Team Training Design and Assessment Architecture to Support Adaptability in Healthcare Teams

DTIC Science & Technology

2017-10-01

to patient safety by addressing key methodological and conceptual gaps in healthcare simulation-based team training. The investigators are developing...primary outcome of Aim 1a is a conceptually and methodologically sound training design architecture that supports the development and integration of team...should be delivered. This subtask was delayed by approximately 1 month and is now completed. Completed Evaluation of existing experimental dataset to
High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials

DOE PAGES

Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; ...

2017-01-31

Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly availablemore » data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds.« less
High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials

PubMed Central

Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; Liu, Miao; Winston, Donald; Chen, Wei; Graf, Tanja; Schladt, Thomas D.; Persson, Kristin A.; Prinz, Fritz B.

2017-01-01

Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly available data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds. PMID:28140408
Detection of tuberculosis using hybrid features from chest radiographs

NASA Astrophysics Data System (ADS)

Fatima, Ayesha; Akram, M. Usman; Akhtar, Mahmood; Shafique, Irrum

2017-02-01

Tuberculosis is an infectious disease and becomes a major threat all over the world but still diagnosis of tuberculosis is a challenging task. In literature, chest radiographs are considered as most commonly used medical images in under developed countries for the diagnosis of TB. Different methods have been proposed but they are not helpful for radiologists due to cost and accuracy issues. Our paper presents a methodology in which different combinations of features are extracted based on intensities, shape and texture of chest radiograph and given to classifier for the detection of TB. The performance of our methodology is evaluated using publically available standard dataset Montgomery Country (MC) which contains 138 CXRs among which 80 CXRs are normal and 58 CXRs are abnormal including effusion and miliary patterns etc. The accuracy of 81.16% was achieved and the results show that proposed method have outperformed existing state of the art methods on MC dataset.
High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials.

PubMed

Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; Liu, Miao; Winston, Donald; Chen, Wei; Graf, Tanja; Schladt, Thomas D; Persson, Kristin A; Prinz, Fritz B

2017-01-31

Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly available data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds.

In-flight photogrammetric camera calibration and validation via complementary lidar

NASA Astrophysics Data System (ADS)

Gneeniss, A. S.; Mills, J. P.; Miller, P. E.

2015-02-01

This research assumes lidar as a reference dataset against which in-flight camera system calibration and validation can be performed. The methodology utilises a robust least squares surface matching algorithm to align a dense network of photogrammetric points to the lidar reference surface, allowing for the automatic extraction of so-called lidar control points (LCPs). Adjustment of the photogrammetric data is then repeated using the extracted LCPs in a self-calibrating bundle adjustment with additional parameters. This methodology was tested using two different photogrammetric datasets, a Microsoft UltraCamX large format camera and an Applanix DSS322 medium format camera. Systematic sensitivity testing explored the influence of the number and weighting of LCPs. For both camera blocks it was found that when the number of control points increase, the accuracy improves regardless of point weighting. The calibration results were compared with those obtained using ground control points, with good agreement found between the two.
Decision tree methods: applications for classification and prediction.

PubMed

Song, Yan-Yan; Lu, Ying

2015-04-25

Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.
A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents.

PubMed

Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

2017-01-01

In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.
A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents

PubMed Central

Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

2017-01-01

In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907
Soil Bulk Density by Soil Type, Land Use and Data Source: Putting the Error in SOC Estimates

NASA Astrophysics Data System (ADS)

Wills, S. A.; Rossi, A.; Loecke, T.; Ramcharan, A. M.; Roecker, S.; Mishra, U.; Waltman, S.; Nave, L. E.; Williams, C. O.; Beaudette, D.; Libohova, Z.; Vasilas, L.

2017-12-01

An important part of SOC stock and pool assessment is the assessment, estimation, and application of bulk density estimates. The concept of bulk density is relatively simple (the mass of soil in a given volume), the specifics Bulk density can be difficult to measure in soils due to logistical and methodological constraints. While many estimates of SOC pools use legacy data in their estimates, few concerted efforts have been made to assess the process used to convert laboratory carbon concentration measurements and bulk density collection into volumetrically based SOC estimates. The methodologies used are particularly sensitive in wetlands and organic soils with high amounts of carbon and very low bulk densities. We will present an analysis across four database measurements: NCSS - the National Cooperative Soil Survey Characterization dataset, RaCA - the Rapid Carbon Assessment sample dataset, NWCA - the National Wetland Condition Assessment, and ISCN - the International soil Carbon Network. The relationship between bulk density and soil organic carbon will be evaluated by dataset and land use/land cover information. Prediction methods (both regression and machine learning) will be compared and contrasted across datasets and available input information. The assessment and application of bulk density, including modeling, aggregation and error propagation will be evaluated. Finally, recommendations will be made about both the use of new data in soil survey products (such as SSURGO) and the use of that information as legacy data in SOC pool estimates.
Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines.

PubMed

Raja, Kalpana; Natarajan, Jeyakumar

2018-07-01

Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus. Copyright © 2018 Elsevier B.V. All rights reserved.
Downscaling global precipitation for local applications - a case for the Rhine basin

NASA Astrophysics Data System (ADS)

Sperna Weiland, Frederiek; van Verseveld, Willem; Schellekens, Jaap

2017-04-01

Within the EU FP7 project eartH2Observe a global Water Resources Re-analysis (WRR) is being developed. This re-analysis consists of meteorological and hydrological water balance variables with global coverage, spanning the period 1979-2014 at 0.25 degrees resolution (Schellekens et al., 2016). The dataset can be of special interest in regions with limited in-situ data availability, yet for local scale analysis particularly in mountainous regions, a resolution of 0.25 degrees may be too coarse and downscaling the data to a higher resolution may be required. A downscaling toolbox has been made that includes spatial downscaling of precipitation based on the global WorldClim dataset that is available at 1 km resolution as a monthly climatology (Hijmans et al., 2005). The input of the down-scaling tool are either the global eartH2Observe WRR1 and WRR2 datasets based on the WFDEI correction methodology (Weedon et al., 2014) or the global Multi-Source Weighted-Ensemble Precipitation (MSWEP) dataset (Beck et al., 2016). Here we present a validation of the datasets over the Rhine catchment by means of a distributed hydrological model (wflow, Schellekens et al., 2014) using a number of precipitation scenarios. (1) We start by running the model using the local reference dataset derived by spatial interpolation of gauge observations. Furthermore we use (2) the MSWEP dataset at the native 0.25-degree resolution followed by (3) MSWEP downscaled with the WorldClim dataset and final (4) MSWEP downscaled with the local reference dataset. The validation will be based on comparison of the modeled river discharges as well as rainfall statistics. We expect that down-scaling the MSWEP dataset with the WorldClim data to higher resolution will increase its performance. To test the performance of the down-scaling routine we have added a run with MSWEP data down-scaled with the local dataset and compare this with the run based on the local dataset itself. - Beck, H. E. et al., 2016. MSWEP: 3-hourly 0.25° global gridded precipitation (1979-2015) by merging gauge, satellite, and reanalysis data, Hydrol. Earth Syst. Sci. Discuss., doi:10.5194/hess-2016-236, accepted for final publication. - Hijmans, R.J. et al., 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978. - Schellekens, J. et al., 2016. A global water resources ensemble of hydrological models: the eartH2Observe Tier-1 dataset, Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-55, under review. - Schellekens, J. et al., 2014. Rapid setup of hydrological and hydraulic models using OpenStreetMap and the SRTM derived digital elevation model. Environmental Modelling&Software - Weedon, G.P. et al., 2014. The WFDEI meteorological forcing data set: WATCH Forcing Data methodology applied to ERA-Interim reanalysis data. Water Resources Research, 50, doi:10.1002/2014WR015638.
A high-resolution optical rangefinder using tunable focus optics and spatial photonic signal processing

NASA Astrophysics Data System (ADS)

Khwaja, Tariq S.; Mazhar, Mohsin Ali; Niazi, Haris Khan; Reza, Syed Azer

2017-06-01

In this paper, we present the design of a proposed optical rangefinder to determine the distance of a semi-reflective target from the sensor module. The sensor module deploys a simple Tunable Focus Lens (TFL), a Laser Source (LS) with a Gaussian Beam profile and a digital beam profiler/imager to achieve its desired operation. We show that, owing to the nature of existing measurement methodologies, previous attempts to use a simple TFL in prior art to estimate target distance mostly deliver "one-shot" distance measurement estimates instead of obtaining and using a larger dataset which can significantly reduce the effect of some largely incorrect individual data points on the final distance estimate. Using a measurement dataset and calculating averages also helps smooth out measurement errors in individual data points through effectively low-pass filtering unexpectedly odd measurement offsets in individual data points. In this paper, we show that a simple setup deploying an LS, a TFL and a beam profiler or imager is capable of delivering an entire measurement dataset thus effectively mitigating the effects on measurement accuracy which are associated with "one-shot" measurement techniques. The technique we propose allows a Gaussian Beam from an LS to pass through the TFL. Tuning the focal length of the TFL results in altering the spot size of the beam at the beam imager plane. Recording these different spot radii at the plane of the beam profiler for each unique setting of the TFL provides us with a means to use this measurement dataset to obtain a significantly improved estimate of the target distance as opposed to relying on a single measurement. We show that an iterative least-squares curve-fit on the recorded data allows us to estimate distances of remote objects very precisely. We also show that using some basic ray-optics-based approximations, we also obtain an initial seed value for distance estimate and subsequently use this value to obtain a more precise estimate through an iterative residual reduction in the least-squares sense. In our experiments, we use a MEMS-based Digital Micro-mirror Device (DMD) as a beam imager/profiler as it delivers an accurate estimate of a Gaussian Beam profile. The proposed method, its working and the distance estimation methodology are discussed in detail. For a proof-of-concept, we back our claims with initial experimental results.
Spatial distribution, sampling precision and survey design optimisation with non-normal variables: The case of anchovy (Engraulis encrasicolus) recruitment in Spanish Mediterranean waters

NASA Astrophysics Data System (ADS)

Tugores, M. Pilar; Iglesias, Magdalena; Oñate, Dolores; Miquel, Joan

2016-02-01

In the Mediterranean Sea, the European anchovy (Engraulis encrasicolus) displays a key role in ecological and economical terms. Ensuring stock sustainability requires the provision of crucial information, such as species spatial distribution or unbiased abundance and precision estimates, so that management strategies can be defined (e.g. fishing quotas, temporal closure areas or marine protected areas MPA). Furthermore, the estimation of the precision of global abundance at different sampling intensities can be used for survey design optimisation. Geostatistics provide a priori unbiased estimations of the spatial structure, global abundance and precision for autocorrelated data. However, their application to non-Gaussian data introduces difficulties in the analysis in conjunction with low robustness or unbiasedness. The present study applied intrinsic geostatistics in two dimensions in order to (i) analyse the spatial distribution of anchovy in Spanish Western Mediterranean waters during the species' recruitment season, (ii) produce distribution maps, (iii) estimate global abundance and its precision, (iv) analyse the effect of changing the sampling intensity on the precision of global abundance estimates and, (v) evaluate the effects of several methodological options on the robustness of all the analysed parameters. The results suggested that while the spatial structure was usually non-robust to the tested methodological options when working with the original dataset, it became more robust for the transformed datasets (especially for the log-backtransformed dataset). The global abundance was always highly robust and the global precision was highly or moderately robust to most of the methodological options, except for data transformation.
A Web Resource for Standardized Benchmark Datasets, Metrics, and Rosetta Protocols for Macromolecular Modeling and Design.

PubMed

Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja

2015-01-01

The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
Dataset on predictive compressive strength model for self-compacting concrete.

PubMed

Ofuyatan, O M; Edeki, S O

2018-04-01

The determination of compressive strength is affected by many variables such as the water cement (WC) ratio, the superplasticizer (SP), the aggregate combination, and the binder combination. In this dataset article, 7, 28, and 90-day compressive strength models are derived using statistical analysis. The response surface methodology is used toinvestigate the effect of the parameters: Varying percentages of ash, cement, WC, and SP on hardened properties-compressive strengthat 7,28 and 90 days. Thelevels of independent parameters are determinedbased on preliminary experiments. The experimental values for compressive strengthat 7, 28 and 90 days and modulus of elasticity underdifferent treatment conditions are also discussed and presented.These dataset can effectively be used for modelling and prediction in concrete production settings.
Integration of heterogeneous molecular networks to unravel gene-regulation in Mycobacterium tuberculosis.

PubMed

van Dam, Jesse C J; Schaap, Peter J; Martins dos Santos, Vitor A P; Suárez-Diez, María

2014-09-26

Different methods have been developed to infer regulatory networks from heterogeneous omics datasets and to construct co-expression networks. Each algorithm produces different networks and efforts have been devoted to automatically integrate them into consensus sets. However each separate set has an intrinsic value that is diluted and partly lost when building a consensus network. Here we present a methodology to generate co-expression networks and, instead of a consensus network, we propose an integration framework where the different networks are kept and analysed with additional tools to efficiently combine the information extracted from each network. We developed a workflow to efficiently analyse information generated by different inference and prediction methods. Our methodology relies on providing the user the means to simultaneously visualise and analyse the coexisting networks generated by different algorithms, heterogeneous datasets, and a suite of analysis tools. As a show case, we have analysed the gene co-expression networks of Mycobacterium tuberculosis generated using over 600 expression experiments. Regarding DNA damage repair, we identified SigC as a key control element, 12 new targets for LexA, an updated LexA binding motif, and a potential mismatch repair system. We expanded the DevR regulon with 27 genes while identifying 9 targets wrongly assigned to this regulon. We discovered 10 new genes linked to zinc uptake and a new regulatory mechanism for ZuR. The use of co-expression networks to perform system level analysis allows the development of custom made methodologies. As show cases we implemented a pipeline to integrate ChIP-seq data and another method to uncover multiple regulatory layers. Our workflow is based on representing the multiple types of information as network representations and presenting these networks in a synchronous framework that allows their simultaneous visualization while keeping specific associations from the different networks. By simultaneously exploring these networks and metadata, we gained insights into regulatory mechanisms in M. tuberculosis that could not be obtained through the separate analysis of each data type.
Prediction of Solvent Physical Properties using the Hierarchical Clustering Method

EPA Science Inventory

Recently a QSAR (Quantitative Structure Activity Relationship) method, the hierarchical clustering method, was developed to estimate acute toxicity values for large, diverse datasets. This methodology has now been applied to the estimate solvent physical properties including sur...
Integrating disparate lidar datasets for a regional storm tide inundation analysis of Hurricane Katrina

USGS Publications Warehouse

Stoker, Jason M.; Tyler, Dean J.; Turnipseed, D. Phil; Van Wilson, K.; Oimoen, Michael J.

2009-01-01

Hurricane Katrina was one of the largest natural disasters in U.S. history. Due to the sheer size of the affected areas, an unprecedented regional analysis at very high resolution and accuracy was needed to properly quantify and understand the effects of the hurricane and the storm tide. Many disparate sources of lidar data were acquired and processed for varying environmental reasons by pre- and post-Katrina projects. The datasets were in several formats and projections and were processed to varying phases of completion, and as a result the task of producing a seamless digital elevation dataset required a high level of coordination, research, and revision. To create a seamless digital elevation dataset, many technical issues had to be resolved before producing the desired 1/9-arc-second (3meter) grid needed as the map base for projecting the Katrina peak storm tide throughout the affected coastal region. This report presents the methodology that was developed to construct seamless digital elevation datasets from multipurpose, multi-use, and disparate lidar datasets, and describes an easily accessible Web application for viewing the maximum storm tide caused by Hurricane Katrina in southeastern Louisiana, Mississippi, and Alabama.
Prediction of brain tissue temperature using near-infrared spectroscopy

PubMed Central

Holper, Lisa; Mitra, Subhabrata; Bale, Gemma; Robertson, Nicola; Tachtsidis, Ilias

2017-01-01

Abstract. Broadband near-infrared spectroscopy (NIRS) can provide an endogenous indicator of tissue temperature based on the temperature dependence of the water absorption spectrum. We describe a first evaluation of the calibration and prediction of brain tissue temperature obtained during hypothermia in newborn piglets (animal dataset) and rewarming in newborn infants (human dataset) based on measured body (rectal) temperature. The calibration using partial least squares regression proved to be a reliable method to predict brain tissue temperature with respect to core body temperature in the wavelength interval of 720 to 880 nm with a strong mean predictive power of R2=0.713±0.157 (animal dataset) and R2=0.798±0.087 (human dataset). In addition, we applied regression receiver operating characteristic curves for the first time to evaluate the temperature prediction, which provided an overall mean error bias between NIRS predicted brain temperature and body temperature of 0.436±0.283°C (animal dataset) and 0.162±0.149°C (human dataset). We discuss main methodological aspects, particularly the well-known aspect of over- versus underestimation between brain and body temperature, which is relevant for potential clinical applications. PMID:28630878
Breast Shape Analysis With Curvature Estimates and Principal Component Analysis for Cosmetic and Reconstructive Breast Surgery.

PubMed

Catanuto, Giuseppe; Taher, Wafa; Rocco, Nicola; Catalano, Francesca; Allegra, Dario; Milotta, Filippo Luigi Maria; Stanco, Filippo; Gallo, Giovanni; Nava, Maurizio Bruno

2018-03-20

Breast shape is defined utilizing mainly qualitative assessment (full, flat, ptotic) or estimates, such as volume or distances between reference points, that cannot describe it reliably. We will quantitatively describe breast shape with two parameters derived from a statistical methodology denominated principal component analysis (PCA). We created a heterogeneous dataset of breast shapes acquired with a commercial infrared 3-dimensional scanner on which PCA was performed. We plotted on a Cartesian plane the two highest values of PCA for each breast (principal components 1 and 2). Testing of the methodology on a preoperative and postoperative surgical case and test-retest was performed by two operators. The first two principal components derived from PCA are able to characterize the shape of the breast included in the dataset. The test-retest demonstrated that different operators are able to obtain very similar values of PCA. The system is also able to identify major changes in the preoperative and postoperative stages of a two-stage reconstruction. Even minor changes were correctly detected by the system. This methodology can reliably describe the shape of a breast. An expert operator and a newly trained operator can reach similar results in a test/re-testing validation. Once developed and after further validation, this methodology could be employed as a good tool for outcome evaluation, auditing, and benchmarking.
Antibiotic Resistome: Improving Detection and Quantification Accuracy for Comparative Metagenomics.

PubMed

Elbehery, Ali H A; Aziz, Ramy K; Siam, Rania

2016-04-01

The unprecedented rise of life-threatening antibiotic resistance (AR), combined with the unparalleled advances in DNA sequencing of genomes and metagenomes, has pushed the need for in silico detection of the resistance potential of clinical and environmental metagenomic samples through the quantification of AR genes (i.e., genes conferring antibiotic resistance). Therefore, determining an optimal methodology to quantitatively and accurately assess AR genes in a given environment is pivotal. Here, we optimized and improved existing AR detection methodologies from metagenomic datasets to properly consider AR-generating mutations in antibiotic target genes. Through comparative metagenomic analysis of previously published AR gene abundance in three publicly available metagenomes, we illustrate how mutation-generated resistance genes are either falsely assigned or neglected, which alters the detection and quantitation of the antibiotic resistome. In addition, we inspected factors influencing the outcome of AR gene quantification using metagenome simulation experiments, and identified that genome size, AR gene length, total number of metagenomics reads and selected sequencing platforms had pronounced effects on the level of detected AR. In conclusion, our proposed improvements in the current methodologies for accurate AR detection and resistome assessment show reliable results when tested on real and simulated metagenomic datasets.
Terrestrial Ecosystems - Land Surface Forms of the Conterminous United States

USGS Publications Warehouse

Cress, Jill J.; Sayre, Roger G.; Comer, Patrick; Warner, Harumi

2009-01-01

As part of an effort to map terrestrial ecosystems, the U.S. Geological Survey has generated land surface form classes to be used in creating maps depicting standardized, terrestrial ecosystem models for the conterminous United States, using an ecosystems classification developed by NatureServe . A biophysical stratification approach, developed for South America and now being implemented globally, was used to model the ecosystem distributions. Since land surface forms strongly influence the differentiation and distribution of terrestrial ecosystems, they are one of the key input layers in this biophysical stratification. After extensive investigation into various land surface form mapping methodologies, the decision was made to use the methodology developed by the Missouri Resource Assessment Partnership (MoRAP). MoRAP made modifications to Hammond's land surface form classification, which allowed the use of 30-meter source data and a 1-km2 window for analyzing the data cell and its surrounding cells (neighborhood analysis). While Hammond's methodology was based on three topographic variables, slope, local relief, and profile type, MoRAP's methodology uses only slope and local relief. Using the MoRAP method, slope is classified as gently sloping when more than 50 percent of the area in a 1-km2 neighborhood has slope less than 8 percent, otherwise the area is considered moderately sloping. Local relief, which is the difference between the maximum and minimum elevation in a neighborhood, is classified into five groups: 0-15 m, 16-30 m, 31-90 m, 91-150 m, and >150 m. The land surface form classes are derived by combining slope and local relief to create eight landform classes: flat plains (gently sloping and local relief = 90 m), low hills (not gently sloping and local relief = 150 m). However, in the USGS application of the MoRAP methodology, an additional local relief group was used (> 400 m) to capture additional local topographic variation. As a result, low mountains were redefined as not gently sloping and 151 m 400 m. The final application of the MoRAP methodology was implemented using the USGS 30-meter National Elevation Dataset and an existing USGS slope dataset that had been derived by calculating the slope from the NED in Universal Transverse Mercator (UTM) coordinates in each UTM zone, and then combining all of the zones into a national dataset. This map shows a smoothed image of the nine land surface form classes based on MoRAP's methodology. Additional information about this map and any data developed for the ecosystems modeling of the conterminous United States is available online at http://rmgsc.cr.usgs.gov/ecosystems/.
GIS measured environmental correlates of active school transport: A systematic review of 14 studies

PubMed Central

2011-01-01

Background Emerging frameworks to examine active school transportation (AST) commonly emphasize the built environment (BE) as having an influence on travel mode decisions. Objective measures of BE attributes have been recommended for advancing knowledge about the influence of the BE on school travel mode choice. An updated systematic review on the relationships between GIS-measured BE attributes and AST is required to inform future research in this area. The objectives of this review are: i) to examine and summarize the relationships between objectively measured BE features and AST in children and adolescents and ii) to critically discuss GIS methodologies used in this context. Methods Six electronic databases, and websites were systematically searched, and reference lists were searched and screened to identify studies examining AST in students aged five to 18 and reporting GIS as an environmental measurement tool. Fourteen cross-sectional studies were identified. The analyses were classified in terms of density, diversity, and design and further differentiated by the measures used or environmental condition examined. Results Only distance was consistently found to be negatively associated with AST. Consistent findings of positive or negative associations were not found for land use mix, residential density, and intersection density. Potential modifiers of any relationship between these attributes and AST included age, school travel mode, route direction (e.g., to/from school), and trip-end (home or school). Methodological limitations included inconsistencies in geocoding, selection of study sites, buffer methods and the shape of zones (Modifiable Areal Unit Problem [MAUP]), the quality of road and pedestrian infrastructure data, and school route estimation. Conclusions The inconsistent use of spatial concepts limits the ability to draw conclusions about the relationship between objectively measured environmental attributes and AST. Future research should explore standardizing buffer size, assess the quality of street network datasets and, if necessary, customize existing datasets, and explore further attributes linked to safety. PMID:21545750
Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publiclymore » available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.« less

Learning discriminative functional network features of schizophrenia

NASA Astrophysics Data System (ADS)

Gheiratmand, Mina; Rish, Irina; Cecchi, Guillermo; Brown, Matthew; Greiner, Russell; Bashivan, Pouya; Polosecki, Pablo; Dursun, Serdar

2017-03-01

Associating schizophrenia with disrupted functional connectivity is a central idea in schizophrenia research. However, identifying neuroimaging-based features that can serve as reliable "statistical biomarkers" of the disease remains a challenging open problem. We argue that generalization accuracy and stability of candidate features ("biomarkers") must be used as additional criteria on top of standard significance tests in order to discover more robust biomarkers. Generalization accuracy refers to the utility of biomarkers for making predictions about individuals, for example discriminating between patients and controls, in novel datasets. Feature stability refers to the reproducibility of the candidate features across different datasets. Here, we extracted functional connectivity network features from fMRI data at both high-resolution (voxel-level) and a spatially down-sampled lower-resolution ("supervoxel" level). At the supervoxel level, we used whole-brain network links, while at the voxel level, due to the intractably large number of features, we sampled a subset of them. We compared statistical significance, stability and discriminative utility of both feature types in a multi-site fMRI dataset, composed of schizophrenia patients and healthy controls. For both feature types, a considerable fraction of features showed significant differences between the two groups. Also, both feature types were similarly stable across multiple data subsets. However, the whole-brain supervoxel functional connectivity features showed a higher cross-validation classification accuracy of 78.7% vs. 72.4% for the voxel-level features. Cross-site variability and heterogeneity in the patient samples in the multi-site FBIRN dataset made the task more challenging compared to single-site studies. The use of the above methodology in combination with the fully data-driven approach using the whole brain information have the potential to shed light on "biomarker discovery" in schizophrenia.
Semantic integration of gene expression analysis tools and data sources using software connectors

PubMed Central

2013-01-01

Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data. PMID:24341380
Semantic integration of gene expression analysis tools and data sources using software connectors.

PubMed

Miyazaki, Flávia A; Guardia, Gabriela D A; Vêncio, Ricardo Z N; de Farias, Cléver R G

2013-10-25

The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heterogeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
A Systematic Review and Meta-Analysis of a Measure of Staff/Child Interaction Quality (the Classroom Assessment Scoring System) in Early Childhood Education and Care Settings and Child Outcomes.

PubMed

Perlman, Michal; Falenchuk, Olesya; Fletcher, Brooke; McMullen, Evelyn; Beyene, Joseph; Shah, Prakesh S

2016-01-01

The quality of staff/child interactions as measured by the Classroom Assessment Scoring System (CLASS) in Early Childhood Education and Care (ECEC) programs is thought to be important for children's outcomes. The CLASS is made of three domains that assess Emotional Support, Classroom Organization and Instructional Support. It is a relatively new measure that is being used increasingly for research, quality monitoring/accountability and other applied purposes. Our objective was to evaluate the association between the CLASS and child outcomes. Searches of Medline, PsycINFO, ERIC, websites of large datasets and reference sections of all retrieved articles were conducted up to July 3, 2015. Studies that measured association between the CLASS and child outcomes for preschool-aged children who attended ECEC programs were included after screening by two independent reviewers. Searches and data extraction were conducted by two independent reviewers. Thirty-five studies were systematically reviewed of which 19 provided data for meta-analyses. Most studies had moderate to high risk of bias. Of the 14 meta-analyses we conducted, associations between Classroom Organization and Pencil Tapping and between Instructional Support and SSRS Social Skills were significant with pooled correlations of .06 and .09 respectively. All associations were in the expected direction. In the systematic review, significant correlations were reported mainly from one large dataset. Substantial heterogeneity in use of the CLASS, its dimensions, child outcomes and statistical measures was identified. Greater consistency in study methodology is urgently needed. Given the multitude of factors that impact child development it is encouraging that our analyses revealed some, although small, associations between the CLASS and children's outcomes.
A Systematic Review and Meta-Analysis of a Measure of Staff/Child Interaction Quality (the Classroom Assessment Scoring System) in Early Childhood Education and Care Settings and Child Outcomes

PubMed Central

Perlman, Michal; Falenchuk, Olesya; Fletcher, Brooke; McMullen, Evelyn; Beyene, Joseph; Shah, Prakesh S.

2016-01-01

The quality of staff/child interactions as measured by the Classroom Assessment Scoring System (CLASS) in Early Childhood Education and Care (ECEC) programs is thought to be important for children’s outcomes. The CLASS is made of three domains that assess Emotional Support, Classroom Organization and Instructional Support. It is a relatively new measure that is being used increasingly for research, quality monitoring/accountability and other applied purposes. Our objective was to evaluate the association between the CLASS and child outcomes. Searches of Medline, PsycINFO, ERIC, websites of large datasets and reference sections of all retrieved articles were conducted up to July 3, 2015. Studies that measured association between the CLASS and child outcomes for preschool-aged children who attended ECEC programs were included after screening by two independent reviewers. Searches and data extraction were conducted by two independent reviewers. Thirty-five studies were systematically reviewed of which 19 provided data for meta-analyses. Most studies had moderate to high risk of bias. Of the 14 meta-analyses we conducted, associations between Classroom Organization and Pencil Tapping and between Instructional Support and SSRS Social Skills were significant with pooled correlations of .06 and .09 respectively. All associations were in the expected direction. In the systematic review, significant correlations were reported mainly from one large dataset. Substantial heterogeneity in use of the CLASS, its dimensions, child outcomes and statistical measures was identified. Greater consistency in study methodology is urgently needed. Given the multitude of factors that impact child development it is encouraging that our analyses revealed some, although small, associations between the CLASS and children’s outcomes. PMID:28036333
A Regression Model for Predicting Shape Deformation after Breast Conserving Surgery

PubMed Central

Zolfagharnasab, Hooshiar; Bessa, Sílvia; Oliveira, Sara P.; Faria, Pedro; Teixeira, João F.; Cardoso, Jaime S.

2018-01-01

Breast cancer treatments can have a negative impact on breast aesthetics, in case when surgery is intended to intersect tumor. For many years mastectomy was the only surgical option, but more recently breast conserving surgery (BCS) has been promoted as a liable alternative to treat cancer while preserving most part of the breast. However, there is still a significant number of BCS intervened patients who are unpleasant with the result of the treatment, which leads to self-image issues and emotional overloads. Surgeons recognize the value of a tool to predict the breast shape after BCS to facilitate surgeon/patient communication and allow more educated decisions; however, no such tool is available that is suited for clinical usage. These tools could serve as a way of visually sensing the aesthetic consequences of the treatment. In this research, it is intended to propose a methodology for predict the deformation after BCS by using machine learning techniques. Nonetheless, there is no appropriate dataset containing breast data before and after surgery in order to train a learning model. Therefore, an in-house semi-synthetic dataset is proposed to fulfill the requirement of this research. Using the proposed dataset, several learning methodologies were investigated, and promising outcomes are obtained. PMID:29315279
Statistical Frequency-Dependent Analysis of Trial-to-Trial Variability in Single Time Series by Recurrence Plots.

PubMed

Tošić, Tamara; Sellers, Kristin K; Fröhlich, Flavio; Fedotenkova, Mariia; Beim Graben, Peter; Hutt, Axel

2015-01-01

For decades, research in neuroscience has supported the hypothesis that brain dynamics exhibits recurrent metastable states connected by transients, which together encode fundamental neural information processing. To understand the system's dynamics it is important to detect such recurrence domains, but it is challenging to extract them from experimental neuroscience datasets due to the large trial-to-trial variability. The proposed methodology extracts recurrent metastable states in univariate time series by transforming datasets into their time-frequency representations and computing recurrence plots based on instantaneous spectral power values in various frequency bands. Additionally, a new statistical inference analysis compares different trial recurrence plots with corresponding surrogates to obtain statistically significant recurrent structures. This combination of methods is validated by applying it to two artificial datasets. In a final study of visually-evoked Local Field Potentials in partially anesthetized ferrets, the methodology is able to reveal recurrence structures of neural responses with trial-to-trial variability. Focusing on different frequency bands, the δ-band activity is much less recurrent than α-band activity. Moreover, α-activity is susceptible to pre-stimuli, while δ-activity is much less sensitive to pre-stimuli. This difference in recurrence structures in different frequency bands indicates diverse underlying information processing steps in the brain.
Statistical Frequency-Dependent Analysis of Trial-to-Trial Variability in Single Time Series by Recurrence Plots

PubMed Central

Tošić, Tamara; Sellers, Kristin K.; Fröhlich, Flavio; Fedotenkova, Mariia; beim Graben, Peter; Hutt, Axel

2016-01-01

For decades, research in neuroscience has supported the hypothesis that brain dynamics exhibits recurrent metastable states connected by transients, which together encode fundamental neural information processing. To understand the system's dynamics it is important to detect such recurrence domains, but it is challenging to extract them from experimental neuroscience datasets due to the large trial-to-trial variability. The proposed methodology extracts recurrent metastable states in univariate time series by transforming datasets into their time-frequency representations and computing recurrence plots based on instantaneous spectral power values in various frequency bands. Additionally, a new statistical inference analysis compares different trial recurrence plots with corresponding surrogates to obtain statistically significant recurrent structures. This combination of methods is validated by applying it to two artificial datasets. In a final study of visually-evoked Local Field Potentials in partially anesthetized ferrets, the methodology is able to reveal recurrence structures of neural responses with trial-to-trial variability. Focusing on different frequency bands, the δ-band activity is much less recurrent than α-band activity. Moreover, α-activity is susceptible to pre-stimuli, while δ-activity is much less sensitive to pre-stimuli. This difference in recurrence structures in different frequency bands indicates diverse underlying information processing steps in the brain. PMID:26834580
Application of Artificial Neural Networks to the Development of Improved Multi-Sensor Retrievals of Near-Surface Air Temperature and Humidity Over Ocean

NASA Technical Reports Server (NTRS)

Roberts, J. Brent; Robertson, Franklin R.; Clayson, Carol Anne

2012-01-01

Improved estimates of near-surface air temperature and air humidity are critical to the development of more accurate turbulent surface heat fluxes over the ocean. Recent progress in retrieving these parameters has been made through the application of artificial neural networks (ANN) and the use of multi-sensor passive microwave observations. Details are provided on the development of an improved retrieval algorithm that applies the nonlinear statistical ANN methodology to a set of observations from the Advanced Microwave Scanning Radiometer (AMSR-E) and the Advanced Microwave Sounding Unit (AMSU-A) that are currently available from the NASA AQUA satellite platform. Statistical inversion techniques require an adequate training dataset to properly capture embedded physical relationships. The development of multiple training datasets containing only in-situ observations, only synthetic observations produced using the Community Radiative Transfer Model (CRTM), or a mixture of each is discussed. An intercomparison of results using each training dataset is provided to highlight the relative advantages and disadvantages of each methodology. Particular emphasis will be placed on the development of retrievals in cloudy versus clear-sky conditions. Near-surface air temperature and humidity retrievals using the multi-sensor ANN algorithms are compared to previous linear and non-linear retrieval schemes.
A Method for Gene-Based Pathway Analysis Using Genomewide Association Study Summary Statistics Reveals Nine New Type 1 Diabetes Associations

PubMed Central

Evangelou, Marina; Smyth, Deborah J; Fortune, Mary D; Burren, Oliver S; Walker, Neil M; Guo, Hui; Onengut-Gumuscu, Suna; Chen, Wei-Min; Concannon, Patrick; Rich, Stephen S; Todd, John A; Wallace, Chris

2014-01-01

Pathway analysis can complement point-wise single nucleotide polymorphism (SNP) analysis in exploring genomewide association study (GWAS) data to identify specific disease-associated genes that can be candidate causal genes. We propose a straightforward methodology that can be used for conducting a gene-based pathway analysis using summary GWAS statistics in combination with widely available reference genotype data. We used this method to perform a gene-based pathway analysis of a type 1 diabetes (T1D) meta-analysis GWAS (of 7,514 cases and 9,045 controls). An important feature of the conducted analysis is the removal of the major histocompatibility complex gene region, the major genetic risk factor for T1D. Thirty-one of the 1,583 (2%) tested pathways were identified to be enriched for association with T1D at a 5% false discovery rate. We analyzed these 31 pathways and their genes to identify SNPs in or near these pathway genes that showed potentially novel association with T1D and attempted to replicate the association of 22 SNPs in additional samples. Replication P-values were skewed () with 12 of the 22 SNPs showing . Support, including replication evidence, was obtained for nine T1D associated variants in genes ITGB7 (rs11170466, ), NRP1 (rs722988, ), BAD (rs694739, ), CTSB (rs1296023, ), FYN (rs11964650, ), UBE2G1 (rs9906760, ), MAP3K14 (rs17759555, ), ITGB1 (rs1557150, ), and IL7R (rs1445898, ). The proposed methodology can be applied to other GWAS datasets for which only summary level data are available. PMID:25371288
A whole brain morphometric analysis of changes associated with pre-term birth

NASA Astrophysics Data System (ADS)

Thomaz, C. E.; Boardman, J. P.; Counsell, S.; Hill, D. L. G.; Hajnal, J. V.; Edwards, A. D.; Rutherford, M. A.; Gillies, D. F.; Rueckert, D.

2006-03-01

Pre-term birth is strongly associated with subsequent neuropsychiatric impairment. To identify structural differences in preterm infants we have examined a dataset of magnetic resonance (MR) images containing 88 preterm infants and 19 term born controls. We have analyzed these images by combining image registration, deformation based morphometry (DBM), multivariate statistics, and effect size maps (ESM). The methodology described has been performed directly on the MR intensity images rather than on segmented versions of the images. The results indicate that the approach described makes clear the statistical differences between the control and preterm samples, showing a leave-one-out classification accuracy of 94.74% and 95.45% respectively. In addition, finding the most discriminant direction between the groups and using DBM features and ESM we are able to identify not only what are the changes between preterm and term groups but also how relatively relevant they are in terms of volume expansion and contraction.
Predicting the disease of Alzheimer with SNP biomarkers and clinical data using data mining classification approach: decision tree.

PubMed

Erdoğan, Onur; Aydin Son, Yeşim

2014-01-01

Single Nucleotide Polymorphisms (SNPs) are the most common genomic variations where only a single nucleotide differs between individuals. Individual SNPs and SNP profiles associated with diseases can be utilized as biological markers. But there is a need to determine the SNP subsets and patients' clinical data which is informative for the diagnosis. Data mining approaches have the highest potential for extracting the knowledge from genomic datasets and selecting the representative SNPs as well as most effective and informative clinical features for the clinical diagnosis of the diseases. In this study, we have applied one of the widely used data mining classification methodology: "decision tree" for associating the SNP biomarkers and significant clinical data with the Alzheimer's disease (AD), which is the most common form of "dementia". Different tree construction parameters have been compared for the optimization, and the most accurate tree for predicting the AD is presented.
Revisiting the reported signal of acute pancreatitis with rasburicase: an object lesson in pharmacovigilance

PubMed Central

Hauben, Manfred; Hung, Eric Y.

2016-01-01

Introduction: There is an interest in methodologies to expeditiously detect credible signals of drug-induced pancreatitis. An example is the reported signal of pancreatitis with rasburicase emerging from a study [the ‘index publication’ (IP)] combining quantitative signal detection findings from a spontaneous reporting system (SRS) and electronic health records (EHRs). The signal was reportedly supported by a clinical review with a case series manuscript in progress. The reported signal is noteworthy, being initially classified as a false-positive finding for the chosen reference standard, but reclassified as a ‘clinically supported’ signal. Objective: This paper has dual objectives: to revisit the signal of rasburicase and acute pancreatitis and extend the original analysis via reexamination of its findings, in light of more contemporary data; and to motivate discussions on key issues in signal detection and evaluation, including recent findings from a major international pharmacovigilance research initiative. Methodology: We used the same methodology as the IP, including the same disproportionality analysis software/dataset for calculating observed to expected reporting frequencies (O/Es), Medical Dictionary for Regulatory Activities Preferred Term, and O/E metric/threshold combination defining a signal of disproportionate reporting. Baseline analysis results prompted supplementary analyses using alternative analytical choices. We performed a comprehensive literature search to identify additional published case reports of rasburicase and pancreatitis. Results: We could not replicate positive findings (e.g. a signal or statistic of disproportionate reporting) from the SRS data using the same algorithm, software, dataset and vendor specified in the IP. The reporting association was statistically highlighted in default and supplemental analysis when more sensitive forms of disproportionality analysis were used. Two of three reports in the FAERS database were assessed as likely duplicate reports. We did not identify any additional reports in the FAERS corresponding to the three cases identified in the IP using EHRs. We did not identify additional published reports of pancreatitis associated with rasburicase. Discussion: Our exercise stimulated interesting discussions of key points in signal detection and evaluation, including causality assessment, signal detection algorithm performance, pharmacovigilance terminology, duplicate reporting, mechanisms for communicating signals, the structure of the FAERs database, and recent results from a major international pharmacovigilance research initiative. PMID:27298720
On the quantification and efficient propagation of imprecise probabilities resulting from small datasets

NASA Astrophysics Data System (ADS)

Zhang, Jiaxin; Shields, Michael D.

2018-01-01

This paper addresses the problem of uncertainty quantification and propagation when data for characterizing probability distributions are scarce. We propose a methodology wherein the full uncertainty associated with probability model form and parameter estimation are retained and efficiently propagated. This is achieved by applying the information-theoretic multimodel inference method to identify plausible candidate probability densities and associated probabilities that each method is the best model in the Kullback-Leibler sense. The joint parameter densities for each plausible model are then estimated using Bayes' rule. We then propagate this full set of probability models by estimating an optimal importance sampling density that is representative of all plausible models, propagating this density, and reweighting the samples according to each of the candidate probability models. This is in contrast with conventional methods that try to identify a single probability model that encapsulates the full uncertainty caused by lack of data and consequently underestimate uncertainty. The result is a complete probabilistic description of both aleatory and epistemic uncertainty achieved with several orders of magnitude reduction in computational cost. It is shown how the model can be updated to adaptively accommodate added data and added candidate probability models. The method is applied for uncertainty analysis of plate buckling strength where it is demonstrated how dataset size affects the confidence (or lack thereof) we can place in statistical estimates of response when data are lacking.
Network-Assisted Investigation of Combined Causal Signals from Genome-Wide Association Studies in Schizophrenia

PubMed Central

Jia, Peilin; Wang, Lily; Fanous, Ayman H.; Pato, Carlos N.; Edwards, Todd L.; Zhao, Zhongming

2012-01-01

With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had P meta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available. PMID:22792057
A Computational Approach to Qualitative Analysis in Large Textual Datasets

PubMed Central

Evans, Michael S.

2014-01-01

In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398
APPLICATION OF BENCHMARK DOSE METHODOLOGY TO DATA FROM PRENATAL DEVELOPMENTAL TOXICITY STUDIES

EPA Science Inventory

The benchmark dose (BMD) concept was applied to 246 conventional developmental toxicity datasets from government, industry and commercial laboratories. Five modeling approaches were used, two generic and three specific to developmental toxicity (DT models). BMDs for both quantal ...
[Active surveillance of adverse drug reaction in the era of big data: challenge and opportunity for control selection].

PubMed

Wang, S F; Zhan, S Y

2016-07-01

Electronic healthcare databases have become an important source for active surveillance of drug safety in the era of big data. The traditional epidemiology research designs are needed to confirm the association between drug use and adverse events based on these datasets, and the selection of the comparative control is essential to each design. This article aims to explain the principle and application of each type of control selection, introduce the methods and parameters for method comparison, and describe the latest achievements in the batch processing of control selection, which would provide important methodological reference for the use of electronic healthcare databases to conduct post-marketing drug safety surveillance in China.
Applications of the LBA-ECO Metadata Warehouse

NASA Astrophysics Data System (ADS)

Wilcox, L.; Morrell, A.; Griffith, P. C.

2006-05-01

The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.
High resolution global gridded data for use in population studies

NASA Astrophysics Data System (ADS)

Lloyd, Christopher T.; Sorichetta, Alessandro; Tatem, Andrew J.

2017-01-01

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website.

High resolution global gridded data for use in population studies.

PubMed

Lloyd, Christopher T; Sorichetta, Alessandro; Tatem, Andrew J

2017-01-31

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website.
Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less
Mapping moderate-scale land-cover over very large geographic areas within a collaborative framework: A case study of the Southwest Regional Gap Analysis Project (SWReGAP)

USGS Publications Warehouse

Lowry, J.; Ramsey, R.D.; Thomas, K.; Schrupp, D.; Sajwaj, T.; Kirby, J.; Waller, E.; Schrader, S.; Falzarano, S.; Langs, L.; Manis, G.; Wallace, C.; Schulz, K.; Comer, P.; Pohs, K.; Rieth, W.; Velasquez, C.; Wolk, B.; Kepner, W.; Boykin, K.; O'Brien, L.; Bradford, D.; Thompson, B.; Prior-Magee, J.

2007-01-01

Land-cover mapping efforts within the USGS Gap Analysis Program have traditionally been state-centered; each state having the responsibility of implementing a project design for the geographic area within their state boundaries. The Southwest Regional Gap Analysis Project (SWReGAP) was the first formal GAP project designed at a regional, multi-state scale. The project area comprises the southwestern states of Arizona, Colorado, Nevada, New Mexico, and Utah. The land-cover map/dataset was generated using regionally consistent geospatial data (Landsat ETM+ imagery (1999-2001) and DEM derivatives), similar field data collection protocols, a standardized land-cover legend, and a common modeling approach (decision tree classifier). Partitioning of mapping responsibilities amongst the five collaborating states was organized around ecoregion-based "mapping zones". Over the course of 21/2 field seasons approximately 93,000 reference samples were collected directly, or obtained from other contemporary projects, for the land-cover modeling effort. The final map was made public in 2004 and contains 125 land-cover classes. An internal validation of 85 of the classes, representing 91% of the land area was performed. Agreement between withheld samples and the validated dataset was 61% (KHAT = .60, n = 17,030). This paper presents an overview of the methodologies used to create the regional land-cover dataset and highlights issues associated with large-area mapping within a coordinated, multi-institutional management framework. ?? 2006 Elsevier Inc. All rights reserved.
Framework for National Flood Risk Assessment for Canada

NASA Astrophysics Data System (ADS)

Elshorbagy, A. A.; Raja, B.; Lakhanpal, A.; Razavi, S.; Ceola, S.; Montanari, A.

2016-12-01

Worldwide, floods have been identified as a standout amongst the most widely recognized catastrophic events, resulting in the loss of life and property. These natural hazards cannot be avoided, but their consequences can certainly be reduced by having prior knowledge of their occurrence and impact. In the context of floods, the terms occurrence and impact are substituted by flood hazard and flood vulnerability, respectively, which collectively define the flood risk. There is a high need for identifying the flood-prone areas and to quantify the risk associated with them. The present study aims at delivering flood risk maps, which prioritize the potential flood risk areas in Canada. The methodology adopted in this study involves integrating various available spatial datasets such as nightlights satellite imagery, land use, population and the digital elevation model, to build a flexible framework for national flood risk assessment for Canada. The flood risk framework assists in identifying the flood-prone areas and evaluating the associated risk. All these spatial datasets were brought to a common GIS platform for flood risk analysis. The spatial datasets deliver the socioeconomic and topographical information that is required for evaluating the flood vulnerability and flood hazard, respectively. Nightlights have been investigated as a tool to be used as a proxy for the human activities to identify areas with regard to economic investment. However, other datasets, including existing flood protection measures, we added to identify a realistic flood assessment framework. Furthermore, the city of Calgary was used as an example to investigate the effect of using Digital Elevation Models (DEMs) of varying resolutions on risk maps. Along with this, the risk map for the city was further enhanced by including the population data to give a social dimension to the risk map. Flood protection measures play a major role by significantly reducing the flood risk of events with a specific return period. An analysis to update the risk maps when information on protection measures is available was carried out for the city of Winnipeg, Canada. The proposed framework is a promising approach to identify and prioritize flood-prone areas, which are in need of intervention or detailed studies.
Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis.

PubMed

Yi, Ming; Mudunuri, Uma; Che, Anney; Stephens, Robert M

2009-06-29

One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at http://www.abcc.ncifcrf.gov/wps/wps_index.php.
Understanding ageing in older Australians: The contribution of the Dynamic Analyses to Optimise Ageing (DYNOPTA) project to the evidenced base and policy

PubMed Central

Anstey, Kaarin J; Bielak, Allison AM; Birrell, Carole L; Browning, Colette J; Burns, Richard A; Byles, Julie; Kiley, Kim M; Nepal, Binod; Ross, Lesley A; Steel, David; Windsor, Timothy D

2014-01-01

Aim To describe the Dynamic Analyses to Optimise Ageing (DYNOPTA) project and illustrate its contributions to understanding ageing through innovative methodology, and investigations on outcomes based on the project themes. DYNOPTA provides a platform and technical expertise that may be used to combine other national and international datasets. Method The DYNOPTA project has pooled and harmonized data from nine Australian longitudinal studies to create the largest available longitudinal dataset (N=50652) on ageing in Australia. Results A range of findings have resulted from the study to date, including methodological advances, prevalence rates of disease and disability, and mapping trajectories of ageing with and without increasing morbidity. DYNOPTA also forms the basis of a microsimulation model that will provide projections of future costs of disease and disability for the baby boomer cohort. Conclusion DYNOPTA contributes significantly to the Australian evidence-base on ageing to inform key social and health policy domains. PMID:22032767
Response Surface Methodology Using a Fullest Balanced Model: A Re-Analysis of a Dataset in the Korean Journal for Food Science of Animal Resources.

PubMed

Rheem, Sungsue; Rheem, Insoo; Oh, Sejong

2017-01-01

Response surface methodology (RSM) is a useful set of statistical techniques for modeling and optimizing responses in research studies of food science. In the analysis of response surface data, a second-order polynomial regression model is usually used. However, sometimes we encounter situations where the fit of the second-order model is poor. If the model fitted to the data has a poor fit including a lack of fit, the modeling and optimization results might not be accurate. In such a case, using a fullest balanced model, which has no lack of fit, can fix such problem, enhancing the accuracy of the response surface modeling and optimization. This article presents how to develop and use such a model for the better modeling and optimizing of the response through an illustrative re-analysis of a dataset in Park et al. (2014) published in the Korean Journal for Food Science of Animal Resources .
Internal Consistency of the NVAP Water Vapor Dataset

NASA Technical Reports Server (NTRS)

Suggs, Ronnie J.; Jedlovec, Gary J.; Arnold, James E. (Technical Monitor)

2001-01-01

The NVAP (NASA Water Vapor Project) dataset is a global dataset at 1 x 1 degree spatial resolution consisting of daily, pentad, and monthly atmospheric precipitable water (PW) products. The analysis blends measurements from the Television and Infrared Operational Satellite (TIROS) Operational Vertical Sounder (TOVS), the Special Sensor Microwave/Imager (SSM/I), and radiosonde observations into a daily collage of PW. The original dataset consisted of five years of data from 1988 to 1992. Recent updates have added three additional years (1993-1995) and incorporated procedural and algorithm changes from the original methodology. Since each of the PW sources (TOVS, SSM/I, and radiosonde) do not provide global coverage, each of these sources compliment one another by providing spatial coverage over regions and during times where the other is not available. For this type of spatial and temporal blending to be successful, each of the source components should have similar or compatible accuracies. If this is not the case, regional and time varying biases may be manifested in the NVAP dataset. This study examines the consistency of the NVAP source data by comparing daily collocated TOVS and SSM/I PW retrievals with collocated radiosonde PW observations. The daily PW intercomparisons are performed over the time period of the dataset and for various regions.
Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation.

PubMed

Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M

2016-01-01

Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.
Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation

PubMed Central

Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian

2016-01-01

Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177
Perception of Virtual Audiences.

PubMed

Chollet, Mathieu; Scherer, Stefan

2017-01-01

A growing body of evidence shows that virtual audiences are a valuable tool in the treatment of social anxiety, and recent works show that it also a useful in public-speaking training programs. However, little research has focused on how such audiences are perceived and on how the behavior of virtual audiences can be manipulated to create various types of stimuli. The authors used a crowdsourcing methodology to create a virtual audience nonverbal behavior model and, with it, created a dataset of videos with virtual audiences containing varying behaviors. Using this dataset, they investigated how virtual audiences are perceived and which factors affect this perception.
Toward a science of tumor forecasting for clinical oncology.

PubMed

Yankeelov, Thomas E; Quaranta, Vito; Evans, Katherine J; Rericha, Erin C

2015-03-15

We propose that the quantitative cancer biology community makes a concerted effort to apply lessons from weather forecasting to develop an analogous methodology for predicting and evaluating tumor growth and treatment response. Currently, the time course of tumor response is not predicted; instead, response is only assessed post hoc by physical examination or imaging methods. This fundamental practice within clinical oncology limits optimization of a treatment regimen for an individual patient, as well as to determine in real time whether the choice was in fact appropriate. This is especially frustrating at a time when a panoply of molecularly targeted therapies is available, and precision genetic or proteomic analyses of tumors are an established reality. By learning from the methods of weather and climate modeling, we submit that the forecasting power of biophysical and biomathematical modeling can be harnessed to hasten the arrival of a field of predictive oncology. With a successful methodology toward tumor forecasting, it should be possible to integrate large tumor-specific datasets of varied types and effectively defeat one cancer patient at a time. ©2015 American Association for Cancer Research.
An approach for reduction of false predictions in reverse engineering of gene regulatory networks.

PubMed

Khan, Abhinandan; Saha, Goutam; Pal, Rajat Kumar

2018-05-14

A gene regulatory network discloses the regulatory interactions amongst genes, at a particular condition of the human body. The accurate reconstruction of such networks from time-series genetic expression data using computational tools offers a stiff challenge for contemporary computer scientists. This is crucial to facilitate the understanding of the proper functioning of a living organism. Unfortunately, the computational methods produce many false predictions along with the correct predictions, which is unwanted. Investigations in the domain focus on the identification of as many correct regulations as possible in the reverse engineering of gene regulatory networks to make it more reliable and biologically relevant. One way to achieve this is to reduce the number of incorrect predictions in the reconstructed networks. In the present investigation, we have proposed a novel scheme to decrease the number of false predictions by suitably combining several metaheuristic techniques. We have implemented the same using a dataset ensemble approach (i.e. combining multiple datasets) also. We have employed the proposed methodology on real-world experimental datasets of the SOS DNA Repair network of Escherichia coli and the IMRA network of Saccharomyces cerevisiae. Subsequently, we have experimented upon somewhat larger, in silico networks, namely, DREAM3 and DREAM4 Challenge networks, and 15-gene and 20-gene networks extracted from the GeneNetWeaver database. To study the effect of multiple datasets on the quality of the inferred networks, we have used four datasets in each experiment. The obtained results are encouraging enough as the proposed methodology can reduce the number of false predictions significantly, without using any supplementary prior biological information for larger gene regulatory networks. It is also observed that if a small amount of prior biological information is incorporated here, the results improve further w.r.t. the prediction of true positives. Copyright © 2018 Elsevier Ltd. All rights reserved.
GLEAM version 3: Global Land Evaporation Datasets and Model

NASA Astrophysics Data System (ADS)

Martens, B.; Miralles, D. G.; Lievens, H.; van der Schalie, R.; de Jeu, R.; Fernandez-Prieto, D.; Verhoest, N.

2015-12-01

Terrestrial evaporation links energy, water and carbon cycles over land and is therefore a key variable of the climate system. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to limitations in in situ measurements. As a result, several methods have risen to estimate global patterns of land evaporation from satellite observations. However, these algorithms generally differ in their approach to model evaporation, resulting in large differences in their estimates. One of these methods is GLEAM, the Global Land Evaporation: the Amsterdam Methodology. GLEAM estimates terrestrial evaporation based on daily satellite observations of meteorological variables, vegetation characteristics and soil moisture. Since the publication of the first version of the algorithm (2011), the model has been widely applied to analyse trends in the water cycle and land-atmospheric feedbacks during extreme hydrometeorological events. A third version of the GLEAM global datasets is foreseen by the end of 2015. Given the relevance of having a continuous and reliable record of global-scale evaporation estimates for climate and hydrological research, the establishment of an online data portal to host these data to the public is also foreseen. In this new release of the GLEAM datasets, different components of the model have been updated, with the most significant change being the revision of the data assimilation algorithm. In this presentation, we will highlight the most important changes of the methodology and present three new GLEAM datasets and their validation against in situ observations and an alternative dataset of terrestrial evaporation (ERA-Land). Results of the validation exercise indicate that the magnitude and the spatiotemporal variability of the modelled evaporation agree reasonably well with the estimates of ERA-Land and the in situ observations. It is also shown that the performance of the revised model is higher compared to the original one.
A methodology for the design of experiments in computational intelligence with multiple regression models.

PubMed

Fernandez-Lozano, Carlos; Gestal, Marcos; Munteanu, Cristian R; Dorado, Julian; Pazos, Alejandro

2016-01-01

The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.
A methodology for the design of experiments in computational intelligence with multiple regression models

PubMed Central

Gestal, Marcos; Munteanu, Cristian R.; Dorado, Julian; Pazos, Alejandro

2016-01-01

The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable. PMID:27920952
Methodology for Computational Fluid Dynamic Validation for Medical Use: Application to Intracranial Aneurysm.

PubMed

Paliwal, Nikhil; Damiano, Robert J; Varble, Nicole A; Tutino, Vincent M; Dou, Zhongwang; Siddiqui, Adnan H; Meng, Hui

2017-12-01

Computational fluid dynamics (CFD) is a promising tool to aid in clinical diagnoses of cardiovascular diseases. However, it uses assumptions that simplify the complexities of the real cardiovascular flow. Due to high-stakes in the clinical setting, it is critical to calculate the effect of these assumptions in the CFD simulation results. However, existing CFD validation approaches do not quantify error in the simulation results due to the CFD solver's modeling assumptions. Instead, they directly compare CFD simulation results against validation data. Thus, to quantify the accuracy of a CFD solver, we developed a validation methodology that calculates the CFD model error (arising from modeling assumptions). Our methodology identifies independent error sources in CFD and validation experiments, and calculates the model error by parsing out other sources of error inherent in simulation and experiments. To demonstrate the method, we simulated the flow field of a patient-specific intracranial aneurysm (IA) in the commercial CFD software star-ccm+. Particle image velocimetry (PIV) provided validation datasets for the flow field on two orthogonal planes. The average model error in the star-ccm+ solver was 5.63 ± 5.49% along the intersecting validation line of the orthogonal planes. Furthermore, we demonstrated that our validation method is superior to existing validation approaches by applying three representative existing validation techniques to our CFD and experimental dataset, and comparing the validation results. Our validation methodology offers a streamlined workflow to extract the "true" accuracy of a CFD solver.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.

PubMed

Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

2017-03-01

Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Remote-sensing application for facilitating land resource assessment and monitoring for utility-scale solar energy development

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hamada, Yuki; Grippo, Mark A.

2015-01-01

A monitoring plan that incorporates regional datasets and integrates cost-effective data collection methods is necessary to sustain the long-term environmental monitoring of utility-scale solar energy development in expansive, environmentally sensitive desert environments. Using very high spatial resolution (VHSR; 15 cm) multispectral imagery collected in November 2012 and January 2014, an image processing routine was developed to characterize ephemeral streams, vegetation, and land surface in the southwestern United States where increased utility-scale solar development is anticipated. In addition to knowledge about desert landscapes, the methodology integrates existing spectral indices and transformation (e.g., visible atmospherically resistant index and principal components); a newlymore » developed index, erosion resistance index (ERI); and digital terrain and surface models, all of which were derived from a common VHSR image. The methodology identified fine-scale ephemeral streams with greater detail than the National Hydrography Dataset and accurately estimated vegetation distribution and fractional cover of various surface types. The ERI classified surface types that have a range of erosive potentials. The remote-sensing methodology could ultimately reduce uncertainty and monitoring costs for all stakeholders by providing a cost-effective monitoring approach that accurately characterizes the land resources at potential development sites.« less
No association between SNP rs498055 on chromosome 10 and late-onset Alzheimer disease in multiple datasets.

PubMed

Liang, Xueying; Schnetz-Boutaud, Nathalie; Bartlett, Jackie; Allen, Melissa J; Gwirtsman, Harry; Schmechel, Don E; Carney, Regina M; Gilbert, John R; Pericak-Vance, Margaret A; Haines, Jonathan L

2008-01-01

SNP rs498055 in the predicted gene LOC439999 on chromosome 10 was recently identified as being strongly associated with late-onset Alzheimer disease (LOAD). This SNP falls within a chromosomal region that has engendered continued interest generated from both preliminary genetic linkage and candidate gene studies. To independently evaluate this interesting candidate SNP we examined four independent datasets, three family-based and one case-control. All the cases were late-onset AD Caucasian patients with minimum age at onset >or= 60 years. None of the three family samples or the combined family-based dataset showed association in either allelic or genotypic family-based association tests at p < 0.05. Both original and OSA two-point LOD scores were calculated. However, there was no evidence indicating linkage no matter what covariates were applied (the highest LOD score was 0.82). The case-control dataset did not demonstrate any association between this SNP and AD (all p-values > 0.52). Our results do not confirm the previous association, but are consistent with a more recent negative association result that used family-based association tests to examine the effect of this SNP in two family datasets. Thus we conclude that rs498055 is not associated with an increased risk of LOAD.

Disentangling methodological and biological sources of gene tree discordance on oryza (poaceae) chromosome 3

USDA-ARS?s Scientific Manuscript database

We describe new methods for characterizing gene tree discordance in phylogenomic datasets, which screen for deviations from neutral expectations, summarize variation in statistical support among gene trees, and allow comparison of the patterns of discordance induced by various analysis choices. Usin...
USEEIO v1.1-Matrices

EPA Science Inventory

This dataset provides the basic building blocks for the USEEIO v1.1 model and life cycle results per $1 (2013 USD) demand for all goods and services in the model in the producer's price (see BEA 2015). The methodology underlying USEEIO is described in Yang, Ingwersen et al., 2017...
The Development of the Global Citizenship Inventory for Adolescents

ERIC Educational Resources Information Center

Van Gent, Marije; Carabain, Christine; De Goede, Irene; Boonstoppel, Evelien; Hogeling, Lette

2013-01-01

In this paper we report on the development of an inventory that measures global citizenship among adolescents. The methodology used consists of cognitive interviews for questionnaire design and explorative and confirmatory factor analyses among several datasets. The resulting Global Citizenship Inventory (GCI) includes a global citizenship…
A Reliable Methodology for Determining Seed Viability by Using Hyperspectral Data from Two Sides of Wheat Seeds.

PubMed

Zhang, Tingting; Wei, Wensong; Zhao, Bin; Wang, Ranran; Li, Mingliu; Yang, Liming; Wang, Jianhua; Sun, Qun

2018-03-08

This study investigated the possibility of using visible and near-infrared (VIS/NIR) hyperspectral imaging techniques to discriminate viable and non-viable wheat seeds. Both sides of individual seeds were subjected to hyperspectral imaging (400-1000 nm) to acquire reflectance spectral data. Four spectral datasets, including the ventral groove side, reverse side, mean (the mean of two sides' spectra of every seed), and mixture datasets (two sides' spectra of every seed), were used to construct the models. Classification models, partial least squares discriminant analysis (PLS-DA), and support vector machines (SVM), coupled with some pre-processing methods and successive projections algorithm (SPA), were built for the identification of viable and non-viable seeds. Our results showed that the standard normal variate (SNV)-SPA-PLS-DA model had high classification accuracy for whole seeds (>85.2%) and for viable seeds (>89.5%), and that the prediction set was based on a mixed spectral dataset by only using 16 wavebands. After screening with this model, the final germination of the seed lot could be higher than 89.5%. Here, we develop a reliable methodology for predicting the viability of wheat seeds, showing that the VIS/NIR hyperspectral imaging is an accurate technique for the classification of viable and non-viable wheat seeds in a non-destructive manner.
A Reliable Methodology for Determining Seed Viability by Using Hyperspectral Data from Two Sides of Wheat Seeds

PubMed Central

Zhang, Tingting; Wei, Wensong; Zhao, Bin; Wang, Ranran; Li, Mingliu; Yang, Liming; Wang, Jianhua; Sun, Qun

2018-01-01

This study investigated the possibility of using visible and near-infrared (VIS/NIR) hyperspectral imaging techniques to discriminate viable and non-viable wheat seeds. Both sides of individual seeds were subjected to hyperspectral imaging (400–1000 nm) to acquire reflectance spectral data. Four spectral datasets, including the ventral groove side, reverse side, mean (the mean of two sides’ spectra of every seed), and mixture datasets (two sides’ spectra of every seed), were used to construct the models. Classification models, partial least squares discriminant analysis (PLS-DA), and support vector machines (SVM), coupled with some pre-processing methods and successive projections algorithm (SPA), were built for the identification of viable and non-viable seeds. Our results showed that the standard normal variate (SNV)-SPA-PLS-DA model had high classification accuracy for whole seeds (>85.2%) and for viable seeds (>89.5%), and that the prediction set was based on a mixed spectral dataset by only using 16 wavebands. After screening with this model, the final germination of the seed lot could be higher than 89.5%. Here, we develop a reliable methodology for predicting the viability of wheat seeds, showing that the VIS/NIR hyperspectral imaging is an accurate technique for the classification of viable and non-viable wheat seeds in a non-destructive manner. PMID:29517991
A Metastatistical Approach to Satellite Estimates of Extreme Rainfall Events

NASA Astrophysics Data System (ADS)

Zorzetto, E.; Marani, M.

2017-12-01

The estimation of the average recurrence interval of intense rainfall events is a central issue for both hydrologic modeling and engineering design. These estimates require the inference of the properties of the right tail of the statistical distribution of precipitation, a task often performed using the Generalized Extreme Value (GEV) distribution, estimated either from a samples of annual maxima (AM) or with a peaks over threshold (POT) approach. However, these approaches require long and homogeneous rainfall records, which often are not available, especially in the case of remote-sensed rainfall datasets. We use here, and tailor it to remotely-sensed rainfall estimates, an alternative approach, based on the metastatistical extreme value distribution (MEVD), which produces estimates of rainfall extreme values based on the probability distribution function (pdf) of all measured `ordinary' rainfall event. This methodology also accounts for the interannual variations observed in the pdf of daily rainfall by integrating over the sample space of its random parameters. We illustrate the application of this framework to the TRMM Multi-satellite Precipitation Analysis rainfall dataset, where MEVD optimally exploits the relatively short datasets of satellite-sensed rainfall, while taking full advantage of its high spatial resolution and quasi-global coverage. Accuracy of TRMM precipitation estimates and scale issues are here investigated for a case study located in the Little Washita watershed, Oklahoma, using a dense network of rain gauges for independent ground validation. The methodology contributes to our understanding of the risk of extreme rainfall events, as it allows i) an optimal use of the TRMM datasets in estimating the tail of the probability distribution of daily rainfall, and ii) a global mapping of daily rainfall extremes and distributional tail properties, bridging the existing gaps in rain gauges networks.
Upscaling river biomass using dimensional analysis and hydrogeomorphic scaling

NASA Astrophysics Data System (ADS)

Barnes, Elizabeth A.; Power, Mary E.; Foufoula-Georgiou, Efi; Hondzo, Miki; Dietrich, William E.

2007-12-01

We propose a methodology for upscaling biomass in a river using a combination of dimensional analysis and hydro-geomorphologic scaling laws. We first demonstrate the use of dimensional analysis for determining local scaling relationships between Nostoc biomass and hydrologic and geomorphic variables. We then combine these relationships with hydraulic geometry and streamflow scaling in order to upscale biomass from point to reach-averaged quantities. The methodology is demonstrated through an illustrative example using an 18 year dataset of seasonal monitoring of biomass of a stream cyanobacterium (Nostoc parmeloides) in a northern California river.
Using Random Forest Models to Predict Organizational Violence

NASA Technical Reports Server (NTRS)

Levine, Burton; Bobashev, Georgly

2012-01-01

We present a methodology to access the proclivity of an organization to commit violence against nongovernment personnel. We fitted a Random Forest model using the Minority at Risk Organizational Behavior (MAROS) dataset. The MAROS data is longitudinal; so, individual observations are not independent. We propose a modification to the standard Random Forest methodology to account for the violation of the independence assumption. We present the results of the model fit, an example of predicting violence for an organization; and finally, we present a summary of the forest in a "meta-tree,"
Systematic review of the evidence related to mandated nurse staffing ratios in acute hospitals.

PubMed

Olley, Richard; Edwards, Ian; Avery, Mark; Cooper, Helen

2018-04-17

Objective The purpose of this systematic review was to evaluate and summarise available research on nurse staffing methods and relate these to outcomes under three overarching themes of: (1) management of clinical risk, quality and safety; (2) development of a new or innovative staffing methodology; and (3) equity of nursing workload. Methods The PRISMA method was used. Relevant articles were located by searching via the Griffith University Library electronic catalogue, including articles on PubMed, Cumulative Index to Nursing and Allied Health Literature (CINAHL) and Medline. Only English language publications published between 1 January 2010 and 30 April 2016 focusing on methodologies in acute hospital in-patient units were included in the present review. Results Two of the four staffing methods were found to have evidenced-based articles from empirical studies within the parameters set for inclusion. Of the four staffing methodologies searched, supply and demand returned 10 studies and staffing ratios returned 11. Conclusions There is a need to develop an evidence-based nurse-sensitive outcomes measure upon which staffing for safety, quality and workplace equity, as well as an instrument that reliability and validly projects nurse staffing requirements in a variety of clinical settings. Nurse-sensitive indicators reflect elements of patient care that are directly affected by nursing practice In addition, these measures must take into account patient satisfaction, workload and staffing, clinical risks and other measures of the quality and safety of care and nurses' work satisfaction. i. What is known about the topic? Nurse staffing is a controversial topic that has significant patient safety, quality of care, human resources and financial implications. In acute care services, nursing accounts for approximately 70% of salaries and wages paid by health services budgets, and evidence as to the efficacy and effectiveness of any staffing methodology is required because it has workforce and industrial relations implications. Although there is significant literature available on the topic, there is a paucity of empirical evidence supporting claims of increased patient safety in the acute hospital setting, but some evidence exists relating to equity of workload for nurses. What does this paper add? This paper provides a contemporary qualitative analysis of empirical evidence using PRISMA methodology to conduct a systematic review of the available literature. It demonstrates a significant research gap to support claims of increased patient safety in the acute hospital setting. The paper calls for greatly improved datasets upon which research can be undertaken to determine any associations between mandated patient to nurse ratios and other staffing methodologies and patient safety and quality of care. What are the implications for practitioners? There is insufficient contemporary research to support staffing methodologies for appropriate staffing, balanced workloads and quality, safe care. Such research would include the establishment of nurse-sensitive patient outcomes measures, and more robust datasets are needed for empirical analysis to produce such evidence.
Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research

PubMed Central

McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R

2018-01-01

Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729
Participant Observation and the Political Scientist: Possibilities, Priorities, and Practicalities

ERIC Educational Resources Information Center

Gillespie, Andra; Michelson, Melissa R.

2011-01-01

Surveys, experiments, large-"N" datasets and formal models are common instruments in the political scientist's toolkit. In-depth interviews and focus groups play a critical role in helping scholars answer important political questions. In contrast, participant observation techniques are an underused methodological approach. In this article, we…
Initial Development and Validation of the Global Citizenship Scale

ERIC Educational Resources Information Center

Morais, Duarte B.; Ogden, Anthony C.

2011-01-01

The purpose of this article is to report on the initial development of a theoretically grounded and empirically validated scale to measure global citizenship. The methodology employed is multi-faceted, including two expert face validity trials, extensive exploratory and confirmatory factor analyses with multiple datasets, and a series of three…
Diurnal Soil Temperature Effects within the Globe[R] Program Dataset

ERIC Educational Resources Information Center

Witter, Jason D.; Spongberg, Alison L.; Czajkowski, Kevin P.

2007-01-01

Long-term collection of soil temperature with depth is important when studying climate change. The international program GLOBE[R] provides an excellent opportunity to collect such data, although currently endorsed temperature collection protocols need to be refined. To enhance data quality, protocol-based methodology and automated data logging,…
Dataset for reporting of thymic epithelial tumours: recommendations from the International Collaboration on Cancer Reporting (ICCR).

PubMed

Nicholson, Andrew G; Detterbeck, Frank; Marx, Alexander; Roden, Anja C; Marchevsky, Alberto M; Mukai, Kiyoshi; Chen, Gang; Marino, Mirella; den Bakker, Michael A; Yang, Woo-Ick; Judge, Meagan; Hirschowitz, Lynn

2017-03-01

The International Collaboration on Cancer Reporting (ICCR) is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom, the College of American Pathologists, the Canadian Association of Pathologists-Association Canadienne des Pathologists in association with the Canadian Partnership Against Cancer, and the European Society of Pathology. Its goal is to produce standardized, internationally agreed, evidence-based datasets for use throughout the world. This article describes the development of a cancer dataset by the multidisciplinary ICCR expert panel for the reporting of thymic epithelial tumours. The dataset includes 'required' (mandatory) and 'recommended' (non-mandatory) elements, which are validated by a review of current evidence and supported by explanatory text. Seven required elements and 12 recommended elements were agreed by the international dataset authoring committee to represent the essential information for the reporting of thymic epithelial tumours. The use of an internationally agreed, structured pathology dataset for reporting thymic tumours provides all of the necessary information for optimal patient management, facilitates consistent and accurate data collection, and provides valuable data for research and international benchmarking. The dataset also provides a valuable resource for those countries and institutions that are not in a position to develop their own datasets. © 2016 John Wiley & Sons Ltd.
General practitioner (family physician) workforce in Australia: comparing geographic data from surveys, a mailing list and medicare

PubMed Central

2013-01-01

Background Good quality spatial data on Family Physicians or General Practitioners (GPs) are key to accurately measuring geographic access to primary health care. The validity of computed associations between health outcomes and measures of GP access such as GP density is contingent on geographical data quality. This is especially true in rural and remote areas, where GPs are often small in number and geographically dispersed. However, there has been limited effort in assessing the quality of nationally comprehensive, geographically explicit, GP datasets in Australia or elsewhere. Our objective is to assess the extent of association or agreement between different spatially explicit nationwide GP workforce datasets in Australia. This is important since disagreement would imply differential relationships with primary healthcare relevant outcomes with different datasets. We also seek to enumerate these associations across categories of rurality or remoteness. Method We compute correlations of GP headcounts and workload contributions between four different datasets at two different geographical scales, across varying levels of rurality and remoteness. Results The datasets are in general agreement with each other at two different scales. Small numbers of absolute headcounts, with relatively larger fractions of locum GPs in rural areas cause unstable statistical estimates and divergences between datasets. Conclusion In the Australian context, many of the available geographic GP workforce datasets may be used for evaluating valid associations with health outcomes. However, caution must be exercised in interpreting associations between GP headcounts or workloads and outcomes in rural and remote areas. The methods used in these analyses may be replicated in other locales with multiple GP or physician datasets. PMID:24005003
Boosting association rule mining in large datasets via Gibbs sampling.

PubMed

Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

2016-05-03

Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.
Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors.

PubMed

Woodard, Dawn B; Crainiceanu, Ciprian; Ruppert, David

2013-01-01

We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for regression with functional predictors, and show that our method is more effective and efficient for data that include features occurring at varying locations. We apply our methodology to a large and complex dataset from the Sleep Heart Health Study, to quantify the association between sleep characteristics and health outcomes. Software and technical appendices are provided in online supplemental materials.
A Framework for Spatial Interaction Analysis Based on Large-Scale Mobile Phone Data

PubMed Central

Li, Weifeng; Cheng, Xiaoyun; Guo, Gaohua

2014-01-01

The overall understanding of spatial interaction and the exact knowledge of its dynamic evolution are required in the urban planning and transportation planning. This study aimed to analyze the spatial interaction based on the large-scale mobile phone data. The newly arisen mass dataset required a new methodology which was compatible with its peculiar characteristics. A three-stage framework was proposed in this paper, including data preprocessing, critical activity identification, and spatial interaction measurement. The proposed framework introduced the frequent pattern mining and measured the spatial interaction by the obtained association. A case study of three communities in Shanghai was carried out as verification of proposed method and demonstration of its practical application. The spatial interaction patterns and the representative features proved the rationality of the proposed framework. PMID:25435865
Application of crowd-sourced data to multi-scale evolutionary exposure and vulnerability models

NASA Astrophysics Data System (ADS)

Pittore, Massimiliano

2016-04-01

Seismic exposure, defined as the assets (population, buildings, infrastructure) exposed to earthquake hazard and susceptible to damage, is a critical -but often neglected- component of seismic risk assessment. This partly stems from the burden associated with the compilation of a useful and reliable model over wide spatial areas. While detailed engineering data have still to be collected in order to constrain exposure and vulnerability models, the availability of increasingly large crowd-sourced datasets (e. g. OpenStreetMap) opens up the exciting possibility to generate incrementally evolving models. Integrating crowd-sourced and authoritative data using statistical learning methodologies can reduce models uncertainties and also provide additional drive and motivation to volunteered geoinformation collection. A case study in Central Asia will be presented and discussed.
New temperature model of the Netherlands from new data and novel modelling methodology

NASA Astrophysics Data System (ADS)

Bonté, Damien; Struijk, Maartje; Békési, Eszter; Cloetingh, Sierd; van Wees, Jan-Diederik

2017-04-01

Deep geothermal energy has grown in interest in Western Europe in the last decades, for direct use but also, as the knowledge of the subsurface improves, for electricity generation. In the Netherlands, where the sector took off with the first system in 2005, geothermal energy is seen has a key player for a sustainable future. The knowledge of the temperature subsurface, together with the available flow from the reservoir, is an important factor that can determine the success of a geothermal energy project. To support the development of deep geothermal energy system in the Netherlands, we have made a first assessment of the subsurface temperature based on thermal data but also on geological elements (Bonté et al, 2012). An outcome of this work was ThermoGIS that uses the temperature model. This work is a revision of the model that is used in ThermoGIS. The improvement from the first model are multiple, we have been improving not only the dataset used for the calibration and structural model, but also the methodology trough an improved software (called b3t). The temperature dataset has been updated by integrating temperature on the newly accessible wells. The sedimentary description in the basin has been improved by using an updated and refined structural model and an improved lithological definition. A major improvement in from the methodology used to perform the modelling, with b3t the calibration is made not only using the lithospheric parameters but also using the thermal conductivity of the sediments. The result is a much more accurate definition of the parameters for the model and a perfected handling of the calibration process. The result obtain is a precise and improved temperature model of the Netherlands. The thermal conductivity variation in the sediments associated with geometry of the layers is an important factor of temperature variations and the influence of the Zechtein salt in the north of the country is important. In addition, the radiogenic heat production in the crust shows a significant impact. From the temperature values, also identify in the lower part of the basin, deep convective systems that could be major geothermal energy target in the future.

Computer assisted screening, correction, and analysis of historical weather measurements

NASA Astrophysics Data System (ADS)

Burnette, Dorian J.; Stahle, David W.

2013-04-01

A computer program, Historical Observation Tools (HOB Tools), has been developed to facilitate many of the calculations used by historical climatologists to develop instrumental and documentary temperature and precipitation datasets and makes them readily accessible to other researchers. The primitive methodology used by the early weather observers makes the application of standard techniques difficult. HOB Tools provides a step-by-step framework to visually and statistically assess, adjust, and reconstruct historical temperature and precipitation datasets. These routines include the ability to check for undocumented discontinuities, adjust temperature data for poor thermometer exposures and diurnal averaging, and assess and adjust daily precipitation data for undercount. This paper provides an overview of the Visual Basic.NET program and a demonstration of how it can assist in the development of extended temperature and precipitation datasets using modern and early instrumental measurements from the United States.
Automatic detection of blood vessels in retinal images for diabetic retinopathy diagnosis.

PubMed

Raja, D Siva Sundhara; Vasuki, S

2015-01-01

Diabetic retinopathy (DR) is a leading cause of vision loss in diabetic patients. DR is mainly caused due to the damage of retinal blood vessels in the diabetic patients. It is essential to detect and segment the retinal blood vessels for DR detection and diagnosis, which prevents earlier vision loss in diabetic patients. The computer aided automatic detection and segmentation of blood vessels through the elimination of optic disc (OD) region in retina are proposed in this paper. The OD region is segmented using anisotropic diffusion filter and subsequentially the retinal blood vessels are detected using mathematical binary morphological operations. The proposed methodology is tested on two different publicly available datasets and achieved 93.99% sensitivity, 98.37% specificity, 98.08% accuracy in DRIVE dataset and 93.6% sensitivity, 98.96% specificity, and 95.94% accuracy in STARE dataset, respectively.
High resolution global gridded data for use in population studies

PubMed Central

Lloyd, Christopher T.; Sorichetta, Alessandro; Tatem, Andrew J.

2017-01-01

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website. PMID:28140386
Model methodology for estimating pesticide concentration extremes based on sparse monitoring data

USGS Publications Warehouse

Vecchia, Aldo V.

2018-03-22

This report describes a new methodology for using sparse (weekly or less frequent observations) and potentially highly censored pesticide monitoring data to simulate daily pesticide concentrations and associated quantities used for acute and chronic exposure assessments, such as the annual maximum daily concentration. The new methodology is based on a statistical model that expresses log-transformed daily pesticide concentration in terms of a seasonal wave, flow-related variability, long-term trend, and serially correlated errors. Methods are described for estimating the model parameters, generating conditional simulations of daily pesticide concentration given sparse (weekly or less frequent) and potentially highly censored observations, and estimating concentration extremes based on the conditional simulations. The model can be applied to datasets with as few as 3 years of record, as few as 30 total observations, and as few as 10 uncensored observations. The model was applied to atrazine, carbaryl, chlorpyrifos, and fipronil data for U.S. Geological Survey pesticide sampling sites with sufficient data for applying the model. A total of 112 sites were analyzed for atrazine, 38 for carbaryl, 34 for chlorpyrifos, and 33 for fipronil. The results are summarized in this report; and, R functions, described in this report and provided in an accompanying model archive, can be used to fit the model parameters and generate conditional simulations of daily concentrations for use in investigations involving pesticide exposure risk and uncertainty.
GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.

PubMed

Simovski, Boris; Vodák, Daniel; Gundersen, Sveinung; Domanska, Diana; Azab, Abdulrahman; Holden, Lars; Holden, Marit; Grytten, Ivar; Rand, Knut; Drabløs, Finn; Johansen, Morten; Mora, Antonio; Lund-Andersen, Christin; Fromm, Bastian; Eskeland, Ragnhild; Gabrielsen, Odd Stokke; Ferkingstad, Egil; Nakken, Sigve; Bengtsen, Mads; Nederbragt, Alexander Johan; Thorarensen, Hildur Sif; Akse, Johannes Andreas; Glad, Ingrid; Hovig, Eivind; Sandve, Geir Kjetil

2017-07-01

Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no. © The Author 2017. Published by Oxford University Press.
Investigation of Super Learner Methodology on HIV-1 Small Sample: Application on Jaguar Trial Data.

PubMed

Houssaïni, Allal; Assoumou, Lambert; Marcelin, Anne Geneviève; Molina, Jean Michel; Calvez, Vincent; Flandre, Philippe

2012-01-01

Background. Many statistical models have been tested to predict phenotypic or virological response from genotypic data. A statistical framework called Super Learner has been introduced either to compare different methods/learners (discrete Super Learner) or to combine them in a Super Learner prediction method. Methods. The Jaguar trial is used to apply the Super Learner framework. The Jaguar study is an "add-on" trial comparing the efficacy of adding didanosine to an on-going failing regimen. Our aim was also to investigate the impact on the use of different cross-validation strategies and different loss functions. Four different repartitions between training set and validations set were tested through two loss functions. Six statistical methods were compared. We assess performance by evaluating R(2) values and accuracy by calculating the rates of patients being correctly classified. Results. Our results indicated that the more recent Super Learner methodology of building a new predictor based on a weighted combination of different methods/learners provided good performance. A simple linear model provided similar results to those of this new predictor. Slight discrepancy arises between the two loss functions investigated, and slight difference arises also between results based on cross-validated risks and results from full dataset. The Super Learner methodology and linear model provided around 80% of patients correctly classified. The difference between the lower and higher rates is around 10 percent. The number of mutations retained in different learners also varys from one to 41. Conclusions. The more recent Super Learner methodology combining the prediction of many learners provided good performance on our small dataset.
Biogeochemical Typing of Paddy Field by a Data-Driven Approach Revealing Sub-Systems within a Complex Environment - A Pipeline to Filtrate, Organize and Frame Massive Dataset from Multi-Omics Analyses

PubMed Central

Ogawa, Diogo M. O.; Moriya, Shigeharu; Tsuboi, Yuuri; Date, Yasuhiro; Prieto-da-Silva, Álvaro R. B.; Rádis-Baptista, Gandhi; Yamane, Tetsuo; Kikuchi, Jun

2014-01-01

We propose the technique of biogeochemical typing (BGC typing) as a novel methodology to set forth the sub-systems of organismal communities associated to the correlated chemical profiles working within a larger complex environment. Given the intricate characteristic of both organismal and chemical consortia inherent to the nature, many environmental studies employ the holistic approach of multi-omics analyses undermining as much information as possible. Due to the massive amount of data produced applying multi-omics analyses, the results are hard to visualize and to process. The BGC typing analysis is a pipeline built using integrative statistical analysis that can treat such huge datasets filtering, organizing and framing the information based on the strength of the various mutual trends of the organismal and chemical fluctuations occurring simultaneously in the environment. To test our technique of BGC typing, we choose a rich environment abounding in chemical nutrients and organismal diversity: the surficial freshwater from Japanese paddy fields and surrounding waters. To identify the community consortia profile we employed metagenomics as high throughput sequencing (HTS) for the fragments amplified from Archaea rRNA, universal 16S rRNA and 18S rRNA; to assess the elemental content we employed ionomics by inductively coupled plasma optical emission spectroscopy (ICP-OES); and for the organic chemical profile, metabolomics employing both Fourier transformed infrared (FT-IR) spectroscopy and proton nuclear magnetic resonance (1H-NMR) all these analyses comprised our multi-omics dataset. The similar trends between the community consortia against the chemical profiles were connected through correlation. The result was then filtered, organized and framed according to correlation strengths and peculiarities. The output gave us four BGC types displaying uniqueness in community and chemical distribution, diversity and richness. We conclude therefore that the BGC typing is a successful technique for elucidating the sub-systems of organismal communities with associated chemical profiles in complex ecosystems. PMID:25330259
Biogeochemical typing of paddy field by a data-driven approach revealing sub-systems within a complex environment--a pipeline to filtrate, organize and frame massive dataset from multi-omics analyses.

PubMed

Ogawa, Diogo M O; Moriya, Shigeharu; Tsuboi, Yuuri; Date, Yasuhiro; Prieto-da-Silva, Álvaro R B; Rádis-Baptista, Gandhi; Yamane, Tetsuo; Kikuchi, Jun

2014-01-01

We propose the technique of biogeochemical typing (BGC typing) as a novel methodology to set forth the sub-systems of organismal communities associated to the correlated chemical profiles working within a larger complex environment. Given the intricate characteristic of both organismal and chemical consortia inherent to the nature, many environmental studies employ the holistic approach of multi-omics analyses undermining as much information as possible. Due to the massive amount of data produced applying multi-omics analyses, the results are hard to visualize and to process. The BGC typing analysis is a pipeline built using integrative statistical analysis that can treat such huge datasets filtering, organizing and framing the information based on the strength of the various mutual trends of the organismal and chemical fluctuations occurring simultaneously in the environment. To test our technique of BGC typing, we choose a rich environment abounding in chemical nutrients and organismal diversity: the surficial freshwater from Japanese paddy fields and surrounding waters. To identify the community consortia profile we employed metagenomics as high throughput sequencing (HTS) for the fragments amplified from Archaea rRNA, universal 16S rRNA and 18S rRNA; to assess the elemental content we employed ionomics by inductively coupled plasma optical emission spectroscopy (ICP-OES); and for the organic chemical profile, metabolomics employing both Fourier transformed infrared (FT-IR) spectroscopy and proton nuclear magnetic resonance (1H-NMR) all these analyses comprised our multi-omics dataset. The similar trends between the community consortia against the chemical profiles were connected through correlation. The result was then filtered, organized and framed according to correlation strengths and peculiarities. The output gave us four BGC types displaying uniqueness in community and chemical distribution, diversity and richness. We conclude therefore that the BGC typing is a successful technique for elucidating the sub-systems of organismal communities with associated chemical profiles in complex ecosystems.
LEAP: biomarker inference through learning and evaluating association patterns.

PubMed

Jiang, Xia; Neapolitan, Richard E

2015-03-01

Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

PubMed

Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

2015-10-12

Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.
Metabolic parameters linked by Phenotype MicroArray to acid resistance profiles of poultry-associated Salmonella enterica.

USDA-ARS?s Scientific Manuscript database

Phenotype microarrays were analyzed for 51 datasets derived from Salmonella enterica. The top 4 serovars associated with poultry products and one associated with turkey, respectively Typhimurium, Enteritidis, Heidelberg, Infantis and Senftenberg, were represented. Datasets were clustered into two ...
Vocational Rehabilitation Employment Outcomes and Interagency Collaboration for Youth with Disabilities

ERIC Educational Resources Information Center

Awsumb, Jessica M.

2017-01-01

This study examines post-school outcomes of youth with disabilities that were served by the Illinois vocational rehabilitation (VR) agency while in Chicago Public Schools (CPS) through a mixed methodology research design. In order to understand how outcomes differ among the study population, a large-scale dataset of the employment outcomes of…
BIOFRAG - a new database for analyzing BIOdiversity responses to forest FRAGmentation

Treesearch

M. Pfeifer; Tamara Heartsill Scalley

2014-01-01

Habitat fragmentation studies have produced complex results that are challenging to synthesize. Inconsistencies among studies may result from variation in the choice of landscape metrics and response variables, which is often compounded by a lack of key statistical or methodological information. Collating primary datasets on biodiversity responses to fragmentation in a...
Mentoring Educational Leadership Doctoral Students: Using Methodological Diversification to Examine Gender and Identity Intersections

ERIC Educational Resources Information Center

Welton, Anjale D.; Mansfield, Katherine Cumings; Lee, Pei-Ling; Young, Michelle D.

2015-01-01

An essential component to learning and teaching in educational leadership is mentoring graduate students for successful transition to K-12 and higher education positions. This study integrates quantitative and qualitative datasets to examine doctoral students' experiences with mentoring from macro and micro perspectives. Findings show that…
Analysing the Preferences of Prospective Students for Higher Education Institution Attributes

ERIC Educational Resources Information Center

Walsh, Sharon; Flannery, Darragh; Cullinan, John

2018-01-01

We utilise a dataset of students in their final year of upper secondary education in Ireland to provide a detailed examination of the preferences of prospective students for higher education institutions (HEIs). Our analysis is based upon a discrete choice experiment methodology with willingness to pay estimates derived for specific HEI attributes…
A public-industry partnership for enhancing corn nitrogen research and datasets: project description, methodology, and outcomes

USDA-ARS?s Scientific Manuscript database

Due to economic and environmental consequences of nitrogen (N) lost from fertilizer applications in corn (Zea mays L.), considerable public and industry attention has been devoted to development of N decision tools. Now a wide variety of tools are available to farmers for managing N inputs. However,...
Service Delivery Experiences and Intervention Needs of Military Families with Children with ASD

ERIC Educational Resources Information Center

Davis, Jennifer M.; Finke, Erinn; Hickerson, Benjamin

2016-01-01

The purpose of this study was to describe the experiences of military families with children with autism spectrum disorder (ASD) specifically as it relates to relocation. Online survey methodology was used to gather information from military spouses with children with ASD. The finalized dataset included 189 cases. Descriptive statistics and…
Expanding Downward: Innovation, Diffusion, and State Policy Adoptions of Universal Preschool

ERIC Educational Resources Information Center

Curran, F. Chris

2015-01-01

Framed within the theoretical framework of policy innovation and diffusion, this study explores both interstate (diffusion) and intrastate predictors of adoption of state universal preschool policies. Event history analysis methodology is applied to a state level dataset drawn from the Census, the NCES Common Core, the Book of the States, and…
The Concepts of Informational Approach to the Management of Higher Education's Development

ERIC Educational Resources Information Center

Levina, Elena Y.; Voronina, Marianna V.; Rybolovleva, Alla A.; Sharafutdinova, Mariya M.; Zhandarova, Larisa F.; Avilova, Vilora V.

2016-01-01

The research urgency is caused by necessity to develop the informational support for management of development of higher education in conditions of high turbulence of external and internal environment. The purpose of the paper is the development of methodology for structuring and analyzing datasets of educational activities in order to reduce…
Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

PubMed

Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

2014-01-01

Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.

The inland water macro-invertebrate occurrences in Flanders, Belgium.

PubMed

Vannevel, Rudy; Brosens, Dimitri; Cooman, Ward De; Gabriels, Wim; Frank Lavens; Mertens, Joost; Vervaeke, Bart

2018-01-01

The Flanders Environment Agency (VMM) has been performing biological water quality assessments on inland waters in Flanders (Belgium) since 1989 and sediment quality assessments since 2000. The water quality monitoring network is a combined physico-chemical and biological network, the biological component focusing on macro-invertebrates. The sediment monitoring programme produces biological data to assess the sediment quality. Both monitoring programmes aim to provide index values, applying a similar conceptual methodology based on the presence of macro-invertebrates. The biological data obtained from both monitoring networks are consolidated in the VMM macro-invertebrates database and include identifications at family and genus level of the freshwater phyla Coelenterata, Platyhelminthes, Annelida, Mollusca, and Arthropoda. This paper discusses the content of this database, and the dataset published thereof: 282,309 records of 210 observed taxa from 4,140 monitoring sites located on 657 different water bodies, collected during 22,663 events. This paper provides some background information on the methodology, temporal and spatial coverage, and taxonomy, and describes the content of the dataset. The data are distributed as open data under the Creative Commons CC-BY license.
The species translation challenge—A systems biology perspective on human and rat bronchial epithelial cells

PubMed Central

Poussin, Carine; Mathis, Carole; Alexopoulos, Leonidas G; Messinis, Dimitris E; Dulize, Rémi H J; Belcastro, Vincenzo; Melas, Ioannis N; Sakellaropoulos, Theodore; Rhrissorrakrai, Kahn; Bilal, Erhan; Meyer, Pablo; Talikka, Marja; Boué, Stéphanie; Norel, Raquel; Rice, John J; Stolovitzky, Gustavo; Ivanov, Nikolai V; Peitsch, Manuel C; Hoeng, Julia

2014-01-01

The biological responses to external cues such as drugs, chemicals, viruses and hormones, is an essential question in biomedicine and in the field of toxicology, and cannot be easily studied in humans. Thus, biomedical research has continuously relied on animal models for studying the impact of these compounds and attempted to ‘translate’ the results to humans. In this context, the SBV IMPROVER (Systems Biology Verification for Industrial Methodology for PROcess VErification in Research) collaborative initiative, which uses crowd-sourcing techniques to address fundamental questions in systems biology, invited scientists to deploy their own computational methodologies to make predictions on species translatability. A multi-layer systems biology dataset was generated that was comprised of phosphoproteomics, transcriptomics and cytokine data derived from normal human (NHBE) and rat (NRBE) bronchial epithelial cells exposed in parallel to more than 50 different stimuli under identical conditions. The present manuscript describes in detail the experimental settings, generation, processing and quality control analysis of the multi-layer omics dataset accessible in public repositories for further intra- and inter-species translation studies. PMID:25977767
The species translation challenge-a systems biology perspective on human and rat bronchial epithelial cells.

PubMed

Poussin, Carine; Mathis, Carole; Alexopoulos, Leonidas G; Messinis, Dimitris E; Dulize, Rémi H J; Belcastro, Vincenzo; Melas, Ioannis N; Sakellaropoulos, Theodore; Rhrissorrakrai, Kahn; Bilal, Erhan; Meyer, Pablo; Talikka, Marja; Boué, Stéphanie; Norel, Raquel; Rice, John J; Stolovitzky, Gustavo; Ivanov, Nikolai V; Peitsch, Manuel C; Hoeng, Julia

2014-01-01

The biological responses to external cues such as drugs, chemicals, viruses and hormones, is an essential question in biomedicine and in the field of toxicology, and cannot be easily studied in humans. Thus, biomedical research has continuously relied on animal models for studying the impact of these compounds and attempted to 'translate' the results to humans. In this context, the SBV IMPROVER (Systems Biology Verification for Industrial Methodology for PROcess VErification in Research) collaborative initiative, which uses crowd-sourcing techniques to address fundamental questions in systems biology, invited scientists to deploy their own computational methodologies to make predictions on species translatability. A multi-layer systems biology dataset was generated that was comprised of phosphoproteomics, transcriptomics and cytokine data derived from normal human (NHBE) and rat (NRBE) bronchial epithelial cells exposed in parallel to more than 50 different stimuli under identical conditions. The present manuscript describes in detail the experimental settings, generation, processing and quality control analysis of the multi-layer omics dataset accessible in public repositories for further intra- and inter-species translation studies.
An Evaluation of Database Solutions to Spatial Object Association

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, V S; Kurc, T; Saltz, J

2008-06-24

Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less
Genome-wide pathway-based association analysis identifies risk pathways associated with Parkinson's disease.

PubMed

Zhang, Mingming; Mu, Hongbo; Shang, Zhenwei; Kang, Kai; Lv, Hongchao; Duan, Lian; Li, Jin; Chen, Xinren; Teng, Yanbo; Jiang, Yongshuai; Zhang, Ruijie

2017-01-06

Parkinson's disease (PD) is the second most common neurodegenerative disease. It is generally believed that it is influenced by both genetic and environmental factors, but the precise pathogenesis of PD is unknown to date. In this study, we performed a pathway analysis based on genome-wide association study (GWAS) to detect risk pathways of PD in three GWAS datasets. We first mapped all SNP markers to autosomal genes in each GWAS dataset. Then, we evaluated gene risk values using the minimum P-value of the tagSNPs. We took a pathway as a unit to identify the risk pathways based on the cumulative risks of the genes in the pathway. Finally, we combine the analysis results of the three datasets to detect the high risk pathways associated with PD. We found there were five same pathways in the three datasets. Besides, we also found there were five pathways which were shared in two datasets. Most of these pathways are associated with nervoussystem. Five pathways had been reported to be PD-related pathways in the previous literature. Our findings also implied that there was a close association between immune response and PD. Continued investigation of these pathways will further help us explain the pathogenesis of PD. Copyright © 2016. Published by Elsevier Ltd.
Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research.

PubMed

McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R; Beresford, Michael W

2018-02-01

This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Hydrodynamic modelling and global datasets: Flow connectivity and SRTM data, a Bangkok case study.

NASA Astrophysics Data System (ADS)

Trigg, M. A.; Bates, P. B.; Michaelides, K.

2012-04-01

The rise in the global interconnected manufacturing supply chains requires an understanding and consistent quantification of flood risk at a global scale. Flood risk is often better quantified (or at least more precisely defined) in regions where there has been an investment in comprehensive topographical data collection such as LiDAR coupled with detailed hydrodynamic modelling. Yet in regions where these data and modelling are unavailable, the implications of flooding and the knock on effects for global industries can be dramatic, as evidenced by the recent floods in Bangkok, Thailand. There is a growing momentum in terms of global modelling initiatives to address this lack of a consistent understanding of flood risk and they will rely heavily on the application of available global datasets relevant to hydrodynamic modelling, such as Shuttle Radar Topography Mission (SRTM) data and its derivatives. These global datasets bring opportunities to apply consistent methodologies on an automated basis in all regions, while the use of coarser scale datasets also brings many challenges such as sub-grid process representation and downscaled hydrology data from global climate models. There are significant opportunities for hydrological science in helping define new, realistic and physically based methodologies that can be applied globally as well as the possibility of gaining new insights into flood risk through analysis of the many large datasets that will be derived from this work. We use Bangkok as a case study to explore some of the issues related to using these available global datasets for hydrodynamic modelling, with particular focus on using SRTM data to represent topography. Research has shown that flow connectivity on the floodplain is an important component in the dynamics of flood flows on to and off the floodplain, and indeed within different areas of the floodplain. A lack of representation of flow connectivity, often due to data resolution limitations, means that important subgrid processes are missing from hydrodynamic models leading to poor model predictive capabilities. Specifically here, the issue of flow connectivity during flood events is explored using geostatistical techniques to quantify the change of flow connectivity on floodplains due to grid rescaling methods. We also test whether this method of assessing connectivity can be used as new tool in the quantification of flood risk that moves beyond the simple flood extent approach, encapsulating threshold changes and data limitations.
De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

PubMed Central

Arbuckle, Luk; Koru, Gunes; Eze, Benjamin; Gaudette, Lisa; Neri, Emilio; Rose, Sean; Howard, Jeremy; Gluck, Jonathan

2012-01-01

Background There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. Objective To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Methods We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. Results An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. Conclusions It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data. PMID:22370452
Variable Star Signature Classification using Slotted Symbolic Markov Modeling

NASA Astrophysics Data System (ADS)

Johnston, K. B.; Peter, A. M.

2017-01-01

With the advent of digital astronomy, new benefits and new challenges have been presented to the modern day astronomer. No longer can the astronomer rely on manual processing, instead the profession as a whole has begun to adopt more advanced computational means. This paper focuses on the construction and application of a novel time-domain signature extraction methodology and the development of a supporting supervised pattern classification algorithm for the identification of variable stars. A methodology for the reduction of stellar variable observations (time-domain data) into a novel feature space representation is introduced. The methodology presented will be referred to as Slotted Symbolic Markov Modeling (SSMM) and has a number of advantages which will be demonstrated to be beneficial; specifically to the supervised classification of stellar variables. It will be shown that the methodology outperformed a baseline standard methodology on a standardized set of stellar light curve data. The performance on a set of data derived from the LINEAR dataset will also be shown.
Variable Star Signature Classification using Slotted Symbolic Markov Modeling

NASA Astrophysics Data System (ADS)

Johnston, Kyle B.; Peter, Adrian M.

2016-01-01

With the advent of digital astronomy, new benefits and new challenges have been presented to the modern day astronomer. No longer can the astronomer rely on manual processing, instead the profession as a whole has begun to adopt more advanced computational means. Our research focuses on the construction and application of a novel time-domain signature extraction methodology and the development of a supporting supervised pattern classification algorithm for the identification of variable stars. A methodology for the reduction of stellar variable observations (time-domain data) into a novel feature space representation is introduced. The methodology presented will be referred to as Slotted Symbolic Markov Modeling (SSMM) and has a number of advantages which will be demonstrated to be beneficial; specifically to the supervised classification of stellar variables. It will be shown that the methodology outperformed a baseline standard methodology on a standardized set of stellar light curve data. The performance on a set of data derived from the LINEAR dataset will also be shown.
Wet climate and transportation routes accelerate spread of human plague

PubMed Central

Xu, Lei; Stige, Leif Chr.; Kausrud, Kyrre Linné; Ben Ari, Tamara; Wang, Shuchun; Fang, Xiye; Schmid, Boris V.; Liu, Qiyong; Stenseth, Nils Chr.; Zhang, Zhibin

2014-01-01

Currently, large-scale transmissions of infectious diseases are becoming more closely associated with accelerated globalization and climate change, but quantitative analyses are still rare. By using an extensive dataset consisting of date and location of cases for the third plague pandemic from 1772 to 1964 in China and a novel method (nearest neighbour approach) which deals with both short- and long-distance transmissions, we found the presence of major roads, rivers and coastline accelerated the spread of plague and shaped the transmission patterns. We found that plague spread velocity was positively associated with wet conditions (measured by an index of drought and flood events) in China, probably due to flood-driven transmission by people or rodents. Our study provides new insights on transmission patterns and possible mechanisms behind variability in transmission speed, with implications for prevention and control measures. The methodology may also be applicable to studies of disease dynamics or species movement in other systems. PMID:24523275
Digital database architecture and delineation methodology for deriving drainage basins, and a comparison of digitally and non-digitally derived numeric drainage areas

USGS Publications Warehouse

Dupree, Jean A.; Crowfoot, Richard M.

2012-01-01

The drainage basin is a fundamental hydrologic entity used for studies of surface-water resources and during planning of water-related projects. Numeric drainage areas published by the U.S. Geological Survey water science centers in Annual Water Data Reports and on the National Water Information Systems (NWIS) Web site are still primarily derived from hard-copy sources and by manual delineation of polygonal basin areas on paper topographic map sheets. To expedite numeric drainage area determinations, the Colorado Water Science Center developed a digital database structure and a delineation methodology based on the hydrologic unit boundaries in the National Watershed Boundary Dataset. This report describes the digital database architecture and delineation methodology and also presents the results of a comparison of the numeric drainage areas derived using this digital methodology with those derived using traditional, non-digital methods. (Please see report for full Abstract)
EnsembleGASVR: a novel ensemble method for classifying missense single nucleotide polymorphisms.

PubMed

Rapakoulia, Trisevgeni; Theofilatos, Konstantinos; Kleftogiannis, Dimitrios; Likothanasis, Spiros; Tsakalidis, Athanasios; Mavroudi, Seferina

2014-08-15

Single nucleotide polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrarily without sufficient documentation. Moreover, they are limited to the problem of missing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores. To overcome these limitations, a novel ensemble computational methodology is proposed. EnsembleGASVR facilitates a two-step algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step, these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an internal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs, and a superset of 88 features was constructed. Experimental results show that the proposed framework outperforms well-known algorithms in terms of classification performance in the examined datasets. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature, and the top-scored predictions were further validated by linking them with disease phenotypes. Datasets and codes are freely available on the Web at http://prlab.ceid.upatras.gr/EnsembleGASVR/dataset-codes.zip. All the required information about the article is available through http://prlab.ceid.upatras.gr/EnsembleGASVR/site.html. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Independent studies using deep sequencing resolve the same set of core bacterial species dominating gut communities of honey bees.

PubMed

Sabree, Zakee L; Hansen, Allison K; Moran, Nancy A

2012-01-01

Starting in 2003, numerous studies using culture-independent methodologies to characterize the gut microbiota of honey bees have retrieved a consistent and distinctive set of eight bacterial species, based on near identity of the 16S rRNA gene sequences. A recent study [Mattila HR, Rios D, Walker-Sperling VE, Roeselers G, Newton ILG (2012) Characterization of the active microbiotas associated with honey bees reveals healthier and broader communities when colonies are genetically diverse. PLoS ONE 7(3): e32962], using pyrosequencing of the V1-V2 hypervariable region of the 16S rRNA gene, reported finding entirely novel bacterial species in honey bee guts, and used taxonomic assignments from these reads to predict metabolic activities based on known metabolisms of cultivable species. To better understand this discrepancy, we analyzed the Mattila et al. pyrotag dataset. In contrast to the conclusions of Mattila et al., we found that the large majority of pyrotag sequences belonged to clusters for which representative sequences were identical to sequences from previously identified core species of the bee microbiota. On average, they represent 95% of the bacteria in each worker bee in the Mattila et al. dataset, a slightly lower value than that found in other studies. Some colonies contain small proportions of other bacteria, mostly species of Enterobacteriaceae. Reanalysis of the Mattila et al. dataset also did not support a relationship between abundances of Bifidobacterium and of putative pathogens or a significant difference in gut communities between colonies from queens that were singly or multiply mated. Additionally, consistent with previous studies, the dataset supports the occurrence of considerable strain variation within core species, even within single colonies. The roles of these bacteria within bees, or the implications of the strain variation, are not yet clear.
ConGEMs: Condensed Gene Co-Expression Module Discovery Through Rule-Based Clustering and Its Application to Carcinogenesis.

PubMed

Mallik, Saurav; Zhao, Zhongming

2017-12-28

For transcriptomic analysis, there are numerous microarray-based genomic data, especially those generated for cancer research. The typical analysis measures the difference between a cancer sample-group and a matched control group for each transcript or gene. Association rule mining is used to discover interesting item sets through rule-based methodology. Thus, it has advantages to find causal effect relationships between the transcripts. In this work, we introduce two new rule-based similarity measures-weighted rank-based Jaccard and Cosine measures-and then propose a novel computational framework to detect condensed gene co-expression modules ( C o n G E M s) through the association rule-based learning system and the weighted similarity scores. In practice, the list of evolved condensed markers that consists of both singular and complex markers in nature depends on the corresponding condensed gene sets in either antecedent or consequent of the rules of the resultant modules. In our evaluation, these markers could be supported by literature evidence, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology annotations. Specifically, we preliminarily identified differentially expressed genes using an empirical Bayes test. A recently developed algorithm-RANWAR-was then utilized to determine the association rules from these genes. Based on that, we computed the integrated similarity scores of these rule-based similarity measures between each rule-pair, and the resultant scores were used for clustering to identify the co-expressed rule-modules. We applied our method to a gene expression dataset for lung squamous cell carcinoma and a genome methylation dataset for uterine cervical carcinogenesis. Our proposed module discovery method produced better results than the traditional gene-module discovery measures. In summary, our proposed rule-based method is useful for exploring biomarker modules from transcriptomic data.
Evaluation of the Soil Conservation Service curve number methodology using data from agricultural plots

NASA Astrophysics Data System (ADS)

Lal, Mohan; Mishra, S. K.; Pandey, Ashish; Pandey, R. P.; Meena, P. K.; Chaudhary, Anubhav; Jha, Ranjit Kumar; Shreevastava, Ajit Kumar; Kumar, Yogendra

2017-01-01

The Soil Conservation Service curve number (SCS-CN) method, also known as the Natural Resources Conservation Service curve number (NRCS-CN) method, is popular for computing the volume of direct surface runoff for a given rainfall event. The performance of the SCS-CN method, based on large rainfall (P) and runoff (Q) datasets of United States watersheds, is evaluated using a large dataset of natural storm events from 27 agricultural plots in India. On the whole, the CN estimates from the National Engineering Handbook (chapter 4) tables do not match those derived from the observed P and Q datasets. As a result, the runoff prediction using former CNs was poor for the data of 22 (out of 24) plots. However, the match was little better for higher CN values, consistent with the general notion that the existing SCS-CN method performs better for high rainfall-runoff (high CN) events. Infiltration capacity (fc) was the main explanatory variable for runoff (or CN) production in study plots as it exhibited the expected inverse relationship between CN and fc. The plot-data optimization yielded initial abstraction coefficient (λ) values from 0 to 0.659 for the ordered dataset and 0 to 0.208 for the natural dataset (with 0 as the most frequent value). Mean and median λ values were, respectively, 0.030 and 0 for the natural rainfall-runoff dataset and 0.108 and 0 for the ordered rainfall-runoff dataset. Runoff estimation was very sensitive to λ and it improved consistently as λ changed from 0.2 to 0.03.
An ontological system for interoperable spatial generalisation in biodiversity monitoring

NASA Astrophysics Data System (ADS)

Nieland, Simon; Moran, Niklas; Kleinschmit, Birgit; Förster, Michael

2015-11-01

Semantic heterogeneity remains a barrier to data comparability and standardisation of results in different fields of spatial research. Because of its thematic complexity, differing acquisition methods and national nomenclatures, interoperability of biodiversity monitoring information is especially difficult. Since data collection methods and interpretation manuals broadly vary there is a need for automatised, objective methodologies for the generation of comparable data-sets. Ontology-based applications offer vast opportunities in data management and standardisation. This study examines two data-sets of protected heathlands in Germany and Belgium which are based on remote sensing image classification and semantically formalised in an OWL2 ontology. The proposed methodology uses semantic relations of the two data-sets, which are (semi-)automatically derived from remote sensing imagery, to generate objective and comparable information about the status of protected areas by utilising kernel-based spatial reclassification. This automatised method suggests a generalisation approach, which is able to generate delineation of Special Areas of Conservation (SAC) of the European biodiversity Natura 2000 network. Furthermore, it is able to transfer generalisation rules between areas surveyed with varying acquisition methods in different countries by taking into account automated inference of the underlying semantics. The generalisation results were compared with the manual delineation of terrestrial monitoring. For the different habitats in the two sites an accuracy of above 70% was detected. However, it has to be highlighted that the delineation of the ground-truth data inherits a high degree of uncertainty, which is discussed in this study.
Coalescence computations for large samples drawn from populations of time-varying sizes

PubMed Central

Polanski, Andrzej; Szczesna, Agnieszka; Garbulowski, Mateusz; Kimmel, Marek

2017-01-01

We present new results concerning probability distributions of times in the coalescence tree and expected allele frequencies for coalescent with large sample size. The obtained results are based on computational methodologies, which involve combining coalescence time scale changes with techniques of integral transformations and using analytical formulae for infinite products. We show applications of the proposed methodologies for computing probability distributions of times in the coalescence tree and their limits, for evaluation of accuracy of approximate expressions for times in the coalescence tree and expected allele frequencies, and for analysis of large human mitochondrial DNA dataset. PMID:28170404
From Sky to Earth: Data Science Methodology Transfer

NASA Astrophysics Data System (ADS)

Mahabal, Ashish A.; Crichton, Daniel; Djorgovski, S. G.; Law, Emily; Hughes, John S.

2017-06-01

We describe here the parallels in astronomy and earth science datasets, their analyses, and the opportunities for methodology transfer from astroinformatics to geoinformatics. Using example of hydrology, we emphasize how meta-data and ontologies are crucial in such an undertaking. Using the infrastructure being designed for EarthCube - the Virtual Observatory for the earth sciences - we discuss essential steps for better transfer of tools and techniques in the future e.g. domain adaptation. Finally we point out that it is never a one-way process and there is enough for astroinformatics to learn from geoinformatics as well.
Site-conditions map for Portugal based on VS measurements: methodology and final model

NASA Astrophysics Data System (ADS)

Vilanova, Susana; Narciso, João; Carvalho, João; Lopes, Isabel; Quinta Ferreira, Mario; Moura, Rui; Borges, José; Nemser, Eliza; Pinto, carlos

2017-04-01

In this paper we present a statistically significant site-condition model for Portugal based on shear-wave velocity (VS) data and surface geology. We also evaluate the performance of commonly used Vs30 proxies based on exogenous data and analyze the implications of using those proxies for calculating site amplification in seismic hazard assessment. The dataset contains 161 Vs profiles acquired in Portugal in the context of research projects, technical reports, academic thesis and academic papers. The methodologies involved in characterizing the Vs structure at the sites in the database include seismic refraction, multichannel analysis of seismic waves and refraction microtremor. Invasive measurements were performed in selected locations in order to compare the Vs profiles obtained from both invasive and non-invasive techniques. In general there was good agreement in the subsurface structure of Vs30 obtained from the different methodologies. The database flat-file includes information on Vs30, surface geology at 1:50.000 and 1:500.000 scales, elevation and topographic slope and based on SRTM30 topographic dataset. The procedure used to develop the site-conditions map is based on a three-step process that includes defining a preliminary set of geological units based on the literature, performing statistical tests to assess whether or not the differences in the distributions of Vs30 are statistically significant, and merging of the geological units accordingly. The dataset was, to some extent, affected by clustering and/or preferential sampling and therefore a declustering algorithm was applied. The final model includes three geological units: 1) Igneous, metamorphic and old (Paleogene and Mesozoic) sedimentary rocks; 2) Neogene and Pleistocene formations, and 3) Holocene formations. The evaluation of proxies indicates that although geological analogues and topographic slope are in general unbiased, the latter shows significant bias for particular geological units and subsequently for some geographical regions.

The Great Lakes Hydrography Dataset: Consistent, binational ...

EPA Pesticide Factsheets

Ecosystem-based management of the Laurentian Great Lakes, which spans both the United States and Canada, is hampered by the lack of consistent binational watersheds for the entire Basin. Using comparable data sources and consistent methods we developed spatially equivalent watershed boundaries for the binational extent of the Basin to create the Great Lakes Hydrography Dataset (GLHD). The GLHD consists of 5,589 watersheds for the entire Basin, covering a total area of approximately 547,967 km2, or about twice the 247,003 km2 surface water area of the Great Lakes. The GLHD improves upon existing watershed efforts by delineating watersheds for the entire Basin using consistent methods; enhancing the precision of watershed delineation by using recently developed flow direction grids that have been hydrologically enforced and vetted by provincial and federal water resource agencies; and increasing the accuracy of watershed boundaries by enforcing embayments, delineating watersheds on islands, and delineating watersheds for all tributaries draining to connecting channels. In addition, the GLHD is packaged in a publically available geodatabase that includes synthetic stream networks, reach catchments, watershed boundaries, a broad set of attribute data for each tributary, and metadata documenting methodology. The GLHD provides a common set of watersheds and associated hydrography data for the Basin that will enhance binational efforts to protect and restore the Great
A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets.

PubMed

Koren, Omry; Knights, Dan; Gonzalez, Antonio; Waldron, Levi; Segata, Nicola; Knight, Rob; Huttenhower, Curtis; Ley, Ruth E

2013-01-01

Recent analyses of human-associated bacterial diversity have categorized individuals into 'enterotypes' or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes.
A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets

PubMed Central

Waldron, Levi; Segata, Nicola; Knight, Rob; Huttenhower, Curtis; Ley, Ruth E.

2013-01-01

Recent analyses of human-associated bacterial diversity have categorized individuals into ‘enterotypes’ or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes. PMID:23326225
A global wind resource atlas including high-resolution terrain effects

NASA Astrophysics Data System (ADS)

Hahmann, Andrea; Badger, Jake; Olsen, Bjarke; Davis, Neil; Larsen, Xiaoli; Badger, Merete

2015-04-01

Currently no accurate global wind resource dataset is available to fill the needs of policy makers and strategic energy planners. Evaluating wind resources directly from coarse resolution reanalysis datasets underestimate the true wind energy resource, as the small-scale spatial variability of winds is missing. This missing variability can account for a large part of the local wind resource. Crucially, it is the windiest sites that suffer the largest wind resource errors: in simple terrain the windiest sites may be underestimated by 25%, in complex terrain the underestimate can be as large as 100%. The small-scale spatial variability of winds can be modelled using novel statistical methods and by application of established microscale models within WAsP developed at DTU Wind Energy. We present the framework for a single global methodology, which is relative fast and economical to complete. The method employs reanalysis datasets, which are downscaled to high-resolution wind resource datasets via a so-called generalization step, and microscale modelling using WAsP. This method will create the first global wind atlas (GWA) that covers all land areas (except Antarctica) and 30 km coastal zone over water. Verification of the GWA estimates will be done at carefully selected test regions, against verified estimates from mesoscale modelling and satellite synthetic aperture radar (SAR). This verification exercise will also help in the estimation of the uncertainty of the new wind climate dataset. Uncertainty will be assessed as a function of spatial aggregation. It is expected that the uncertainty at verification sites will be larger than that of dedicated assessments, but the uncertainty will be reduced at levels of aggregation appropriate for energy planning, and importantly much improved relative to what is used today. In this presentation we discuss the methodology used, which includes the generalization of wind climatologies, and the differences in local and spatially aggregated wind resources that result from using different reanalyses in the various verification regions. A prototype web interface for the public access to the data will also be showcased.
Generation and evaluation of typical meteorological year datasets for greenhouse and external conditions on the Mediterranean coast.

PubMed

Fernández, M D; López, J C; Baeza, E; Céspedes, A; Meca, D E; Bailey, B

2015-08-01

A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar 'Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar radiation transmission of approximately 65 % and rely on manual control of ventilation which constitute the majority in the south-east of Spain and in most Mediterranean greenhouse areas.
Generation and evaluation of typical meteorological year datasets for greenhouse and external conditions on the Mediterranean coast

NASA Astrophysics Data System (ADS)

Fernández, M. D.; López, J. C.; Baeza, E.; Céspedes, A.; Meca, D. E.; Bailey, B.

2015-08-01

A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar `Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar radiation transmission of approximately 65 % and rely on manual control of ventilation which constitute the majority in the south-east of Spain and in most Mediterranean greenhouse areas.
A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning

NASA Astrophysics Data System (ADS)

Houborg, Rasmus; McCabe, Matthew F.

2018-01-01

With an increasing volume and dimensionality of Earth observation data, enhanced integration of machine-learning methodologies is needed to effectively analyze and utilize these information rich datasets. In machine-learning, a training dataset is required to establish explicit associations between a suite of explanatory 'predictor' variables and the target property. The specifics of this learning process can significantly influence model validity and portability, with a higher generalization level expected with an increasing number of observable conditions being reflected in the training dataset. Here we propose a hybrid training approach for leaf area index (LAI) estimation, which harnesses synergistic attributes of scattered in-situ measurements and systematically distributed physically based model inversion results to enhance the information content and spatial representativeness of the training data. To do this, a complimentary training dataset of independent LAI was derived from a regularized model inversion of RapidEye surface reflectances and subsequently used to guide the development of LAI regression models via Cubist and random forests (RF) decision tree methods. The application of the hybrid training approach to a broad set of Landsat 8 vegetation index (VI) predictor variables resulted in significantly improved LAI prediction accuracies and spatial consistencies, relative to results relying on in-situ measurements alone for model training. In comparing the prediction capacity and portability of the two machine-learning algorithms, a pair of relatively simple multi-variate regression models established by Cubist performed best, with an overall relative mean absolute deviation (rMAD) of ∼11%, determined based on a stringent scene-specific cross-validation approach. In comparison, the portability of RF regression models was less effective (i.e., an overall rMAD of ∼15%), which was attributed partly to model saturation at high LAI in association with inherent extrapolation and transferability limitations. Explanatory VIs formed from bands in the near-infrared (NIR) and shortwave infrared domains (e.g., NDWI) were associated with the highest predictive ability, whereas Cubist models relying entirely on VIs based on NIR and red band combinations (e.g., NDVI) were associated with comparatively high uncertainties (i.e., rMAD ∼ 21%). The most transferable and best performing models were based on combinations of several predictor variables, which included both NDWI- and NDVI-like variables. In this process, prior screening of input VIs based on an assessment of variable relevance served as an effective mechanism for optimizing prediction accuracies from both Cubist and RF. While this study demonstrated benefit in combining data mining operations with physically based constraints via a hybrid training approach, the concept of transferability and portability warrants further investigations in order to realize the full potential of emerging machine-learning techniques for regression purposes.
Utility and Limitations of Using Gene Expression Data to Identify Functional Associations

PubMed Central

Peng, Cheng; Shiu, Shin-Han

2016-01-01

Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. PMID:27935950
Picking vs Waveform based detection and location methods for induced seismicity monitoring

NASA Astrophysics Data System (ADS)

Grigoli, Francesco; Boese, Maren; Scarabello, Luca; Diehl, Tobias; Weber, Bernd; Wiemer, Stefan; Clinton, John F.

2017-04-01

Microseismic monitoring is a common operation in various industrial activities related to geo-resouces, such as oil and gas and mining operations or geothermal energy exploitation. In microseismic monitoring we generally deal with large datasets from dense monitoring networks that require robust automated analysis procedures. The seismic sequences being monitored are often characterized by very many events with short inter-event times that can even provide overlapped seismic signatures. In these situations, traditional approaches that identify seismic events using dense seismic networks based on detections, phase identification and event association can fail, leading to missed detections and/or reduced location resolution. In recent years, to improve the quality of automated catalogues, various waveform-based methods for the detection and location of microseismicity have been proposed. These methods exploit the coherence of the waveforms recorded at different stations and do not require any automated picking procedure. Although this family of methods have been applied to different induced seismicity datasets, an extensive comparison with sophisticated pick-based detection and location methods is still lacking. We aim here to perform a systematic comparison in term of performance using the waveform-based method LOKI and the pick-based detection and location methods (SCAUTOLOC and SCANLOC) implemented within the SeisComP3 software package. SCANLOC is a new detection and location method specifically designed for seismic monitoring at local scale. Although recent applications have proved an extensive test with induced seismicity datasets have been not yet performed. This method is based on a cluster search algorithm to associate detections to one or many potential earthquake sources. On the other hand, SCAUTOLOC is more a "conventional" method and is the basic tool for seismic event detection and location in SeisComp3. This approach was specifically designed for regional and teleseismic applications, thus its performance with microseismic data might be limited. We analyze the performance of the three methodologies for a synthetic dataset with realistic noise conditions as well as for the first hour of continuous waveform data, including the Ml 3.5 St. Gallen earthquake, recorded by a microseismic network deployed in the area. We finally compare the results obtained all these three methods with a manually revised catalogue.
Gene-Gene and Gene-Environment Interactions in Ulcerative Colitis

PubMed Central

Wang, Ming-Hsi; Fiocchi, Claudio; Zhu, Xiaofeng; Ripke, Stephan; Kamboh, M. Ilyas; Rebert, Nancy; Duerr, Richard H.; Achkar, Jean-Paul

2014-01-01

Genome-wide association studies (GWAS) have identified at least 133 ulcerative colitis (UC) associated loci. The role of genetic factors in clinical practice is not clearly defined. The relevance of genetic variants to disease pathogenesis is still uncertain because of not characterized gene-gene and gene-environment interactions. We examined the predictive value of combining the 133 UC risk loci with genetic interactions in an ongoing inflammatory bowel disease (IBD) GWAS. The Wellcome Trust Case-Control Consortium (WTCCC) IBD GWAS was used as a replication cohort. We applied logic regression (LR), a novel adaptive regression methodology, to search for high order interactions. Exploratory genotype correlations with UC sub-phenotypes (extent of disease, need of surgery, age of onset, extra-intestinal manifestations and primary sclerosing cholangitis (PSC)) were conducted. The combination of 133 UC loci yielded good UC risk predictability (area under the curve [AUC] of 0.86). A higher cumulative allele score predicted higher UC risk. Through LR, several lines of evidence for genetic interactions were identified and successfully replicated in the WTCCC cohort. The genetic interactions combined with the gene-smoking interaction significantly improved predictability in the model (AUC, from 0.86 to 0.89, P=3.26E-05). Explained UC variance increased from 37% to 42% after adding the interaction terms. A within case analysis found suggested genetic association with PSC. Our study demonstrates that the LR methodology allows the identification and replication of high order genetic interactions in UC GWAS datasets. UC risk can be predicted by a 133 loci and improved by adding gene-gene and gene-environment interactions. PMID:24241240
Classification of Alzheimer's Patients through Ubiquitous Computing.

PubMed

Nieto-Reyes, Alicia; Duque, Rafael; Montaña, José Luis; Lage, Carmen

2017-07-21

Functional data analysis and artificial neural networks are the building blocks of the proposed methodology that distinguishes the movement patterns among c's patients on different stages of the disease and classifies new patients to their appropriate stage of the disease. The movement patterns are obtained by the accelerometer device of android smartphones that the patients carry while moving freely. The proposed methodology is relevant in that it is flexible on the type of data to which it is applied. To exemplify that, it is analyzed a novel real three-dimensional functional dataset where each datum is observed in a different time domain. Not only is it observed on a difference frequency but also the domain of each datum has different length. The obtained classification success rate of 83 % indicates the potential of the proposed methodology.
Iterative refinement of implicit boundary models for improved geological feature reproduction

NASA Astrophysics Data System (ADS)

Martin, Ryan; Boisvert, Jeff B.

2017-12-01

Geological domains contain non-stationary features that cannot be described by a single direction of continuity. Non-stationary estimation frameworks generate more realistic curvilinear interpretations of subsurface geometries. A radial basis function (RBF) based implicit modeling framework using domain decomposition is developed that permits introduction of locally varying orientations and magnitudes of anisotropy for boundary models to better account for the local variability of complex geological deposits. The interpolation framework is paired with a method to automatically infer the locally predominant orientations, which results in a rapid and robust iterative non-stationary boundary modeling technique that can refine locally anisotropic geological shapes automatically from the sample data. The method also permits quantification of the volumetric uncertainty associated with the boundary modeling. The methodology is demonstrated on a porphyry dataset and shows improved local geological features.
Polynomial chaos representation of databases on manifolds

DOE Office of Scientific and Technical Information (OSTI.GOV)

Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu

2017-04-15

Characterizing the polynomial chaos expansion (PCE) of a vector-valued random variable with probability distribution concentrated on a manifold is a relevant problem in data-driven settings. The probability distribution of such random vectors is multimodal in general, leading to potentially very slow convergence of the PCE. In this paper, we build on a recent development for estimating and sampling from probabilities concentrated on a diffusion manifold. The proposed methodology constructs a PCE of the random vector together with an associated generator that samples from the target probability distribution which is estimated from data concentrated in the neighborhood of the manifold. Themore » method is robust and remains efficient for high dimension and large datasets. The resulting polynomial chaos construction on manifolds permits the adaptation of many uncertainty quantification and statistical tools to emerging questions motivated by data-driven queries.« less
An informatics research agenda to support precision medicine: seven key areas.

PubMed

Tenenbaum, Jessica D; Avillach, Paul; Benham-Hutchins, Marge; Breitenstein, Matthew K; Crowgey, Erin L; Hoffman, Mark A; Jiang, Xia; Madhavan, Subha; Mattison, John E; Nagarajan, Radhakrishnan; Ray, Bisakha; Shin, Dmitriy; Visweswaran, Shyam; Zhao, Zhongming; Freimuth, Robert R

2016-07-01

The recent announcement of the Precision Medicine Initiative by President Obama has brought precision medicine (PM) to the forefront for healthcare providers, researchers, regulators, innovators, and funders alike. As technologies continue to evolve and datasets grow in magnitude, a strong computational infrastructure will be essential to realize PM's vision of improved healthcare derived from personal data. In addition, informatics research and innovation affords a tremendous opportunity to drive the science underlying PM. The informatics community must lead the development of technologies and methodologies that will increase the discovery and application of biomedical knowledge through close collaboration between researchers, clinicians, and patients. This perspective highlights seven key areas that are in need of further informatics research and innovation to support the realization of PM. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

NASA Astrophysics Data System (ADS)

Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

2016-04-01

Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not suitable for CPV systems. For these systems, the TMY datasets obtained using dedicated drivers (DNI only or more precise one) are more representative to derive TMY datasets from limited long-term meteorological dataset.
Investigating Teacher Stress when Using Technology

ERIC Educational Resources Information Center

Al-Fudail, Mohammed; Mellar, Harvey

2008-01-01

In this study we use a model which we refer to as the "teacher-technology environment interaction model" to explore the issue of the stress experienced by teachers whilst using ICT in the classroom. The methodology we used involved a comparison of three datasets obtained from: direct observation and video-logging of the teachers in the classroom;…
Linguistic and Cultural Effects on the Attainment of Ethnic Minority Students: Some Methodological Considerations

ERIC Educational Resources Information Center

Theodosiou-Zipiti, Galatia; Lamprianou, Iasonas

2016-01-01

Established literature suggests that language problems lead to lower attainment levels in those subjects that are more language dependent. Also, language has been suggested as a main driver of ethnic minority attainment. We use an original dataset of 2,020 secondary school students to show that ethnic minority students in Cyprus underperform…
A Common Methodology: Using Cluster Analysis to Identify Organizational Culture across Two Workforce Datasets

ERIC Educational Resources Information Center

Munn, Sunny L.

2016-01-01

Organizational structures are comprised of an organizational culture created by the beliefs, values, traditions, policies and processes carried out by the organization. The work-life system in which individuals use work-life initiatives to achieve a work-life balance can be influenced by the type of organizational culture within one's workplace,…
A Bootstrap Based Measure Robust to the Choice of Normalization Methods for Detecting Rhythmic Features in High Dimensional Data.

PubMed

Larriba, Yolanda; Rueda, Cristina; Fernández, Miguel A; Peddada, Shyamal D

2018-01-01

Motivation: Gene-expression data obtained from high throughput technologies are subject to various sources of noise and accordingly the raw data are pre-processed before formally analyzed. Normalization of the data is a key pre-processing step, since it removes systematic variations across arrays. There are numerous normalization methods available in the literature. Based on our experience, in the context of oscillatory systems, such as cell-cycle, circadian clock, etc., the choice of the normalization method may substantially impact the determination of a gene to be rhythmic. Thus rhythmicity of a gene can purely be an artifact of how the data were normalized. Since the determination of rhythmic genes is an important component of modern toxicological and pharmacological studies, it is important to determine truly rhythmic genes that are robust to the choice of a normalization method. Results: In this paper we introduce a rhythmicity measure and a bootstrap methodology to detect rhythmic genes in an oscillatory system. Although the proposed methodology can be used for any high-throughput gene expression data, in this paper we illustrate the proposed methodology using several publicly available circadian clock microarray gene-expression datasets. We demonstrate that the choice of normalization method has very little effect on the proposed methodology. Specifically, for any pair of normalization methods considered in this paper, the resulting values of the rhythmicity measure are highly correlated. Thus it suggests that the proposed measure is robust to the choice of a normalization method. Consequently, the rhythmicity of a gene is potentially not a mere artifact of the normalization method used. Lastly, as demonstrated in the paper, the proposed bootstrap methodology can also be used for simulating data for genes participating in an oscillatory system using a reference dataset. Availability: A user friendly code implemented in R language can be downloaded from http://www.eio.uva.es/~miguel/robustdetectionprocedure.html.
A Bootstrap Based Measure Robust to the Choice of Normalization Methods for Detecting Rhythmic Features in High Dimensional Data

PubMed Central

Larriba, Yolanda; Rueda, Cristina; Fernández, Miguel A.; Peddada, Shyamal D.

2018-01-01

Motivation: Gene-expression data obtained from high throughput technologies are subject to various sources of noise and accordingly the raw data are pre-processed before formally analyzed. Normalization of the data is a key pre-processing step, since it removes systematic variations across arrays. There are numerous normalization methods available in the literature. Based on our experience, in the context of oscillatory systems, such as cell-cycle, circadian clock, etc., the choice of the normalization method may substantially impact the determination of a gene to be rhythmic. Thus rhythmicity of a gene can purely be an artifact of how the data were normalized. Since the determination of rhythmic genes is an important component of modern toxicological and pharmacological studies, it is important to determine truly rhythmic genes that are robust to the choice of a normalization method. Results: In this paper we introduce a rhythmicity measure and a bootstrap methodology to detect rhythmic genes in an oscillatory system. Although the proposed methodology can be used for any high-throughput gene expression data, in this paper we illustrate the proposed methodology using several publicly available circadian clock microarray gene-expression datasets. We demonstrate that the choice of normalization method has very little effect on the proposed methodology. Specifically, for any pair of normalization methods considered in this paper, the resulting values of the rhythmicity measure are highly correlated. Thus it suggests that the proposed measure is robust to the choice of a normalization method. Consequently, the rhythmicity of a gene is potentially not a mere artifact of the normalization method used. Lastly, as demonstrated in the paper, the proposed bootstrap methodology can also be used for simulating data for genes participating in an oscillatory system using a reference dataset. Availability: A user friendly code implemented in R language can be downloaded from http://www.eio.uva.es/~miguel/robustdetectionprocedure.html PMID:29456555

Climate Model Diagnostic Analyzer Web Service System

NASA Astrophysics Data System (ADS)

Lee, S.; Pan, L.; Zhai, C.; Tang, B.; Kubar, T. L.; Li, J.; Zhang, J.; Wang, W.

2015-12-01

Both the National Research Council Decadal Survey and the latest Intergovernmental Panel on Climate Change Assessment Report stressed the need for the comprehensive and innovative evaluation of climate models with the synergistic use of global satellite observations in order to improve our weather and climate simulation and prediction capabilities. The abundance of satellite observations for fundamental climate parameters and the availability of coordinated model outputs from CMIP5 for the same parameters offer a great opportunity to understand and diagnose model biases in climate models. In addition, the Obs4MIPs efforts have created several key global observational datasets that are readily usable for model evaluations. However, a model diagnostic evaluation process requires physics-based multi-variable comparisons that typically involve large-volume and heterogeneous datasets, making them both computationally- and data-intensive. In response, we have developed a novel methodology to diagnose model biases in contemporary climate models and implementing the methodology as a web-service based, cloud-enabled, provenance-supported climate-model evaluation system. The evaluation system is named Climate Model Diagnostic Analyzer (CMDA), which is the product of the research and technology development investments of several current and past NASA ROSES programs. The current technologies and infrastructure of CMDA are designed and selected to address several technical challenges that the Earth science modeling and model analysis community faces in evaluating and diagnosing climate models. In particular, we have three key technology components: (1) diagnostic analysis methodology; (2) web-service based, cloud-enabled technology; (3) provenance-supported technology. The diagnostic analysis methodology includes random forest feature importance ranking, conditional probability distribution function, conditional sampling, and time-lagged correlation map. We have implemented the new methodology as web services and incorporated the system into the Cloud. We have also developed a provenance management system for CMDA where CMDA service semantics modeling, service search and recommendation, and service execution history management are designed and implemented.
Acquisition of thin coronal sectional dataset of cadaveric liver.

PubMed

Lou, Li; Liu, Shu Wei; Zhao, Zhen Mei; Tang, Yu Chun; Lin, Xiang Tao

2014-04-01

To obtain the thin coronal sectional anatomic dataset of the liver by using digital freezing milling technique. The upper abdomen of one Chinese adult cadaver was selected as the specimen. After CT and MRI examinations verification of absent liver lesions, the specimen was embedded with gelatin in stand erect position and frozen under profound hypothermia, and the specimen was then serially sectioned from anterior to posterior layer by layer with digital milling machine in the freezing chamber. The sequential images were captured by means of a digital camera and the dataset was imported to imaging workstation. The thin serial section of the liver added up to 699 layers with each layer being 0.2 mm in thickness. The shape, location, structure, intrahepatic vessels and adjacent structures of the liver was displayed clearly on each layer of the coronal sectional slice. CT and MR images through the body were obtained at 1.0 and 3.0 mm intervals, respectively. The methodology reported here is an adaptation of the milling methods previously described, which is a new data acquisition method for sectional anatomy. The thin coronal sectional anatomic dataset of the liver obtained by this technique is of high precision and good quality.
Spatial aspects of building and population exposure data and their implications for global earthquake exposure modeling

USGS Publications Warehouse

Dell’Acqua, F.; Gamba, P.; Jaiswal, K.

2012-01-01

This paper discusses spatial aspects of the global exposure dataset and mapping needs for earthquake risk assessment. We discuss this in the context of development of a Global Exposure Database for the Global Earthquake Model (GED4GEM), which requires compilation of a multi-scale inventory of assets at risk, for example, buildings, populations, and economic exposure. After defining the relevant spatial and geographic scales of interest, different procedures are proposed to disaggregate coarse-resolution data, to map them, and if necessary to infer missing data by using proxies. We discuss the advantages and limitations of these methodologies and detail the potentials of utilizing remote-sensing data. The latter is used especially to homogenize an existing coarser dataset and, where possible, replace it with detailed information extracted from remote sensing using the built-up indicators for different environments. Present research shows that the spatial aspects of earthquake risk computation are tightly connected with the availability of datasets of the resolution necessary for producing sufficiently detailed exposure. The global exposure database designed by the GED4GEM project is able to manage datasets and queries of multiple spatial scales.
A multi-strategy approach to informative gene identification from gene expression data.

PubMed

Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D

2010-02-01

An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.
Decoys Selection in Benchmarking Datasets: Overview and Perspectives

PubMed Central

Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

2018-01-01

Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
Transient Science from Diverse Surveys

NASA Astrophysics Data System (ADS)

Mahabal, A.; Crichton, D.; Djorgovski, S. G.; Donalek, C.; Drake, A.; Graham, M.; Law, E.

2016-12-01

Over the last several years we have moved closer to being able to make digital movies of the non-static sky with wide-field synoptic telescopes operating at a variety of depths, resolutions, and wavelengths. For optimal combined use of these datasets, it is crucial that they speak and understand the same language and are thus interoperable. Initial steps towards such interoperability (e.g. the footprint service) were taken during the two five-year Virtual Observatory projects viz. National Virtual Observatory (NVO), and later Virtual Astronomical Observatory (VAO). Now with far bigger datasets and in an era of resource excess thanks to the cloud-based workflows, we show how the movement of data and of resources is required - rather than just one or the other - to combine diverse datasets for applications such as real-time astronomical transient characterization. Taking the specific example of ElectroMagnetic (EM) follow-up of Gravitational Wave events and EM transients (such as CRTS but also other optical and non-optical surveys), we discuss the requirements for rapid and flexible response. We show how the same methodology is applicable to Earth Science data with its datasets differing in spatial and temporal resolution as well as differing time-spans.
Inter-fraction variations in respiratory motion models

NASA Astrophysics Data System (ADS)

McClelland, J. R.; Hughes, S.; Modat, M.; Qureshi, A.; Ahmad, S.; Landau, D. B.; Ourselin, S.; Hawkes, D. J.

2011-01-01

Respiratory motion can vary dramatically between the planning stage and the different fractions of radiotherapy treatment. Motion predictions used when constructing the radiotherapy plan may be unsuitable for later fractions of treatment. This paper presents a methodology for constructing patient-specific respiratory motion models and uses these models to evaluate and analyse the inter-fraction variations in the respiratory motion. The internal respiratory motion is determined from the deformable registration of Cine CT data and related to a respiratory surrogate signal derived from 3D skin surface data. Three different models for relating the internal motion to the surrogate signal have been investigated in this work. Data were acquired from six lung cancer patients. Two full datasets were acquired for each patient, one before the course of radiotherapy treatment and one at the end (approximately 6 weeks later). Separate models were built for each dataset. All models could accurately predict the respiratory motion in the same dataset, but had large errors when predicting the motion in the other dataset. Analysis of the inter-fraction variations revealed that most variations were spatially varying base-line shifts, but changes to the anatomy and the motion trajectories were also observed.
Evaluating global reanalysis datasets for provision of boundary conditions in regional climate modelling

NASA Astrophysics Data System (ADS)

Moalafhi, Ditiro B.; Evans, Jason P.; Sharma, Ashish

2016-11-01

Regional climate modelling studies often begin by downscaling a reanalysis dataset in order to simulate the observed climate, allowing the investigation of regional climate processes and quantification of the errors associated with the regional model. To date choice of reanalysis to perform such downscaling has been made based either on convenience or on performance of the reanalyses within the regional domain for relevant variables such as near-surface air temperature and precipitation. However, the only information passed from the reanalysis to the regional model are the atmospheric temperature, moisture and winds at the location of the boundaries of the regional domain. Here we present a methodology to evaluate reanalyses derived lateral boundary conditions for an example domain over southern Africa using satellite data. This study focusses on atmospheric temperature and moisture which are easily available. Five commonly used global reanalyses (NCEP1, NCEP2, ERA-I, 20CRv2, and MERRA) are evaluated against the Atmospheric Infrared Sounder satellite temperature and relative humidity over boundaries of two domains centred on southern Africa for the years 2003-2012 inclusive. The study reveals that MERRA is the most suitable for climate mean with NCEP1 the next most suitable. For climate variability, ERA-I is the best followed by MERRA. Overall, MERRA is preferred for generating lateral boundary conditions for this domain, followed by ERA-I. While a "better" LBC specification is not the sole precursor to an improved downscaling outcome, any reduction in uncertainty associated with the specification of LBCs is a step in the right direction.
Genotype Imputation with Thousands of Genomes

PubMed Central

Howie, Bryan; Marchini, Jonathan; Stephens, Matthew

2011-01-01

Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package. PMID:22384356
Searching data for supporting archaeo-landscapes in Cyprus: an overview of aerial, satellite, and cartographic datasets of the island

NASA Astrophysics Data System (ADS)

Agapiou, Athos; Lysandrou, Vasiliki; Themistocleous, Kyriakos; Nisantzi, Argyro; Lasaponara, Rosa; Masini, Nicola; Krauss, Thomas; Cerra, Daniele; Gessner, Ursula; Schreier, Gunter; Hadjimitsis, Diofantos

2016-08-01

The landscape of Cyprus is characterized by transformations that occurred during the 20th century, with many of such changes being still active today. Landscapes' changes are due to a variety of reasons including war conflicts, environmental conditions and modern development that have often caused the alteration or even the total loss of important information that could have assisted the archaeologists to comprehend the archaeo-landscape. The present work aims to provide detailed information regarding the different existing datasets that can be used to support archaeologists in understanding the transformations that the landscape in Cyprus undergone, from a remote sensing perspective. Such datasets may help archaeologists to visualize a lost landscape and try to retrieve valuable information, while they support researchers for future investigations. As such they can further highlight in a predictive manner and consequently assess the impacts of landscape transformation -being of natural or anthropogenic cause- to cultural heritage. Three main datasets are presented here: aerial images, satellite datasets including spy satellite datasets acquired during the Cold War, and cadastral maps. The variety of data is provided in a chronological order (e.g. year of acquisitions), while other important parameters such as the cost and the accuracy are also determined. Individual examples of archaeological sites in Cyprus are also provided for each dataset in order to underline both their importance and performance. Also some pre- and post-processing remote sensing methodologies are briefly described in order to enhance the final results. The paper within the framework of ATHENA project, dedicated to remote sensing archaeology/CH, aims to fill a significant gap in the recent literature of remote sensing archaeology of the island and to assist current and future archaeologists in their quest for remote sensing information to support their research.
Assessment of the Long Term Trends in Extreme Heat Events and the Associated Health Impacts in the United States

NASA Astrophysics Data System (ADS)

Bell, J.; Rennie, J.; Kunkel, K.; Herring, S.; Cullen, H. M.

2017-12-01

Land surface air temperature products have been essential for monitoring the evolution of the climate system. Before a temperature dataset is included in such reports, it is important that non-climatic influences be removed or changed so the dataset is considered homogenous. These inhomogeneities include changes in station location, instrumentation and observing practices. While many homogenized products exist on the monthly time scale, few daily products exist, due to the complication of removing breakpoints that are truly inhomogeneous rather than solely by chance (for example, sharp changes due to synoptic conditions). Recently, a sub monthly homogenized dataset has been developed using data and software provided by NOAA's National Centers for Environmental Information (NCEI). Homogeneous daily data are useful for identification and attribution of extreme heat events over a period of time. Projections of increasing temperatures are expected to result in corresponding increases in the frequency, duration, and intensity of extreme heat events. It is also established that extreme heat events can have significant public health impacts, including short-term increases in mortality and morbidity. In addition, it can exacerbate chronic health conditions in vulnerable populations, including renal and cardiovascular issues. To understand how heat events impact a specific population, it will be important to connect observations on the duration and intensity of extreme heat events with health impacts data including insurance claims and hospital admissions data. This presentation will explain the methodology to identify extreme heat events, provide a climatology of heat event onset, length and severity, and explore a case study of an anomalous heat event with available health data.
Digital Mapping and Environmental Characterization of National Wild and Scenic River Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

McManamay, Ryan A; Bosnall, Peter; Hetrick, Shelaine L

2013-09-01

Spatially accurate geospatial information is required to support decision-making regarding sustainable future hydropower development. Under a memorandum of understanding among several federal agencies, a pilot study was conducted to map a subset of National Wild and Scenic Rivers (WSRs) at a higher resolution and provide a consistent methodology for mapping WSRs across the United States and across agency jurisdictions. A subset of rivers (segments falling under the jurisdiction of the National Park Service) were mapped at a high resolution using the National Hydrography Dataset (NHD). The spatial extent and representation of river segments mapped at NHD scale were compared withmore » the prevailing geospatial coverage mapped at a coarser scale. Accurately digitized river segments were linked to environmental attribution datasets housed within the Oak Ridge National Laboratory s National Hydropower Asset Assessment Program database to characterize the environmental context of WSR segments. The results suggest that both the spatial scale of hydrography datasets and the adherence to written policy descriptions are critical to accurately mapping WSRs. The environmental characterization provided information to deduce generalized trends in either the uniqueness or the commonness of environmental variables associated with WSRs. Although WSRs occur in a wide range of human-modified landscapes, environmental data layers suggest that they provide habitats important to terrestrial and aquatic organisms and recreation important to humans. Ultimately, the research findings herein suggest that there is a need for accurate, consistent, mapping of the National WSRs across the agencies responsible for administering each river. Geospatial applications examining potential landscape and energy development require accurate sources of information, such as data layers that portray realistic spatial representations.« less
A program for handling map projections of small-scale geospatial raster data

USGS Publications Warehouse

Finn, Michael P.; Steinwand, Daniel R.; Trent, Jason R.; Buehler, Robert A.; Mattli, David M.; Yamamoto, Kristina H.

2012-01-01

Scientists routinely accomplish small-scale geospatial modeling using raster datasets of global extent. Such use often requires the projection of global raster datasets onto a map or the reprojection from a given map projection associated with a dataset. The distortion characteristics of these projection transformations can have significant effects on modeling results. Distortions associated with the reprojection of global data are generally greater than distortions associated with reprojections of larger-scale, localized areas. The accuracy of areas in projected raster datasets of global extent is dependent on spatial resolution. To address these problems of projection and the associated resampling that accompanies it, methods for framing the transformation space, direct point-to-point transformations rather than gridded transformation spaces, a solution to the wrap-around problem, and an approach to alternative resampling methods are presented. The implementations of these methods are provided in an open-source software package called MapImage (or mapIMG, for short), which is designed to function on a variety of computer architectures.
Harmonization of Multiple Forest Disturbance Data to Create a 1986-2011 Database for the Conterminous United States

NASA Astrophysics Data System (ADS)

Soulard, C. E.; Acevedo, W.; Yang, Z.; Cohen, W. B.; Stehman, S. V.; Taylor, J. L.

2015-12-01

A wide range of spatial forest disturbance data exist for the conterminous United States, yet inconsistencies between map products arise because of differing programmatic objectives and methodologies. Researchers on the Land Change Research Project (LCRP) are working to assess spatial agreement, characterize uncertainties, and resolve discrepancies between these national level datasets, in regard to forest disturbance. Disturbance maps from the Global Forest Change (GFC), Landfire Vegetation Disturbance (LVD), National Land Cover Dataset (NLCD), Vegetation Change Tracker (VCT), Web-enabled Landsat Data (WELD), and Monitoring Trends in Burn Severity (MTBS) were harmonized using a pixel-based data fusion process. The harmonization process reconciled forest harvesting, forest fire, and remaining forest disturbance across four intervals (1986-1992, 1992-2001, 2001-2006, and 2006-2011) by relying on convergence of evidence across all datasets available for each interval. Pixels with high agreement across datasets were retained, while moderate-to-low agreement pixels were visually assessed and either manually edited using reference imagery or discarded from the final disturbance map(s). National results show that annual rates of forest harvest and overall fire have increased over the past 25 years. Overall, this study shows that leveraging the best elements of readily-available data improves forest loss monitoring relative to using a single dataset to monitor forest change, particularly by reducing commission errors.
Structure Discovery in Large Semantic Graphs Using Extant Ontological Scaling and Descriptive Statistics

DOE Office of Scientific and Technical Information (OSTI.GOV)

al-Saffar, Sinan; Joslyn, Cliff A.; Chappell, Alan R.

As semantic datasets grow to be very large and divergent, there is a need to identify and exploit their inherent semantic structure for discovery and optimization. Towards that end, we present here a novel methodology to identify the semantic structures inherent in an arbitrary semantic graph dataset. We first present the concept of an extant ontology as a statistical description of the semantic relations present amongst the typed entities modeled in the graph. This serves as a model of the underlying semantic structure to aid in discovery and visualization. We then describe a method of ontological scaling in which themore » ontology is employed as a hierarchical scaling filter to infer different resolution levels at which the graph structures are to be viewed or analyzed. We illustrate these methods on three large and publicly available semantic datasets containing more than one billion edges each. Keywords-Semantic Web; Visualization; Ontology; Multi-resolution Data Mining;« less
Automated colour identification in melanocytic lesions.

PubMed

Sabbaghi, S; Aldeen, M; Garnavi, R; Varigos, G; Doliantis, C; Nicolopoulos, J

2015-08-01

Colour information plays an important role in classifying skin lesion. However, colour identification by dermatologists can be very subjective, leading to cases of misdiagnosis. Therefore, a computer-assisted system for quantitative colour identification is highly desirable for dermatologists to use. Although numerous colour detection systems have been developed, few studies have focused on imitating the human visual perception of colours in melanoma application. In this paper we propose a new methodology based on QuadTree decomposition technique for automatic colour identification in dermoscopy images. Our approach mimics the human perception of lesion colours. The proposed method is trained on a set of 47 images from NIH dataset and applied to a test set of 190 skin lesions obtained from PH2 dataset. The results of our proposed method are compared with a recently reported colour identification method using the same dataset. The effectiveness of our method in detecting colours in dermoscopy images is vindicated by obtaining approximately 93% accuracy when the CIELab1 colour space is used.
Technical challenges for big data in biomedicine and health: data sources, infrastructure, and analytics.

PubMed

Peek, N; Holmes, J H; Sun, J

2014-08-15

To review technical and methodological challenges for big data research in biomedicine and health. We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.
Topobathymetric elevation model development using a new methodology: Coastal National Elevation Database

USGS Publications Warehouse

Danielson, Jeffrey J.; Poppenga, Sandra K.; Brock, John C.; Evans, Gayla A.; Tyler, Dean; Gesch, Dean B.; Thatcher, Cindy A.; Barras, John

2016-01-01

During the coming decades, coastlines will respond to widely predicted sea-level rise, storm surge, and coastalinundation flooding from disastrous events. Because physical processes in coastal environments are controlled by the geomorphology of over-the-land topography and underwater bathymetry, many applications of geospatial data in coastal environments require detailed knowledge of the near-shore topography and bathymetry. In this paper, an updated methodology used by the U.S. Geological Survey Coastal National Elevation Database (CoNED) Applications Project is presented for developing coastal topobathymetric elevation models (TBDEMs) from multiple topographic data sources with adjacent intertidal topobathymetric and offshore bathymetric sources to generate seamlessly integrated TBDEMs. This repeatable, updatable, and logically consistent methodology assimilates topographic data (land elevation) and bathymetry (water depth) into a seamless coastal elevation model. Within the overarching framework, vertical datum transformations are standardized in a workflow that interweaves spatially consistent interpolation (gridding) techniques with a land/water boundary mask delineation approach. Output gridded raster TBDEMs are stacked into a file storage system of mosaic datasets within an Esri ArcGIS geodatabase for efficient updating while maintaining current and updated spatially referenced metadata. Topobathymetric data provide a required seamless elevation product for several science application studies, such as shoreline delineation, coastal inundation mapping, sediment-transport, sea-level rise, storm surge models, and tsunami impact assessment. These detailed coastal elevation data are critical to depict regions prone to climate change impacts and are essential to planners and managers responsible for mitigating the associated risks and costs to both human communities and ecosystems. The CoNED methodology approach has been used to construct integrated TBDEM models in Mobile Bay, the northern Gulf of Mexico, San Francisco Bay, the Hurricane Sandy region, and southern California.
Hyperspectral signature analysis of skin parameters

NASA Astrophysics Data System (ADS)

Vyas, Saurabh; Banerjee, Amit; Garza, Luis; Kang, Sewon; Burlina, Philippe

2013-02-01

The temporal analysis of changes in biological skin parameters, including melanosome concentration, collagen concentration and blood oxygenation, may serve as a valuable tool in diagnosing the progression of malignant skin cancers and in understanding the pathophysiology of cancerous tumors. Quantitative knowledge of these parameters can also be useful in applications such as wound assessment, and point-of-care diagnostics, amongst others. We propose an approach to estimate in vivo skin parameters using a forward computational model based on Kubelka-Munk theory and the Fresnel Equations. We use this model to map the skin parameters to their corresponding hyperspectral signature. We then use machine learning based regression to develop an inverse map from hyperspectral signatures to skin parameters. In particular, we employ support vector machine based regression to estimate the in vivo skin parameters given their corresponding hyperspectral signature. We build on our work from SPIE 2012, and validate our methodology on an in vivo dataset. This dataset consists of 241 signatures collected from in vivo hyperspectral imaging of patients of both genders and Caucasian, Asian and African American ethnicities. In addition, we also extend our methodology past the visible region and through the short-wave infrared region of the electromagnetic spectrum. We find promising results when comparing the estimated skin parameters to the ground truth, demonstrating good agreement with well-established physiological precepts. This methodology can have potential use in non-invasive skin anomaly detection and for developing minimally invasive pre-screening tools.
Data splitting for artificial neural networks using SOM-based stratified sampling.

PubMed

May, R J; Maier, H R; Dandy, G C

2010-03-01

Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. Copyright 2009 Elsevier Ltd. All rights reserved.

A novel, fast and efficient single-sensor automatic sleep-stage classification based on complementary cross-frequency coupling estimates.

PubMed

Dimitriadis, Stavros I; Salis, Christos; Linden, David

2018-04-01

Limitations of the manual scoring of polysomnograms, which include data from electroencephalogram (EEG), electro-oculogram (EOG), electrocardiogram (ECG) and electromyogram (EMG) channels have long been recognized. Manual staging is resource intensive and time consuming, and thus considerable effort must be spent to ensure inter-rater reliability. As a result, there is a great interest in techniques based on signal processing and machine learning for a completely Automatic Sleep Stage Classification (ASSC). In this paper, we present a single-EEG-sensor ASSC technique based on the dynamic reconfiguration of different aspects of cross-frequency coupling (CFC) estimated between predefined frequency pairs over 5 s epoch lengths. The proposed analytic scheme is demonstrated using the PhysioNet Sleep European Data Format (EDF) Database with repeat recordings from 20 healthy young adults. We validate our methodology in a second sleep dataset. We achieved very high classification sensitivity, specificity and accuracy of 96.2 ± 2.2%, 94.2 ± 2.3%, and 94.4 ± 2.2% across 20 folds, respectively, and also a high mean F1 score (92%, range 90-94%) when a multi-class Naive Bayes classifier was applied. High classification performance has been achieved also in the second sleep dataset. Our method outperformed the accuracy of previous studies not only on different datasets but also on the same database. Single-sensor ASSC makes the entire methodology appropriate for longitudinal monitoring using wearable EEG in real-world and laboratory-oriented environments. Crown Copyright © 2018. Published by Elsevier B.V. All rights reserved.
A methodology for translating positional error into measures of attribute error, and combining the two error sources

Treesearch

Yohay Carmel; Curtis Flather; Denis Dean

2006-01-01

This paper summarizes our efforts to investigate the nature, behavior, and implications of positional error and attribute error in spatiotemporal datasets. Estimating the combined influence of these errors on map analysis has been hindered by the fact that these two error types are traditionally expressed in different units (distance units, and categorical units,...
A User's Guide to CGNS. 1.0

NASA Technical Reports Server (NTRS)

Rumsey, Christopher L.; Poirier, Diane M. A.; Bush, Robert H.; Towne, Charles E.

2001-01-01

The CFD General Notation System (CGNS) was developed to be a self-descriptive, machine-independent standard for storing CFD aerodynamic data. This guide aids users in the implementation of CGNS. It is intended as a tutorial on the usage of the CGNS mid-level library routines for reading and writing grid and flow solution datasets for both structured and unstructured methodologies.
Alternative Student-Based Revenue Streams for Higher Education Institutions: A Difference-in-Difference Analysis Using Guaranteed Tuition Policies

ERIC Educational Resources Information Center

Delaney, Jennifer A.; Kearney, Tyler D.

2016-01-01

This study considered the impact of state-level guaranteed tuition programs on alternative student-based revenue streams. It used a quasi-experimental, difference-in-difference methodology with a panel dataset of public four-year institutions from 2000-2012. Illinois' 2004 "Truth-in-Tuition" law was used as the policy of interest and the…
Universal Stochastic Multiscale Image Fusion: An Example Application for Shale Rock.

PubMed

Gerke, Kirill M; Karsanina, Marina V; Mallants, Dirk

2015-11-02

Spatial data captured with sensors of different resolution would provide a maximum degree of information if the data were to be merged into a single image representing all scales. We develop a general solution for merging multiscale categorical spatial data into a single dataset using stochastic reconstructions with rescaled correlation functions. The versatility of the method is demonstrated by merging three images of shale rock representing macro, micro and nanoscale spatial information on mineral, organic matter and porosity distribution. Merging multiscale images of shale rock is pivotal to quantify more reliably petrophysical properties needed for production optimization and environmental impacts minimization. Images obtained by X-ray microtomography and scanning electron microscopy were fused into a single image with predefined resolution. The methodology is sufficiently generic for implementation of other stochastic reconstruction techniques, any number of scales, any number of material phases, and any number of images for a given scale. The methodology can be further used to assess effective properties of fused porous media images or to compress voluminous spatial datasets for efficient data storage. Practical applications are not limited to petroleum engineering or more broadly geosciences, but will also find their way in material sciences, climatology, and remote sensing.
Storytelling and story testing in domestication.

PubMed

Gerbault, Pascale; Allaby, Robin G; Boivin, Nicole; Rudzinski, Anna; Grimaldi, Ilaria M; Pires, J Chris; Climer Vigueira, Cynthia; Dobney, Keith; Gremillion, Kristen J; Barton, Loukas; Arroyo-Kalin, Manuel; Purugganan, Michael D; Rubio de Casas, Rafael; Bollongino, Ruth; Burger, Joachim; Fuller, Dorian Q; Bradley, Daniel G; Balding, David J; Richerson, Peter J; Gilbert, M Thomas P; Larson, Greger; Thomas, Mark G

2014-04-29

The domestication of plants and animals marks one of the most significant transitions in human, and indeed global, history. Traditionally, study of the domestication process was the exclusive domain of archaeologists and agricultural scientists; today it is an increasingly multidisciplinary enterprise that has come to involve the skills of evolutionary biologists and geneticists. Although the application of new information sources and methodologies has dramatically transformed our ability to study and understand domestication, it has also generated increasingly large and complex datasets, the interpretation of which is not straightforward. In particular, challenges of equifinality, evolutionary variance, and emergence of unexpected or counter-intuitive patterns all face researchers attempting to infer past processes directly from patterns in data. We argue that explicit modeling approaches, drawing upon emerging methodologies in statistics and population genetics, provide a powerful means of addressing these limitations. Modeling also offers an approach to analyzing datasets that avoids conclusions steered by implicit biases, and makes possible the formal integration of different data types. Here we outline some of the modeling approaches most relevant to current problems in domestication research, and demonstrate the ways in which simulation modeling is beginning to reshape our understanding of the domestication process.
Storytelling and story testing in domestication

PubMed Central

Gerbault, Pascale; Allaby, Robin G.; Boivin, Nicole; Rudzinski, Anna; Grimaldi, Ilaria M.; Pires, J. Chris; Climer Vigueira, Cynthia; Dobney, Keith; Gremillion, Kristen J.; Barton, Loukas; Arroyo-Kalin, Manuel; Purugganan, Michael D.; Rubio de Casas, Rafael; Bollongino, Ruth; Burger, Joachim; Fuller, Dorian Q.; Bradley, Daniel G.; Balding, David J.; Richerson, Peter J.; Gilbert, M. Thomas P.; Larson, Greger; Thomas, Mark G.

2014-01-01

The domestication of plants and animals marks one of the most significant transitions in human, and indeed global, history. Traditionally, study of the domestication process was the exclusive domain of archaeologists and agricultural scientists; today it is an increasingly multidisciplinary enterprise that has come to involve the skills of evolutionary biologists and geneticists. Although the application of new information sources and methodologies has dramatically transformed our ability to study and understand domestication, it has also generated increasingly large and complex datasets, the interpretation of which is not straightforward. In particular, challenges of equifinality, evolutionary variance, and emergence of unexpected or counter-intuitive patterns all face researchers attempting to infer past processes directly from patterns in data. We argue that explicit modeling approaches, drawing upon emerging methodologies in statistics and population genetics, provide a powerful means of addressing these limitations. Modeling also offers an approach to analyzing datasets that avoids conclusions steered by implicit biases, and makes possible the formal integration of different data types. Here we outline some of the modeling approaches most relevant to current problems in domestication research, and demonstrate the ways in which simulation modeling is beginning to reshape our understanding of the domestication process. PMID:24753572
Universal Stochastic Multiscale Image Fusion: An Example Application for Shale Rock

PubMed Central

Gerke, Kirill M.; Karsanina, Marina V.; Mallants, Dirk

2015-01-01

Spatial data captured with sensors of different resolution would provide a maximum degree of information if the data were to be merged into a single image representing all scales. We develop a general solution for merging multiscale categorical spatial data into a single dataset using stochastic reconstructions with rescaled correlation functions. The versatility of the method is demonstrated by merging three images of shale rock representing macro, micro and nanoscale spatial information on mineral, organic matter and porosity distribution. Merging multiscale images of shale rock is pivotal to quantify more reliably petrophysical properties needed for production optimization and environmental impacts minimization. Images obtained by X-ray microtomography and scanning electron microscopy were fused into a single image with predefined resolution. The methodology is sufficiently generic for implementation of other stochastic reconstruction techniques, any number of scales, any number of material phases, and any number of images for a given scale. The methodology can be further used to assess effective properties of fused porous media images or to compress voluminous spatial datasets for efficient data storage. Practical applications are not limited to petroleum engineering or more broadly geosciences, but will also find their way in material sciences, climatology, and remote sensing. PMID:26522938
Modern data science for analytical chemical data - A comprehensive review.

PubMed

Szymańska, Ewa

2018-10-22

Efficient and reliable analysis of chemical analytical data is a great challenge due to the increase in data size, variety and velocity. New methodologies, approaches and methods are being proposed not only by chemometrics but also by other data scientific communities to extract relevant information from big datasets and provide their value to different applications. Besides common goal of big data analysis, different perspectives and terms on big data are being discussed in scientific literature and public media. The aim of this comprehensive review is to present common trends in the analysis of chemical analytical data across different data scientific fields together with their data type-specific and generic challenges. Firstly, common data science terms used in different data scientific fields are summarized and discussed. Secondly, systematic methodologies to plan and run big data analysis projects are presented together with their steps. Moreover, different analysis aspects like assessing data quality, selecting data pre-processing strategies, data visualization and model validation are considered in more detail. Finally, an overview of standard and new data analysis methods is provided and their suitability for big analytical chemical datasets shortly discussed. Copyright © 2018 Elsevier B.V. All rights reserved.
User-Appropriate Viewer for High Resolution Interactive Engagement with 3d Digital Cultural Artefacts

NASA Astrophysics Data System (ADS)

Gillespie, D.; La Pensée, A.; Cooper, M.

2013-07-01

Three dimensional (3D) laser scanning is an important documentation technique for cultural heritage. This technology has been adopted from the engineering and aeronautical industry and is an invaluable tool for the documentation of objects within museum collections (La Pensée, 2008). The datasets created via close range laser scanning are extremely accurate and the created 3D dataset allows for a more detailed analysis in comparison to other documentation technologies such as photography. The dataset can be used for a range of different applications including: documentation; archiving; surface monitoring; replication; gallery interactives; educational sessions; conservation and visualization. However, the novel nature of a 3D dataset is presenting a rather unique challenge with respect to its sharing and dissemination. This is in part due to the need for specialised 3D software and a supported graphics card to display high resolution 3D models. This can be detrimental to one of the main goals of cultural institutions, which is to share knowledge and enable activities such as research, education and entertainment. This has limited the presentation of 3D models of cultural heritage objects to mainly either images or videos. Yet with recent developments in computer graphics, increased internet speed and emerging technologies such as Adobe's Stage 3D (Adobe, 2013) and WebGL (Khronos, 2013), it is now possible to share a dataset directly within a webpage. This allows website visitors to interact with the 3D dataset allowing them to explore every angle of the object, gaining an insight into its shape and nature. This can be very important considering that it is difficult to offer the same level of understanding of the object through the use of traditional mediums such as photographs and videos. Yet this presents a range of problems: this is a very novel experience and very few people have engaged with 3D objects outside of 3D software packages or games. This paper presents results of research that aims to provide a methodology for museums and cultural institutions for prototyping a 3D viewer within a webpage, thereby not only allowing institutions to promote their collections via the internet but also providing a tool for users to engage in a meaningful way with cultural heritage datasets. The design process encompasses evaluation as the central part of the design methodology; focusing on how slight changes to navigation, object engagement and aesthetic appearance can influence the user's experience. The prototype used in this paper, was created using WebGL with the Three.Js (Three.JS, 2013) library and datasets were loaded as the OpenCTM (Geelnard, 2010) file format. The overall design is centred on creating an easy-tolearn interface allowing non-skilled users to interact with the datasets, and also providing tools allowing skilled users to discover more about the cultural heritage object. User testing was carried out, allowing users to interact with 3D datasets within the interactive viewer. The results are analysed and the insights learned are discussed in relation to an interface designed to interact with 3D content. The results will lead to the design of interfaces for interacting with 3D objects, which allow for both skilled and non skilled users to engage with 3D cultural heritage objects in a meaningful way.
76 FR 4904 - Agency Information Collection Request; 30-Day Public Comment Request

Federal Register 2010, 2011, 2012, 2013, 2014

2011-01-27

... datasets that are not specific to individual's personal health information to improve decision making by... making health indicator datasets (data that is not associated with any individuals) and tools available.../health . These datasets and tools are anticipated to benefit development of applications, web-based tools...
Classification of Alzheimer’s Patients through Ubiquitous Computing †

PubMed Central

Nieto-Reyes, Alicia; Duque, Rafael; Montaña, José Luis; Lage, Carmen

2017-01-01

Functional data analysis and artificial neural networks are the building blocks of the proposed methodology that distinguishes the movement patterns among c’s patients on different stages of the disease and classifies new patients to their appropriate stage of the disease. The movement patterns are obtained by the accelerometer device of android smartphones that the patients carry while moving freely. The proposed methodology is relevant in that it is flexible on the type of data to which it is applied. To exemplify that, it is analyzed a novel real three-dimensional functional dataset where each datum is observed in a different time domain. Not only is it observed on a difference frequency but also the domain of each datum has different length. The obtained classification success rate of 83% indicates the potential of the proposed methodology. PMID:28753975
Statistical and Spatial Analysis of Bathymetric Data for the St. Clair River, 1971-2007

USGS Publications Warehouse

Bennion, David

2009-01-01

To address questions concerning ongoing geomorphic processes in the St. Clair River, selected bathymetric datasets spanning 36 years were analyzed. Comparisons of recent high-resolution datasets covering the upper river indicate a highly variable, active environment. Although statistical and spatial comparisons of the datasets show that some changes to the channel size and shape have taken place during the study period, uncertainty associated with various survey methods and interpolation processes limit the statistically certain results. The methods used to spatially compare the datasets are sensitive to small variations in position and depth that are within the range of uncertainty associated with the datasets. Characteristics of the data, such as the density of measured points and the range of values surveyed, can also influence the results of spatial comparison. With due consideration of these limitations, apparently active and ongoing areas of elevation change in the river are mapped and discussed.
Study of infectious diseases in archaeological bone material - A dataset.

PubMed

Pucu, Elisa; Cascardo, Paula; Chame, Marcia; Felice, Gisele; Guidon, Niéde; Cleonice Vergne, Maria; Campos, Guadalupe; Roberto Machado-Silva, José; Leles, Daniela

2017-08-01

Bones of human and ground sloth remains were analyzed for presence of Trypanosoma cruzi by conventional PCR using primers TC, TC1 and TC2. Sequence results amplified a fragment with the same product size as the primers (300 and 350pb). Amplified PCR product was sequenced and analyzed on GenBank, using Blast. Although these sequences did not match with these parasites they showed high amplification with species of bacteria. This article presents the methodology used and the alignment of the sequences. The display of this dataset will allow further analysis of our results and discussion presented in the manuscript "Finding the unexpected: a critical view on molecular diagnosis of infectious diseases in archaeological samples" (Pucu et al. 2017) [1].
Use of graph theory measures to identify errors in record linkage.

PubMed

Randall, Sean M; Boyd, James H; Ferrante, Anna M; Bauer, Jacqueline K; Semmens, James B

2014-07-01

Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
A biclustering algorithm for extracting bit-patterns from binary datasets.

PubMed

Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

2011-10-01

Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.
A framework for automatic creation of gold-standard rigid 3D-2D registration datasets.

PubMed

Madan, Hennadii; Pernuš, Franjo; Likar, Boštjan; Špiclin, Žiga

2017-02-01

Advanced image-guided medical procedures incorporate 2D intra-interventional information into pre-interventional 3D image and plan of the procedure through 3D/2D image registration (32R). To enter clinical use, and even for publication purposes, novel and existing 32R methods have to be rigorously validated. The performance of a 32R method can be estimated by comparing it to an accurate reference or gold standard method (usually based on fiducial markers) on the same set of images (gold standard dataset). Objective validation and comparison of methods are possible only if evaluation methodology is standardized, and the gold standard dataset is made publicly available. Currently, very few such datasets exist and only one contains images of multiple patients acquired during a procedure. To encourage the creation of gold standard 32R datasets, we propose an automatic framework. The framework is based on rigid registration of fiducial markers. The main novelty is spatial grouping of fiducial markers on the carrier device, which enables automatic marker localization and identification across the 3D and 2D images. The proposed framework was demonstrated on clinical angiograms of 20 patients. Rigid 32R computed by the framework was more accurate than that obtained manually, with the respective target registration error below 0.027 mm compared to 0.040 mm. The framework is applicable for gold standard setup on any rigid anatomy, provided that the acquired images contain spatially grouped fiducial markers. The gold standard datasets and software will be made publicly available.
Assessing Human Modifications to Floodplains using Large-Scale Hydrogeomorphic Floodplain Modeling

NASA Astrophysics Data System (ADS)

Morrison, R. R.; Scheel, K.; Nardi, F.; Annis, A.

2017-12-01

Human modifications to floodplains for water resource and flood management purposes have significantly transformed river-floodplain connectivity dynamics in many watersheds. Bridges, levees, reservoirs, shifts in land use, and other hydraulic engineering works have altered flow patterns and caused changes in the timing and extent of floodplain inundation processes. These hydrogeomorphic changes have likely resulted in negative impacts to aquatic habitat and ecological processes. The availability of large-scale topographic datasets at high resolution provide an opportunity for detecting anthropogenic impacts by means of geomorphic mapping. We have developed and are implementing a methodology for comparing a hydrogeomorphic floodplain mapping technique to hydraulically-modeled floodplain boundaries to estimate floodplain loss due to human activities. Our hydrogeomorphic mapping methodology assumes that river valley morphology intrinsically includes information on flood-driven erosion and depositional phenomena. We use a digital elevation model-based algorithm to identify the floodplain as the area of the fluvial corridor laying below water reference levels, which are estimated using a simplified hydrologic model. Results from our hydrogeomorphic method are compared to hydraulically-derived flood zone maps and spatial datasets of levee protected-areas to explore where water management features, such as levees, have changed floodplain dynamics and landscape features. Parameters associated with commonly used F-index functions are quantified and analyzed to better understand how floodplain areas have been reduced within a basin. Preliminary results indicate that the hydrogeomorphic floodplain model is useful for quickly delineating floodplains at large watershed scales, but further analyses are needed to understand the caveats for using the model in determining floodplain loss due to levees. We plan to continue this work by exploring the spatial dependencies of the F-index function. Results from this work have implications for loss of aquatic habitat and ecological functions, and can inform management and restoration activities by highlighting regions with significant floodplain loss.
Modeling 4D Pathological Changes by Leveraging Normative Models

PubMed Central

Wang, Bo; Prastawa, Marcel; Irimia, Andrei; Saha, Avishek; Liu, Wei; Goh, S.Y. Matthew; Vespa, Paul M.; Van Horn, John D.; Gerig, Guido

2016-01-01

With the increasing use of efficient multimodal 3D imaging, clinicians are able to access longitudinal imaging to stage pathological diseases, to monitor the efficacy of therapeutic interventions, or to assess and quantify rehabilitation efforts. Analysis of such four-dimensional (4D) image data presenting pathologies, including disappearing and newly appearing lesions, represents a significant challenge due to the presence of complex spatio-temporal changes. Image analysis methods for such 4D image data have to include not only a concept for joint segmentation of 3D datasets to account for inherent correlations of subject-specific repeated scans but also a mechanism to account for large deformations and the destruction and formation of lesions (e.g., edema, bleeding) due to underlying physiological processes associated with damage, intervention, and recovery. In this paper, we propose a novel framework that provides a joint segmentation-registration framework to tackle the inherent problem of image registration in the presence of objects not present in all images of the time series. Our methodology models 4D changes in pathological anatomy across time and and also provides an explicit mapping of a healthy normative template to a subject’s image data with pathologies. Since atlas-moderated segmentation methods cannot explain appearance and locality pathological structures that are not represented in the template atlas, the new framework provides different options for initialization via a supervised learning approach, iterative semisupervised active learning, and also transfer learning, which results in a fully automatic 4D segmentation method. We demonstrate the effectiveness of our novel approach with synthetic experiments and a 4D multimodal MRI dataset of severe traumatic brain injury (TBI), including validation via comparison to expert segmentations. However, the proposed methodology is generic in regard to different clinical applications requiring quantitative analysis of 4D imaging representing spatio-temporal changes of pathologies. PMID:27818606
A test-retest dataset for assessing long-term reliability of brain morphology and resting-state brain activity.

PubMed

Huang, Lijie; Huang, Taicheng; Zhen, Zonglei; Liu, Jia

2016-03-15

We present a test-retest dataset for evaluation of long-term reliability of measures from structural and resting-state functional magnetic resonance imaging (sMRI and rfMRI) scans. The repeated scan dataset was collected from 61 healthy adults in two sessions using highly similar imaging parameters at an interval of 103-189 days. However, as the imaging parameters were not completely identical, the reliability estimated from this dataset shall reflect the lower bounds of the true reliability of sMRI/rfMRI measures. Furthermore, in conjunction with other test-retest datasets, our dataset may help explore the impact of different imaging parameters on reliability of sMRI/rfMRI measures, which is especially critical for assessing datasets collected from multiple centers. In addition, intelligence quotient (IQ) was measured for each participant using Raven's Advanced Progressive Matrices. The data can thus be used for purposes other than assessing reliability of sMRI/rfMRI alone. For example, data from each single session could be used to associate structural and functional measures of the brain with the IQ metrics to explore brain-IQ association.

Using classification models for the generation of disease-specific medications from biomedical literature and clinical data repository.

PubMed

Wang, Liqin; Haug, Peter J; Del Fiol, Guilherme

2017-05-01

Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies. We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s). Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset. It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases. Copyright © 2017 Elsevier Inc. All rights reserved.
RIGED-RA project - Restoration and management of Coastal Dunes in the Northern Adriatic Coast, Ravenna Area - Italy

NASA Astrophysics Data System (ADS)

Giambastiani, Beatrice M. S.; Greggio, Nicolas; Sistilli, Flavia; Fabbri, Stefano; Scarelli, Frederico; Candiago, Sebastian; Anfossi, Giulia; Lipparini, Carlo A.; Cantelli, Luigi; Antonellini, Marco; Gabbianelli, Giovanni

2016-10-01

Coastal dunes play an important role in protecting the coastline. Unfortunately, in the last decades dunes have been removed or damaged by human activities. In the Emilia- Romagna region significant residual dune systems are found only along Ravenna and Ferrara coasts. In this context, the RIGED-RA project “Restoration and management of coastal dunes along the Ravenna coast” (2013-2016) has been launched with the aims to identify dynamics, erosion and vulnerability of Northern Adriatic coast and associated residual dunes, and to define intervention strategies for dune protection and restoration. The methodology is based on a multidisciplinary approach that integrates the expertise of several researchers and investigates all aspects (biotic and abiotic), which drive the dune-beach system. All datasets were integrated to identify test sites for applying dune restoration. The intervention finished in April 2016; evolution and restoration efficiency will be assessed.
A data-based model to locate mass movements triggered by seismic events in Sichuan, China.

PubMed

de Souza, Fabio Teodoro

2014-01-01

Earthquakes affect the entire world and have catastrophic consequences. On May 12, 2008, an earthquake of magnitude 7.9 on the Richter scale occurred in the Wenchuan area of Sichuan province in China. This event, together with subsequent aftershocks, caused many avalanches, landslides, debris flows, collapses, and quake lakes and induced numerous unstable slopes. This work proposes a methodology that uses a data mining approach and geographic information systems to predict these mass movements based on their association with the main and aftershock epicenters, geologic faults, riverbeds, and topography. A dataset comprising 3,883 mass movements is analyzed, and some models to predict the location of these mass movements are developed. These predictive models could be used by the Chinese authorities as an important tool for identifying risk areas and rescuing survivors during similar events in the future.
Multilayer Stock Forecasting Model Using Fuzzy Time Series

PubMed Central

Javedani Sadaei, Hossein; Lee, Muhammad Hisyam

2014-01-01

After reviewing the vast body of literature on using FTS in stock market forecasting, certain deficiencies are distinguished in the hybridization of findings. In addition, the lack of constructive systematic framework, which can be helpful to indicate direction of growth in entire FTS forecasting systems, is outstanding. In this study, we propose a multilayer model for stock market forecasting including five logical significant layers. Every single layer has its detailed concern to assist forecast development by reconciling certain problems exclusively. To verify the model, a set of huge data containing Taiwan Stock Index (TAIEX), National Association of Securities Dealers Automated Quotations (NASDAQ), Dow Jones Industrial Average (DJI), and S&P 500 have been chosen as experimental datasets. The results indicate that the proposed methodology has the potential to be accepted as a framework for model development in stock market forecasts using FTS. PMID:24605058
Outlier identification in urban soils and its implications for identification of potential contaminated land

NASA Astrophysics Data System (ADS)

Zhang, Chaosheng

2010-05-01

Outliers in urban soil geochemical databases may imply potential contaminated land. Different methodologies which can be easily implemented for the identification of global and spatial outliers were applied for Pb concentrations in urban soils of Galway City in Ireland. Due to its strongly skewed probability feature, a Box-Cox transformation was performed prior to further analyses. The graphic methods of histogram and box-and-whisker plot were effective in identification of global outliers at the original scale of the dataset. Spatial outliers could be identified by a local indicator of spatial association of local Moran's I, cross-validation of kriging, and a geographically weighted regression. The spatial locations of outliers were visualised using a geographical information system. Different methods showed generally consistent results, but differences existed. It is suggested that outliers identified by statistical methods should be confirmed and justified using scientific knowledge before they are properly dealt with.
Predicting the graft survival for heart-lung transplantation patients: an integrated data mining methodology.

PubMed

Oztekin, Asil; Delen, Dursun; Kong, Zhenyu James

2009-12-01

Predicting the survival of heart-lung transplant patients has the potential to play a critical role in understanding and improving the matching procedure between the recipient and graft. Although voluminous data related to the transplantation procedures is being collected and stored, only a small subset of the predictive factors has been used in modeling heart-lung transplantation outcomes. The previous studies have mainly focused on applying statistical techniques to a small set of factors selected by the domain-experts in order to reveal the simple linear relationships between the factors and survival. The collection of methods known as 'data mining' offers significant advantages over conventional statistical techniques in dealing with the latter's limitations such as normality assumption of observations, independence of observations from each other, and linearity of the relationship between the observations and the output measure(s). There are statistical methods that overcome these limitations. Yet, they are computationally more expensive and do not provide fast and flexible solutions as do data mining techniques in large datasets. The main objective of this study is to improve the prediction of outcomes following combined heart-lung transplantation by proposing an integrated data-mining methodology. A large and feature-rich dataset (16,604 cases with 283 variables) is used to (1) develop machine learning based predictive models and (2) extract the most important predictive factors. Then, using three different variable selection methods, namely, (i) machine learning methods driven variables-using decision trees, neural networks, logistic regression, (ii) the literature review-based expert-defined variables, and (iii) common sense-based interaction variables, a consolidated set of factors is generated and used to develop Cox regression models for heart-lung graft survival. The predictive models' performance in terms of 10-fold cross-validation accuracy rates for two multi-imputed datasets ranged from 79% to 86% for neural networks, from 78% to 86% for logistic regression, and from 71% to 79% for decision trees. The results indicate that the proposed integrated data mining methodology using Cox hazard models better predicted the graft survival with different variables than the conventional approaches commonly used in the literature. This result is validated by the comparison of the corresponding Gains charts for our proposed methodology and the literature review based Cox results, and by the comparison of Akaike information criteria (AIC) values received from each. Data mining-based methodology proposed in this study reveals that there are undiscovered relationships (i.e. interactions of the existing variables) among the survival-related variables, which helps better predict the survival of the heart-lung transplants. It also brings a different set of variables into the scene to be evaluated by the domain-experts and be considered prior to the organ transplantation.
Using geometric morphometric visualizations of directional selection gradients to investigate morphological differentiation.

PubMed

Weaver, Timothy D; Gunz, Philipp

2018-04-01

Researchers studying extant and extinct taxa are often interested in identifying the evolutionary processes that have lead to the morphological differences among the taxa. Ideally, one could distinguish the influences of neutral evolutionary processes (genetic drift, mutation) from natural selection, and in situations for which selection is implicated, identify the targets of selection. The directional selection gradient is an effective tool for investigating evolutionary process, because it can relate form (size and shape) differences between taxa to the variation and covariation found within taxa. However, although most modern morphometric analyses use the tools of geometric morphometrics (GM) to analyze landmark data, to date, selection gradients have mainly been calculated from linear measurements. To address this methodological gap, here we present a GM approach for visualizing and comparing between-taxon selection gradients with each other, associated difference vectors, and "selection" gradients from neutral simulations. To exemplify our approach, we use a dataset of 347 three-dimensional landmarks and semilandmarks recorded on the crania of 260 primate specimens (112 humans, 67 common chimpanzees, 36 bonobos, 45 gorillas). Results on this example dataset show how incorporating geometric information can provide important insights into the evolution of the human braincase, and serve to demonstrate the utility of our approach for understanding morphological evolution. © 2018 The Author(s). Evolution © 2018 The Society for the Study of Evolution.
Modelling gene expression profiles related to prostate tumor progression using binary states

PubMed Central

2013-01-01

Background Cancer is a complex disease commonly characterized by the disrupted activity of several cancer-related genes such as oncogenes and tumor-suppressor genes. Previous studies suggest that the process of tumor progression to malignancy is dynamic and can be traced by changes in gene expression. Despite the enormous efforts made for differential expression detection and biomarker discovery, few methods have been designed to model the gene expression level to tumor stage during malignancy progression. Such models could help us understand the dynamics and simplify or reveal the complexity of tumor progression. Methods We have modeled an on-off state of gene activation per sample then per stage to select gene expression profiles associated to tumor progression. The selection is guided by statistical significance of profiles based on random permutated datasets. Results We show that our method identifies expected profiles corresponding to oncogenes and tumor suppressor genes in a prostate tumor progression dataset. Comparisons with other methods support our findings and indicate that a considerable proportion of significant profiles is not found by other statistical tests commonly used to detect differential expression between tumor stages nor found by other tailored methods. Ontology and pathway analysis concurred with these findings. Conclusions Results suggest that our methodology may be a valuable tool to study tumor malignancy progression, which might reveal novel cancer therapies. PMID:23721350
Identification of pumping influences in long-term water level fluctuations.

PubMed

Harp, Dylan R; Vesselinov, Velimir V

2011-01-01

Identification of the pumping influences at monitoring wells caused by spatially and temporally variable water supply pumping can be a challenging, yet an important hydrogeological task. The information that can be obtained can be critical for conceptualization of the hydrogeological conditions and indications of the zone of influence of the individual pumping wells. However, the pumping influences are often intermittent and small in magnitude with variable production rates from multiple pumping wells. While these difficulties may support an inclination to abandon the existing dataset and conduct a dedicated cross-hole pumping test, that option can be challenging and expensive to coordinate and execute. This paper presents a method that utilizes a simple analytical modeling approach for analysis of a long-term water level record utilizing an inverse modeling approach. The methodology allows the identification of pumping wells influencing the water level fluctuations. Thus, the analysis provides an efficient and cost-effective alternative to designed and coordinated cross-hole pumping tests. We apply this method on a dataset from the Los Alamos National Laboratory site. Our analysis also provides (1) an evaluation of the information content of the transient water level data; (2) indications of potential structures of the aquifer heterogeneity inhibiting or promoting pressure propagation; and (3) guidance for the development of more complicated models requiring detailed specification of the aquifer heterogeneity. Copyright © 2010 The Author(s). Journal compilation © 2010 National Ground Water Association.
Semantics of data and service registration to advance interdisciplinary information and data access.

NASA Astrophysics Data System (ADS)

Fox, P. P.; McGuinness, D. L.; Raskin, R.; Sinha, A. K.

2008-12-01

In developing an application of semantic web methods and technologies to address the integration of heterogeneous and interdisciplinary earth-science datasets, we have developed methodologies for creating rich semantic descriptions (ontologies) of the application domains. We have leveraged and extended where possible existing ontology frameworks such as SWEET. As a result of this semantic approach, we have also utilized ontologic descriptions of key enabling elements of the application, such as the registration of datasets with ontologies at several levels of granularity. This has enabled the location and usage of the data across disciplines. We are also realizing the need to develop similar semantic registration of web service data holdings as well as those provided with community and/or standard markup languages (e.g. GeoSciML). This level of semantic enablement extending beyond domain terms and relations significantly enhances our ability to provide a coherent semantic data framework for data and information systems. Much of this work is on the frontier of technology development and we will present the current and near-future capabilities we are developing. This work arises from the Semantically-Enabled Science Data Integration (SESDI) project, which is an NASA/ESTO/ACCESS-funded project involving the High Altitude Observatory at the National Center for Atmospheric Research (NCAR), McGuinness Associates Consulting, NASA/JPL and Virginia Polytechnic University.
Convolutional neural network approach for enhanced capture of breast parenchymal complexity patterns associated with breast cancer risk

NASA Astrophysics Data System (ADS)

Oustimov, Andrew; Gastounioti, Aimilia; Hsieh, Meng-Kang; Pantalone, Lauren; Conant, Emily F.; Kontos, Despina

2017-03-01

We assess the feasibility of a parenchymal texture feature fusion approach, utilizing a convolutional neural network (ConvNet) architecture, to benefit breast cancer risk assessment. Hypothesizing that by capturing sparse, subtle interactions between localized motifs present in two-dimensional texture feature maps derived from mammographic images, a multitude of texture feature descriptors can be optimally reduced to five meta-features capable of serving as a basis on which a linear classifier, such as logistic regression, can efficiently assess breast cancer risk. We combine this methodology with our previously validated lattice-based strategy for parenchymal texture analysis and we evaluate the feasibility of this approach in a case-control study with 424 digital mammograms. In a randomized split-sample setting, we optimize our framework in training/validation sets (N=300) and evaluate its descriminatory performance in an independent test set (N=124). The discriminatory capacity is assessed in terms of the the area under the curve (AUC) of the receiver operator characteristic (ROC). The resulting meta-features exhibited strong classification capability in the test dataset (AUC = 0.90), outperforming conventional, non-fused, texture analysis which previously resulted in an AUC=0.85 on the same case-control dataset. Our results suggest that informative interactions between localized motifs exist and can be extracted and summarized via a fairly simple ConvNet architecture.
An automated approach towards detecting complex behaviours in deep brain oscillations.

PubMed

Mace, Michael; Yousif, Nada; Naushahi, Mohammad; Abdullah-Al-Mamun, Khondaker; Wang, Shouyan; Nandi, Dipankar; Vaidyanathan, Ravi

2014-03-15

Extracting event-related potentials (ERPs) from neurological rhythms is of fundamental importance in neuroscience research. Standard ERP techniques typically require the associated ERP waveform to have low variance, be shape and latency invariant and require many repeated trials. Additionally, the non-ERP part of the signal needs to be sampled from an uncorrelated Gaussian process. This limits methods of analysis to quantifying simple behaviours and movements only when multi-trial data-sets are available. We introduce a method for automatically detecting events associated with complex or large-scale behaviours, where the ERP need not conform to the aforementioned requirements. The algorithm is based on the calculation of a detection contour and adaptive threshold. These are combined using logical operations to produce a binary signal indicating the presence (or absence) of an event with the associated detection parameters tuned using a multi-objective genetic algorithm. To validate the proposed methodology, deep brain signals were recorded from implanted electrodes in patients with Parkinson's disease as they participated in a large movement-based behavioural paradigm. The experiment involved bilateral recordings of local field potentials from the sub-thalamic nucleus (STN) and pedunculopontine nucleus (PPN) during an orientation task. After tuning, the algorithm is able to extract events achieving training set sensitivities and specificities of [87.5 ± 6.5, 76.7 ± 12.8, 90.0 ± 4.1] and [92.6 ± 6.3, 86.0 ± 9.0, 29.8 ± 12.3] (mean ± 1 std) for the three subjects, averaged across the four neural sites. Furthermore, the methodology has the potential for utility in real-time applications as only a single-trial ERP is required. Copyright © 2013 Elsevier B.V. All rights reserved.
The road to NHDPlus — Advancements in digital stream networks and associated catchments

USGS Publications Warehouse

Moore, Richard B.; Dewald, Thomas A.

2016-01-01

A progression of advancements in Geographic Information Systems techniques for hydrologic network and associated catchment delineation has led to the production of the National Hydrography Dataset Plus (NHDPlus). NHDPlus is a digital stream network for hydrologic modeling with catchments and a suite of related geospatial data. Digital stream networks with associated catchments provide a geospatial framework for linking and integrating water-related data. Advancements in the development of NHDPlus are expected to continue to improve the capabilities of this national geospatial hydrologic framework. NHDPlus is built upon the medium-resolution NHD and, like NHD, was developed by the U.S. Environmental Protection Agency and U.S. Geological Survey to support the estimation of streamflow and stream velocity used in fate-and-transport modeling. Catchments included with NHDPlus were created by integrating vector information from the NHD and from the Watershed Boundary Dataset with the gridded land surface elevation as represented by the National Elevation Dataset. NHDPlus is an actively used and continually improved dataset. Users recognize the importance of a reliable stream network and associated catchments. The NHDPlus spatial features and associated data tables will continue to be improved to support regional water quality and streamflow models and other user-defined applications.
Data Sources for the Analyses

EPA Pesticide Factsheets

Links are provided for the National Wetlands Inventory, National Hydrography Dataset, and the WorldClim-Global Climate Data source data websitesThis dataset is associated with the following publication:Lane , C., and E. D'Amico. Identification of Putative Geographically Isolated Wetlands of the Conterminous United States. JAWRA. American Water Resources Association, Middleburg, VA, USA, online, (2016).
The relationship between the Early Childhood Environment Rating Scale and its revised form and child outcomes: A systematic review and meta-analysis.

PubMed

Brunsek, Ashley; Perlman, Michal; Falenchuk, Olesya; McMullen, Evelyn; Fletcher, Brooke; Shah, Prakesh S

2017-01-01

The Early Childhood Environment Rating Scale (ECERS) and its revised version (ECERS-R) were designed as global measures of quality that assess structural and process aspects of Early Childhood Education and Care (ECEC) programs. Despite frequent use of the ECERS/ECERS-R in research and applied settings, associations between it and child outcomes have not been systematically reviewed. The objective of this research was to evaluate the association between the ECERS/ECERS-R and children's wellbeing. Searches of Medline, PsycINFO, ERIC, websites of large datasets and reference sections of all retrieved articles were completed up to July 3, 2015. Eligible studies provided a statistical link between the ECERS/ECERS-R and child outcomes for preschool-aged children in ECEC programs. Of the 823 studies selected for full review, 73 were included in the systematic review and 16 were meta-analyzed. The combined sample across all eligible studies consisted of 33, 318 preschool-aged children. Qualitative systematic review results revealed that ECERS/ECERS-R total scores were more generally associated with positive outcomes than subscales or factors. Seventeen separate meta-analyses were conducted to assess the strength of association between the ECERS/ECERS-R and measures that assessed children's language, math and social-emotional outcomes. Meta-analyses revealed a small number of weak effects (in the expected direction) between the ECERS/ECERS-R total score and children's language and positive behavior outcomes. The Language-Reasoning subscale was weakly related to a language outcome. The enormous heterogeneity in how studies operationalized the ECERS/ECERS-R, the outcomes measured and statistics reported limited our ability to meta-analyze many studies. Greater consistency in study methodology is needed in this area of research. Despite these methodological challenges, the ECERS/ECERS-R does appear to capture aspects of quality that are important for children's wellbeing; however, the strength of association is weak.
The relationship between the Early Childhood Environment Rating Scale and its revised form and child outcomes: A systematic review and meta-analysis

PubMed Central

Brunsek, Ashley; Perlman, Michal; Falenchuk, Olesya; McMullen, Evelyn; Fletcher, Brooke; Shah, Prakesh S.

2017-01-01

The Early Childhood Environment Rating Scale (ECERS) and its revised version (ECERS-R) were designed as global measures of quality that assess structural and process aspects of Early Childhood Education and Care (ECEC) programs. Despite frequent use of the ECERS/ECERS-R in research and applied settings, associations between it and child outcomes have not been systematically reviewed. The objective of this research was to evaluate the association between the ECERS/ECERS-R and children’s wellbeing. Searches of Medline, PsycINFO, ERIC, websites of large datasets and reference sections of all retrieved articles were completed up to July 3, 2015. Eligible studies provided a statistical link between the ECERS/ECERS-R and child outcomes for preschool-aged children in ECEC programs. Of the 823 studies selected for full review, 73 were included in the systematic review and 16 were meta-analyzed. The combined sample across all eligible studies consisted of 33, 318 preschool-aged children. Qualitative systematic review results revealed that ECERS/ECERS-R total scores were more generally associated with positive outcomes than subscales or factors. Seventeen separate meta-analyses were conducted to assess the strength of association between the ECERS/ECERS-R and measures that assessed children’s language, math and social-emotional outcomes. Meta-analyses revealed a small number of weak effects (in the expected direction) between the ECERS/ECERS-R total score and children’s language and positive behavior outcomes. The Language-Reasoning subscale was weakly related to a language outcome. The enormous heterogeneity in how studies operationalized the ECERS/ECERS-R, the outcomes measured and statistics reported limited our ability to meta-analyze many studies. Greater consistency in study methodology is needed in this area of research. Despite these methodological challenges, the ECERS/ECERS-R does appear to capture aspects of quality that are important for children’s wellbeing; however, the strength of association is weak. PMID:28586399
Initialization and Setup of the Coastal Model Test Bed: STWAVE

DTIC Science & Technology

2017-01-01

Laboratory (CHL) Field Research Facility (FRF) in Duck , NC. The improved evaluation methodology will promote rapid enhancement of model capability and focus...Blanton 2008) study . This regional digital elevation model (DEM), with a cell size of 10 m, was generated from numerous datasets collected at different...INFORMATION: For additional information, contact Spicer Bak, Coastal Observation and Analysis Branch, Coastal and Hydraulics Laboratory, 1261 Duck Road
A novel statistical methodology to overcome sampling irregularities in the forest inventory data and to model forest changes under dynamic disturbance regimes

Treesearch

Nikolay Strigul; Jean Lienard

2015-01-01

Forest inventory datasets offer unprecedented opportunities to model forest dynamics under evolving environmental conditions but they are analytically challenging due to irregular sampling time intervals of the same plot, across the years. We propose here a novel method to model dynamic changes in forest biomass and basal area using forest inventory data. Our...
Effectiveness of Vertex Nomination via Seeded Graph Matching to Find Bijections Between Similar Networks

DTIC Science & Technology

2018-02-01

similar methodology as the author’s example was conducted to prepare this dataset for processing via the SGM algorithm. Since and ′ are...TECHNICAL MEMORANDUM APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED STINFO COPY AIR FORCE RESEARCH LABORATORY...PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Air Force Research Laboratory/RIEA 525 Brooks Road Rome NY 13441-4505 8. PERFORMING ORGANIZATION REPORT NUMBER
Dataset for Reporting of Malignant Mesothelioma of the Pleura or Peritoneum: Recommendations From the International Collaboration on Cancer Reporting (ICCR).

PubMed

Churg, Andrew; Attanoos, Richard; Borczuk, Alain C; Chirieac, Lucian R; Galateau-Sallé, Françoise; Gibbs, Allen; Henderson, Douglas; Roggli, Victor; Rusch, Valerie; Judge, Meagan J; Srigley, John R

2016-10-01

-The International Collaboration on Cancer Reporting is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom; the College of American Pathologists; the Canadian Association of Pathologists-Association Canadienne des Pathologists, in association with the Canadian Partnership Against Cancer; and the European Society of Pathology. Its goal is to produce common, internationally agreed upon, evidence-based datasets for use throughout the world. -To describe a dataset developed by the Expert Panel of the International Collaboration on Cancer Reporting for reporting malignant mesothelioma of both the pleura and peritoneum. The dataset is composed of "required" (mandatory) and "recommended" (nonmandatory) elements. -Based on a review of the most recent evidence and supported by explanatory commentary. -Eight required elements and 7 recommended elements were agreed upon by the Expert Panel to represent the essential information for reporting malignant mesothelioma of the pleura and peritoneum. -In time, the widespread use of an internationally agreed upon, structured, pathology dataset for mesothelioma will lead not only to improved patient management but also provide valuable data for research and international benchmarks.

A methodology for generating normal and pathological brain perfusion SPECT images for evaluation of MRI/SPECT fusion methods: application in epilepsy

NASA Astrophysics Data System (ADS)

Grova, C.; Jannin, P.; Biraben, A.; Buvat, I.; Benali, H.; Bernard, A. M.; Scarabin, J. M.; Gibaud, B.

2003-12-01

Quantitative evaluation of brain MRI/SPECT fusion methods for normal and in particular pathological datasets is difficult, due to the frequent lack of relevant ground truth. We propose a methodology to generate MRI and SPECT datasets dedicated to the evaluation of MRI/SPECT fusion methods and illustrate the method when dealing with ictal SPECT. The method consists in generating normal or pathological SPECT data perfectly aligned with a high-resolution 3D T1-weighted MRI using realistic Monte Carlo simulations that closely reproduce the response of a SPECT imaging system. Anatomical input data for the SPECT simulations are obtained from this 3D T1-weighted MRI, while functional input data result from an inter-individual analysis of anatomically standardized SPECT data. The method makes it possible to control the 'brain perfusion' function by proposing a theoretical model of brain perfusion from measurements performed on real SPECT images. Our method provides an absolute gold standard for assessing MRI/SPECT registration method accuracy since, by construction, the SPECT data are perfectly registered with the MRI data. The proposed methodology has been applied to create a theoretical model of normal brain perfusion and ictal brain perfusion characteristic of mesial temporal lobe epilepsy. To approach realistic and unbiased perfusion models, real SPECT data were corrected for uniform attenuation, scatter and partial volume effect. An anatomic standardization was used to account for anatomic variability between subjects. Realistic simulations of normal and ictal SPECT deduced from these perfusion models are presented. The comparison of real and simulated SPECT images showed relative differences in regional activity concentration of less than 20% in most anatomical structures, for both normal and ictal data, suggesting realistic models of perfusion distributions for evaluation purposes. Inter-hemispheric asymmetry coefficients measured on simulated data were found within the range of asymmetry coefficients measured on corresponding real data. The features of the proposed approach are compared with those of other methods previously described to obtain datasets appropriate for the assessment of fusion methods.
Marginal regression models for clustered count data based on zero-inflated Conway-Maxwell-Poisson distribution with applications.

PubMed

Choo-Wosoba, Hyoyoung; Levy, Steven M; Datta, Somnath

2016-06-01

Community water fluoridation is an important public health measure to prevent dental caries, but it continues to be somewhat controversial. The Iowa Fluoride Study (IFS) is a longitudinal study on a cohort of Iowa children that began in 1991. The main purposes of this study (http://www.dentistry.uiowa.edu/preventive-fluoride-study) were to quantify fluoride exposures from both dietary and nondietary sources and to associate longitudinal fluoride exposures with dental fluorosis (spots on teeth) and dental caries (cavities). We analyze a subset of the IFS data by a marginal regression model with a zero-inflated version of the Conway-Maxwell-Poisson distribution for count data exhibiting excessive zeros and a wide range of dispersion patterns. In general, we introduce two estimation methods for fitting a ZICMP marginal regression model. Finite sample behaviors of the estimators and the resulting confidence intervals are studied using extensive simulation studies. We apply our methodologies to the dental caries data. Our novel modeling incorporating zero inflation, clustering, and overdispersion sheds some new light on the effect of community water fluoridation and other factors. We also include a second application of our methodology to a genomic (next-generation sequencing) dataset that exhibits underdispersion. © 2015, The International Biometric Society.
Seamless lesion insertion in digital mammography: methodology and reader study

NASA Astrophysics Data System (ADS)

Pezeshk, Aria; Petrick, Nicholas; Sahiner, Berkman

2016-03-01

Collection of large repositories of clinical images containing verified cancer locations is costly and time consuming due to difficulties associated with both the accumulation of data and establishment of the ground truth. This problem poses a significant challenge to the development of machine learning algorithms that require large amounts of data to properly train and avoid overfitting. In this paper we expand the methods in our previous publications by making several modifications that significantly increase the speed of our insertion algorithms, thereby allowing them to be used for inserting lesions that are much larger in size. These algorithms have been incorporated into an image composition tool that we have made publicly available. This tool allows users to modify or supplement existing datasets by seamlessly inserting a real breast mass or micro-calcification cluster extracted from a source digital mammogram into a different location on another mammogram. We demonstrate examples of the performance of this tool on clinical cases taken from the University of South Florida Digital Database for Screening Mammography (DDSM). Finally, we report the results of a reader study evaluating the realism of inserted lesions compared to clinical lesions. Analysis of the radiologist scores in the study using receiver operating characteristic (ROC) methodology indicates that inserted lesions cannot be reliably distinguished from clinical lesions.
Parton Distributions based on a Maximally Consistent Dataset

NASA Astrophysics Data System (ADS)

Rojo, Juan

2016-04-01

The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.
Learning in data-limited multimodal scenarios: Scandent decision forests and tree-based features.

PubMed

Hor, Soheil; Moradi, Mehdi

2016-12-01

Incomplete and inconsistent datasets often pose difficulties in multimodal studies. We introduce the concept of scandent decision trees to tackle these difficulties. Scandent trees are decision trees that optimally mimic the partitioning of the data determined by another decision tree, and crucially, use only a subset of the feature set. We show how scandent trees can be used to enhance the performance of decision forests trained on a small number of multimodal samples when we have access to larger datasets with vastly incomplete feature sets. Additionally, we introduce the concept of tree-based feature transforms in the decision forest paradigm. When combined with scandent trees, the tree-based feature transforms enable us to train a classifier on a rich multimodal dataset, and use it to classify samples with only a subset of features of the training data. Using this methodology, we build a model trained on MRI and PET images of the ADNI dataset, and then test it on cases with only MRI data. We show that this is significantly more effective in staging of cognitive impairments compared to a similar decision forest model trained and tested on MRI only, or one that uses other kinds of feature transform applied to the MRI data. Copyright © 2016. Published by Elsevier B.V.
Capturing Data Connections within the Climate Data Initiative to Support Resiliency

NASA Astrophysics Data System (ADS)

Ramachandran, R.; Bugbee, K.; Weigel, A. M.; Tilmes, C.

2015-12-01

The Climate Data Initiative (CDI) focuses on preparing the United States for the impacts of climate change by leveraging existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship supporting national climate-change preparedness. To achieve these goals, relevant data was curated around seven thematic areas relevant to climate change resiliency. Data for each theme was selected by subject matter experts from various Federal agencies and collected in Data.gov at http://climate.data.gov. While the curation effort for each theme has been immensely valuable on its own, in the end, the themes essentially become a long directory or a list. Establishing valuable connections between datasets and their intended use is lost. Therefore, the user understands that the datasets in the list have been approved by the CDI subject matter experts but has less certainty when making connections between the various datasets and their possible applications. Additionally, the intended use of the curated list is overwhelming and can be difficult to interpret. In order to better address the needs of the CDI data end users, the CDI team has been developing a new controlled vocabulary that will assist in capturing connections between datasets. This new vocabulary will be implemented in the Global Change Information System (GCIS), which has the capability to link individual items within the system. This presentation will highlight the methodology used to develop the controlled vocabulary that will aid end users in both understanding and locating relevant datasets for their intended use.
Datasets for supplier selection and order allocation with green criteria, all-unit quantity discounts and varying number of suppliers.

PubMed

Hamdan, Sadeque; Cheaitou, Ali

2017-08-01

This data article provides detailed optimization input and output datasets and optimization code for the published research work titled "Dynamic green supplier selection and order allocation with quantity discounts and varying supplier availability" (Hamdan and Cheaitou, 2017, In press) [1]. Researchers may use these datasets as a baseline for future comparison and extensive analysis of the green supplier selection and order allocation problem with all-unit quantity discount and varying number of suppliers. More particularly, the datasets presented in this article allow researchers to generate the exact optimization outputs obtained by the authors of Hamdan and Cheaitou (2017, In press) [1] using the provided optimization code and then to use them for comparison with the outputs of other techniques or methodologies such as heuristic approaches. Moreover, this article includes the randomly generated optimization input data and the related outputs that are used as input data for the statistical analysis presented in Hamdan and Cheaitou (2017 In press) [1] in which two different approaches for ranking potential suppliers are compared. This article also provides the time analysis data used in (Hamdan and Cheaitou (2017, In press) [1] to study the effect of the problem size on the computation time as well as an additional time analysis dataset. The input data for the time study are generated randomly, in which the problem size is changed, and then are used by the optimization problem to obtain the corresponding optimal outputs as well as the corresponding computation time.
Performance evaluation of tile-based Fisher Ratio analysis using a benchmark yeast metabolome dataset.

PubMed

Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E

2016-08-12

Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
Circadian Gene Variants and Susceptibility to Type 2 Diabetes: A Pilot Study

PubMed Central

Kelly, M. Ann; Rees, Simon D.; Hydrie, M. Zafar I.; Shera, A. Samad; Bellary, Srikanth; O’Hare, J. Paul; Kumar, Sudhesh; Taheri, Shahrad; Basit, Abdul; Barnett, Anthony H.

2012-01-01

Background Disruption of endogenous circadian rhythms has been shown to increase the risk of developing type 2 diabetes, suggesting that circadian genes might play a role in determining disease susceptibility. We present the results of a pilot study investigating the association between type 2 diabetes and selected single nucleotide polymorphisms (SNPs) in/near nine circadian genes. The variants were chosen based on their previously reported association with prostate cancer, a disease that has been suggested to have a genetic link with type 2 diabetes through a number of shared inherited risk determinants. Methodology/Principal Findings The pilot study was performed using two genetically homogeneous Punjabi cohorts, one resident in the United Kingdom and one indigenous to Pakistan. Subjects with (N = 1732) and without (N = 1780) type 2 diabetes were genotyped for thirteen circadian variants using a competitive allele-specific polymerase chain reaction method. Associations between the SNPs and type 2 diabetes were investigated using logistic regression. The results were also combined with in silico data from other South Asian datasets (SAT2D consortium) and white European cohorts (DIAGRAM+) using meta-analysis. The rs7602358G allele near PER2 was negatively associated with type 2 diabetes in our Punjabi cohorts (combined odds ratio [OR] = 0.75 [0.66–0.86], p = 3.18×10−5), while the BMAL1 rs11022775T allele was associated with an increased risk of the disease (combined OR = 1.22 [1.07–1.39], p = 0.003). Neither of these associations was replicated in the SAT2D or DIAGRAM+ datasets, however. Meta-analysis of all the cohorts identified disease associations with two variants, rs2292912 in CRY2 and rs12315175 near CRY1, although statistical significance was nominal (combined OR = 1.05 [1.01–1.08], p = 0.008 and OR = 0.95 [0.91–0.99], p = 0.015 respectively). Conclusions/significance None of the selected circadian gene variants was associated with type 2 diabetes with study-wide significance after meta-analysis. The nominal association observed with the CRY2 SNP, however, complements previous findings and confirms a role for this locus in disease susceptibility. PMID:22485135
National Transportation Atlas Databases : 2013

DOT National Transportation Integrated Search

2013-01-01

The National Transportation Atlas Databases 2013 (NTAD2013) is a set of nationwide geographic datasets of transportation facilities, transportation networks, associated infrastructure, and other political and administrative entities. These datasets i...
A comprehensive NMR methodology to assess the composition of biobased and biodegradable polymers in contact with food.

PubMed

Gratia, Audrey; Merlet, Denis; Ducruet, Violette; Lyathaud, Cédric

2015-01-01

A nuclear magnetic resonance (NMR) methodology was assessed regarding the identification and quantification of additives in three types of polylactide (PLA) intended as food contact materials. Additives were identified using the LNE/NMR database which clusters NMR datasets on more than 130 substances authorized by European Regulation No. 10/2011. Of the 12 additives spiked in the three types of PLA pellets, 10 were rapidly identified by the database and correlated with spectral comparison. The levels of the 12 additives were estimated using quantitative NMR combined with graphical computation. A comparison with chromatographic methods tended to prove the sensitivity of NMR by demonstrating an analytical difference of less than 15%. Our results therefore demonstrated the efficiency of the proposed NMR methodology for rapid assessment of the composition of PLA. Copyright © 2014 Elsevier B.V. All rights reserved.
Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

PubMed

Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

2017-02-01

Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.
Deep image mining for diabetic retinopathy screening.

PubMed

Quellec, Gwenolé; Charrière, Katia; Boudi, Yassine; Cochener, Béatrice; Lamard, Mathieu

2017-07-01

Deep learning is quickly becoming the leading methodology for medical image analysis. Given a large medical archive, where each image is associated with a diagnosis, efficient pathology detectors or classifiers can be trained with virtually no expert knowledge about the target pathologies. However, deep learning algorithms, including the popular ConvNets, are black boxes: little is known about the local patterns analyzed by ConvNets to make a decision at the image level. A solution is proposed in this paper to create heatmaps showing which pixels in images play a role in the image-level predictions. In other words, a ConvNet trained for image-level classification can be used to detect lesions as well. A generalization of the backpropagation method is proposed in order to train ConvNets that produce high-quality heatmaps. The proposed solution is applied to diabetic retinopathy (DR) screening in a dataset of almost 90,000 fundus photographs from the 2015 Kaggle Diabetic Retinopathy competition and a private dataset of almost 110,000 photographs (e-ophtha). For the task of detecting referable DR, very good detection performance was achieved: A z =0.954 in Kaggle's dataset and A z =0.949 in e-ophtha. Performance was also evaluated at the image level and at the lesion level in the DiaretDB1 dataset, where four types of lesions are manually segmented: microaneurysms, hemorrhages, exudates and cotton-wool spots. For the task of detecting images containing these four lesion types, the proposed detector, which was trained to detect referable DR, outperforms recent algorithms trained to detect those lesions specifically, with pixel-level supervision. At the lesion level, the proposed detector outperforms heatmap generation algorithms for ConvNets. This detector is part of the Messidor® system for mobile eye pathology screening. Because it does not rely on expert knowledge or manual segmentation for detecting relevant patterns, the proposed solution is a promising image mining tool, which has the potential to discover new biomarkers in images. Copyright © 2017 Elsevier B.V. All rights reserved.
Otolith reading and multi-model inference for improved estimation of age and growth in the gilthead seabream Sparus aurata (L.)

NASA Astrophysics Data System (ADS)

Mercier, Lény; Panfili, Jacques; Paillon, Christelle; N'diaye, Awa; Mouillot, David; Darnaude, Audrey M.

2011-05-01

Accurate knowledge of fish age and growth is crucial for species conservation and management of exploited marine stocks. In exploited species, age estimation based on otolith reading is routinely used for building growth curves that are used to implement fishery management models. However, the universal fit of the von Bertalanffy growth function (VBGF) on data from commercial landings can lead to uncertainty in growth parameter inference, preventing accurate comparison of growth-based history traits between fish populations. In the present paper, we used a comprehensive annual sample of wild gilthead seabream ( Sparus aurata L.) in the Gulf of Lions (France, NW Mediterranean) to test a methodology improving growth modelling for exploited fish populations. After validating the timing for otolith annual increment formation for all life stages, a comprehensive set of growth models (including VBGF) were fitted to the obtained age-length data, used as a whole or sub-divided between group 0 individuals and those coming from commercial landings (ages 1-6). Comparisons in growth model accuracy based on Akaike Information Criterion allowed assessment of the best model for each dataset and, when no model correctly fitted the data, a multi-model inference (MMI) based on model averaging was carried out. The results provided evidence that growth parameters inferred with VBGF must be used with high caution. Hence, VBGF turned to be among the less accurate for growth prediction irrespective of the dataset and its fit to the whole population, the juvenile or the adult datasets provided different growth parameters. The best models for growth prediction were the Tanaka model, for group 0 juveniles, and the MMI, for the older fish, confirming that growth differs substantially between juveniles and adults. All asymptotic models failed to correctly describe the growth of adult S. aurata, probably because of the poor representation of old individuals in the dataset. Multi-model inference associated with separate analysis of juveniles and adult fish is then advised to obtain objective estimations of growth parameters when sampling cannot be corrected towards older fish.
Renewable Energy Zones for Balancing Siting Trade-offs in India

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deshmukh, Ranjit; Wu, Grace C.; Phadke, Amol

India’s targets of 175 GW of renewable energy capacity by 2022, and 40% generation capacity from non-fossil fuel sources by 2030 will require a rapid and dramatic increase in solar and wind capacity deployment and overcoming its associated economic, siting, and power system challenges. The objective of this study was to spatially identify the amount and quality of wind and utility-scale solar resource potential in India, and the possible siting-related constraints and opportunities for development of renewable resources. Using the Multi-criteria Analysis for Planning Renewable Energy (MapRE) methodological framework, we estimated several criteria valuable for the selection of sites formore » development for each identified potential "zone", such as the levelized cost of electricity, distance to nearest substation, capacity value (or the temporal matching of renewable energy generation to demand), and the type of land cover. We find that high quality resources are spatially heterogeneous across India, with most wind and solar resources concentrated in the southern and western states, and the northern state of Rajasthan. Assuming India's Central Electricity Regulatory Commission's norms, we find that the range of levelized costs of generation of wind and solar PV resources overlap, but concentrated solar power (CSP) resources can be approximately twice as expensive. Further, the levelized costs of generation vary much more across wind zones than those across solar zones because of greater heterogeneity in the quality of wind resources compared to that of solar resources. When considering transmission accessibility, we find that about half of all wind zones (47%) and two-thirds of all solar PV zones (66%) are more than 25 km from existing 220 kV and above substations, suggesting potential constraints in access to high voltage transmission infrastructure and opportunities for preemptive transmission planning to scale up RE development. Additionally and importantly, we find that about 84% of all wind zones are on agricultural land, which provide opportunities for multiple-uses of land but may also impose constraints on land availability. We find that only 29% of suitable solar PV sites and 15% of CSP sites are within 10 km of a surface water body suggesting water availability as a significant siting constraint for solar plants. Availability of groundwater resources was not analyzed as part of this study. Lastly, given the possible economic benefits of transmission extensions or upgrades that serve both wind and solar generators, we quantified the co-location opportunities between the two technologies and find that about a quarter (28%) of all solar PV zones overlap with wind zones. Using the planning tools made available as part of this study, these multiple siting constraints and opportunities can be systematically compared and weighted to prioritize development that achieves a particular technology target. Our results are limited by the uncertainties associated with the input datasets, in particular the geospatial wind and solar resource, transmission, and land use land cover datasets. As input datasets get updated and improved, the methodology and tools developed through this study can be easily adapted and applied to these new datasets to improve upon the results presented in this study. India is on a path to significantly decarbonize its electricity grid through wind and solar development. A stakeholder-driven, systematic, and integrated planning approach using data and tools such as those highlighted in this study is essential to not only meet the country's RE targets, but to meet them in a cost-effective, and socially and environmentally sustainable way.« less
A method for gene-based pathway analysis using genomewide association study summary statistics reveals nine new type 1 diabetes associations.

PubMed

Evangelou, Marina; Smyth, Deborah J; Fortune, Mary D; Burren, Oliver S; Walker, Neil M; Guo, Hui; Onengut-Gumuscu, Suna; Chen, Wei-Min; Concannon, Patrick; Rich, Stephen S; Todd, John A; Wallace, Chris

2014-12-01

Pathway analysis can complement point-wise single nucleotide polymorphism (SNP) analysis in exploring genomewide association study (GWAS) data to identify specific disease-associated genes that can be candidate causal genes. We propose a straightforward methodology that can be used for conducting a gene-based pathway analysis using summary GWAS statistics in combination with widely available reference genotype data. We used this method to perform a gene-based pathway analysis of a type 1 diabetes (T1D) meta-analysis GWAS (of 7,514 cases and 9,045 controls). An important feature of the conducted analysis is the removal of the major histocompatibility complex gene region, the major genetic risk factor for T1D. Thirty-one of the 1,583 (2%) tested pathways were identified to be enriched for association with T1D at a 5% false discovery rate. We analyzed these 31 pathways and their genes to identify SNPs in or near these pathway genes that showed potentially novel association with T1D and attempted to replicate the association of 22 SNPs in additional samples. Replication P-values were skewed (P=9.85×10-11) with 12 of the 22 SNPs showing P<0.05. Support, including replication evidence, was obtained for nine T1D associated variants in genes ITGB7 (rs11170466, P=7.86×10-9), NRP1 (rs722988, 4.88×10-8), BAD (rs694739, 2.37×10-7), CTSB (rs1296023, 2.79×10-7), FYN (rs11964650, P=5.60×10-7), UBE2G1 (rs9906760, 5.08×10-7), MAP3K14 (rs17759555, 9.67×10-7), ITGB1 (rs1557150, 1.93×10-6), and IL7R (rs1445898, 2.76×10-6). The proposed methodology can be applied to other GWAS datasets for which only summary level data are available. © 2014 The Authors. ** Genetic Epidemiology published by Wiley Periodicals, Inc.
Associations between land use and Perkinsus marinus infection of eastern oysters in a high salinity, partially urbanized estuary

USGS Publications Warehouse

Gray, Brian R.; Bushek, David; Drane, J. Wanzer; Porter, Dwayne

2009-01-01

Infection levels of eastern oysters by the unicellular pathogen Perkinsus marinus have been associated with anthropogenic influences in laboratory studies. However, these relationships have been difficult to investigate in the field because anthropogenic inputs are often associated with natural influences such as freshwater inflow, which can also affect infection levels. We addressed P. marinus-land use associations using field-collected data from Murrells Inlet, South Carolina, USA, a developed, coastal estuary with relatively minor freshwater inputs. Ten oysters from each of 30 reefs were sampled quarterly in each of 2 years. Distances to nearest urbanized land class and to nearest stormwater outfall were measured via both tidal creeks and an elaboration of Euclidean distance. As the forms of any associations between oyster infection and distance to urbanization were unknown a priori, we used data from the first and second years of the study as exploratory and confirmatory datasets, respectively. With one exception, quarterly land use associations identified using the exploratory dataset were not confirmed using the confirmatory dataset. The exception was an association between the prevalence of moderate to high infection levels in winter and decreasing distance to nearest urban land use. Given that the study design appeared adequate to detect effects inferred from the exploratory dataset, these results suggest that effects of land use gradients were largely insubstantial or were ephemeral with duration less than 3 months.
Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry-based metabolomics data.

PubMed

Davidson, Robert L; Weber, Ralf J M; Liu, Haoyu; Sharma-Oates, Archana; Viant, Mark R

2016-01-01

Metabolomics is increasingly recognized as an invaluable tool in the biological, medical and environmental sciences yet lags behind the methodological maturity of other omics fields. To achieve its full potential, including the integration of multiple omics modalities, the accessibility, standardization and reproducibility of computational metabolomics tools must be improved significantly. Here we present our end-to-end mass spectrometry metabolomics workflow in the widely used platform, Galaxy. Named Galaxy-M, our workflow has been developed for both direct infusion mass spectrometry (DIMS) and liquid chromatography mass spectrometry (LC-MS) metabolomics. The range of tools presented spans from processing of raw data, e.g. peak picking and alignment, through data cleansing, e.g. missing value imputation, to preparation for statistical analysis, e.g. normalization and scaling, and principal components analysis (PCA) with associated statistical evaluation. We demonstrate the ease of using these Galaxy workflows via the analysis of DIMS and LC-MS datasets, and provide PCA scores and associated statistics to help other users to ensure that they can accurately repeat the processing and analysis of these two datasets. Galaxy and data are all provided pre-installed in a virtual machine (VM) that can be downloaded from the GigaDB repository. Additionally, source code, executables and installation instructions are available from GitHub. The Galaxy platform has enabled us to produce an easily accessible and reproducible computational metabolomics workflow. More tools could be added by the community to expand its functionality. We recommend that Galaxy-M workflow files are included within the supplementary information of publications, enabling metabolomics studies to achieve greater reproducibility.
Dataset of working conditions and thermo-economic performances for hybrid organic Rankine plants fed by solar and low-grade energy sources.

PubMed

Scardigno, Domenico; Fanelli, Emanuele; Viggiano, Annarita; Braccio, Giacobbe; Magi, Vinicio

2016-06-01

This article provides the dataset of operating conditions of a hybrid organic Rankine plant generated by the optimization procedure employed in the research article "A genetic optimization of a hybrid organic Rankine plant for solar and low-grade energy sources" (Scardigno et al., 2015) [1]. The methodology used to obtain the data is described. The operating conditions are subdivided into two separate groups: feasible and unfeasible solutions. In both groups, the values of the design variables are given. Besides, the subset of feasible solutions is described in details, by providing the thermodynamic and economic performances, the temperatures at some characteristic sections of the thermodynamic cycle, the net power, the absorbed powers and the area of the heat exchange surfaces.
Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database

DOE Office of Scientific and Technical Information (OSTI.GOV)

Heredia-Langner, Alejandro; Jarman, Kristin H.; Amidan, Brett G.

2013-09-01

This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the setmore » of predictors to around 8 factors that can be validated using reputable medical and public health resources.« less

Estimating Mixture of Gaussian Processes by Kernel Smoothing

PubMed Central

Huang, Mian; Li, Runze; Wang, Hansheng; Yao, Weixin

2014-01-01

When the functional data are not homogeneous, e.g., there exist multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this paper, we propose a new estimation procedure for the Mixture of Gaussian Processes, to incorporate both functional and inhomogeneous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from EM algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset. PMID:24976675
White blood cells identification system based on convolutional deep neural learning networks.

PubMed

Shahin, A I; Guo, Yanhui; Amin, K M; Sharawi, Amr A

2017-11-16

White blood cells (WBCs) differential counting yields valued information about human health and disease. The current developed automated cell morphology equipments perform differential count which is based on blood smear image analysis. Previous identification systems for WBCs consist of successive dependent stages; pre-processing, segmentation, feature extraction, feature selection, and classification. There is a real need to employ deep learning methodologies so that the performance of previous WBCs identification systems can be increased. Classifying small limited datasets through deep learning systems is a major challenge and should be investigated. In this paper, we propose a novel identification system for WBCs based on deep convolutional neural networks. Two methodologies based on transfer learning are followed: transfer learning based on deep activation features and fine-tuning of existed deep networks. Deep acrivation featues are extracted from several pre-trained networks and employed in a traditional identification system. Moreover, a novel end-to-end convolutional deep architecture called "WBCsNet" is proposed and built from scratch. Finally, a limited balanced WBCs dataset classification is performed through the WBCsNet as a pre-trained network. During our experiments, three different public WBCs datasets (2551 images) have been used which contain 5 healthy WBCs types. The overall system accuracy achieved by the proposed WBCsNet is (96.1%) which is more than different transfer learning approaches or even the previous traditional identification system. We also present features visualization for the WBCsNet activation which reflects higher response than the pre-trained activated one. a novel WBCs identification system based on deep learning theory is proposed and a high performance WBCsNet can be employed as a pre-trained network. Copyright © 2017. Published by Elsevier B.V.
Applying Multimodel Ensemble from Regional Climate Models for Improving Runoff Projections on Semiarid Regions of Spain

NASA Astrophysics Data System (ADS)

Garcia Galiano, S. G.; Olmos, P.; Giraldo Osorio, J. D.

2015-12-01

In the Mediterranean area, significant changes on temperature and precipitation are expected throughout the century. These trends could exacerbate the existing conditions in regions already vulnerable to climatic variability, reducing the water availability. Improving knowledge about plausible impacts of climate change on water cycle processes at basin scale, is an important step for building adaptive capacity to the impacts in this region, where severe water shortages are expected for the next decades. RCMs ensemble in combination with distributed hydrological models with few parameters, constitutes a valid and robust methodology to increase the reliability of climate and hydrological projections. For reaching this objective, a novel methodology for building Regional Climate Models (RCMs) ensembles of meteorological variables (rainfall an temperatures), was applied. RCMs ensembles are justified for increasing the reliability of climate and hydrological projections. The evaluation of RCMs goodness-of-fit to build the ensemble is based on empirical probability density functions (PDF) extracted from both RCMs dataset and a highly resolution gridded observational dataset, for the time period 1961-1990. The applied method is considering the seasonal and annual variability of the rainfall and temperatures. The RCMs ensembles constitute the input to a distributed hydrological model at basin scale, for assessing the runoff projections. The selected hydrological model is presenting few parameters in order to reduce the uncertainties involved. The study basin corresponds to a head basin of Segura River Basin, located in the South East of Spain. The impacts on runoff and its trend from observational dataset and climate projections, were assessed. Considering the control period 1961-1990, plausible significant decreases in runoff for the time period 2021-2050, were identified.
The effects of fossil placement and calibration on divergence times and rates: an example from the termites (Insecta: Isoptera).

PubMed

Ware, Jessica L; Grimaldi, David A; Engel, Michael S

2010-01-01

Among insects, eusocial behavior occurs in termites, ants, some bees and wasps. Isoptera and Hymenoptera convergently share social behavior, and for both taxa its evolution remains poorly understood. While dating analyses provide researchers with the opportunity to date the origin of eusociality, fossil calibration methodology may mislead subsequent ecological interpretations. Using a comprehensive termite dataset, we explored the effect of fossil placement and calibration methodology. A combined molecular and morphological dataset for 42 extant termite lineages was used, and a second dataset including these 42 taxa, plus an additional 39 fossil lineages for which we had only morphological data. MrBayes doublet-model analyses recovered similar topologies, with one minor exception (Stolotermitidae is sister to the Hodotermitidae, s.s., in the 42-taxon analysis but is in a polytomy with Hodotermitidae and (Kalotermitidae + Neoisoptera) in the 81-taxon analysis). Analyses using the r8s program on these topologies were run with either minimum/maximum constraints (analysis a = 42-taxon and analysis c = 81-taxon analyses) or with the fossil taxon ages fixed (ages fixed to be the geological age of the deposit from which they came, analysis b = 81-taxon analysis). Confidence intervals were determined for the resulting ultrametric trees, and for most major clades there was significant overlap between dates recovered for analyses A and C (with exceptions, such as the nodes Neoisoptera, and Euisoptera). With the exception of isopteran and eusiopteran node ages, however, none of the major clade ages overlapped when analysis B is compared with either analysis A or C. Future studies on Dictyoptera should note that the age of Kalotermitidae was underestimated in absence of kalotermitid fossils with fixed ages. Copyright (c) 2009 Elsevier Ltd. All rights reserved.
Fast 3D shape screening of large chemical databases through alignment-recycling

PubMed Central

Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H

2007-01-01

Background Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes. Results Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with de novo alignment calculation, with better than 80% hit list overlap on average. Conclusion Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed. PMID:17880744
Geodatabase of sites, basin boundaries, and topology rules used to store drainage basin boundaries for the U.S. Geological Survey, Colorado Water Science Center

USGS Publications Warehouse

Dupree, Jean A.; Crowfoot, Richard M.

2012-01-01

This geodatabase and its component datasets are part of U.S. Geological Survey Digital Data Series 650 and were generated to store basin boundaries for U.S. Geological Survey streamgages and other sites in Colorado. The geodatabase and its components were created by the U.S. Geological Survey, Colorado Water Science Center, and are used to derive the numeric drainage areas for Colorado that are input into the U.S. Geological Survey's National Water Information System (NWIS) database and also published in the Annual Water Data Report and on NWISWeb. The foundational dataset used to create the basin boundaries in this geodatabase was the National Watershed Boundary Dataset. This geodatabase accompanies a U.S. Geological Survey Techniques and Methods report (Book 11, Section C, Chapter 6) entitled "Digital Database Architecture and Delineation Methodology for Deriving Drainage Basins, and Comparison of Digitally and Non-Digitally Derived Numeric Drainage Areas." The Techniques and Methods report details the geodatabase architecture, describes the delineation methodology and workflows used to develop these basin boundaries, and compares digitally derived numeric drainage areas in this geodatabase to non-digitally derived areas. 1. COBasins.gdb: This geodatabase contains site locations and basin boundaries for Colorado. It includes a single feature dataset, called BasinsFD, which groups the component feature classes and topology rules. 2. BasinsFD: This feature dataset in the "COBasins.gdb" geodatabase is a digital container that holds the feature classes used to archive site locations and basin boundaries as well as the topology rules that govern spatial relations within and among component feature classes. This feature dataset includes three feature classes: the sites for which basins have been delineated (the "Sites" feature class), basin bounding lines (the "BasinLines" feature class), and polygonal basin areas (the "BasinPolys" feature class). The feature dataset also stores the topology rules (the "BasinsFD_Topology") that constrain the relations within and among component feature classes. The feature dataset also forces any feature classes inside it to have a consistent projection system, which is, in this case, an Albers-Equal-Area projection system. 3. BasinsFD_Topology: This topology contains four persistent topology rules that constrain the spatial relations within the "BasinLines" feature class and between the "BasinLines" feature class and the "BasinPolys" feature classes. 4. Sites: This point feature class contains the digital representations of the site locations for which Colorado Water Science Center basin boundaries have been delineated. This feature class includes point locations for Colorado Water Science Center active (as of September 30, 2009) gages and for other sites. 5. BasinLines: This line feature class contains the perimeters of basins delineated for features in the "Sites" feature class, and it also contains information regarding the sources of lines used for the basin boundaries. 6. BasinPolys: This polygon feature class contains the polygonal basin areas delineated for features in the "Sites" feature class, and it is used to derive the numeric drainage areas published by the Colorado Water Science Center.
Virtual Water Transfers in U.S. Cities from Domestic Commodity Flows

NASA Astrophysics Data System (ADS)

Ahams, I. C.; Mejia, A.; Paterson, W.

2015-12-01

Cities have imported water into their boundaries for centuries but understanding how cities indirectly affect watersheds through the commodities which they import is fairly unknown. Thus, we present and discuss here a methodology for determining the virtual water transfers to and from U.S. cities associated with domestic commodity flows. For our methodology, we only consider agricultural and industrial commodities and, to represent the commodity flows, we use the Freight Analysis Framework (FAF) dataset from the U.S. Department of Transportation. Accordingly, we determine virtual water transfers for the 123 geographic regions in the FAF, which consists of 17 states, 73 metropolitan statistical areas (MSAs), and 33 remainders of states. Out of the 41 sectors that comprise the FAF data, we consider only the 29 sectors that account for the agricultural and industrial commodities. Using both water use data for macro-sectors and national water use coefficients for different industries, we determine a weighted water use coefficient for each of the 29 sectors considered. Ultimately, we use these weighted coefficients to estimate virtual water transfers and the water footprint for each city. Preliminary comparisons with other water footprint estimates indicate that our methodology yields reasonable results. In terms of the water footprint, we find that cities (i.e. MSAs) are net consumers, can consume a large proportion of their own production, and can have a large agricultural production. We also find that the per capita water footprint of industrial consumption decreases with increasing population in cities, suggesting that large cities may be more efficient.
A methodology to link national and local information for spatial targeting of ammonia mitigation efforts

NASA Astrophysics Data System (ADS)

Carnell, E. J.; Misselbrook, T. H.; Dore, A. J.; Sutton, M. A.; Dragosits, U.

2017-09-01

The effects of atmospheric nitrogen (N) deposition are evident in terrestrial ecosystems worldwide, with eutrophication and acidification leading to significant changes in species composition. Substantial reductions in N deposition from nitrogen oxides emissions have been achieved in recent decades. By contrast, ammonia (NH3) emissions from agriculture have not decreased substantially and are typically highly spatially variable, making efficient mitigation challenging. One solution is to target NH3 mitigation measures spatially in source landscapes to maximize the benefits for nature conservation. The paper develops an approach to link national scale data and detailed local data to help identify suitable measures for spatial targeting of local sources near designated Special Areas of Conservation (SACs). The methodology combines high-resolution national data on emissions, deposition and source attribution with local data on agricultural management and site conditions. Application of the methodology for the full set of 240 SACs in England found that agriculture contributes ∼45 % of total N deposition. Activities associated with cattle farming represented 54 % of agricultural NH3 emissions within 2 km of the SACs, making them a major contributor to local N deposition, followed by mineral fertiliser application (21 %). Incorporation of local information on agricultural management practices at seven example SACs provided the means to correct outcomes compared with national-scale emission factors. The outcomes show how national scale datasets can provide information on N deposition threats at landscape to national scales, while local-scale information helps to understand the feasibility of mitigation measures, including the impact of detailed spatial targeting on N deposition rates to designated sites.
A dataset of stereoscopic images and ground-truth disparity mimicking human fixations in peripersonal space

PubMed Central

Canessa, Andrea; Gibaldi, Agostino; Chessa, Manuela; Fato, Marco; Solari, Fabio; Sabatini, Silvio P.

2017-01-01

Binocular stereopsis is the ability of a visual system, belonging to a live being or a machine, to interpret the different visual information deriving from two eyes/cameras for depth perception. From this perspective, the ground-truth information about three-dimensional visual space, which is hardly available, is an ideal tool both for evaluating human performance and for benchmarking machine vision algorithms. In the present work, we implemented a rendering methodology in which the camera pose mimics realistic eye pose for a fixating observer, thus including convergent eye geometry and cyclotorsion. The virtual environment we developed relies on highly accurate 3D virtual models, and its full controllability allows us to obtain the stereoscopic pairs together with the ground-truth depth and camera pose information. We thus created a stereoscopic dataset: GENUA PESTO—GENoa hUman Active fixation database: PEripersonal space STereoscopic images and grOund truth disparity. The dataset aims to provide a unified framework useful for a number of problems relevant to human and computer vision, from scene exploration and eye movement studies to 3D scene reconstruction. PMID:28350382
Real-time individual predictions of prostate cancer recurrence using joint models

PubMed Central

Taylor, Jeremy M. G.; Park, Yongseok; Ankerst, Donna P.; Proust-Lima, Cecile; Williams, Scott; Kestin, Larry; Bae, Kyoungwha; Pickles, Tom; Sandler, Howard

2012-01-01

Summary Patients who were previously treated for prostate cancer with radiation therapy are monitored at regular intervals using a laboratory test called Prostate Specific Antigen (PSA). If the value of the PSA test starts to rise, this is an indication that the prostate cancer is more likely to recur, and the patient may wish to initiate new treatments. Such patients could be helped in making medical decisions by an accurate estimate of the probability of recurrence of the cancer in the next few years. In this paper, we describe the methodology for giving the probability of recurrence for a new patient, as implemented on a web-based calculator. The methods use a joint longitudinal survival model. The model is developed on a training dataset of 2,386 patients and tested on a dataset of 846 patients. Bayesian estimation methods are used with one Markov chain Monte Carlo (MCMC) algorithm developed for estimation of the parameters from the training dataset and a second quick MCMC developed for prediction of the risk of recurrence that uses the longitudinal PSA measures from a new patient. PMID:23379600
Glycan array data management at Consortium for Functional Glycomics.

PubMed

Venkataraman, Maha; Sasisekharan, Ram; Raman, Rahul

2015-01-01

Glycomics or the study of structure-function relationships of complex glycans has reshaped post-genomics biology. Glycans mediate fundamental biological functions via their specific interactions with a variety of proteins. Recognizing the importance of glycomics, large-scale research initiatives such as the Consortium for Functional Glycomics (CFG) were established to address these challenges. Over the past decade, the Consortium for Functional Glycomics (CFG) has generated novel reagents and technologies for glycomics analyses, which in turn have led to generation of diverse datasets. These datasets have contributed to understanding glycan diversity and structure-function relationships at molecular (glycan-protein interactions), cellular (gene expression and glycan analysis), and whole organism (mouse phenotyping) levels. Among these analyses and datasets, screening of glycan-protein interactions on glycan array platforms has gained much prominence and has contributed to cross-disciplinary realization of the importance of glycomics in areas such as immunology, infectious diseases, cancer biomarkers, etc. This manuscript outlines methodologies for capturing data from glycan array experiments and online tools to access and visualize glycan array data implemented at the CFG.
Handling limited datasets with neural networks in medical applications: A small-data approach.

PubMed

Shaikhina, Torgyn; Khovanova, Natalia A

2017-01-01

Single-centre studies in medical domain are often characterised by limited samples due to the complexity and high costs of patient data collection. Machine learning methods for regression modelling of small datasets (less than 10 observations per predictor variable) remain scarce. Our work bridges this gap by developing a novel framework for application of artificial neural networks (NNs) for regression tasks involving small medical datasets. In order to address the sporadic fluctuations and validation issues that appear in regression NNs trained on small datasets, the method of multiple runs and surrogate data analysis were proposed in this work. The approach was compared to the state-of-the-art ensemble NNs; the effect of dataset size on NN performance was also investigated. The proposed framework was applied for the prediction of compressive strength (CS) of femoral trabecular bone in patients suffering from severe osteoarthritis. The NN model was able to estimate the CS of osteoarthritic trabecular bone from its structural and biological properties with a standard error of 0.85MPa. When evaluated on independent test samples, the NN achieved accuracy of 98.3%, outperforming an ensemble NN model by 11%. We reproduce this result on CS data of another porous solid (concrete) and demonstrate that the proposed framework allows for an NN modelled with as few as 56 samples to generalise on 300 independent test samples with 86.5% accuracy, which is comparable to the performance of an NN developed with 18 times larger dataset (1030 samples). The significance of this work is two-fold: the practical application allows for non-destructive prediction of bone fracture risk, while the novel methodology extends beyond the task considered in this study and provides a general framework for application of regression NNs to medical problems characterised by limited dataset sizes. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
VIS – A database on the distribution of fishes in inland and estuarine waters in Flanders, Belgium

PubMed Central

Brosens, Dimitri; Breine, Jan; Van Thuyne, Gerlinde; Belpaire, Claude; Desmet, Peter; Verreycken, Hugo

2015-01-01

Abstract The Research Institute for Nature and Forest (INBO) has been performing standardized fish stock assessments in Flanders, Belgium. This Flemish Fish Monitoring Network aims to assess fish populations in public waters at regular time intervals in both inland waters and estuaries. This monitoring was set up in support of the Water Framework Directive, the Habitat Directive, the Eel Regulation, the Red List of fishes, fish stock management, biodiversity research, and to assess the colonization and spreading of non-native fish species. The collected data are consolidated in the Fish Information System or VIS. From VIS, the occurrence data are now published at the INBO IPT as two datasets: ‘VIS - Fishes in inland waters in Flanders, Belgium’ and ‘VIS - Fishes in estuarine waters in Flanders, Belgium’. Together these datasets represent a complete overview of the distribution and abundance of fish species pertaining in Flanders from late 1992 to the end of 2012. This data paper discusses both datasets together, as both have a similar methodology and structure. The inland waters dataset contains over 350,000 fish observations, sampled between 1992 and 2012 from over 2,000 locations in inland rivers, streams, canals, and enclosed waters in Flanders. The dataset includes 64 fish species, as well as a number of non-target species (mainly crustaceans). The estuarine waters dataset contains over 44,000 fish observations, sampled between 1995 and 2012 from almost 50 locations in the estuaries of the rivers Yser and Scheldt (“Zeeschelde”), including two sampling sites in the Netherlands. The dataset includes 69 fish species and a number of non-target crustacean species. To foster broad and collaborative use, the data are dedicated to the public domain under a Creative Commons Zero waiver and reference the INBO norms for data use. PMID:25685001
Deep learning-based fine-grained car make/model classification for visual surveillance

NASA Astrophysics Data System (ADS)

Gundogdu, Erhan; Parıldı, Enes Sinan; Solmaz, Berkan; Yücesoy, Veysel; Koç, Aykut

2017-10-01

Fine-grained object recognition is a potential computer vision problem that has been recently addressed by utilizing deep Convolutional Neural Networks (CNNs). Nevertheless, the main disadvantage of classification methods relying on deep CNN models is the need for considerably large amount of data. In addition, there exists relatively less amount of annotated data for a real world application, such as the recognition of car models in a traffic surveillance system. To this end, we mainly concentrate on the classification of fine-grained car make and/or models for visual scenarios by the help of two different domains. First, a large-scale dataset including approximately 900K images is constructed from a website which includes fine-grained car models. According to their labels, a state-of-the-art CNN model is trained on the constructed dataset. The second domain that is dealt with is the set of images collected from a camera integrated to a traffic surveillance system. These images, which are over 260K, are gathered by a special license plate detection method on top of a motion detection algorithm. An appropriately selected size of the image is cropped from the region of interest provided by the detected license plate location. These sets of images and their provided labels for more than 30 classes are employed to fine-tune the CNN model which is already trained on the large scale dataset described above. To fine-tune the network, the last two fully-connected layers are randomly initialized and the remaining layers are fine-tuned in the second dataset. In this work, the transfer of a learned model on a large dataset to a smaller one has been successfully performed by utilizing both the limited annotated data of the traffic field and a large scale dataset with available annotations. Our experimental results both in the validation dataset and the real field show that the proposed methodology performs favorably against the training of the CNN model from scratch.
MSWEP V2 global 3-hourly 0.1° precipitation: methodology and quantitative appraisal

NASA Astrophysics Data System (ADS)

Beck, H.; Yang, L.; Pan, M.; Wood, E. F.; William, L.

2017-12-01

Here, we present Multi-Source Weighted-Ensemble Precipitation (MSWEP) V2, the first fully global gridded precipitation (P) dataset with a 0.1° spatial resolution. The dataset covers the period 1979-2016, has a 3-hourly temporal resolution, and was derived by optimally merging a wide range of data sources based on gauges (WorldClim, GHCN-D, GSOD, and others), satellites (CMORPH, GridSat, GSMaP, and TMPA 3B42RT), and reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR). MSWEP V2 implements some major improvements over V1, such as (i) the correction of distributional P biases using cumulative distribution function matching, (ii) increasing the spatial resolution from 0.25° to 0.1°, (iii) the inclusion of ocean areas, (iv) the addition of NCEP-CFSR P estimates, (v) the addition of thermal infrared-based P estimates for the pre-TRMM era, (vi) the addition of 0.1° daily interpolated gauge data, (vii) the use of a daily gauge correction scheme that accounts for regional differences in the 24-hour accumulation period of gauges, and (viii) extension of the data record to 2016. The gauge-based assessment of the reanalysis and satellite P datasets, necessary for establishing the merging weights, revealed that the reanalysis datasets strongly overestimate the P frequency for the entire globe, and that the satellite (resp. reanalysis) datasets consistently performed better at low (high) latitudes. Compared to other state-of-the-art P datasets, MSWEP V2 exhibits more plausible global patterns in mean annual P, percentiles, and annual number of dry days, and better resolves the small-scale variability over topographically complex terrain. Other P datasets appear to consistently underestimate P amounts over mountainous regions. Long-term mean P estimates for the global, land, and ocean domains based on MSWEP V2 are 959, 796, and 1026 mm/yr, respectively, in close agreement with the best previous published estimates.
The Role of Work Experiences in College Student Leadership Development: Evidence from a National Dataset and a Text Mining Approach to Examining Beliefs about Leadership

ERIC Educational Resources Information Center

Lewis, Jonathan S.

2017-01-01

Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about…
The coordinate-based meta-analysis of neuroimaging data.

PubMed

Samartsidis, Pantelis; Montagna, Silvia; Nichols, Thomas E; Johnson, Timothy D

2017-01-01

Neuroimaging meta-analysis is an area of growing interest in statistics. The special characteristics of neuroimaging data render classical meta-analysis methods inapplicable and therefore new methods have been developed. We review existing methodologies, explaining the benefits and drawbacks of each. A demonstration on a real dataset of emotion studies is included. We discuss some still-open problems in the field to highlight the need for future research.
The coordinate-based meta-analysis of neuroimaging data

PubMed Central

Samartsidis, Pantelis; Montagna, Silvia; Nichols, Thomas E.; Johnson, Timothy D.

2017-01-01

Neuroimaging meta-analysis is an area of growing interest in statistics. The special characteristics of neuroimaging data render classical meta-analysis methods inapplicable and therefore new methods have been developed. We review existing methodologies, explaining the benefits and drawbacks of each. A demonstration on a real dataset of emotion studies is included. We discuss some still-open problems in the field to highlight the need for future research. PMID:29545671
Proposal for Development of EBM-CDSS (Evidence-Based Clinical Decision Support System) to Aid Prognostication in Terminally Ill Patients

DTIC Science & Technology

2011-10-01

inconsistency in the representation of the dataset. RST provides a mathematical tool for representing and reasoning about vagueness and inconsistency. Its...use of various mathematical , statistical and soft computing methodologies with the objective of identifying meaningful relationships between condition...Evidence-based Medicine and Health Outcomes Research, University of South Florida, Tampa, FL 2Department of Mathematics , Indiana University Northwest, Gary
Distance Metric Tracking

DTIC Science & Technology

2016-03-02

some close- ness constant and dissimilar pairs be more distant than some larger constant. Online and non -linear extensions to the ITML methodology are...is obtained, instead of solving an objective function formed from the entire dataset. Many online learning methods have regret guarantees, that is... function Metric learning seeks to learn a metric that encourages data points marked as similar to be close and data points marked as different to be far

Genomic Datasets for Cancer Research

Cancer.gov

A variety of datasets from genome-wide association studies of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays, are available to approved investigators through the Extramural National Cancer Institute Data Access Committee.
Measures and Indicators of Vgi Quality: AN Overview

NASA Astrophysics Data System (ADS)

Antoniou, V.; Skopeliti, A.

2015-08-01

The evaluation of VGI quality has been a very interesting and popular issue amongst academics and researchers. Various metrics and indicators have been proposed for evaluating VGI quality elements. Various efforts have focused on the use of well-established methodologies for the evaluation of VGI quality elements against authoritative data. In this paper, a number of research papers have been reviewed and summarized in a detailed report on measures for each spatial data quality element. Emphasis is given on the methodology followed and the data used in order to assess and evaluate the quality of the VGI datasets. However, as the use of authoritative data is not always possible many researchers have turned their focus on the analysis of new quality indicators that can function as proxies for the understanding of VGI quality. In this paper, the difficulties in using authoritative datasets are briefly presented and new proposed quality indicators are discussed, as recorded through the literature review. We classify theses new indicators in four main categories that relate with: i) data, ii) demographics, iii) socio-economic situation and iv) contributors. This paper presents a dense, yet comprehensive overview of the research on this field and provides the basis for the ongoing academic effort to create a practical quality evaluation method through the use of appropriate quality indicators.
Performance testing of LiDAR exploitation software

NASA Astrophysics Data System (ADS)

Varela-González, M.; González-Jorge, H.; Riveiro, B.; Arias, P.

2013-04-01

Mobile LiDAR systems are being used widely in recent years for many applications in the field of geoscience. One of most important limitations of this technology is the large computational requirements involved in data processing. Several software solutions for data processing are available in the market, but users are often unknown about the methodologies to verify their performance accurately. In this work a methodology for LiDAR software performance testing is presented and six different suites are studied: QT Modeler, AutoCAD Civil 3D, Mars 7, Fledermaus, Carlson and TopoDOT (all of them in x64). Results depict as QTModeler, TopoDOT and AutoCAD Civil 3D allow the loading of large datasets, while Fledermaus, Mars7 and Carlson do not achieve these powerful performance. AutoCAD Civil 3D needs large loading time in comparison with the most powerful softwares such as QTModeler and TopoDOT. Carlson suite depicts the poorest results among all the softwares under study, where point clouds larger than 5 million points cannot be loaded and loading time is very large in comparison with the other suites even for the smaller datasets. AutoCAD Civil 3D, Carlson and TopoDOT show more threads than other softwares like QTModeler, Mars7 and Fledermaus.
Analyzing simulation-based PRA data through traditional and topological clustering: A BWR station blackout case study

DOE PAGES

Maljovec, D.; Liu, S.; Wang, B.; ...

2015-07-14

Here, dynamic probabilistic risk assessment (DPRA) methodologies couple system simulator codes (e.g., RELAP and MELCOR) with simulation controller codes (e.g., RAVEN and ADAPT). Whereas system simulator codes model system dynamics deterministically, simulation controller codes introduce both deterministic (e.g., system control logic and operating procedures) and stochastic (e.g., component failures and parameter uncertainties) elements into the simulation. Typically, a DPRA is performed by sampling values of a set of parameters and simulating the system behavior for that specific set of parameter values. For complex systems, a major challenge in using DPRA methodologies is to analyze the large number of scenarios generated,more » where clustering techniques are typically employed to better organize and interpret the data. In this paper, we focus on the analysis of two nuclear simulation datasets that are part of the risk-informed safety margin characterization (RISMC) boiling water reactor (BWR) station blackout (SBO) case study. We provide the domain experts a software tool that encodes traditional and topological clustering techniques within an interactive analysis and visualization environment, for understanding the structures of such high-dimensional nuclear simulation datasets. We demonstrate through our case study that both types of clustering techniques complement each other for enhanced structural understanding of the data.« less
Report of the Association of Coloproctology of Great Britain and Ireland/British Society of Gastroenterology Colorectal Polyp Working Group: the development of a complex colorectal polyp minimum dataset.

PubMed

Chattree, A; Barbour, J A; Thomas-Gibson, S; Bhandari, P; Saunders, B P; Veitch, A M; Anderson, J; Rembacken, B J; Loughrey, M B; Pullan, R; Garrett, W V; Lewis, G; Dolwani, S; Rutter, M D

2017-01-01

The management of large non-pedunculated colorectal polyps (LNPCPs) is complex, with widespread variation in management and outcome, even amongst experienced clinicians. Variations in the assessment and decision-making processes are likely to be a major factor in this variability. The creation of a standardized minimum dataset to aid decision-making may therefore result in improved clinical management. An official working group of 13 multidisciplinary specialists was appointed by the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG) to develop a minimum dataset on LNPCPs. The literature review used to structure the ACPGBI/BSG guidelines for the management of LNPCPs was used by a steering subcommittee to identify various parameters pertaining to the decision-making processes in the assessment and management of LNPCPs. A modified Delphi consensus process was then used for voting on proposed parameters over multiple voting rounds with at least 80% agreement defined as consensus. The minimum dataset was used in a pilot process to ensure rigidity and usability. A 23-parameter minimum dataset with parameters relating to patient and lesion factors, including six parameters relating to image retrieval, was formulated over four rounds of voting with two pilot processes to test rigidity and usability. This paper describes the development of the first reported evidence-based and expert consensus minimum dataset for the management of LNPCPs. It is anticipated that this dataset will allow comprehensive and standardized lesion assessment to improve decision-making in the assessment and management of LNPCPs. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.
Individual classification of ADHD patients by integrating multiscale neuroimaging markers and advanced pattern recognition techniques

PubMed Central

Cheng, Wei; Ji, Xiaoxi; Zhang, Jie; Feng, Jianfeng

2012-01-01

Accurate classification or prediction of the brain state across individual subject, i.e., healthy, or with brain disorders, is generally a more difficult task than merely finding group differences. The former must be approached with highly informative and sensitive biomarkers as well as effective pattern classification/feature selection approaches. In this paper, we propose a systematic methodology to discriminate attention deficit hyperactivity disorder (ADHD) patients from healthy controls on the individual level. Multiple neuroimaging markers that are proved to be sensitive features are identified, which include multiscale characteristics extracted from blood oxygenation level dependent (BOLD) signals, such as regional homogeneity (ReHo) and amplitude of low-frequency fluctuations. Functional connectivity derived from Pearson, partial, and spatial correlation is also utilized to reflect the abnormal patterns of functional integration, or, dysconnectivity syndromes in the brain. These neuroimaging markers are calculated on either voxel or regional level. Advanced feature selection approach is then designed, including a brain-wise association study (BWAS). Using identified features and proper feature integration, a support vector machine (SVM) classifier can achieve a cross-validated classification accuracy of 76.15% across individuals from a large dataset consisting of 141 healthy controls and 98 ADHD patients, with the sensitivity being 63.27% and the specificity being 85.11%. Our results show that the most discriminative features for classification are primarily associated with the frontal and cerebellar regions. The proposed methodology is expected to improve clinical diagnosis and evaluation of treatment for ADHD patient, and to have wider applications in diagnosis of general neuropsychiatric disorders. PMID:22888314
Incorporating ToxCast and Tox21 Datasets to Rank Biological Activity of Chemicals at Superfund Sites in North Carolina

PubMed Central

Tilley, Sloane K.; Reif, David M.; Fry, Rebecca C.

2017-01-01

Background The Superfund program of the Environmental Protection Agency (EPA) was established in 1980 to address public health concerns posed by toxic substances released into the environment in the United States. Forty-two of the 1328 hazardous waste sites that remain on the Superfund National Priority List are located in the state of North Carolina. Methods We set out to develop a database that contained information on both the prevalence and biological activity of chemicals present at Superfund sites in North Carolina. A chemical characterization tool, the Toxicological Priority Index (ToxPi), was used to rank the biological activity of these chemicals based on their predicted bioavailability, documented associations with biological pathways, and activity in in vitro assays of the ToxCast and Tox21 programs. Results The ten most prevalent chemicals found at North Carolina Superfund sites were chromium, trichloroethene, lead, tetrachloroethene, arsenic, benzene, manganese, 1,2-dichloroethane, nickel, and barium. For all chemicals found at North Carolina Superfund sites, ToxPi analysis was used to rank their biological activity. Through this data integration, residual pesticides and organic solvents were identified to be some of the most highly-ranking predicted bioactive chemicals. This study provides a novel methodology for creating state or regional databases of Superfund sites. Conclusions These data represent a novel integrated profile of the most prevalent chemicals at North Carolina Superfund sites. This information, and the associated methodology, is useful to toxicologists, risk assessors, and the communities living in close proximity to these sites. PMID:28153528
Incorporating ToxCast and Tox21 datasets to rank biological activity of chemicals at Superfund sites in North Carolina.

PubMed

Tilley, Sloane K; Reif, David M; Fry, Rebecca C

2017-04-01

The Superfund program of the Environmental Protection Agency (EPA) was established in 1980 to address public health concerns posed by toxic substances released into the environment in the United States. Forty-two of the 1328 hazardous waste sites that remain on the Superfund National Priority List are located in the state of North Carolina. We set out to develop a database that contained information on both the prevalence and biological activity of chemicals present at Superfund sites in North Carolina. A chemical characterization tool, the Toxicological Priority Index (ToxPi), was used to rank the biological activity of these chemicals based on their predicted bioavailability, documented associations with biological pathways, and activity in in vitro assays of the ToxCast and Tox21 programs. The ten most prevalent chemicals found at North Carolina Superfund sites were chromium, trichloroethene, lead, tetrachloroethene, arsenic, benzene, manganese, 1,2-dichloroethane, nickel, and barium. For all chemicals found at North Carolina Superfund sites, ToxPi analysis was used to rank their biological activity. Through this data integration, residual pesticides and organic solvents were identified to be some of the most highly-ranking predicted bioactive chemicals. This study provides a novel methodology for creating state or regional databases of biological activity of contaminants at Superfund sites. These data represent a novel integrated profile of the most prevalent chemicals at North Carolina Superfund sites. This information, and the associated methodology, is useful to toxicologists, risk assessors, and the communities living in close proximity to these sites. Copyright © 2016. Published by Elsevier Ltd.
A web-based system for neural network based classification in temporomandibular joint osteoarthritis.

PubMed

de Dumast, Priscille; Mirabel, Clément; Cevidanes, Lucia; Ruellas, Antonio; Yatabe, Marilia; Ioshida, Marcos; Ribera, Nina Tubau; Michoud, Loic; Gomes, Liliane; Huang, Chao; Zhu, Hongtu; Muniz, Luciana; Shoukri, Brandon; Paniagua, Beatriz; Styner, Martin; Pieper, Steve; Budin, Francois; Vimort, Jean-Baptiste; Pascal, Laura; Prieto, Juan Carlos

2018-07-01

The purpose of this study is to describe the methodological innovations of a web-based system for storage, integration and computation of biomedical data, using a training imaging dataset to remotely compute a deep neural network classifier of temporomandibular joint osteoarthritis (TMJOA). This study imaging dataset consisted of three-dimensional (3D) surface meshes of mandibular condyles constructed from cone beam computed tomography (CBCT) scans. The training dataset consisted of 259 condyles, 105 from control subjects and 154 from patients with diagnosis of TMJ OA. For the image analysis classification, 34 right and left condyles from 17 patients (39.9 ± 11.7 years), who experienced signs and symptoms of the disease for less than 5 years, were included as the testing dataset. For the integrative statistical model of clinical, biological and imaging markers, the sample consisted of the same 17 test OA subjects and 17 age and sex matched control subjects (39.4 ± 15.4 years), who did not show any sign or symptom of OA. For these 34 subjects, a standardized clinical questionnaire, blood and saliva samples were also collected. The technological methodologies in this study include a deep neural network classifier of 3D condylar morphology (ShapeVariationAnalyzer, SVA), and a flexible web-based system for data storage, computation and integration (DSCI) of high dimensional imaging, clinical, and biological data. The DSCI system trained and tested the neural network, indicating 5 stages of structural degenerative changes in condylar morphology in the TMJ with 91% close agreement between the clinician consensus and the SVA classifier. The DSCI remotely ran with a novel application of a statistical analysis, the Multivariate Functional Shape Data Analysis, that computed high dimensional correlations between shape 3D coordinates, clinical pain levels and levels of biological markers, and then graphically displayed the computation results. The findings of this study demonstrate a comprehensive phenotypic characterization of TMJ health and disease at clinical, imaging and biological levels, using novel flexible and versatile open-source tools for a web-based system that provides advanced shape statistical analysis and a neural network based classification of temporomandibular joint osteoarthritis. Published by Elsevier Ltd.
SU-G-JeP3-01: A Method to Quantify Lung SBRT Target Localization Accuracy Based On Digitally Reconstructed Fluoroscopy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lafata, K; Ren, L; Cai, J

2016-06-15

Purpose: To develop a methodology based on digitally-reconstructed-fluoroscopy (DRF) to quantitatively assess target localization accuracy of lung SBRT, and to evaluate using both a dynamic digital phantom and a patient dataset. Methods: For each treatment field, a 10-phase DRF is generated based on the planning 4DCT. Each frame is pre-processed with a morphological top-hat filter, and corresponding beam apertures are projected to each detector plane. A template-matching algorithm based on cross-correlation is used to detect the tumor location in each frame. Tumor motion relative beam aperture is extracted in the superior-inferior direction based on each frame’s impulse response to themore » template, and the mean tumor position (MTP) is calculated as the average tumor displacement. The DRF template coordinates are then transferred to the corresponding MV-cine dataset, which is retrospectively filtered as above. The treatment MTP is calculated within each field’s projection space, relative to the DRF-defined template. The field’s localization error is defined as the difference between the DRF-derived-MTP (planning) and the MV-cine-derived-MTP (delivery). A dynamic digital phantom was used to assess the algorithm’s ability to detect intra-fractional changes in patient alignment, by simulating different spatial variations in the MV-cine and calculating the corresponding change in MTP. Inter-and-intra-fractional variation, IGRT accuracy, and filtering effects were investigated on a patient dataset. Results: Phantom results demonstrated a high accuracy in detecting both translational and rotational variation. The lowest localization error of the patient dataset was achieved at each fraction’s first field (mean=0.38mm), with Fx3 demonstrating a particularly strong correlation between intra-fractional motion-caused localization error and treatment progress. Filtering significantly improved tracking visibility in both the DRF and MV-cine images. Conclusion: We have developed and evaluated a methodology to quantify lung SBRT target localization accuracy based on digitally-reconstructed-fluoroscopy. Our approach may be useful in potentially reducing treatment margins to optimize lung SBRT outcomes. R01-184173.« less
Integrating genome-wide association study summaries and element-gene interaction datasets identified multiple associations between elements and complex diseases.

PubMed

He, Awen; Wang, Wenyu; Prakash, N Tejo; Tinkov, Alexey A; Skalny, Anatoly V; Wen, Yan; Hao, Jingcan; Guo, Xiong; Zhang, Feng

2018-03-01

Chemical elements are closely related to human health. Extensive genomic profile data of complex diseases offer us a good opportunity to systemically investigate the relationships between elements and complex diseases/traits. In this study, we applied gene set enrichment analysis (GSEA) approach to detect the associations between elements and complex diseases/traits though integrating element-gene interaction datasets and genome-wide association study (GWAS) data of complex diseases/traits. To illustrate the performance of GSEA, the element-gene interaction datasets of 24 elements were extracted from the comparative toxicogenomics database (CTD). GWAS summary datasets of 24 complex diseases or traits were downloaded from the dbGaP or GEFOS websites. We observed significant associations between 7 elements and 13 complex diseases or traits (all false discovery rate (FDR) < 0.05), including reported relationships such as aluminum vs. Alzheimer's disease (FDR = 0.042), calcium vs. bone mineral density (FDR = 0.031), magnesium vs. systemic lupus erythematosus (FDR = 0.012) as well as novel associations, such as nickel vs. hypertriglyceridemia (FDR = 0.002) and bipolar disorder (FDR = 0.027). Our study results are consistent with previous biological studies, supporting the good performance of GSEA. Our analyzing results based on GSEA framework provide novel clues for discovering causal relationships between elements and complex diseases. © 2017 WILEY PERIODICALS, INC.
Speeding up the Consensus Clustering methodology for microarray data analysis

PubMed Central

2011-01-01

Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. Results Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations. Conclusions In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures. PMID:21235792
A model to determine payments associated with radiology procedures.

PubMed

Mabotuwana, Thusitha; Hall, Christopher S; Thomas, Shiby; Wald, Christoph

2017-12-01

Across the United States, there is a growing number of patients in Accountable Care Organizations and under risk contracts with commercial insurance. This is due to proliferation of new value-based payment models and care delivery reform efforts. In this context, the business model of radiology within a hospital or health system context is shifting from a primary profit-center to a cost-center with a goal of cost savings. Radiology departments need to increasingly understand how the transactional nature of the business relates to financial rewards. The main challenge with current reporting systems is that the information is presented only at an aggregated level, and often not broken down further, for instance, by type of exam. As such, the primary objective of this research is to provide better visibility into payments associated with individual radiology procedures in order to better calibrate expense/capital structure of the imaging enterprise to the actual revenue or value-add to the organization it belongs to. We propose a methodology that can be used to determine technical payments at a procedure level. We use a proportion based model to allocate payments to individual radiology procedures based on total charges (which also includes non-radiology related charges). Using a production dataset containing 424,250 radiology exams we calculated the overall average technical charge for Radiology to be $873.08 per procedure and the corresponding average payment to be $326.43 (range: $48.27 for XR and $2750.11 for PET/CT) resulting in an average payment percentage of 37.39% across all exams. We describe how charges associated with a procedure can be used to approximate technical payments at a more granular level with a focus on Radiology. The methodology is generalizable to approximate payment for other services as well. Understanding payments associated with each procedure can be useful during strategic practice planning. Charge-to-total charge ratio can be used to approximate radiology payments at a procedure level. Copyright © 2017 Elsevier B.V. All rights reserved.
A dataset on tail risk of commodities markets.

PubMed

Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

2017-12-01

This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.
Mean composite fire severity metrics computed with Google Earth Engine offer improved accuracy and expanded mapping potential

USGS Publications Warehouse

Parks, Sean; Holsinger, Lisa M.; Voss, Morgan; Loehman, Rachel A.; Robinson, Nathaniel P.

2018-01-01

Landsat-based fire severity datasets are an invaluable resource for monitoring and research purposes. These gridded fire severity datasets are generally produced with pre-and post-fire imagery to estimate the degree of fire-induced ecological change. Here, we introduce methods to produce three Landsat-based fire severity metrics using the Google Earth Engine (GEE) platform: the delta normalized burn ratio (dNBR), the relativized delta normalized burn ratio (RdNBR), and the relativized burn ratio (RBR). Our methods do not rely on time-consuming a priori scene selection and instead use a mean compositing approach in which all valid pixels (e.g. cloud-free) over a pre-specified date range (pre- and post-fire) are stacked and the mean value for each pixel over each stack is used to produce the resulting fire severity datasets. This approach demonstrates that fire severity datasets can be produced with relative ease and speed compared the standard approach in which one pre-fire and post-fire scene are judiciously identified and used to produce fire severity datasets. We also validate the GEE-derived fire severity metrics using field-based fire severity plots for 18 fires in the western US. These validations are compared to Landsat-based fire severity datasets produced using only one pre- and post-fire scene, which has been the standard approach in producing such datasets since their inception. Results indicate that the GEE-derived fire severity datasets show improved validation statistics compared to parallel versions in which only one pre-fire and post-fire scene are used. We provide code and a sample geospatial fire history layer to produce dNBR, RdNBR, and RBR for the 18 fires we evaluated. Although our approach requires that a geospatial fire history layer (i.e. fire perimeters) be produced independently and prior to applying our methods, we suggest our GEE methodology can reasonably be implemented on hundreds to thousands of fires, thereby increasing opportunities for fire severity monitoring and research across the globe.
cGRNB: a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets.

PubMed

Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue

2013-01-01

We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.
Similarity of markers identified from cancer gene expression studies: observations from GEO.

PubMed

Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge

2014-09-01

Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Development of an Uncertainty Quantification Predictive Chemical Reaction Model for Syngas Combustion

DOE PAGES

Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.; ...

2017-01-24

An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less
Development of an Uncertainty Quantification Predictive Chemical Reaction Model for Syngas Combustion

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.

An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less
Evaluating the evidence for non-monotonic dose-response relationships: A systematic literature review and (re-)analysis of in vivo toxicity data in the area of food safety.

PubMed

Varret, C; Beronius, A; Bodin, L; Bokkers, B G H; Boon, P E; Burger, M; De Wit-Bos, L; Fischer, A; Hanberg, A; Litens-Karlsson, S; Slob, W; Wolterink, G; Zilliacus, J; Beausoleil, C; Rousselle, C

2018-01-15

This study aims to evaluate the evidence for the existence of non-monotonic dose-responses (NMDRs) of substances in the area of food safety. This review was performed following the systematic review methodology with the aim to identify in vivo studies published between January 2002 and February 2015 containing evidence for potential NMDRs. Inclusion and reliability criteria were defined and used to select relevant and reliable studies. A set of six checkpoints was developed to establish the likelihood that the data retrieved contained evidence for NMDR. In this review, 49 in vivo studies were identified as relevant and reliable, of which 42 were used for dose-response analysis. These studies contained 179 in vivo dose-response datasets with at least five dose groups (and a control group) as fewer doses cannot provide evidence for NMDR. These datasets were extracted and analyzed using the PROAST software package. The resulting dose-response relationships were evaluated for possible evidence of NMDRs by applying the six checkpoints. In total, 10 out of the 179 in vivo datasets fulfilled all six checkpoints. While these datasets could be considered as providing evidence for NMDR, replicated studies would still be needed to check if the results can be reproduced to rule out that the non-monotonicity was caused by incidental anomalies in that specific study. This approach, combining a systematic review with a set of checkpoints, is new and appears useful for future evaluations of the dose response datasets regarding evidence of non-monotonicity. Published by Elsevier Inc.

Identifying and acting on potentially inappropriate care? Inadequacy of current hospital coding for this task.

PubMed

Cooper, P David; Smart, David R

2017-06-01

Recent Australian attempts to facilitate disinvestment in healthcare, by identifying instances of 'inappropriate' care from large Government datasets, are subject to significant methodological flaws. Amongst other criticisms has been the fact that the Government datasets utilized for this purpose correlate poorly with datasets collected by relevant professional bodies. Government data derive from official hospital coding, collected retrospectively by clerical personnel, whilst professional body data derive from unit-specific databases, collected contemporaneously with care by clinical personnel. Assessment of accuracy of official hospital coding data for hyperbaric services in a tertiary referral hospital. All official hyperbaric-relevant coding data submitted to the relevant Australian Government agencies by the Royal Hobart Hospital, Tasmania, Australia for financial year 2010-2011 were reviewed and compared against actual hyperbaric unit activity as determined by reference to original source documents. Hospital coding data contained one or more errors in diagnoses and/or procedures in 70% of patients treated with hyperbaric oxygen that year. Multiple discrete error types were identified, including (but not limited to): missing patients; missing treatments; 'additional' treatments; 'additional' patients; incorrect procedure codes and incorrect diagnostic codes. Incidental observations of errors in surgical, anaesthetic and intensive care coding within this cohort suggest that the problems are not restricted to the specialty of hyperbaric medicine alone. Publications from other centres indicate that these problems are not unique to this institution or State. Current Government datasets are irretrievably compromised and not fit for purpose. Attempting to inform the healthcare policy debate by reference to these datasets is inappropriate. Urgent clinical engagement with hospital coding departments is warranted.
Improving stability of prediction models based on correlated omics data by using network approaches.

PubMed

Tissier, Renaud; Houwing-Duistermaat, Jeanine; Rodríguez-Girondo, Mar

2018-01-01

Building prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.
Using the Spatial Distribution of Installers to Define Solar Photovoltaic Markets

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Shaughnessy, Eric; Nemet, Gregory F.; Darghouth, Naim

2016-09-01

Solar PV market research to date has largely relied on arbitrary jurisdictional boundaries, such as counties, to study solar PV market dynamics. This paper seeks to improve solar PV market research by developing a methodology to define solar PV markets. The methodology is based on the spatial distribution of solar PV installers. An algorithm is developed and applied to a rich dataset of solar PV installations to study the outcomes of the installer-based market definitions. The installer-based approach exhibits several desirable properties. Specifically, the higher market granularity of the installer-based approach will allow future PV market research to study themore » relationship between market dynamics and pricing with more precision.« less
Passive Containment DataSet

EPA Pesticide Factsheets

This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).
A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

NASA Astrophysics Data System (ADS)

Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

2017-10-01

In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
Spatiotemporal integration of molecular and anatomical data in virtual reality using semantic mapping.

PubMed

Soh, Jung; Turinsky, Andrei L; Trinh, Quang M; Chang, Jasmine; Sabhaney, Ajay; Dong, Xiaoli; Gordon, Paul Mk; Janzen, Ryan Pw; Hau, David; Xia, Jianguo; Wishart, David S; Sensen, Christoph W

2009-01-01

We have developed a computational framework for spatiotemporal integration of molecular and anatomical datasets in a virtual reality environment. Using two case studies involving gene expression data and pharmacokinetic data, respectively, we demonstrate how existing knowledge bases for molecular data can be semantically mapped onto a standardized anatomical context of human body. Our data mapping methodology uses ontological representations of heterogeneous biomedical datasets and an ontology reasoner to create complex semantic descriptions of biomedical processes. This framework provides a means to systematically combine an increasing amount of biomedical imaging and numerical data into spatiotemporally coherent graphical representations. Our work enables medical researchers with different expertise to simulate complex phenomena visually and to develop insights through the use of shared data, thus paving the way for pathological inference, developmental pattern discovery and biomedical hypothesis testing.
A large dataset of protein dynamics in the mammalian heart proteome.

PubMed

Lau, Edward; Cao, Quan; Ng, Dominic C M; Bleakley, Brian J; Dincer, T Umut; Bot, Brian M; Wang, Ding; Liem, David A; Lam, Maggie P Y; Ge, Junbo; Ping, Peipei

2016-03-15

Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems.
Compressive sensing reconstruction of 3D wet refractivity based on GNSS and InSAR observations

NASA Astrophysics Data System (ADS)

Heublein, Marion; Alshawaf, Fadwa; Erdnüß, Bastian; Zhu, Xiao Xiang; Hinz, Stefan

2018-06-01

In this work, the reconstruction quality of an approach for neutrospheric water vapor tomography based on Slant Wet Delays (SWDs) obtained from Global Navigation Satellite Systems (GNSS) and Interferometric Synthetic Aperture Radar (InSAR) is investigated. The novelties of this approach are (1) the use of both absolute GNSS and absolute InSAR SWDs for tomography and (2) the solution of the tomographic system by means of compressive sensing (CS). The tomographic reconstruction is performed based on (i) a synthetic SWD dataset generated using wet refractivity information from the Weather Research and Forecasting (WRF) model and (ii) a real dataset using GNSS and InSAR SWDs. Thus, the validation of the achieved results focuses (i) on a comparison of the refractivity estimates with the input WRF refractivities and (ii) on radiosonde profiles. In case of the synthetic dataset, the results show that the CS approach yields a more accurate and more precise solution than least squares (LSQ). In addition, the benefit of adding synthetic InSAR SWDs into the tomographic system is analyzed. When applying CS, adding synthetic InSAR SWDs into the tomographic system improves the solution both in magnitude and in scattering. When solving the tomographic system by means of LSQ, no clear behavior is observed. In case of the real dataset, the estimated refractivities of both methodologies show a consistent behavior although the LSQ and CS solution strategies differ.
DMirNet: Inferring direct microRNA-mRNA association networks.

PubMed

Lee, Minsu; Lee, HyungJune

2016-12-05

MicroRNAs (miRNAs) play important regulatory roles in the wide range of biological processes by inducing target mRNA degradation or translational repression. Based on the correlation between expression profiles of a miRNA and its target mRNA, various computational methods have previously been proposed to identify miRNA-mRNA association networks by incorporating the matched miRNA and mRNA expression profiles. However, there remain three major issues to be resolved in the conventional computation approaches for inferring miRNA-mRNA association networks from expression profiles. 1) Inferred correlations from the observed expression profiles using conventional correlation-based methods include numerous erroneous links or over-estimated edge weight due to the transitive information flow among direct associations. 2) Due to the high-dimension-low-sample-size problem on the microarray dataset, it is difficult to obtain an accurate and reliable estimate of the empirical correlations between all pairs of expression profiles. 3) Because the previously proposed computational methods usually suffer from varying performance across different datasets, a more reliable model that guarantees optimal or suboptimal performance across different datasets is highly needed. In this paper, we present DMirNet, a new framework for identifying direct miRNA-mRNA association networks. To tackle the aforementioned issues, DMirNet incorporates 1) three direct correlation estimation methods (namely Corpcor, SPACE, Network deconvolution) to infer direct miRNA-mRNA association networks, 2) the bootstrapping method to fully utilize insufficient training expression profiles, and 3) a rank-based Ensemble aggregation to build a reliable and robust model across different datasets. Our empirical experiments on three datasets demonstrate the combinatorial effects of necessary components in DMirNet. Additional performance comparison experiments show that DMirNet outperforms the state-of-the-art Ensemble-based model [1] which has shown the best performance across the same three datasets, with a factor of up to 1.29. Further, we identify 43 putative novel multi-cancer-related miRNA-mRNA association relationships from an inferred Top 1000 direct miRNA-mRNA association network. We believe that DMirNet is a promising method to identify novel direct miRNA-mRNA relations and to elucidate the direct miRNA-mRNA association networks. Since DMirNet infers direct relationships from the observed data, DMirNet can contribute to reconstructing various direct regulatory pathways, including, but not limited to, the direct miRNA-mRNA association networks.
Engineering and Functional Characterization of Fusion Genes Identifies Novel Oncogenic Drivers of Cancer. | Office of Cancer Genomics

Cancer.gov

Oncogenic gene fusions drive many human cancers, but tools to more quickly unravel their functional contributions are needed. Here we describe methodology permitting fusion gene construction for functional evaluation. Using this strategy, we engineered the known fusion oncogenes, BCR-ABL1, EML4-ALK, and ETV6-NTRK3, as well as 20 previously uncharacterized fusion genes identified in TCGA datasets.
Bayesian Hierarchical Models to Augment the Mediterranean Forecast System

DTIC Science & Technology

2010-09-30

In part 2 (Bonazzi et al., 2010), the impact of the ensemble forecast methodology based on MFS-Wind-BHM perturbations is documented. Forecast...absence of dt data stage inputs, the forecast impact of MFS-Error-BHM is neutral. Experiments are underway now to introduce dt back into the MFS-Error...BHM and quantify forecast impacts at MFS. MFS-SuperEnsemble-BHM We have assembled all needed datasets and completed algorithmic development
Uncovering and Managing the Impact of Methodological Choices for the Computational Construction of Socio-Technical Networks from Texts

DTIC Science & Technology

2012-09-01

supported by the National Science Foundation (NSF) IGERT 9972762, the Army Research Institute (ARI) W91WAW07C0063, the Army Research Laboratory (ARL/CTA...prediction models in AutoMap .................................................. 144 Figure 13: Decision Tree for prediction model selection in...generated for nationally funded initiatives and made available through the Linguistic Data Consortium (LDC). An overview of these datasets is provided in
Methodological issues of genetic association studies.

PubMed

Simundic, Ana-Maria

2010-12-01

Genetic association studies explore the association between genetic polymorphisms and a certain trait, disease or predisposition to disease. It has long been acknowledged that many genetic association studies fail to replicate their initial positive findings. This raises concern about the methodological quality of these reports. Case-control genetic association studies often suffer from various methodological flaws in study design and data analysis, and are often reported poorly. Flawed methodology and poor reporting leads to distorted results and incorrect conclusions. Many journals have adopted guidelines for reporting genetic association studies. In this review, some major methodological determinants of genetic association studies will be discussed.
Developing an Automated Machine Learning Marine Oil Spill Detection System with Synthetic Aperture Radar

NASA Astrophysics Data System (ADS)

Pinales, J. C.; Graber, H. C.; Hargrove, J. T.; Caruso, M. J.

2016-02-01

Previous studies have demonstrated the ability to detect and classify marine hydrocarbon films with spaceborne synthetic aperture radar (SAR) imagery. The dampening effects of hydrocarbon discharges on small surface capillary-gravity waves renders the ocean surface "radar dark" compared with the standard wind-borne ocean surfaces. Given the scope and impact of events like the Deepwater Horizon oil spill, the need for improved, automated and expedient monitoring of hydrocarbon-related marine anomalies has become a pressing and complex issue for governments and the extraction industry. The research presented here describes the development, training, and utilization of an algorithm that detects marine oil spills in an automated, semi-supervised manner, utilizing X-, C-, or L-band SAR data as the primary input. Ancillary datasets include related radar-borne variables (incidence angle, etc.), environmental data (wind speed, etc.) and textural descriptors. Shapefiles produced by an experienced human-analyst served as targets (validation) during the training portion of the investigation. Training and testing datasets were chosen for development and assessment of algorithm effectiveness as well as optimal conditions for oil detection in SAR data. The algorithm detects oil spills by following a 3-step methodology: object detection, feature extraction, and classification. Previous oil spill detection and classification methodologies such as machine learning algorithms, artificial neural networks (ANN), and multivariate classification methods like partial least squares-discriminant analysis (PLS-DA) are evaluated and compared. Statistical, transform, and model-based image texture techniques, commonly used for object mapping directly or as inputs for more complex methodologies, are explored to determine optimal textures for an oil spill detection system. The influence of the ancillary variables is explored, with a particular focus on the role of strong vs. weak wind forcing.
Survival prediction of trauma patients: a study on US National Trauma Data Bank.

PubMed

Sefrioui, I; Amadini, R; Mauro, J; El Fallahi, A; Gabbrielli, M

2017-12-01

Exceptional circumstances like major incidents or natural disasters may cause a huge number of victims that might not be immediately and simultaneously saved. In these cases it is important to define priorities avoiding to waste time and resources for not savable victims. Trauma and Injury Severity Score (TRISS) methodology is the well-known and standard system usually used by practitioners to predict the survival probability of trauma patients. However, practitioners have noted that the accuracy of TRISS predictions is unacceptable especially for severely injured patients. Thus, alternative methods should be proposed. In this work we evaluate different approaches for predicting whether a patient will survive or not according to simple and easily measurable observations. We conducted a rigorous, comparative study based on the most important prediction techniques using real clinical data of the US National Trauma Data Bank. Empirical results show that well-known Machine Learning classifiers can outperform the TRISS methodology. Based on our findings, we can say that the best approach we evaluated is Random Forest: it has the best accuracy, the best area under the curve, and k-statistic, as well as the second-best sensitivity and specificity. It has also a good calibration curve. Furthermore, its performance monotonically increases as the dataset size grows, meaning that it can be very effective to exploit incoming knowledge. Considering the whole dataset, it is always better than TRISS. Finally, we implemented a new tool to compute the survival of victims. This will help medical practitioners to obtain a better accuracy than the TRISS tools. Random Forests may be a good candidate solution for improving the predictions on survival upon the standard TRISS methodology.
Land cover mapping for development planning in Eastern and Southern Africa

NASA Astrophysics Data System (ADS)

Oduor, P.; Flores Cordova, A. I.; Wakhayanga, J. A.; Kiema, J.; Farah, H.; Mugo, R. M.; Wahome, A.; Limaye, A. S.; Irwin, D.

2016-12-01

Africa continues to experience intensification of land use, driven by competition for resources and a growing population. Land cover maps are some of the fundamental datasets required by numerous stakeholders to inform a number of development decisions. For instance, they can be integrated with other datasets to create value added products such as vulnerability impact assessment maps, and natural capital accounting products. In addition, land cover maps are used as inputs into Greenhouse Gas (GHG) inventories to inform the Agriculture, Forestry and other Land Use (AFOLU) sector. However, the processes and methodologies of creating land cover maps consistent with international and national land cover classification schemes can be challenging, especially in developing countries where skills, hardware and software resources can be limiting. To meet this need, SERVIR Eastern and Southern Africa developed methodologies and stakeholder engagement processes that led to a successful initiative in which land cover maps for 9 countries (Malawi, Rwanda, Namibia, Botswana, Lesotho, Ethiopia, Uganda, Zambia and Tanzania) were developed, using 2 major classification schemes. The first sets of maps were developed based on an internationally acceptable classification system, while the second sets of maps were based on a nationally defined classification system. The mapping process benefited from reviews from national experts and also from technical advisory groups. The maps have found diverse uses, among them the definition of the Forest Reference Levels in Zambia. In Ethiopia, the maps have been endorsed by the national mapping agency as part of national data. The data for Rwanda is being used to inform the Natural Capital Accounting process, through the WAVES program, a World Bank Initiative. This work illustrates the methodologies and stakeholder engagement processes that brought success to this land cover mapping initiative.
Linked Records of Children with Traumatic Brain Injury. Probabilistic Linkage without Use of Protected Health Information.

PubMed

Bennett, T D; Dean, J M; Keenan, H T; McGlincy, M H; Thomas, A M; Cook, L J

2015-01-01

Record linkage may create powerful datasets with which investigators can conduct comparative effectiveness studies evaluating the impact of tests or interventions on health. All linkages of health care data files to date have used protected health information (PHI) in their linkage variables. A technique to link datasets without using PHI would be advantageous both to preserve privacy and to increase the number of potential linkages. We applied probabilistic linkage to records of injured children in the National Trauma Data Bank (NTDB, N = 156,357) and the Pediatric Health Information Systems (PHIS, N = 104,049) databases from 2007 to 2010. 49 match variables without PHI were used, many of them administrative variables and indicators for procedures recorded as International Classification of Diseases, 9th revision, Clinical Modification codes. We validated the accuracy of the linkage using identified data from a single center that submits to both databases. We accurately linked the PHIS and NTDB records for 69% of children with any injury, and 88% of those with severe traumatic brain injury eligible for a study of intervention effectiveness (positive predictive value of 98%, specificity of 99.99%). Accurate linkage was associated with longer lengths of stay, more severe injuries, and multiple injuries. In populations with substantial illness or injury severity, accurate record linkage may be possible in the absence of PHI. This methodology may enable linkages and, in turn, comparative effectiveness studies that would be unlikely or impossible otherwise.
A comparison of three clustering methods for finding subgroups in MRI, SMS or clinical data: SPSS TwoStep Cluster analysis, Latent Gold and SNOB.

PubMed

Kent, Peter; Jensen, Rikke K; Kongsted, Alice

2014-10-02

There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is a scarcity of head-to-head comparisons that can inform the choice of which clustering method might be suitable for particular clinical datasets and research questions. Therefore, the aim of this study was to perform a head-to-head comparison of three commonly available methods (SPSS TwoStep CA, Latent Gold LCA and SNOB LCA). The performance of these three methods was compared: (i) quantitatively using the number of subgroups detected, the classification probability of individuals into subgroups, the reproducibility of results, and (ii) qualitatively using subjective judgments about each program's ease of use and interpretability of the presentation of results.We analysed five real datasets of varying complexity in a secondary analysis of data from other research projects. Three datasets contained only MRI findings (n = 2,060 to 20,810 vertebral disc levels), one dataset contained only pain intensity data collected for 52 weeks by text (SMS) messaging (n = 1,121 people), and the last dataset contained a range of clinical variables measured in low back pain patients (n = 543 people). Four artificial datasets (n = 1,000 each) containing subgroups of varying complexity were also analysed testing the ability of these clustering methods to detect subgroups and correctly classify individuals when subgroup membership was known. The results from the real clinical datasets indicated that the number of subgroups detected varied, the certainty of classifying individuals into those subgroups varied, the findings had perfect reproducibility, some programs were easier to use and the interpretability of the presentation of their findings also varied. The results from the artificial datasets indicated that all three clustering methods showed a near-perfect ability to detect known subgroups and correctly classify individuals into those subgroups. Our subjective judgement was that Latent Gold offered the best balance of sensitivity to subgroups, ease of use and presentation of results with these datasets but we recognise that different clustering methods may suit other types of data and clinical research questions.
Characterization of Transport Errors in Chemical Forecasts from a Global Tropospheric Chemical Transport Model

NASA Technical Reports Server (NTRS)

Bey, I.; Jacob, D. J.; Liu, H.; Yantosca, R. M.; Sachse, G. W.

2004-01-01

We propose a new methodology to characterize errors in the representation of transport processes in chemical transport models. We constrain the evaluation of a global three-dimensional chemical transport model (GEOS-CHEM) with an extended dataset of carbon monoxide (CO) concentrations obtained during the Transport and Chemical Evolution over the Pacific (TRACE-P) aircraft campaign. The TRACEP mission took place over the western Pacific, a region frequently impacted by continental outflow associated with different synoptic-scale weather systems (such as cold fronts) and deep convection, and thus provides a valuable dataset. for our analysis. Model simulations using both forecast and assimilated meteorology are examined. Background CO concentrations are computed as a function of latitude and altitude and subsequently subtracted from both the observed and the model datasets to focus on the ability of the model to simulate variability on a synoptic scale. Different sampling strategies (i.e., spatial displacement and smoothing) are applied along the flight tracks to search for systematic model biases. Statistical quantities such as correlation coefficient and centered root-mean-square difference are computed between the simulated and the observed fields and are further inter-compared using Taylor diagrams. We find no systematic bias in the model for the TRACE-P region when we consider the entire dataset (i.e., from the surface to 12 km ). This result indicates that the transport error in our model is globally unbiased, which has important implications for using the model to conduct inverse modeling studies. Using the First-Look assimilated meteorology only provides little improvement of the correlation, in comparison with the forecast meteorology. These general statements can be refined when the entire dataset is divided into different vertical domains, i.e., the lower troposphere (less than 2 km), the middle troposphere (2-6 km), and the upper troposphere (greater than 6 km). The best agreement between the observations and the model is found in the lower and middle troposphere. Downward displacements in the lower troposphere provide a better fit with the observed value, which could indicate a problem in the representation of boundary layer height in the model. Significant improvement is also found for downward and southward displacements in the upper troposphere. There are several potential sources of errors in our simulation of the continental outflow in the upper troposphere which could lead to such biases, including the location and/or the strength of deep convective cells as well as that of wildfires in Southeast Asia.
Moisture transport and Atmospheric circulation in the Arctic

NASA Astrophysics Data System (ADS)

Woods, Cian; Caballero, Rodrigo

2013-04-01

Cyclones are an important feature of the Mid-Latitudes and Arctic Climates. They are a main transporter of warm moist energy from the sub tropics to the poles. The Arctic Winter is dominated by highly stable conditions for most of the season due to a low level temperature inversion caused by a radiation deficit at the surface. This temperature inversion is a ubiquitous feature of the Arctic Winter Climate and can persist for up to weeks at a time. The inversion can be destroyed during the passage of a cyclone advecting moisture and warming the surface. In the absence of an inversion, and in the presence of this warm moist air mass, clouds can form quite readily and as such influence the radiative processes and energy budget of the Arctic. Wind stress caused by a passing cyclones also has the tendency to cause break-up of the ice sheet by induced rotation, deformation and divergence at the surface. For these reasons, we wish to understand the mechanisms of warm moisture advection into the Arctic from lower latitudes and how these mechanisms are controlled. The body of work in this area has been growing and gaining momentum in recent years (Stramler et al. 2011; Morrison et al. 2012; Screen et al. 2011). However, there has been no in depth analysis of the underlying dynamics to date. Improving our understanding of Arctic dynamics becomes increasingly important in the context of climate change. Many models agree that a northward shift of the storm track is likely in the future, which could have large impacts in the Arctic, particularly the sea ice. A climatology of six-day forward and backward trajectories starting from multiple heights around 70 N is constructed using the 22 year ECMWF reanalysis dataset (ERA-INT). The data is 6 hourly with a horizontal resolution of 1 degree on 16 pressure levels. Our methodology here is inspired by previous studies examining flow patterns through cyclones in the mid-latitudes. We apply these earlier mid-latitude methods in the Arctic. We investigate an Arctic trajectory dataset and provide a phenomenological/descriptive analysis of these trajectories, including key meteorological variables carried along trajectories. The trajectory climatology is linked to a previously established cyclone climatology dataset from Hanley and Caballero (2011). We associate trajectories and the meteorological variables they are carrying to cyclones in this dataset. A climatology of 'Arctic-influencing' cyclones is constructed from the cyclone dataset. The resilience of the polar vortex and its effect on circulation, via blocking and breaking, is examined in relation to our trajectory climatology.

Enlarged leukocyte referent libraries can explain additional variance in blood-based epigenome-wide association studies.

PubMed

Kim, Stephanie; Eliot, Melissa; Koestler, Devin C; Houseman, Eugene A; Wetmur, James G; Wiencke, John K; Kelsey, Karl T

2016-09-01

We examined whether variation in blood-based epigenome-wide association studies could be more completely explained by augmenting existing reference DNA methylation libraries. We compared existing and enhanced libraries in predicting variability in three publicly available 450K methylation datasets that collected whole-blood samples. Models were fit separately to each CpG site and used to estimate the additional variability when adjustments for cell composition were made with each library. Calculation of the mean difference in the CpG-specific residual sums of squares error between models for an arthritis, aging and metabolic syndrome dataset, indicated that an enhanced library explained significantly more variation across all three datasets (p < 10(-3)). Pathologically important immune cell subtypes can explain important variability in epigenome-wide association studies done in blood.
XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets.

PubMed

Yu, Yao; Hu, Hao; Bohlender, Ryan J; Hu, Fulan; Chen, Jiun-Sheng; Holt, Carson; Fowler, Jerry; Guthery, Stephen L; Scheet, Paul; Hildebrandt, Michelle A T; Yandell, Mark; Huff, Chad D

2018-04-06

High-throughput sequencing data are increasingly being made available to the research community for secondary analyses, providing new opportunities for large-scale association studies. However, heterogeneity in target capture and sequencing technologies often introduce strong technological stratification biases that overwhelm subtle signals of association in studies of complex traits. Here, we introduce the Cross-Platform Association Toolkit, XPAT, which provides a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets. XPAT includes tools to support cross-platform aware variant calling, quality control filtering, gene-based association testing and rare variant effect size estimation. To evaluate the performance of XPAT, we conducted case-control association studies for three diseases, including 783 breast cancer cases, 272 ovarian cancer cases, 205 Crohn disease cases and 3507 shared controls (including 1722 females) using sequencing data from multiple sources. XPAT greatly reduced Type I error inflation in the case-control analyses, while replicating many previously identified disease-gene associations. We also show that association tests conducted with XPAT using cross-platform data have comparable performance to tests using matched platform data. XPAT enables new association studies that combine existing sequencing datasets to identify genetic loci associated with common diseases and other complex traits.
Assisted reproductive technology has no association with autism spectrum disorders: The Taiwan Birth Cohort Study.

PubMed

Lung, For-Wey; Chiang, Tung-Liang; Lin, Shio-Jean; Lee, Meng-Chih; Shu, Bih-Ching

2018-04-01

The use of assisted reproduction technology has increased over the last two decades. Autism spectrum disorders and assisted reproduction technology share many risk factors. However, previous studies on the association between autism spectrum disorders and assisted reproduction technology have shown inconsistent results. The purpose of this study was to investigate the association between assisted reproduction technology and autism spectrum disorder diagnosis in a national birth cohort database. Furthermore, the results from the assisted reproduction technology and autism spectrum disorder propensity score matching exact matched datasets were compared. For this study, the 6- and 66-month Taiwan Birth Cohort Study datasets were used (N = 20,095). In all, 744 families were propensity score matching exact matched and selected as the assisted reproduction technology sample (ratio of assisted reproduction technology to controls: 1:2) and 415 families as the autism spectrum disorder sample (ratio of autism spectrum disorder to controls: 1:4). Using a national birth cohort dataset, controlling for the confounding factors of assisted reproduction technology conception and autism spectrum disorder diagnosis, both assisted reproduction technology and autism spectrum disorder propensity score matching matched datasets showed the same results of no association between assisted reproduction technology and autism spectrum disorder. Further study on the detailed information regarding the processes and methods of assisted reproduction technology may provide us with more information on the association between assisted reproduction technology and autism spectrum disorder.
Diagnosing observed characteristics of the wet season across Africa to identify deficiencies in climate model simulations

NASA Astrophysics Data System (ADS)

Dunning, C.; Black, E.; Allan, R. P.

2017-12-01

The seasonality of rainfall over Africa plays a key role in determining socio-economic impacts for agricultural stakeholders, influences energy supply from hydropower, affects the length of the malaria transmission season and impacts surface water supplies. Hence, failure or delays of these rains can lead to significant socio-economic impacts. Diagnosing and interpreting interannual variability and long-term trends in seasonality, and analysing the physical driving mechanisms, requires a robust definition of African precipitation seasonality, applicable to both observational datasets and model simulations. Here we present a methodology for objectively determining the onset and cessation of multiple wet seasons across the whole of Africa. Compatibility with known physical drivers of African rainfall, consistency with indigenous methods, and generally strong agreement between satellite-based rainfall data sets confirm that the method is capturing the correct seasonal progression of African rainfall. Application of this method to observational datasets reveals that over East Africa cessation of the short rains is 5 days earlier in La Nina years, and the failure of the rains and subsequent humanitarian disaster is associated with shorter as well as weaker rainy seasons over this region. The method is used to examine the representation of the seasonality of African precipitation in CMIP5 model simulations. Overall, atmosphere-only and fully coupled CMIP5 historical simulations represent essential aspects of the seasonal cycle; patterns of seasonal progression of the rainy season are captured, for the most part mean model onset/ cessation dates agree with mean observational dates to within 18 days. However, unlike the atmosphere-only simulations, the coupled simulations do not capture the biannual regime over the southern West African coastline, linked to errors in Gulf of Guinea Sea Surface Temperature. Application to both observational and climate model datasets, and good agreement with agricultural onset methods, indicates the potential applicability of this method to a variety of meteorological and climate impact studies.
Performance and precision of double digestion RAD (ddRAD) genotyping in large multiplexed datasets of marine fish species.

PubMed

Maroso, F; Hillen, J E J; Pardo, B G; Gkagkavouzis, K; Coscia, I; Hermida, M; Franch, R; Hellemans, B; Van Houdt, J; Simionati, B; Taggart, J B; Nielsen, E E; Maes, G; Ciavaglia, S A; Webster, L M I; Volckaert, F A M; Martinez, P; Bargelloni, L; Ogden, R

2018-06-01

The development of Genotyping-By-Sequencing (GBS) technologies enables cost-effective analysis of large numbers of Single Nucleotide Polymorphisms (SNPs), especially in "non-model" species. Nevertheless, as such technologies enter a mature phase, biases and errors inherent to GBS are becoming evident. Here, we evaluated the performance of double digest Restriction enzyme Associated DNA (ddRAD) sequencing in SNP genotyping studies including high number of samples. Datasets of sequence data were generated from three marine teleost species (>5500 samples, >2.5 × 10 12 bases in total), using a standardized protocol. A common bioinformatics pipeline based on STACKS was established, with and without the use of a reference genome. We performed analyses throughout the production and analysis of ddRAD data in order to explore (i) the loss of information due to heterogeneous raw read number across samples; (ii) the discrepancy between expected and observed tag length and coverage; (iii) the performances of reference based vs. de novo approaches; (iv) the sources of potential genotyping errors of the library preparation/bioinformatics protocol, by comparing technical replicates. Our results showed use of a reference genome and a posteriori genotype correction improved genotyping precision. Individual read coverage was a key variable for reproducibility; variance in sequencing depth between loci in the same individual was also identified as an important factor and found to correlate to tag length. A comparison of downstream analysis carried out with ddRAD vs single SNP allele specific assay genotypes provided information about the levels of genotyping imprecision that can have a significant impact on allele frequency estimations and population assignment. The results and insights presented here will help to select and improve approaches to the analysis of large datasets based on RAD-like methodologies. Crown Copyright © 2018. Published by Elsevier B.V. All rights reserved.
Assessing Data Quality in Emergent Domains of Earth Sciences

NASA Astrophysics Data System (ADS)

Darch, P. T.; Borgman, C.

2016-12-01

As earth scientists seek to study known phenomena in new ways, and to study new phenomena, they often develop new technologies and new methods such as embedded network sensing, or reapply extant technologies, such as seafloor drilling. Emergent domains are often highly multidisciplinary as researchers from many backgrounds converge on new research questions. They may adapt existing methods, or develop methods de novo. As a result, emerging domains tend to be methodologically heterogeneous. As these domains mature, pressure to standardize methods increases. Standardization promotes trust, reliability, accuracy, and reproducibility, and simplifies data management. However, for standardization to occur, researchers must be able to assess which of the competing methods produces the highest quality data. The exploratory nature of emerging domains discourages standardization. Because competing methods originate in different disciplinary backgrounds, their scientific credibility is difficult to compare. Instead of direct comparison, researchers attempt to conduct meta-analyses. Scientists compare datasets produced by different methods to assess their consistency and efficiency. This paper presents findings from a long-term qualitative case study of research on the deep subseafloor biosphere, an emergent domain. A diverse community converged on the study of microbes in the seafloor and those microbes' interactions with the physical environments they inhabit. Data on this problem are scarce, leading to calls for standardization as a means to acquire and analyze greater volumes of data. Lacking consistent methods, scientists attempted to conduct meta-analyses to determine the most promising methods on which to standardize. Among the factors that inhibited meta-analyses were disparate approaches to metadata and to curating data. Datasets may be deposited in a variety of databases or kept on individual scientists' servers. Associated metadata may be inconsistent or hard to interpret. Incentive structures, including prospects for journal publication, often favor new data over reanalyzing extant datasets. Assessing data quality in emergent domains is extremely difficult and will require adaptations in infrastructure, culture, and incentives.
Climate Trend Detection using Sea-Surface Temperature Data-sets from the (A)ATSR and AVHRR Space Sensors.

NASA Astrophysics Data System (ADS)

Llewellyn-Jones, D. T.; Corlett, G. K.; Remedios, J. J.; Noyes, E. J.; Good, S. A.

2007-05-01

Sea-Surface Temperature (SST) is an important indicator of global change, designated by GCOS as an essential Climate Variable (ECV). The detection of trends in Global SST requires rigorous measurements that are not only global, but also highly accurate and consistent. Space instruments can provide the means to achieve these required attributes in SST data. This paper presents an analysis of 15 years of SST data from two independent data sets, generated from the (A)ATSR and AVHRR series of sensors respectively. The analyses reveal trends of increasing global temperature between 0.13°C to 0.18 °C, per decade, closely matching that expected from some current predictions. A high level of consistency in the results from the two independent observing systems is seen, which gives increased confidence in data from both systems and also enables comparative analyses of the accuracy and stability of both data sets to be carried out. The conclusion is that these satellite SST data-sets provide important means to quantify and explore the processes of climate change. An analysis based upon singular value decomposition, allowing the removal of gross transitory disturbances, notably the El Niño, in order to examine regional areas of change other than the tropical Pacific, is also presented. Interestingly, although El Niño events clearly affect SST globally, they are found to have a non- significant (within error) effect on the calculated trends, which changed by only 0.01 K/decade when the pattern of El Niño and the associated variations was removed from the SST record. Although similar global trends were calculated for these two independent data sets, larger regional differences are noted. Evidence of decreased temperatures after the eruption of Mount Pinatubo in 1991 was also observed. The methodology demonstrated here can be applied to other data-sets, which cover long time-series observations of geophysical observations in order to characterise long-term change.
Integrated dataset of impact of dissolved organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles

EPA Pesticide Factsheets

This dataset is generated to both qualitatively and quantitatively examine the interactions between nano-TiO2 and natural organic matter (NOM). This integrated dataset assemble all data generated in this project through a series of experiments. This dataset is associated with the following publication:Li , S., H. Ma, L. Wallis, M. Etterson , B. Riley , D. Hoff , and S. Diamond. Impact of natural organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 542: 324-333, (2016).
Partition dataset according to amino acid type improves the prediction of deleterious non-synonymous SNPs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, Jing; Li, Yuan-Yuan; Shanghai Center for Bioinformation Technology, Shanghai 200235

2012-03-02

Highlights: Black-Right-Pointing-Pointer Proper dataset partition can improve the prediction of deleterious nsSNPs. Black-Right-Pointing-Pointer Partition according to original residue type at nsSNP is a good criterion. Black-Right-Pointing-Pointer Similar strategy is supposed promising in other machine learning problems. -- Abstract: Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNPmore » site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.« less
Emory University: High-Throughput Protein-Protein Interaction Dataset for Lung Cancer-Associated Genes | Office of Cancer Genomics

Cancer.gov

To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.
Methodology to develop crash modification functions for road safety treatments with fully specified and hierarchical models.

PubMed

Chen, Yongsheng; Persaud, Bhagwant

2014-09-01

Crash modification factors (CMFs) for road safety treatments are developed as multiplicative factors that are used to reflect the expected changes in safety performance associated with changes in highway design and/or the traffic control features. However, current CMFs have methodological drawbacks. For example, variability with application circumstance is not well understood, and, as important, correlation is not addressed when several CMFs are applied multiplicatively. These issues can be addressed by developing safety performance functions (SPFs) with components of crash modification functions (CM-Functions), an approach that includes all CMF related variables, along with others, while capturing quantitative and other effects of factors and accounting for cross-factor correlations. CM-Functions can capture the safety impact of factors through a continuous and quantitative approach, avoiding the problematic categorical analysis that is often used to capture CMF variability. There are two formulations to develop such SPFs with CM-Function components - fully specified models and hierarchical models. Based on sample datasets from two Canadian cities, both approaches are investigated in this paper. While both model formulations yielded promising results and reasonable CM-Functions, the hierarchical model was found to be more suitable in retaining homogeneity of first-level SPFs, while addressing CM-Functions in sub-level modeling. In addition, hierarchical models better capture the correlations between different impact factors. Copyright © 2014 Elsevier Ltd. All rights reserved.
LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer

PubMed Central

Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert

2015-01-01

Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482
LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer.

PubMed

Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert

2015-12-01

Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.
Dynamic association rules for gene expression data analysis.

PubMed

Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung

2015-10-14

The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
A correlative imaging based methodology for accurate quantitative assessment of bone formation in additive manufactured implants.

PubMed

Geng, Hua; Todd, Naomi M; Devlin-Mullin, Aine; Poologasundarampillai, Gowsihan; Kim, Taek Bo; Madi, Kamel; Cartmell, Sarah; Mitchell, Christopher A; Jones, Julian R; Lee, Peter D

2016-06-01

A correlative imaging methodology was developed to accurately quantify bone formation in the complex lattice structure of additive manufactured implants. Micro computed tomography (μCT) and histomorphometry were combined, integrating the best features from both, while demonstrating the limitations of each imaging modality. This semi-automatic methodology registered each modality using a coarse graining technique to speed the registration of 2D histology sections to high resolution 3D μCT datasets. Once registered, histomorphometric qualitative and quantitative bone descriptors were directly correlated to 3D quantitative bone descriptors, such as bone ingrowth and bone contact. The correlative imaging allowed the significant volumetric shrinkage of histology sections to be quantified for the first time (~15 %). This technique demonstrated the importance of location of the histological section, demonstrating that up to a 30 % offset can be introduced. The results were used to quantitatively demonstrate the effectiveness of 3D printed titanium lattice implants.
Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA

PubMed Central

Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.

2018-01-01

Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513
Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA.

PubMed

Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G

2018-02-20

Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.
Birth-death prior on phylogeny and speed dating

PubMed Central

2008-01-01

Background In recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies. Results We demonstrate that a hill-climbing maximum a posteriori (MAP) adaptation of the MCMC scheme results in considerable gain in computational efficiency. We demonstrate also that a novel dynamic programming (DP) algorithm for branch length factorization, useful both in the hill-climbing and in the MCMC setting, further reduces computation time. For the problem of inferring rates and times parameters on a fixed tree, we perform simulations, comparisons between hill-climbing and MCMC on a plant rbcL gene dataset, and dating analysis on an animal mtDNA dataset, showing that our methodology enables efficient, highly accurate analysis of very large trees. Datasets requiring a computation time of several days with MCMC can with our MAP algorithm be accurately analysed in less than a minute. From the results of our example analyses, we conclude that our methodology generally avoids getting trapped early in local optima. For the cases where this nevertheless can be a problem, for instance when we in addition to the parameters also infer the tree topology, we show that the problem can be evaded by using a simulated-annealing like (SAL) method in which we favour tree swaps early in the inference while biasing our focus towards rate and time parameter changes later on. Conclusion Our contribution leaves the field open for fast and accurate dating analysis of nucleotide sequence data. Modeling branch substitutions rates and divergence times separately allows us to include birth-death priors on the times without the assumption of a molecular clock. The methodology is easily adapted to take data from fossil records into account and it can be used together with a broad range of rate and substitution models. PMID:18318893
Birth-death prior on phylogeny and speed dating.

PubMed

Akerborg, Orjan; Sennblad, Bengt; Lagergren, Jens

2008-03-04

In recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies. We demonstrate that a hill-climbing maximum a posteriori (MAP) adaptation of the MCMC scheme results in considerable gain in computational efficiency. We demonstrate also that a novel dynamic programming (DP) algorithm for branch length factorization, useful both in the hill-climbing and in the MCMC setting, further reduces computation time. For the problem of inferring rates and times parameters on a fixed tree, we perform simulations, comparisons between hill-climbing and MCMC on a plant rbcL gene dataset, and dating analysis on an animal mtDNA dataset, showing that our methodology enables efficient, highly accurate analysis of very large trees. Datasets requiring a computation time of several days with MCMC can with our MAP algorithm be accurately analysed in less than a minute. From the results of our example analyses, we conclude that our methodology generally avoids getting trapped early in local optima. For the cases where this nevertheless can be a problem, for instance when we in addition to the parameters also infer the tree topology, we show that the problem can be evaded by using a simulated-annealing like (SAL) method in which we favour tree swaps early in the inference while biasing our focus towards rate and time parameter changes later on. Our contribution leaves the field open for fast and accurate dating analysis of nucleotide sequence data. Modeling branch substitutions rates and divergence times separately allows us to include birth-death priors on the times without the assumption of a molecular clock. The methodology is easily adapted to take data from fossil records into account and it can be used together with a broad range of rate and substitution models.
Calculating Effective Elastic Properties of Berea Sandstone Using Segmentation-less Method without Targets

NASA Astrophysics Data System (ADS)

Ikeda, K.; Goldfarb, E. J.; Tisato, N.

2017-12-01

Digital rock physics (DRP) allows performing common laboratory experiments on numerical models to estimate, for example, rock hydraulic permeability. The standard procedure of DRP involves turning a rock sample into a numerical array using X-ray micro computed tomography (micro-CT). Each element of the array bears a value proportional to the X-ray attenuation of the rock at the element (voxel). However, the traditional DRP methodology, which includes segmentation, over-predicts rock moduli by significant amounts (e.g., 100%). Recently, a new methodology - the segmentation-less approach - has been proposed leading to more accurate DRP estimate of elastic moduli. This new method is based on homogenization theory. Typically, segmentation-less approach requires calibration points from known density objects, known as targets. Not all micro-CT datasets have these reference points. Here, we describe how we perform segmentation- and target-less DRP to estimate elastic properties of rocks (i.e., elastic moduli), which are crucial parameters to perform subsurface modeling. We calculate the elastic properties of a Berea sandstone sample that was scanned at a resolution of 40 microns per voxel. We transformed the CT images into density matrices using polynomial fitting curve with four calibration points: the whole rock, the center of quartz grains, the center of iron oxide grains, and the center of air-filled volumes. The first calibration point is obtained by assigning the density of the whole rock to the average of all CT-numbers in the dataset. Then, we locate the center of each phase by finding local extrema point in the dataset. The average CT-numbers of these center points are assigned the density equal to either pristine minerals (quartz and iron oxide) or air. Next, density matrices are transformed to porosity and moduli matrices by means of an effective medium theory. Finally, effective static bulk and shear modulus are numerically calculated by using a Matlab code derived from the elas3D NIST code. The calculated quasi-static P- and S-wave speed overestimates the laboratory result by 37% and 5%, respectively. In fact, our approach predicts wave speeds more accurately than traditional DRP methods. Nevertheless, the presented methodology need to be further investigated and improved.

An early illness recognition framework using a temporal Smith Waterman algorithm and NLP.

PubMed

Hajihashemi, Zahra; Popescu, Mihail

2013-01-01

In this paper we propose a framework for detecting health patterns based on non-wearable sensor sequence similarity and natural language processing (NLP). In TigerPlace, an aging in place facility from Columbia, MO, we deployed 47 sensor networks together with a nursing electronic health record (EHR) system to provide early illness recognition. The proposed framework utilizes sensor sequence similarity and NLP on EHR nursing comments to automatically notify the physician when health problems are detected. The reported methodology is inspired by genomic sequence annotation using similarity algorithms such as Smith Waterman (SW). Similarly, for each sensor sequence, we associate health concepts extracted from the nursing notes using Metamap, a NLP tool provided by Unified Medical Language System (UMLS). Since sensor sequences, unlike genomics ones, have an associated time dimension we propose a temporal variant of SW (TSW) to account for time. The main challenges presented by our framework are finding the most suitable time sequence similarity and aggregation of the retrieved UMLS concepts. On a pilot dataset from three Tiger Place residents, with a total of 1685 sensor days and 626 nursing records, we obtained an average precision of 0.64 and a recall of 0.37.
Enhanced spatio-temporal alignment of plantar pressure image sequences using B-splines.

PubMed

Oliveira, Francisco P M; Tavares, João Manuel R S

2013-03-01

This article presents an enhanced methodology to align plantar pressure image sequences simultaneously in time and space. The temporal alignment of the sequences is accomplished using B-splines in the time modeling, and the spatial alignment can be attained using several geometric transformation models. The methodology was tested on a dataset of 156 real plantar pressure image sequences (3 sequences for each foot of the 26 subjects) that was acquired using a common commercial plate during barefoot walking. In the alignment of image sequences that were synthetically deformed both in time and space, an outstanding accuracy was achieved with the cubic B-splines. This accuracy was significantly better (p < 0.001) than the one obtained using the best solution proposed in our previous work. When applied to align real image sequences with unknown transformation involved, the alignment based on cubic B-splines also achieved superior results than our previous methodology (p < 0.001). The consequences of the temporal alignment on the dynamic center of pressure (COP) displacement was also assessed by computing the intraclass correlation coefficients (ICC) before and after the temporal alignment of the three image sequence trials of each foot of the associated subject at six time instants. The results showed that, generally, the ICCs related to the medio-lateral COP displacement were greater when the sequences were temporally aligned than the ICCs of the original sequences. Based on the experimental findings, one can conclude that the cubic B-splines are a remarkable solution for the temporal alignment of plantar pressure image sequences. These findings also show that the temporal alignment can increase the consistency of the COP displacement on related acquired plantar pressure image sequences.
The direct and indirect costs of both overweight and obesity: a systematic review

PubMed Central

2014-01-01

Background The rising prevalence of overweight and obesity places a financial burden on health services and on the wider economy. Health service and societal costs of overweight and obesity are typically estimated by top-down approaches which derive population attributable fractions for a range of conditions associated with increased body fat or bottom-up methods based on analyses of cross-sectional or longitudinal datasets. The evidence base of cost of obesity studies is continually expanding, however, the scope of these studies varies widely and a lack of standardised methods limits comparisons nationally and internationally. The objective of this review is to contribute to this knowledge pool by examining direct costs and indirect (lost productivity) costs of both overweight and obesity to provide comparable estimates. This review was undertaken as part of the introductory work for the Irish cost of overweight and obesity study and examines inconsistencies in the methodologies of cost of overweight and obesity studies. Studies which evaluated the direct costs and indirect costs of both overweight and obesity were included. Methods A computerised search of English language studies addressing direct and indirect costs of overweight and obesity in adults between 2001 and 2011 was conducted. Reference lists of reports, articles and earlier reviews were scanned to identify additional studies. Results Five published articles were deemed eligible for inclusion. Despite the limited scope of this review there was considerable heterogeneity in methodological approaches and findings. In the four studies which presented separate estimates for direct and indirect costs of overweight and obesity, the indirect costs were higher, accounting for between 54% and 59% of the estimated total costs. Conclusion A gradient exists between increasing BMI and direct healthcare costs and indirect costs due to reduced productivity and early premature mortality. Determining precise estimates for the increases is mired by the large presence of heterogeneity among the available cost estimation literature. To improve the availability of quality evidence an international consensus on standardised methods for cost of obesity studies is warranted. Analyses of nationally representative cross-sectional datasets augmented by data from primary care are likely to provide the best data for international comparisons. PMID:24739239
The direct and indirect costs of both overweight and obesity: a systematic review.

PubMed

Dee, Anne; Kearns, Karen; O'Neill, Ciaran; Sharp, Linda; Staines, Anthony; O'Dwyer, Victoria; Fitzgerald, Sarah; Perry, Ivan J

2014-04-16

The rising prevalence of overweight and obesity places a financial burden on health services and on the wider economy. Health service and societal costs of overweight and obesity are typically estimated by top-down approaches which derive population attributable fractions for a range of conditions associated with increased body fat or bottom-up methods based on analyses of cross-sectional or longitudinal datasets. The evidence base of cost of obesity studies is continually expanding, however, the scope of these studies varies widely and a lack of standardised methods limits comparisons nationally and internationally. The objective of this review is to contribute to this knowledge pool by examining direct costs and indirect (lost productivity) costs of both overweight and obesity to provide comparable estimates. This review was undertaken as part of the introductory work for the Irish cost of overweight and obesity study and examines inconsistencies in the methodologies of cost of overweight and obesity studies. Studies which evaluated the direct costs and indirect costs of both overweight and obesity were included. A computerised search of English language studies addressing direct and indirect costs of overweight and obesity in adults between 2001 and 2011 was conducted. Reference lists of reports, articles and earlier reviews were scanned to identify additional studies. Five published articles were deemed eligible for inclusion. Despite the limited scope of this review there was considerable heterogeneity in methodological approaches and findings. In the four studies which presented separate estimates for direct and indirect costs of overweight and obesity, the indirect costs were higher, accounting for between 54% and 59% of the estimated total costs. A gradient exists between increasing BMI and direct healthcare costs and indirect costs due to reduced productivity and early premature mortality. Determining precise estimates for the increases is mired by the large presence of heterogeneity among the available cost estimation literature. To improve the availability of quality evidence an international consensus on standardised methods for cost of obesity studies is warranted. Analyses of nationally representative cross-sectional datasets augmented by data from primary care are likely to provide the best data for international comparisons.
Detection of genomic loci associated with environmental variables using generalized linear mixed models.

PubMed

Lobréaux, Stéphane; Melodelima, Christelle

2015-02-01

We tested the use of Generalized Linear Mixed Models to detect associations between genetic loci and environmental variables, taking into account the population structure of sampled individuals. We used a simulation approach to generate datasets under demographically and selectively explicit models. These datasets were used to analyze and optimize GLMM capacity to detect the association between markers and selective coefficients as environmental data in terms of false and true positive rates. Different sampling strategies were tested, maximizing the number of populations sampled, sites sampled per population, or individuals sampled per site, and the effect of different selective intensities on the efficiency of the method was determined. Finally, we apply these models to an Arabidopsis thaliana SNP dataset from different accessions, looking for loci associated with spring minimal temperature. We identified 25 regions that exhibit unusual correlations with the climatic variable and contain genes with functions related to temperature stress. Copyright © 2014 Elsevier Inc. All rights reserved.
Multivariate Bayesian analysis of Gaussian, right censored Gaussian, ordered categorical and binary traits using Gibbs sampling

PubMed Central

Korsgaard, Inge Riis; Lund, Mogens Sandø; Sorensen, Daniel; Gianola, Daniel; Madsen, Per; Jensen, Just

2003-01-01

A fully Bayesian analysis using Gibbs sampling and data augmentation in a multivariate model of Gaussian, right censored, and grouped Gaussian traits is described. The grouped Gaussian traits are either ordered categorical traits (with more than two categories) or binary traits, where the grouping is determined via thresholds on the underlying Gaussian scale, the liability scale. Allowances are made for unequal models, unknown covariance matrices and missing data. Having outlined the theory, strategies for implementation are reviewed. These include joint sampling of location parameters; efficient sampling from the fully conditional posterior distribution of augmented data, a multivariate truncated normal distribution; and sampling from the conditional inverse Wishart distribution, the fully conditional posterior distribution of the residual covariance matrix. Finally, a simulated dataset was analysed to illustrate the methodology. This paper concentrates on a model where residuals associated with liabilities of the binary traits are assumed to be independent. A Bayesian analysis using Gibbs sampling is outlined for the model where this assumption is relaxed. PMID:12633531
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

PubMed

Martínez-Romero, Marcos; O'Connor, Martin J; Shankar, Ravi D; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L; Gevaert, Olivier; Graybeal, John; Musen, Mark A

2017-01-01

In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations

PubMed Central

Martínez-Romero, Marcos; O’Connor, Martin J.; Shankar, Ravi D.; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L.; Gevaert, Olivier; Graybeal, John; Musen, Mark A.

2017-01-01

In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository. PMID:29854196
Clinical diabetes research using data mining: a Canadian perspective.

PubMed

Shah, Baiju R; Lipscombe, Lorraine L

2015-06-01

With the advent of the digitization of large amounts of information and the computer power capable of analyzing this volume of information, data mining is increasingly being applied to medical research. Datasets created for administration of the healthcare system provide a wealth of information from different healthcare sectors, and Canadian provinces' single-payer universal healthcare systems mean that data are more comprehensive and complete in this country than in many other jurisdictions. The increasing ability to also link clinical information, such as electronic medical records, laboratory test results and disease registries, has broadened the types of data available for analysis. Data-mining methods have been used in many different areas of diabetes clinical research, including classic epidemiology, effectiveness research, population health and health services research. Although methodologic challenges and privacy concerns remain important barriers to using these techniques, data mining remains a powerful tool for clinical research. Copyright © 2015 Canadian Diabetes Association. Published by Elsevier Inc. All rights reserved.
Disentangling the role of floral sensory stimuli in pollination networks.

PubMed

Kantsa, Aphrodite; Raguso, Robert A; Dyer, Adrian G; Olesen, Jens M; Tscheulin, Thomas; Petanidou, Theodora

2018-03-12

Despite progress in understanding pollination network structure, the functional roles of floral sensory stimuli (visual, olfactory) have never been addressed comprehensively in a community context, even though such traits are known to mediate plant-pollinator interactions. Here, we use a comprehensive dataset of floral traits and a novel dynamic data-pooling methodology to explore the impacts of floral sensory diversity on the structure of a pollination network in a Mediterranean scrubland. Our approach tracks transitions in the network behaviour of each plant species throughout its flowering period and, despite dynamism in visitor composition, reveals significant links to floral scent, and/or colour as perceived by pollinators. Having accounted for floral phenology, abundance and phylogeny, the persistent association between floral sensory traits and visitor guilds supports a deeper role for sensory bias and diffuse coevolution in structuring plant-pollinator networks. This knowledge of floral sensory diversity, by identifying the most influential phenotypes, could help prioritize efforts for plant-pollinator community restoration.
Time series modelling of global mean temperature for managerial decision-making.

PubMed

Romilly, Peter

2005-07-01

Climate change has important implications for business and economic activity. Effective management of climate change impacts will depend on the availability of accurate and cost-effective forecasts. This paper uses univariate time series techniques to model the properties of a global mean temperature dataset in order to develop a parsimonious forecasting model for managerial decision-making over the short-term horizon. Although the model is estimated on global temperature data, the methodology could also be applied to temperature data at more localised levels. The statistical techniques include seasonal and non-seasonal unit root testing with and without structural breaks, as well as ARIMA and GARCH modelling. A forecasting evaluation shows that the chosen model performs well against rival models. The estimation results confirm the findings of a number of previous studies, namely that global mean temperatures increased significantly throughout the 20th century. The use of GARCH modelling also shows the presence of volatility clustering in the temperature data, and a positive association between volatility and global mean temperature.
Outlier identification and visualization for Pb concentrations in urban soils and its implications for identification of potential contaminated land.

PubMed

Zhang, Chaosheng; Tang, Ya; Luo, Lin; Xu, Weilin

2009-11-01

Outliers in urban soil geochemical databases may imply potential contaminated land. Different methodologies which can be easily implemented for the identification of global and spatial outliers were applied for Pb concentrations in urban soils of Galway City in Ireland. Due to its strongly skewed probability feature, a Box-Cox transformation was performed prior to further analyses. The graphic methods of histogram and box-and-whisker plot were effective in identification of global outliers at the original scale of the dataset. Spatial outliers could be identified by a local indicator of spatial association of local Moran's I, cross-validation of kriging, and a geographically weighted regression. The spatial locations of outliers were visualised using a geographical information system. Different methods showed generally consistent results, but differences existed. It is suggested that outliers identified by statistical methods should be confirmed and justified using scientific knowledge before they are properly dealt with.
A national profile of end-of-life caregiving in the United States

PubMed Central

Ornstein, Katherine A.; Kelley, Amy S.; Bollens-Lund, Evan; Wolff, Jennifer L.

2017-01-01

Family and friends are the predominant providers of end-of-life care (EOL). Yet knowledge of the caregiving experience at the EOL has been constrained by a narrow focus on specific diseases or the “primary” caregiver and methodological limitations due to reliance on convenience samples, or recall biases associated with mortality follow-back study design. Using prospective, linked nationally representative datasets of Medicare beneficiaries and their caregivers, we found that in 2011 900,000 older adults at the EOL received support from 2.3 million paid and unpaid caregivers. Nearly 9 in 10 of these caregivers were family members or unpaid. EOL caregivers provided more extensive care and reported more care-related challenges (e.g., physical difficulty) than non-EOL caregivers. EOL challenges were especially prevalent among caregiving spouses. To meet the needs of older adults at the EOL, families and unpaid caregivers must be better recognized and integrated in care delivery and supportive services must be expanded and made more widely available. PMID:28679804
Claim of solar influence is on thin ice: are 11-year cycle solar minima associated with severe winters in Europe?

NASA Astrophysics Data System (ADS)

van Oldenborgh, G. J.; de Laat, A. T. J.; Luterbacher, J.; Ingram, W. J.; Osborn, T. J.

2013-06-01

A recent paper in Geophysical Research Letters, ‘Solar influence on winter severity in central Europe’, by Sirocko et al (2012 Geophys. Res. Lett. 39 L16704) claims that ‘weak solar activity is empirically related to extremely cold winter conditions in Europe’ based on analyses of documentary evidence of freezing of the River Rhine in Germany and of the Reanalysis of the Twentieth Century (20C). However, our attempt to reproduce these findings failed. The documentary data appear to be selected subjectively and agree neither with instrumental observations nor with two other reconstructions based on documentary data. None of these datasets show significant connection between solar activity and winter severity in Europe beyond a common trend. The analysis of Sirocko et al of the 20C circulation and temperature is inconsistent with their time series analysis. A physically-motivated consistent methodology again fails to support the reported conclusions. We conclude that multiple lines of evidence contradict the findings of Sirocko et al.
Big Data and the Future of Radiology Informatics.

PubMed

Kansagra, Akash P; Yu, John-Paul J; Chatterjee, Arindam R; Lenchik, Leon; Chow, Daniel S; Prater, Adam B; Yeh, Jean; Doshi, Ankur M; Hawkins, C Matthew; Heilbrun, Marta E; Smith, Stacy E; Oselkin, Martin; Gupta, Pushpender; Ali, Sayed

2016-01-01

Rapid growth in the amount of data that is electronically recorded as part of routine clinical operations has generated great interest in the use of Big Data methodologies to address clinical and research questions. These methods can efficiently analyze and deliver insights from high-volume, high-variety, and high-growth rate datasets generated across the continuum of care, thereby forgoing the time, cost, and effort of more focused and controlled hypothesis-driven research. By virtue of an existing robust information technology infrastructure and years of archived digital data, radiology departments are particularly well positioned to take advantage of emerging Big Data techniques. In this review, we describe four areas in which Big Data is poised to have an immediate impact on radiology practice, research, and operations. In addition, we provide an overview of the Big Data adoption cycle and describe how academic radiology departments can promote Big Data development. Copyright © 2016 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.
Time-dependent efficacy of longitudinal biomarker for clinical endpoint.

PubMed

Kolamunnage-Dona, Ruwanthi; Williamson, Paula R

2018-06-01

Joint modelling of longitudinal biomarker and event-time processes has gained its popularity in recent years as they yield more accurate and precise estimates. Considering this modelling framework, a new methodology for evaluating the time-dependent efficacy of a longitudinal biomarker for clinical endpoint is proposed in this article. In particular, the proposed model assesses how well longitudinally repeated measurements of a biomarker over various time periods (0,t) distinguish between individuals who developed the disease by time t and individuals who remain disease-free beyond time t. The receiver operating characteristic curve is used to provide the corresponding efficacy summaries at various t based on the association between longitudinal biomarker trajectory and risk of clinical endpoint prior to each time point. The model also allows detecting the time period over which a biomarker should be monitored for its best discriminatory value. The proposed approach is evaluated through simulation and illustrated on the motivating dataset from a prospective observational study of biomarkers to diagnose the onset of sepsis.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments

PubMed Central

Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D.; Kluger, Yuval

2012-01-01

Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development. PMID:22307239
Bayesian multiproxy temperature reconstruction with black spruce ring widths and stable isotopes from the northern Quebec taiga

NASA Astrophysics Data System (ADS)

Gennaretti, Fabio; Huard, David; Naulier, Maud; Savard, Martine; Bégin, Christian; Arseneault, Dominique; Guiot, Joel

2017-12-01

Northeastern North America has very few millennium-long, high-resolution climate proxy records. However, very recently, a new tree-ring dataset suitable for temperature reconstructions over the last millennium was developed in the northern Quebec taiga. This dataset is composed of one δ18O and six ring width chronologies. Until now, these chronologies have only been used in independent temperature reconstructions (from δ18O or ring width) showing some differences. Here, we added to the dataset a δ13C chronology and developed a significantly improved millennium-long multiproxy reconstruction (997-2006 CE) accounting for uncertainties with a Bayesian approach that evaluates the likelihood of each proxy model. We also undertook a methodological sensitivity analysis to assess the different responses of each proxy to abrupt forcings such as strong volcanic eruptions. Ring width showed a larger response to single eruptions and a larger cumulative impact of multiple eruptions during active volcanic periods, δ18O showed intermediate responses, and δ13C was mostly insensitive to volcanic eruptions. We conclude that all reconstructions based on a single proxy can be misleading because of the possible reduced or amplified responses to specific forcing agents.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments.

PubMed

Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D; Kluger, Yuval

2012-05-01

Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.
Enrichment of OpenStreetMap Data Completeness with Sidewalk Geometries Using Data Mining Techniques.

PubMed

Mobasheri, Amin; Huang, Haosheng; Degrossi, Lívia Castro; Zipf, Alexander

2018-02-08

Tailored routing and navigation services utilized by wheelchair users require certain information about sidewalk geometries and their attributes to execute efficiently. Except some minor regions/cities, such detailed information is not present in current versions of crowdsourced mapping databases including OpenStreetMap. CAP4Access European project aimed to use (and enrich) OpenStreetMap for making it fit to the purpose of wheelchair routing. In this respect, this study presents a modified methodology based on data mining techniques for constructing sidewalk geometries using multiple GPS traces collected by wheelchair users during an urban travel experiment. The derived sidewalk geometries can be used to enrich OpenStreetMap to support wheelchair routing. The proposed method was applied to a case study in Heidelberg, Germany. The constructed sidewalk geometries were compared to an official reference dataset ("ground truth dataset"). The case study shows that the constructed sidewalk network overlays with 96% of the official reference dataset. Furthermore, in terms of positional accuracy, a low Root Mean Square Error (RMSE) value (0.93 m) is achieved. The article presents our discussion on the results as well as the conclusion and future research directions.

Image-based query-by-example for big databases of galaxy images

NASA Astrophysics Data System (ADS)

Shamir, Lior; Kuminski, Evan

2017-01-01

Very large astronomical databases containing millions or even billions of galaxy images have been becoming increasingly important tools in astronomy research. However, in many cases the very large size makes it more difficult to analyze these data manually, reinforcing the need for computer algorithms that can automate the data analysis process. An example of such task is the identification of galaxies of a certain morphology of interest. For instance, if a rare galaxy is identified it is reasonable to expect that more galaxies of similar morphology exist in the database, but it is virtually impossible to manually search these databases to identify such galaxies. Here we describe computer vision and pattern recognition methodology that receives a galaxy image as an input, and searches automatically a large dataset of galaxies to return a list of galaxies that are visually similar to the query galaxy. The returned list is not necessarily complete or clean, but it provides a substantial reduction of the original database into a smaller dataset, in which the frequency of objects visually similar to the query galaxy is much higher. Experimental results show that the algorithm can identify rare galaxies such as ring galaxies among datasets of 10,000 astronomical objects.
Progress, pitfalls and parallel universes: a history of insect phylogenetics

PubMed Central

Simon, Chris; Yavorskaya, Margarita; Beutel, Rolf G.

2016-01-01

The phylogeny of insects has been both extensively studied and vigorously debated for over a century. A relatively accurate deep phylogeny had been produced by 1904. It was not substantially improved in topology until recently when phylogenomics settled many long-standing controversies. Intervening advances came instead through methodological improvement. Early molecular phylogenetic studies (1985–2005), dominated by a few genes, provided datasets that were too small to resolve controversial phylogenetic problems. Adding to the lack of consensus, this period was characterized by a polarization of philosophies, with individuals belonging to either parsimony or maximum-likelihood camps; each largely ignoring the insights of the other. The result was an unfortunate detour in which the few perceived phylogenetic revolutions published by both sides of the philosophical divide were probably erroneous. The size of datasets has been growing exponentially since the mid-1980s accompanied by a wave of confidence that all relationships will soon be known. However, large datasets create new challenges, and a large number of genes does not guarantee reliable results. If history is a guide, then the quality of conclusions will be determined by an improved understanding of both molecular and morphological evolution, and not simply the number of genes analysed. PMID:27558853
Addressing Methodological Challenges in Large Communication Datasets: Collecting and Coding Longitudinal Interactions in Home Hospice Cancer Care

PubMed Central

Reblin, Maija; Clayton, Margaret F; John, Kevin K; Ellington, Lee

2015-01-01

In this paper, we present strategies for collecting and coding a large longitudinal communication dataset collected across multiple sites, consisting of over 2000 hours of digital audio recordings from approximately 300 families. We describe our methods within the context of implementing a large-scale study of communication during cancer home hospice nurse visits, but this procedure could be adapted to communication datasets across a wide variety of settings. This research is the first study designed to capture home hospice nurse-caregiver communication, a highly understudied location and type of communication event. We present a detailed example protocol encompassing data collection in the home environment, large-scale, multi-site secure data management, the development of theoretically-based communication coding, and strategies for preventing coder drift and ensuring reliability of analyses. Although each of these challenges have the potential to undermine the utility of the data, reliability between coders is often the only issue consistently reported and addressed in the literature. Overall, our approach demonstrates rigor and provides a “how-to” example for managing large, digitally-recorded data sets from collection through analysis. These strategies can inform other large-scale health communication research. PMID:26580414
DNA methylation age is not accelerated in brain or blood of subjects with schizophrenia.

PubMed

McKinney, Brandon C; Lin, Huang; Ding, Ying; Lewis, David A; Sweet, Robert A

2017-10-05

Individuals with schizophrenia (SZ) exhibit multiple premature age-related phenotypes and die ~20years prematurely. The accelerated aging hypothesis of SZ has been advanced to explain these observations, it posits that SZ-associated factors accelerate the progressive biological changes associated with normal aging. Testing the hypothesis has been limited by the absence of robust, meaningful, and multi-tissue measures of biological age. Recently, a method was described in which DNA methylation (DNAm) levels at 353 genomic sites are used to produce "DNAm age", an estimate of biological age with advantages over existing measures. We used this method and 3 publicly-available DNAm datasets, 1 from brain and 2 from blood, to test the hypothesis. The brain dataset was composed of data from the dorsolateral prefrontal cortex of 232 non-psychiatric control (NPC) and 195 SZ subjects. Blood dataset #1 was composed of data from whole blood of 304 NPC and 332 SZ subjects, and blood dataset #2 was composed of data from whole blood of 405 NPC and 260 SZ subjects. DNAm age and chronological age correlated strongly (r=0.92-0.95, p<0.0001) in both NPC and SZ subjects in all 3 datasets. DNAm age acceleration did not differ between NPC and SZ subjects in the brain dataset (t=0.52, p=0.60), blood dataset #1 (t=1.51, p=0.13), or blood dataset #2 (t=0.93, p=0.35). Consistent with our previous findings from a smaller study of postmortem brains, our findings suggest there is no acceleration of brain or blood aging in SZ and, thus, do not support the accelerated aging hypothesis of SZ. Copyright © 2017 Elsevier B.V. All rights reserved.
GWATCH: a web platform for automated gene association discovery analysis.

PubMed

Svitin, Anton; Malov, Sergey; Cherkasov, Nikolay; Geerts, Paul; Rotkevich, Mikhail; Dobrynin, Pavel; Shevchenko, Andrey; Guan, Li; Troyer, Jennifer; Hendrickson, Sher; Dilks, Holli Hutcheson; Oleksyk, Taras K; Donfield, Sharyne; Gomperts, Edward; Jabs, Douglas A; Sezgin, Efe; Van Natta, Mark; Harrigan, P Richard; Brumme, Zabrina L; O'Brien, Stephen J

2014-01-01

As genome-wide sequence analyses for complex human disease determinants are expanding, it is increasingly necessary to develop strategies to promote discovery and validation of potential disease-gene associations. Here we present a dynamic web-based platform - GWATCH - that automates and facilitates four steps in genetic epidemiological discovery: 1) Rapid gene association search and discovery analysis of large genome-wide datasets; 2) Expanded visual display of gene associations for genome-wide variants (SNPs, indels, CNVs), including Manhattan plots, 2D and 3D snapshots of any gene region, and a dynamic genome browser illustrating gene association chromosomal regions; 3) Real-time validation/replication of candidate or putative genes suggested from other sources, limiting Bonferroni genome-wide association study (GWAS) penalties; 4) Open data release and sharing by eliminating privacy constraints (The National Human Genome Research Institute (NHGRI) Institutional Review Board (IRB), informed consent, The Health Insurance Portability and Accountability Act (HIPAA) of 1996 etc.) on unabridged results, which allows for open access comparative and meta-analysis. GWATCH is suitable for both GWAS and whole genome sequence association datasets. We illustrate the utility of GWATCH with three large genome-wide association studies for HIV-AIDS resistance genes screened in large multicenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.
2013 Workplace and Equal Opportunity Survey of Active Duty Members: Administration, Datasets, and Codebook

DTIC Science & Technology

2016-05-01

and Kroeger (2002) provide details on sampling and weighting. Following the summary of the survey methodology is a description of the survey analysis... description of priority, for the ADDRESS file). At any given time, the current address used corresponded to the address number with the highest priority...types of address updates provided by the postal service. They are detailed below; each includes a description of the processing steps. 1. Postal Non
PuReD-MCL: a graph-based PubMed document clustering methodology.

PubMed

Theodosiou, T; Darzentas, N; Angelis, L; Ouzounis, C A

2008-09-01

Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/
Epileptic Seizure Forewarning by Nonlinear Techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hively, L.M.

2002-04-19

This report describes work that was performed under a Cooperative Research and Development Agreement (CRADA) between UT-Battelle, LLC (Contractor) and a commercial participant, VIASYS Healthcare Inc. (formerly Nicolet Biomedical, Inc.). The Contractor has patented technology that forewarns of impending epileptic events via scalp electroencephalograph (EEG) data and successfully demonstrated this technology on 20 datasets from the Participant under pre-CRADA effort. This CRADA sought to bridge the gap between the Contractor's existing research-class software and a prototype medical device for subsequent commercialization by the Participant. The objectives of this CRADA were (1) development of a combination of existing computer hardware andmore » Contractor-patented software into a clinical process for warning of impending epileptic events in human patients, and (2) validation of the epilepsy warning methodology. This work modified the ORNL research-class FORTRAN for forewarning to run under a graphical user interface (GUI). The GUI-FORTRAN software subsequently was installed on desktop computers at five epilepsy monitoring units. The forewarning prototypes have run for more than one year without any hardware or software failures. This work also reported extensive analysis of model and EEG datasets to demonstrate the usefulness of the methodology. However, the Participant recently chose to stop work on the CRADA, due to a change in business priorities. Much work remains to convert the technology into a commercial clinical or ambulatory device for patient use, as discussed in App. H.« less
Usage and applications of Semantic Web techniques and technologies to support chemistry research

PubMed Central

2014-01-01

Background The drug discovery process is now highly dependent on the management, curation and integration of large amounts of potentially useful data. Semantics are necessary in order to interpret the information and derive knowledge. Advances in recent years have mitigated concerns that the lack of robust, usable tools has inhibited the adoption of methodologies based on semantics. Results This paper presents three examples of how Semantic Web techniques and technologies can be used in order to support chemistry research: a controlled vocabulary for quantities, units and symbols in physical chemistry; a controlled vocabulary for the classification and labelling of chemical substances and mixtures; and, a database of chemical identifiers. This paper also presents a Web-based service that uses the datasets in order to assist with the completion of risk assessment forms, along with a discussion of the legal implications and value-proposition for the use of such a service. Conclusions We have introduced the Semantic Web concepts, technologies, and methodologies that can be used to support chemistry research, and have demonstrated the application of those techniques in three areas very relevant to modern chemistry research, generating three new datasets that we offer as exemplars of an extensible portfolio of advanced data integration facilities. We have thereby established the importance of Semantic Web techniques and technologies for meeting Wild’s fourth “grand challenge”. PMID:24855494
A deep learning approach for the analysis of masses in mammograms with minimal user intervention.

PubMed

Dhungel, Neeraj; Carneiro, Gustavo; Bradley, Andrew P

2017-04-01

We present an integrated methodology for detecting, segmenting and classifying breast masses from mammograms with minimal user intervention. This is a long standing problem due to low signal-to-noise ratio in the visualisation of breast masses, combined with their large variability in terms of shape, size, appearance and location. We break the problem down into three stages: mass detection, mass segmentation, and mass classification. For the detection, we propose a cascade of deep learning methods to select hypotheses that are refined based on Bayesian optimisation. For the segmentation, we propose the use of deep structured output learning that is subsequently refined by a level set method. Finally, for the classification, we propose the use of a deep learning classifier, which is pre-trained with a regression to hand-crafted feature values and fine-tuned based on the annotations of the breast mass classification dataset. We test our proposed system on the publicly available INbreast dataset and compare the results with the current state-of-the-art methodologies. This evaluation shows that our system detects 90% of masses at 1 false positive per image, has a segmentation accuracy of around 0.85 (Dice index) on the correctly detected masses, and overall classifies masses as malignant or benign with sensitivity (Se) of 0.98 and specificity (Sp) of 0.7. Copyright © 2017 Elsevier B.V. All rights reserved.
Acoustic Seabed Characterization of the Porcupine Bank, Irish Margin

NASA Astrophysics Data System (ADS)

O'Toole, Ronan; Monteys, Xavier

2010-05-01

The Porcupine Bank represents a large section of continental shelf situated west of the Irish landmass, located in water depths ranging between 150 and 500m. Under the Irish National Seabed Survey (INSS 1999-2006) this area was comprehensively mapped, generating multiple acoustic datasets including high resolution multibeam echosounder data. The unique nature of the area's datasets in terms of data density, consistency and geographic extent has allowed the development of a large-scale integrated physical characterization of the Porcupine Bank for multidisciplinary applications. Integrated analysis of backscatter and bathymetry data has resulted in a baseline delineation of sediment distribution, seabed geology and geomorphological features on the bank, along with an inclusive set of related database information. The methodology used incorporates a variety of statistical techniques which are necessary in isolating sonar system artefacts and addressing sonar geometry related issues. A number of acoustic backscatter parameters at several angles of incidence have been analysed in order to complement the characterization for both surface and subsurface sediments. Acoustic sub bottom records have also been incorporated in order to investigate the physical characteristics of certain features on the Porcupine Bank. Where available, groundtruthing information in terms of sediment samples, video footage and cores has been applied to add physical descriptors and validation to the characterization. Extensive mapping of different rock outcrops, sediment drifts, seabed features and other geological classes has been achieved using this methodology.
Usage and applications of Semantic Web techniques and technologies to support chemistry research.

PubMed

Borkum, Mark I; Frey, Jeremy G

2014-01-01

The drug discovery process is now highly dependent on the management, curation and integration of large amounts of potentially useful data. Semantics are necessary in order to interpret the information and derive knowledge. Advances in recent years have mitigated concerns that the lack of robust, usable tools has inhibited the adoption of methodologies based on semantics. THIS PAPER PRESENTS THREE EXAMPLES OF HOW SEMANTIC WEB TECHNIQUES AND TECHNOLOGIES CAN BE USED IN ORDER TO SUPPORT CHEMISTRY RESEARCH: a controlled vocabulary for quantities, units and symbols in physical chemistry; a controlled vocabulary for the classification and labelling of chemical substances and mixtures; and, a database of chemical identifiers. This paper also presents a Web-based service that uses the datasets in order to assist with the completion of risk assessment forms, along with a discussion of the legal implications and value-proposition for the use of such a service. We have introduced the Semantic Web concepts, technologies, and methodologies that can be used to support chemistry research, and have demonstrated the application of those techniques in three areas very relevant to modern chemistry research, generating three new datasets that we offer as exemplars of an extensible portfolio of advanced data integration facilities. We have thereby established the importance of Semantic Web techniques and technologies for meeting Wild's fourth "grand challenge".
Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction.

PubMed

Park, Seong Ho; Han, Kyunghwa

2018-03-01

The use of artificial intelligence in medicine is currently an issue of great interest, especially with regard to the diagnostic or predictive analysis of medical images. Adoption of an artificial intelligence tool in clinical practice requires careful confirmation of its clinical utility. Herein, the authors explain key methodology points involved in a clinical evaluation of artificial intelligence technology for use in medicine, especially high-dimensional or overparameterized diagnostic or predictive models in which artificial deep neural networks are used, mainly from the standpoints of clinical epidemiology and biostatistics. First, statistical methods for assessing the discrimination and calibration performances of a diagnostic or predictive model are summarized. Next, the effects of disease manifestation spectrum and disease prevalence on the performance results are explained, followed by a discussion of the difference between evaluating the performance with use of internal and external datasets, the importance of using an adequate external dataset obtained from a well-defined clinical cohort to avoid overestimating the clinical performance as a result of overfitting in high-dimensional or overparameterized classification model and spectrum bias, and the essentials for achieving a more robust clinical evaluation. Finally, the authors review the role of clinical trials and observational outcome studies for ultimate clinical verification of diagnostic or predictive artificial intelligence tools through patient outcomes, beyond performance metrics, and how to design such studies. © RSNA, 2018.
SWAT use of gridded observations for simulating runoff - a Vietnam river basin study

NASA Astrophysics Data System (ADS)

Vu, M. T.; Raghavan, S. V.; Liong, S. Y.

2011-12-01

Many research studies that focus on basin hydrology have used the SWAT model to simulate runoff. One common practice in calibrating the SWAT model is the application of station data rainfall to simulate runoff. But over regions lacking robust station data, there is a problem of applying the model to study the hydrological responses. For some countries and remote areas, the rainfall data availability might be a constraint due to many different reasons such as lacking of technology, war time and financial limitation that lead to difficulty in constructing the runoff data. To overcome such a limitation, this research study uses some of the available globally gridded high resolution precipitation datasets to simulate runoff. Five popular gridded observation precipitation datasets: (1) Asian Precipitation Highly Resolved Observational Data Integration Towards the Evaluation of Water Resources (APHRODITE), (2) Tropical Rainfall Measuring Mission (TRMM), (3) Precipitation Estimation from Remote Sensing Information using Artificial Neural Network (PERSIANN), (4) Global Precipitation Climatology Project (GPCP), (5) modified Global Historical Climatology Network version 2 (GHCN2) and one reanalysis dataset National Centers for Environment Prediction/National Center for Atmospheric Research (NCEP/NCAR) are used to simulate runoff over the Dakbla River (a small tributary of the Mekong River) in Vietnam. Wherever possible, available station data are also used for comparison. Bilinear interpolation of these gridded datasets is used to input the precipitation data at the closest grid points to the station locations. Sensitivity Analysis and Auto-calibration are performed for the SWAT model. The Nash-Sutcliffe Efficiency (NSE) and Coefficient of Determination (R2) indices are used to benchmark the model performance. This entails a good understanding of the response of the hydrological model to different datasets and a quantification of the uncertainties in these datasets. Such a methodology is also useful for planning on Rainfall-runoff and even reservoir/river management both at rural and urban scales.
Advances in soil erosion modelling through remote sensing data availability at European scale

NASA Astrophysics Data System (ADS)

Panagos, Panos; Karydas, Christos; Borrelli, Pasqualle; Ballabio, Cristiano; Meusburger, Katrin

2014-08-01

Under the European Union's Thematic Strategy for Soil Protection, the European Commission's Directorate-General for the Environment (DG Environment) has identified the mitigation of soil losses by erosion as a priority area. Policy makers call for an overall assessment of soil erosion in their geographical area of interest. They have asked that risk areas for soil erosion be mapped under present land use and climate conditions, and that appropriate measures be taken to control erosion within the legal and social context of natural resource management. Remote sensing data help to better assessment of factors that control erosion, such as vegetation coverage, slope length and slope angle. In this context, the data availability of remote sensing data during the past decade facilitates the more precise estimation of soil erosion risk. Following the principles of the Universal Soil Loss Equation (USLE), various options to calculate vegetative cover management (C-factor) have been investigated. The use of the CORINE Land Cover dataset in combination with lookup table values taken from the literature is presented as an option that has the advantage of a coherent input dataset but with the drawback of static input. Recent developments in the Copernicus programme have made detailed datasets available on land cover, leaf area index and base soil characteristics. These dynamic datasets allow for seasonal estimates of vegetation coverage, and their application in the G2 soil erosion model which represents a recent approach to the seasonal monitoring of soil erosion. The use of phenological datasets and the LUCAS land use/cover survey are proposed as auxiliary information in the selection of the best methodology.
Assembling a Protein-Protein Interaction Map of the SSU Processome from Existing Datasets

PubMed Central

Baserga, Susan J.

2011-01-01

Background The small subunit (SSU) processome is a large ribonucleoprotein complex involved in small ribosomal subunit assembly. It consists of the U3 snoRNA and ∼72 proteins. While most of its components have been identified, the protein-protein interactions (PPIs) among them remain largely unknown, and thus the assembly, architecture and function of the SSU processome remains unclear. Methodology We queried PPI databases for SSU processome proteins to quantify the degree to which the three genome-wide high-throughput yeast two-hybrid (HT-Y2H) studies, the genome-wide protein fragment complementation assay (PCA) and the literature-curated (LC) datasets cover the SSU processome interactome. Conclusions We find that coverage of the SSU processome PPI network is remarkably sparse. Two of the three HT-Y2H studies each account for four and six PPIs between only six of the 72 proteins, while the third study accounts for as little as one PPI and two proteins. The PCA dataset has the highest coverage among the genome-wide studies with 27 PPIs between 25 proteins. The LC dataset was the most extensive, accounting for 34 proteins and 38 PPIs, many of which were validated by independent methods, thereby further increasing their reliability. When the collected data were merged, we found that at least 70% of the predicted PPIs have yet to be determined and 26 proteins (36%) have no known partners. Since the SSU processome is conserved in all Eukaryotes, we also queried HT-Y2H datasets from six additional model organisms, but only four orthologues and three previously known interologous interactions were found. This provides a starting point for further work on SSU processome assembly, and spotlights the need for a more complete genome-wide Y2H analysis. PMID:21423703
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis.

PubMed

Till, Kevin; Jones, Ben L; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B

2016-01-01

Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification.
Using aerial images for establishing a workflow for the quantification of water management measures

NASA Astrophysics Data System (ADS)

Leuschner, Annette; Merz, Christoph; van Gasselt, Stephan; Steidl, Jörg

2017-04-01

Quantified landscape characteristics, such as morphology, land use or hydrological conditions, play an important role for hydrological investigations as landscape parameters directly control the overall water balance. A powerful assimilation and geospatial analysis of remote sensing datasets in combination with hydrological modeling allows to quantify landscape parameters and water balances efficiently. This study focuses on the development of a workflow to extract hydrologically relevant data from aerial image datasets and derived products in order to allow an effective parametrization of a hydrological model. Consistent and self-contained data source are indispensable for achieving reasonable modeling results. In order to minimize uncertainties and inconsistencies, input parameters for modeling should be extracted from one remote-sensing dataset mainly if possbile. Here, aerial images have been chosen because of their high spatial and spectral resolution that permits the extraction of various model relevant parameters, like morphology, land-use or artificial drainage-systems. The methodological repertoire to extract environmental parameters range from analyses of digital terrain models, multispectral classification and segmentation of land use distribution maps and mapping of artificial drainage-systems based on spectral and visual inspection. The workflow has been tested for a mesoscale catchment area which forms a characteristic hydrological system of a young moraine landscape located in the state of Brandenburg, Germany. These dataset were used as input-dataset for multi-temporal hydrological modelling of water balances to detect and quantify anthropogenic and meteorological impacts. ArcSWAT, as a GIS-implemented extension and graphical user input interface for the Soil Water Assessment Tool (SWAT) was chosen. The results of this modeling approach provide the basis for anticipating future development of the hydrological system, and regarding system changes for the adaption of water resource management decisions.
Using two on-going HIV studies to obtain clinical data from before, during and after pregnancy for HIV-positive women.

PubMed

Huntington, Susie E; Bansi, Loveleen K; Thorne, Claire; Anderson, Jane; Newell, Marie-Louise; Taylor, Graham P; Pillay, Deenan; Hill, Teresa; Tookey, Pat A; Sabin, Caroline A

2012-07-28

The UK Collaborative HIV Cohort (UK CHIC) is an observational study that collates data on HIV-positive adults accessing HIV clinical care at (currently) 13 large clinics in the UK but does not collect pregnancy specific data. The National Study of HIV in Pregnancy and Childhood (NSHPC) collates data on HIV-positive women receiving antenatal care from every maternity unit in the UK and Ireland. Both studies collate pseudonymised data and neither dataset contains unique patient identifiers. A methodology was developed to find and match records for women reported to both studies thereby obtaining clinical and treatment data on pregnant HIV-positive women not available from either dataset alone. Women in UK CHIC receiving HIV-clinical care in 1996-2009, were found in the NSHPC dataset by initially 'linking' records with identical date-of-birth, linked records were then accepted as a genuine 'match', if they had further matching fields including CD4 test date. In total, 2063 women were found in both datasets, representing 23.1% of HIV-positive women with a pregnancy in the UK (n = 8932). Clinical data was available in UK CHIC following most pregnancies (92.0%, 2471/2685 pregnancies starting before 2009). There was bias towards matching women with repeat pregnancies (35.9% (741/2063) of women found in both datasets had a repeat pregnancy compared to 21.9% (1502/6869) of women in NSHPC only) and matching women HIV diagnosed before their first reported pregnancy (54.8% (1131/2063) compared to 47.7% (3278/6869), respectively). Through the use of demographic data and clinical dates, records from two independent studies were successfully matched, providing data not available from either study alone.
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis

PubMed Central

Till, Kevin; Jones, Ben L.; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B.

2016-01-01

Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification. PMID:27224653

Investigating Perceptual Biases, Data Reliability, and Data Discovery in a Methodology for Collecting Speech Errors From Audio Recordings.

PubMed

Alderete, John; Davies, Monica

2018-04-01

This work describes a methodology of collecting speech errors from audio recordings and investigates how some of its assumptions affect data quality and composition. Speech errors of all types (sound, lexical, syntactic, etc.) were collected by eight data collectors from audio recordings of unscripted English speech. Analysis of these errors showed that: (i) different listeners find different errors in the same audio recordings, but (ii) the frequencies of error patterns are similar across listeners; (iii) errors collected "online" using on the spot observational techniques are more likely to be affected by perceptual biases than "offline" errors collected from audio recordings; and (iv) datasets built from audio recordings can be explored and extended in a number of ways that traditional corpus studies cannot be.
An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.

PubMed

Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J

2016-11-01

The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
Empirical Studies on the Network of Social Groups: The Case of Tencent QQ

PubMed Central

You, Zhi-Qiang; Han, Xiao-Pu; Lü, Linyuan; Yeung, Chi Ho

2015-01-01

Background Participation in social groups are important but the collective behaviors of human as a group are difficult to analyze due to the difficulties to quantify ordinary social relation, group membership, and to collect a comprehensive dataset. Such difficulties can be circumvented by analyzing online social networks. Methodology/Principal Findings In this paper, we analyze a comprehensive dataset released from Tencent QQ, an instant messenger with the highest market share in China. Specifically, we analyze three derivative networks involving groups and their members—the hypergraph of groups, the network of groups and the user network—to reveal social interactions at microscopic and mesoscopic level. Conclusions/Significance Our results uncover interesting behaviors on the growth of user groups, the interactions between groups, and their relationship with member age and gender. These findings lead to insights which are difficult to obtain in social networks based on personal contacts. PMID:26176850
A large dataset of protein dynamics in the mammalian heart proteome

PubMed Central

Lau, Edward; Cao, Quan; Ng, Dominic C.M.; Bleakley, Brian J.; Dincer, T. Umut; Bot, Brian M.; Wang, Ding; Liem, David A.; Lam, Maggie P.Y.; Ge, Junbo; Ping, Peipei

2016-01-01

Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems. PMID:26977904
Will higher traffic flow lead to more traffic conflicts? A crash surrogate metric based analysis

PubMed Central

Kuang, Yan; Yan, Yadan

2017-01-01

In this paper, we aim to examine the relationship between traffic flow and potential conflict risks by using crash surrogate metrics. It has been widely recognized that one traffic flow corresponds to two distinct traffic states with different speeds and densities. In view of this, instead of simply aggregating traffic conditions with the same traffic volume, we represent potential conflict risks at a traffic flow fundamental diagram. Two crash surrogate metrics, namely, Aggregated Crash Index and Time to Collision, are used in this study to represent the potential conflict risks with respect to different traffic conditions. Furthermore, Beijing North Ring III and Next Generation SIMulation Interstate 80 datasets are utilized to carry out case studies. By using the proposed procedure, both datasets generate similar trends, which demonstrate the applicability of the proposed methodology and the transferability of our conclusions. PMID:28787022
Object-Based Retro-Classification Of A Agricultural Land Use: A Case Study Of Irrigated Croplands

NASA Astrophysics Data System (ADS)

Dubovyk, Olena; Conrad, Christopher; Khamzina, Asia; Menz, Gunter

2013-12-01

Availability of the historical crop maps is necessary for the assessment of land management practices and their effectiveness, as well as monitoring of environmental impacts of land uses. Lack of accurate current and past land-use information forestalls assessment of the occurred changes and their consequences and, thus, complicates knowledge-driven agrarian policy development. At the same time, lack of the sampling dataset for the past years often restrict mapping of historical land use. We proposed a methodology for a retro-assessment of several crops, based on multitemporal Landsat 5 TM imagery and a limited sampling dataset. The overall accuracy of the retro-map was 81% while accuracies for specific crop classes varied from 60% to 93%. If further elaborated, the developed method could be a useful tool for the generation of historical data on agricultural land use.
Walkability Index

EPA Pesticide Factsheets

The Walkability Index dataset characterizes every Census 2010 block group in the U.S. based on its relative walkability. Walkability depends upon characteristics of the built environment that influence the likelihood of walking being used as a mode of travel. The Walkability Index is based on the EPA's previous data product, the Smart Location Database (SLD). Block group data from the SLD was the only input into the Walkability Index, and consisted of four variables from the SLD weighted in a formula to create the new Walkability Index. This dataset shares the SLD's block group boundary definitions from Census 2010. The methodology describing the process of creating the Walkability Index can be found in the documents located at ftp://newftp.epa.gov/EPADataCommons/OP/WalkabilityIndex.zip. You can also learn more about the Smart Location Database at https://edg.epa.gov/data/Public/OP/Smart_Location_DB_v02b.zip.
Will higher traffic flow lead to more traffic conflicts? A crash surrogate metric based analysis.

PubMed

Kuang, Yan; Qu, Xiaobo; Yan, Yadan

2017-01-01

In this paper, we aim to examine the relationship between traffic flow and potential conflict risks by using crash surrogate metrics. It has been widely recognized that one traffic flow corresponds to two distinct traffic states with different speeds and densities. In view of this, instead of simply aggregating traffic conditions with the same traffic volume, we represent potential conflict risks at a traffic flow fundamental diagram. Two crash surrogate metrics, namely, Aggregated Crash Index and Time to Collision, are used in this study to represent the potential conflict risks with respect to different traffic conditions. Furthermore, Beijing North Ring III and Next Generation SIMulation Interstate 80 datasets are utilized to carry out case studies. By using the proposed procedure, both datasets generate similar trends, which demonstrate the applicability of the proposed methodology and the transferability of our conclusions.
Investigating the Sensitivity of Streamflow and Water Quality to Climate Change and Urbanization in 20 U.S. Watersheds

NASA Astrophysics Data System (ADS)

Johnson, T. E.; Weaver, C. P.; Butcher, J.; Parker, A.

2011-12-01

Watershed modeling was conducted in 20 large (15,000-60,000 km2), U.S. watersheds to address gaps in our knowledge of the sensitivity of U.S. streamflow, nutrient (N and P) and sediment loading to potential future climate change, and methodological challenges associated with integrating existing tools (e.g., climate models, watershed models) and datasets to address these questions. Climate change scenarios are based on dynamically downscaled (50x50 km2) output from four of the GCMs used in the Intergovernmental Panel on Climate Change (IPCC) 4th Assessment Report for the period 2041-2070 archived by the North American Regional Climate Change Assessment Program (NARCCAP). To explore the potential interaction of climate change and urbanization, model simulations also include urban and residential development scenarios for each of the 20 study watersheds. Urban and residential development scenarios were acquired from EPA's national-scale Integrated Climate and Land Use Scenarios (ICLUS) project. Watershed modeling was conducted using the Hydrologic Simulation Program-FORTRAN (HSPF) and Soil and Water Assessment Tool (SWAT) models. Here we present a summary of results for 5 of the study watersheds; the Minnesota River, the Susquehanna River, the Apalachicola-Chattahoochee-Flint, the Salt/Verde/San Pedro, and the Willamette River Basins. This set of results provide an overview of the response to climate change in different regions of the U.S., the different sensitivities of different streamflow and water quality endpoints, and illustrate a number of methodological issues including the sensitivities and uncertainties associated with use of different watershed models, approaches for downscaling climate change projections, and interaction between climate change and other forcing factors, specifically urbanization and changes in atmospheric CO2 concentration.
Markov Chain Ontology Analysis (MCOA)

PubMed Central

2012-01-01

Background Biomedical ontologies have become an increasingly critical lens through which researchers analyze the genomic, clinical and bibliographic data that fuels scientific research. Of particular relevance are methods, such as enrichment analysis, that quantify the importance of ontology classes relative to a collection of domain data. Current analytical techniques, however, remain limited in their ability to handle many important types of structural complexity encountered in real biological systems including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical relationships between classes, semantic distance and sparse data. Results In this paper, we describe a methodology called Markov Chain Ontology Analysis (MCOA) and illustrate its use through a MCOA-based enrichment analysis application based on a generative model of gene activation. MCOA models the classes in an ontology, the instances from an associated dataset and all directional inter-class, class-to-instance and inter-instance relationships as a single finite ergodic Markov chain. The adjusted transition probability matrix for this Markov chain enables the calculation of eigenvector values that quantify the importance of each ontology class relative to other classes and the associated data set members. On both controlled Gene Ontology (GO) data sets created with Escherichia coli, Drosophila melanogaster and Homo sapiens annotations and real gene expression data extracted from the Gene Expression Omnibus (GEO), the MCOA enrichment analysis approach provides the best performance of comparable state-of-the-art methods. Conclusion A methodology based on Markov chain models and network analytic metrics can help detect the relevant signal within large, highly interdependent and noisy data sets and, for applications such as enrichment analysis, has been shown to generate superior performance on both real and simulated data relative to existing state-of-the-art approaches. PMID:22300537
Markov Chain Ontology Analysis (MCOA).

PubMed

Frost, H Robert; McCray, Alexa T

2012-02-03

Biomedical ontologies have become an increasingly critical lens through which researchers analyze the genomic, clinical and bibliographic data that fuels scientific research. Of particular relevance are methods, such as enrichment analysis, that quantify the importance of ontology classes relative to a collection of domain data. Current analytical techniques, however, remain limited in their ability to handle many important types of structural complexity encountered in real biological systems including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical relationships between classes, semantic distance and sparse data. In this paper, we describe a methodology called Markov Chain Ontology Analysis (MCOA) and illustrate its use through a MCOA-based enrichment analysis application based on a generative model of gene activation. MCOA models the classes in an ontology, the instances from an associated dataset and all directional inter-class, class-to-instance and inter-instance relationships as a single finite ergodic Markov chain. The adjusted transition probability matrix for this Markov chain enables the calculation of eigenvector values that quantify the importance of each ontology class relative to other classes and the associated data set members. On both controlled Gene Ontology (GO) data sets created with Escherichia coli, Drosophila melanogaster and Homo sapiens annotations and real gene expression data extracted from the Gene Expression Omnibus (GEO), the MCOA enrichment analysis approach provides the best performance of comparable state-of-the-art methods. A methodology based on Markov chain models and network analytic metrics can help detect the relevant signal within large, highly interdependent and noisy data sets and, for applications such as enrichment analysis, has been shown to generate superior performance on both real and simulated data relative to existing state-of-the-art approaches.
Dataset from Dick et al published in Sawyer et al 2016

EPA Pesticide Factsheets

Dataset is a time course description of lindane disappearance in blood plasma after dermal exposure in human volunteersThis dataset is associated with the following publication:Sawyer, M.E., M.V. Evans , C. Wilson, L.J. Beesley, L. Leon, C. Eklund , E. Croom, and R. Pegram. Development of a Human Physiologically Based Pharmacokinetics (PBPK) Model For Dermal Permeability for Lindane. TOXICOLOGY LETTERS. Elsevier Science Ltd, New York, NY, USA, 14(245): pp106-109, (2016).
NP-PAH Interaction Dataset

EPA Pesticide Factsheets

Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration were allowed to equilibrate with known mass of nanoparticles. The mixture was then ultracentrifuged and sampled for analysis. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).
Artificial intelligence (AI) systems for interpreting complex medical datasets.

PubMed

Altman, R B

2017-05-01

Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Methodology to estimate particulate matter emissions from certified commercial aircraft engines.

PubMed

Wayson, Roger L; Fleming, Gregg G; Lovinelli, Ralph

2009-01-01

Today, about one-fourth of U.S. commercial service airports, including 41 of the busiest 50, are either in nonattainment or maintenance areas per the National Ambient Air Quality Standards. U.S. aviation activity is forecasted to triple by 2025, while at the same time, the U.S. Environmental Protection Agency (EPA) is evaluating stricter particulate matter (PM) standards on the basis of documented human health and welfare impacts. Stricter federal standards are expected to impede capacity and limit aviation growth if regulatory mandated emission reductions occur as for other non-aviation sources (i.e., automobiles, power plants, etc.). In addition, strong interest exists as to the role aviation emissions play in air quality and climate change issues. These reasons underpin the need to quantify and understand PM emissions from certified commercial aircraft engines, which has led to the need for a methodology to predict these emissions. Standardized sampling techniques to measure volatile and nonvolatile PM emissions from aircraft engines do not exist. As such, a first-order approximation (FOA) was derived to fill this need based on available information. FOA1.0 only allowed prediction of nonvolatile PM. FOA2.0 was a change to include volatile PM emissions on the basis of the ratio of nonvolatile to volatile emissions. Recent collaborative efforts by industry (manufacturers and airlines), research establishments, and regulators have begun to provide further insight into the estimation of the PM emissions. The resultant PM measurement datasets are being analyzed to refine sampling techniques and progress towards standardized PM measurements. These preliminary measurement datasets also support the continued refinement of the FOA methodology. FOA3.0 disaggregated the prediction techniques to allow for independent prediction of nonvolatile and volatile emissions on a more theoretical basis. The Committee for Aviation Environmental Protection of the International Civil Aviation Organization endorsed the use of FOA3.0 in February 2007. Further commitment was made to improve the FOA as new data become available, until such time the methodology is rendered obsolete by a fully validated database of PM emission indices for today's certified commercial fleet. This paper discusses related assumptions and derived equations for the FOA3.0 methodology used worldwide to estimate PM emissions from certified commercial aircraft engines within the vicinity of airports.
Utilizing novel diversity estimators to quantify multiple dimensions of microbial biodiversity across domains

PubMed Central

2013-01-01

Background Microbial ecologists often employ methods from classical community ecology to analyze microbial community diversity. However, these methods have limitations because microbial communities differ from macro-organismal communities in key ways. This study sought to quantify microbial diversity using methods that are better suited for data spanning multiple domains of life and dimensions of diversity. Diversity profiles are one novel, promising way to analyze microbial datasets. Diversity profiles encompass many other indices, provide effective numbers of diversity (mathematical generalizations of previous indices that better convey the magnitude of differences in diversity), and can incorporate taxa similarity information. To explore whether these profiles change interpretations of microbial datasets, diversity profiles were calculated for four microbial datasets from different environments spanning all domains of life as well as viruses. Both similarity-based profiles that incorporated phylogenetic relatedness and naïve (not similarity-based) profiles were calculated. Simulated datasets were used to examine the robustness of diversity profiles to varying phylogenetic topology and community composition. Results Diversity profiles provided insights into microbial datasets that were not detectable with classical univariate diversity metrics. For all datasets analyzed, there were key distinctions between calculations that incorporated phylogenetic diversity as a measure of taxa similarity and naïve calculations. The profiles also provided information about the effects of rare species on diversity calculations. Additionally, diversity profiles were used to examine thousands of simulated microbial communities, showing that similarity-based and naïve diversity profiles only agreed approximately 50% of the time in their classification of which sample was most diverse. This is a strong argument for incorporating similarity information and calculating diversity with a range of emphases on rare and abundant species when quantifying microbial community diversity. Conclusions For many datasets, diversity profiles provided a different view of microbial community diversity compared to analyses that did not take into account taxa similarity information, effective diversity, or multiple diversity metrics. These findings are a valuable contribution to data analysis methodology in microbial ecology. PMID:24238386
An innovative methodology for the non-destructive diagnosis of architectural elements of ancient historical buildings.

PubMed

Fais, Silvana; Casula, Giuseppe; Cuccuru, Francesco; Ligas, Paola; Bianchi, Maria Giovanna

2018-03-12

In the following we present a new non-invasive methodology aimed at the diagnosis of stone building materials used in historical buildings and architectural elements. This methodology consists of the integrated sequential application of in situ proximal sensing methodologies such as the 3D Terrestrial Laser Scanner for the 3D modelling of investigated objects together with laboratory and in situ non-invasive multi-techniques acoustic data, preceded by an accurate petrographical study of the investigated stone materials by optical and scanning electron microscopy. The increasing necessity to integrate different types of techniques in the safeguard of the Cultural Heritage is the result of the following two interdependent factors: 1) The diagnostic process on the building stone materials of monuments is increasingly focused on difficult targets in critical situations. In these cases, the diagnosis using only one type of non-invasive technique may not be sufficient to investigate the conservation status of the stone materials of the superficial and inner parts of the studied structures 2) Recent technological and scientific developments in the field of non-invasive diagnostic techniques for different types of materials favors and supports the acquisition, processing and interpretation of huge multidisciplinary datasets.
CINERGI: Community Inventory of EarthCube Resources for Geoscience Interoperability

NASA Astrophysics Data System (ADS)

Zaslavsky, Ilya; Bermudez, Luis; Grethe, Jeffrey; Gupta, Amarnath; Hsu, Leslie; Lehnert, Kerstin; Malik, Tanu; Richard, Stephen; Valentine, David; Whitenack, Thomas

2014-05-01

Organizing geoscience data resources to support cross-disciplinary data discovery, interpretation, analysis and integration is challenging because of different information models, semantic frameworks, metadata profiles, catalogs, and services used in different geoscience domains, not to mention different research paradigms and methodologies. The central goal of CINERGI, a new project supported by the US National Science Foundation through its EarthCube Building Blocks program, is to create a methodology and assemble a large inventory of high-quality information resources capable of supporting data discovery needs of researchers in a wide range of geoscience domains. The key characteristics of the inventory are: 1) collaboration with and integration of metadata resources from a number of large data facilities; 2) reliance on international metadata and catalog service standards; 3) assessment of resource "interoperability-readiness"; 4) ability to cross-link and navigate data resources, projects, models, researcher directories, publications, usage information, etc.; 5) efficient inclusion of "long-tail" data, which are not appearing in existing domain repositories; 6) data registration at feature level where appropriate, in addition to common dataset-level registration, and 7) integration with parallel EarthCube efforts, in particular focused on EarthCube governance, information brokering, service-oriented architecture design and management of semantic information. We discuss challenges associated with accomplishing CINERGI goals, including defining the inventory scope; managing different granularity levels of resource registration; interaction with search systems of domain repositories; explicating domain semantics; metadata brokering, harvesting and pruning; managing provenance of the harvested metadata; and cross-linking resources based on the linked open data (LOD) approaches. At the higher level of the inventory, we register domain-wide resources such as domain catalogs, vocabularies, information models, data service specifications, identifier systems, and assess their conformance with international standards (such as those adopted by ISO and OGC, and used by INSPIRE) or de facto community standards using, in part, automatic validation techniques. The main level in CINERGI leverages a metadata aggregation platform (currently Geoportal Server) to organize harvested resources from multiple collections and contributed by community members during EarthCube end-user domain workshops or suggested online. The latter mechanism uses the SciCrunch toolkit originally developed within the Neuroscience Information Framework (NIF) project and now being extended to other communities. The inventory is designed to support requests such as "Find resources with theme X in geographic area S", "Find datasets with subject Y using query concept expansion", "Find geographic regions having data of type Z", "Find datasets that contain property P". With the added LOD support, additional types of requests, such as "Find example implementations of specification X", "Find researchers who have worked in Domain X, dataset Y, location L", "Find resources annotated by person X", will be supported. Project's website (http://workspace.earthcube.org/cinergi) provides access to the initial resource inventory, a gallery of EarthCube researchers, collections of geoscience models, metadata entry forms, and other software modules and inventories being integrated into the CINERGI system. Support from the US National Science Foundation under award NSF ICER-1343816 is gratefully acknowledged.
Comparison and validation of shallow landslides susceptibility maps generated by bi-variate and multi-variate linear probabilistic GIS-based techniques. A case study from Ribeira Quente Valley (S. Miguel Island, Azores)

NASA Astrophysics Data System (ADS)

Marques, R.; Amaral, P.; Zêzere, J. L.; Queiroz, G.; Goulart, C.

2009-04-01

Slope instability research and susceptibility mapping is a fundamental component of hazard assessment and is of extreme importance for risk mitigation, land-use management and emergency planning. Landslide susceptibility zonation has been actively pursued during the last two decades and several methodologies are still being improved. Among all the methods presented in the literature, indirect quantitative probabilistic methods have been extensively used. In this work different linear probabilistic methods, both bi-variate and multi-variate (Informative Value, Fuzzy Logic, Weights of Evidence and Logistic Regression), were used for the computation of the spatial probability of landslide occurrence, using the pixel as mapping unit. The methods used are based on linear relationships between landslides and 9 considered conditioning factors (altimetry, slope angle, exposition, curvature, distance to streams, wetness index, contribution area, lithology and land-use). It was assumed that future landslides will be conditioned by the same factors as past landslides in the study area. The presented work was developed for Ribeira Quente Valley (S. Miguel Island, Azores), a study area of 9,5 km2, mainly composed of volcanic deposits (ash and pumice lapilli) produced by explosive eruptions in Furnas Volcano. This materials associated to the steepness of the slopes (38,9% of the area has slope angles higher than 35°, reaching a maximum of 87,5°), make the area very prone to landslide activity. A total of 1.495 shallow landslides were mapped (at 1:5.000 scale) and included in a GIS database. The total affected area is 401.744 m2 (4,5% of the study area). Most slope movements are translational slides frequently evolving into debris-flows. The landslides are elongated, with maximum length generally equivalent to the slope extent, and their width normally does not exceed 25 m. The failure depth rarely exceeds 1,5 m and the volume is usually smaller than 700 m3. For modelling purposes, the landslides were randomly divided in two sub-datasets: a modelling dataset with 748 events (2,2% of the study area) and a validation dataset with 747 events (2,3% of the study area). The susceptibility algorithms achieved with the different probabilistic techniques, were rated individually using success rate and prediction rate curves. The best model performance was obtained with the logistic regression, although the results from the different methods do not show significant differences neither in success nor in prediction rate curves. These evidences revealed that: (1) the modelling landslide dataset is representative of the entire landslide population characteristics; and (2) the increase of complexity and robustness in the probabilistic methodology did not produce a significant increase in success or prediction rates. Therefore, it was concluded that the resolution and quality of the input variables are much more important than the probabilistic model chosen to assess landslide susceptibility. This work was developed on the behalf of VOLCSOILRISK project (Volcanic Soils Geotechnical Characterization for Landslide Risk Mitigation), supported by Direcção Regional da Ciência e Tecnologia - Governo Regional dos Açores.
HYDRA: Revealing heterogeneity of imaging and genetic patterns through a multiple max-margin discriminative analysis framework.

PubMed

Varol, Erdem; Sotiras, Aristeidis; Davatzikos, Christos

2017-01-15

Multivariate pattern analysis techniques have been increasingly used over the past decade to derive highly sensitive and specific biomarkers of diseases on an individual basis. The driving assumption behind the vast majority of the existing methodologies is that a single imaging pattern can distinguish between healthy and diseased populations, or between two subgroups of patients (e.g., progressors vs. non-progressors). This assumption effectively ignores the ample evidence for the heterogeneous nature of brain diseases. Neurodegenerative, neuropsychiatric and neurodevelopmental disorders are largely characterized by high clinical heterogeneity, which likely stems in part from underlying neuroanatomical heterogeneity of various pathologies. Detecting and characterizing heterogeneity may deepen our understanding of disease mechanisms and lead to patient-specific treatments. However, few approaches tackle disease subtype discovery in a principled machine learning framework. To address this challenge, we present a novel non-linear learning algorithm for simultaneous binary classification and subtype identification, termed HYDRA (Heterogeneity through Discriminative Analysis). Neuroanatomical subtypes are effectively captured by multiple linear hyperplanes, which form a convex polytope that separates two groups (e.g., healthy controls from pathologic samples); each face of this polytope effectively defines a disease subtype. We validated HYDRA on simulated and clinical data. In the latter case, we applied the proposed method independently to the imaging and genetic datasets of the Alzheimer's Disease Neuroimaging Initiative (ADNI 1) study. The imaging dataset consisted of T1-weighted volumetric magnetic resonance images of 123 AD patients and 177 controls. The genetic dataset consisted of single nucleotide polymorphism information of 103 AD patients and 139 controls. We identified 3 reproducible subtypes of atrophy in AD relative to controls: (1) diffuse and extensive atrophy, (2) precuneus and extensive temporal lobe atrophy, as well some prefrontal atrophy, (3) atrophy pattern very much confined to the hippocampus and the medial temporal lobe. The genetics dataset yielded two subtypes of AD characterized mainly by the presence/absence of the apolipoprotein E (APOE) ε4 genotype, but also involving differential presence of risk alleles of CD2AP, SPON1 and LOC39095 SNPs that were associated with differences in the respective patterns of brain atrophy, especially in the precuneus. The results demonstrate the potential of the proposed approach to map disease heterogeneity in neuroimaging and genetic studies. Copyright © 2016 Elsevier Inc. All rights reserved.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.