Sample records for high quality datasets

  1. Historical instrumental climate data for Australia - quality and utility for palaeoclimatic studies

    NASA Astrophysics Data System (ADS)

    Nicholls, Neville; Collins, Dean; Trewin, Blair; Hope, Pandora

    2006-10-01

    The quality and availability of climate data suitable for palaeoclimatic calibration and verification for the Australian region are discussed and documented. Details of the various datasets, including problems with the data, are presented. High-quality datasets, where such problems are reduced or even eliminated, are discussed. Many climate datasets are now analysed onto grids, facilitating the preparation of regional-average time series. Work is under way to produce such high-quality, gridded datasets for a variety of hitherto unavailable climate data, including surface humidity, pan evaporation, wind, and cloud. An experiment suggests that only a relatively small number of palaeoclimatic time series could provide a useful estimate of long-term changes in Australian annual average temperature. Copyright

  2. Exploring Antarctic Land Surface Temperature Extremes Using Condensed Anomaly Databases

    NASA Astrophysics Data System (ADS)

    Grant, Glenn Edwin

    Satellite observations have revolutionized the Earth Sciences and climate studies. However, data and imagery continue to accumulate at an accelerating rate, and efficient tools for data discovery, analysis, and quality checking lag behind. In particular, studies of long-term, continental-scale processes at high spatiotemporal resolutions are especially problematic. The traditional technique of downloading an entire dataset and using customized analysis code is often impractical or consumes too many resources. The Condensate Database Project was envisioned as an alternative method for data exploration and quality checking. The project's premise was that much of the data in any satellite dataset is unneeded and can be eliminated, compacting massive datasets into more manageable sizes. Dataset sizes are further reduced by retaining only anomalous data of high interest. Hosting the resulting "condensed" datasets in high-speed databases enables immediate availability for queries and exploration. Proof of the project's success relied on demonstrating that the anomaly database methods can enhance and accelerate scientific investigations. The hypothesis of this dissertation is that the condensed datasets are effective tools for exploring many scientific questions, spurring further investigations and revealing important information that might otherwise remain undetected. This dissertation uses condensed databases containing 17 years of Antarctic land surface temperature anomalies as its primary data. The study demonstrates the utility of the condensate database methods by discovering new information. In particular, the process revealed critical quality problems in the source satellite data. The results are used as the starting point for four case studies, investigating Antarctic temperature extremes, cloud detection errors, and the teleconnections between Antarctic temperature anomalies and climate indices. The results confirm the hypothesis that the condensate databases are a highly useful tool for Earth Science analyses. Moreover, the quality checking capabilities provide an important method for independent evaluation of dataset veracity.

  3. An improved filtering algorithm for big read datasets and its application to single-cell assembly.

    PubMed

    Wedemeyer, Axel; Kliemann, Lasse; Srivastav, Anand; Schielke, Christian; Reusch, Thorsten B; Rosenstiel, Philip

    2017-07-03

    For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k-mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep. We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at https://git.informatik.uni-kiel.de/axw/Bignorm .

  4. The health care and life sciences community profile for dataset descriptions

    PubMed Central

    Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko

    2016-01-01

    Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295

  5. Image Quality Ranking Method for Microscopy

    PubMed Central

    Koho, Sami; Fazeli, Elnaz; Eriksson, John E.; Hänninen, Pekka E.

    2016-01-01

    Automated analysis of microscope images is necessitated by the increased need for high-resolution follow up of events in time. Manually finding the right images to be analyzed, or eliminated from data analysis are common day-to-day problems in microscopy research today, and the constantly growing size of image datasets does not help the matter. We propose a simple method and a software tool for sorting images within a dataset, according to their relative quality. We demonstrate the applicability of our method in finding good quality images in a STED microscope sample preparation optimization image dataset. The results are validated by comparisons to subjective opinion scores, as well as five state-of-the-art blind image quality assessment methods. We also show how our method can be applied to eliminate useless out-of-focus images in a High-Content-Screening experiment. We further evaluate the ability of our image quality ranking method to detect out-of-focus images, by extensive simulations, and by comparing its performance against previously published, well-established microscopy autofocus metrics. PMID:27364703

  6. Evaluation of the sparse coding super-resolution method for improving image quality of up-sampled images in computed tomography

    NASA Astrophysics Data System (ADS)

    Ota, Junko; Umehara, Kensuke; Ishimaru, Naoki; Ohno, Shunsuke; Okamoto, Kentaro; Suzuki, Takanori; Shirai, Naoki; Ishida, Takayuki

    2017-02-01

    As the capability of high-resolution displays grows, high-resolution images are often required in Computed Tomography (CT). However, acquiring high-resolution images takes a higher radiation dose and a longer scanning time. In this study, we applied the Sparse-coding-based Super-Resolution (ScSR) method to generate high-resolution images without increasing the radiation dose. We prepared the over-complete dictionary learned the mapping between low- and highresolution patches and seek a sparse representation of each patch of the low-resolution input. These coefficients were used to generate the high-resolution output. For evaluation, 44 CT cases were used as the test dataset. We up-sampled images up to 2 or 4 times and compared the image quality of the ScSR scheme and bilinear and bicubic interpolations, which are the traditional interpolation schemes. We also compared the image quality of three learning datasets. A total of 45 CT images, 91 non-medical images, and 93 chest radiographs were used for dictionary preparation respectively. The image quality was evaluated by measuring peak signal-to-noise ratio (PSNR) and structure similarity (SSIM). The differences of PSNRs and SSIMs between the ScSR method and interpolation methods were statistically significant. Visual assessment confirmed that the ScSR method generated a high-resolution image with sharpness, whereas conventional interpolation methods generated over-smoothed images. To compare three different training datasets, there were no significance between the CT, the CXR and non-medical datasets. These results suggest that the ScSR provides a robust approach for application of up-sampling CT images and yields substantial high image quality of extended images in CT.

  7. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  8. A gridded hourly rainfall dataset for the UK applied to a national physically-based modelling system

    NASA Astrophysics Data System (ADS)

    Lewis, Elizabeth; Blenkinsop, Stephen; Quinn, Niall; Freer, Jim; Coxon, Gemma; Woods, Ross; Bates, Paul; Fowler, Hayley

    2016-04-01

    An hourly gridded rainfall product has great potential for use in many hydrological applications that require high temporal resolution meteorological data. One important example of this is flood risk management, with flooding in the UK highly dependent on sub-daily rainfall intensities amongst other factors. Knowledge of sub-daily rainfall intensities is therefore critical to designing hydraulic structures or flood defences to appropriate levels of service. Sub-daily rainfall rates are also essential inputs for flood forecasting, allowing for estimates of peak flows and stage for flood warning and response. In addition, an hourly gridded rainfall dataset has significant potential for practical applications such as better representation of extremes and pluvial flash flooding, validation of high resolution climate models and improving the representation of sub-daily rainfall in weather generators. A new 1km gridded hourly rainfall dataset for the UK has been created by disaggregating the daily Gridded Estimates of Areal Rainfall (CEH-GEAR) dataset using comprehensively quality-controlled hourly rain gauge data from over 1300 observation stations across the country. Quality control measures include identification of frequent tips, daily accumulations and dry spells, comparison of daily totals against the CEH-GEAR daily dataset, and nearest neighbour checks. The quality control procedure was validated against historic extreme rainfall events and the UKCP09 5km daily rainfall dataset. General use of the dataset has been demonstrated by testing the sensitivity of a physically-based hydrological modelling system for Great Britain to the distribution and rates of rainfall and potential evapotranspiration. Of the sensitivity tests undertaken, the largest improvements in model performance were seen when an hourly gridded rainfall dataset was combined with potential evapotranspiration disaggregated to hourly intervals, with 61% of catchments showing an increase in NSE between observed and simulated streamflows as a result of more realistic sub-daily meteorological forcing.

  9. Relation Between Selected Water-Quality Variables, Climatic Factors, and Lake Levels in Upper Klamath and Agency Lakes, Oregon, 1990-2006

    USGS Publications Warehouse

    Morace, Jennifer L.

    2007-01-01

    Growth and decomposition of dense blooms of Aphanizomenon flos-aquae in Upper Klamath Lake frequently cause extreme water-quality conditions that have led to critical fishery concerns for the region, including the listing of two species of endemic suckers as endangered. The Bureau of Reclamation has asked the U.S. Geological Survey (USGS) to examine water-quality data collected by the Klamath Tribes for relations with lake level. This analysis evaluates a 17-year dataset (1990-2006) and updates a previous USGS analysis of a 5-year dataset (1990-94). Both univariate hypothesis testing and multivariable analyses evaluated using an information-theoretic approach revealed the same results-no one overarching factor emerged from the data. No single factor could be relegated from consideration either. The lack of statistically significant, strong correlations between water-quality conditions, lake level, and climatic factors does not necessarily show that these factors do not influence water-quality conditions; it is more likely that these conditions work in conjunction with each other to affect water quality. A few different conclusions could be drawn from the larger dataset than from the smaller dataset examined in 1996, but for the most part, the outcome was the same. Using an observational dataset that may not capture all variation in water-quality conditions (samples were collected on a two-week interval) and that has a limited range of conditions for evaluation (confined to the operation of lake) may have confounded the exploration of explanatory factors. In the end, all years experienced some variation in poor water-quality conditions, either in timing of occurrence of the poor conditions or in their duration. The dataset of 17 years simply provided 17 different patterns of lake level, cumulative degree-days, timing of the bloom onset, and poor water-quality conditions, with no overriding causal factor emerging from the variations. Water-quality conditions were evaluated for their potential to be harmful to the endangered sucker species on the basis of high-stress thresholds-water temperature values greater than 28 degrees Celsius, dissolved-oxygen concentrations less than 4 milligrams per liter, and pH values greater than 9.7. Few water temperatures were greater than 28 degrees Celsius, and dissolved-oxygen concentrations less than 4 milligrams per liter generally were recorded in mid to late summer. In contrast, high pH values were more frequent, occurring earlier in the season and parallel with growth in the algal bloom. The 10 hypotheses relating water-quality variables, lake level, and climatic factors from the earlier USGS study were tested in this analysis for the larger 1990-2006 dataset. These hypotheses proposed relations between lake level and chlorophyll-a, pH, dissolved oxygen, total phosphorus, and water temperature. As in the previous study, no evidence was found in the larger dataset for any of these relations based on a seasonal (May-October) distribution. When analyzing only the June data, the previous 5-year study did find evidence for three hypotheses relating lake level to the onset of the bloom, chlorophyll-a concentrations, and the frequency of high pH values in June. These hypotheses were not supported by the 1990-2006 dataset, but the two hypotheses related to cumulative degree-days from the previous study were: chlorophyll-a concentrations were lower and onset of the algal bloom was delayed when spring air temperatures were cooler. Other relations between water-quality variables and cumulative degree-days were not significant. In an attempt to identify interrelations among variables not detected by univariate analysis, multiple regressions were performed between lakewide measures of low dissolved-oxygen concentrations or high pH values in July and August and six physical and biological variables (peak chlorophyll-a concentrations, degree-days, water temperature, median October-May discharg

  10. An assessment of differences in gridded precipitation datasets in complex terrain

    NASA Astrophysics Data System (ADS)

    Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

    2018-01-01

    Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.

  11. A high-throughput system for high-quality tomographic reconstruction of large datasets at Diamond Light Source

    PubMed Central

    Atwood, Robert C.; Bodey, Andrew J.; Price, Stephen W. T.; Basham, Mark; Drakopoulos, Michael

    2015-01-01

    Tomographic datasets collected at synchrotrons are becoming very large and complex, and, therefore, need to be managed efficiently. Raw images may have high pixel counts, and each pixel can be multidimensional and associated with additional data such as those derived from spectroscopy. In time-resolved studies, hundreds of tomographic datasets can be collected in sequence, yielding terabytes of data. Users of tomographic beamlines are drawn from various scientific disciplines, and many are keen to use tomographic reconstruction software that does not require a deep understanding of reconstruction principles. We have developed Savu, a reconstruction pipeline that enables users to rapidly reconstruct data to consistently create high-quality results. Savu is designed to work in an ‘orthogonal’ fashion, meaning that data can be converted between projection and sinogram space throughout the processing workflow as required. The Savu pipeline is modular and allows processing strategies to be optimized for users' purposes. In addition to the reconstruction algorithms themselves, it can include modules for identification of experimental problems, artefact correction, general image processing and data quality assessment. Savu is open source, open licensed and ‘facility-independent’: it can run on standard cluster infrastructure at any institution. PMID:25939626

  12. Ambiguity of Quality in Remote Sensing Data

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Leptoukh, Greg

    2010-01-01

    This slide presentation reviews some of the issues in quality of remote sensing data. Data "quality" is used in several different contexts in remote sensing data, with quite different meanings. At the pixel level, quality typically refers to a quality control process exercised by the processing algorithm, not an explicit declaration of accuracy or precision. File level quality is usually a statistical summary of the pixel-level quality but is of doubtful use for scenes covering large areal extents. Quality at the dataset or product level, on the other hand, usually refers to how accurately the dataset is believed to represent the physical quantities it purports to measure. This assessment often bears but an indirect relationship at best to pixel level quality. In addition to ambiguity at different levels of granularity, ambiguity is endemic within levels. Pixel-level quality terms vary widely, as do recommendations for use of these flags. At the dataset/product level, quality for low-resolution gridded products is often extrapolated from validation campaigns using high spatial resolution swath data, a suspect practice at best. Making use of quality at all levels is complicated by the dependence on application needs. We will present examples of the various meanings of quality in remote sensing data and possible ways forward toward a more unified and usable quality framework.

  13. Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    PubMed

    Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

    2017-07-10

    Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.

  14. ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data.

    PubMed

    Ou, Jianhong; Liu, Haibo; Yu, Jun; Kelliher, Michelle A; Castilla, Lucio H; Lawson, Nathan D; Zhu, Lihua Julie

    2018-03-01

    ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. While several tools have been developed or adopted for assessing read quality, identifying nucleosome occupancy and accessible regions from ATAC-seq data, none of the tools provide a comprehensive set of functionalities for preprocessing and quality assessment of aligned ATAC-seq datasets. We have developed a Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling. Here we demonstrate the utilities of our package using 25 publicly available ATAC-seq datasets from four studies. We also provide guidelines on what the diagnostic plots should look like for an ideal ATAC-seq dataset. This software package has been used successfully for preprocessing and assessing several in-house and public ATAC-seq datasets. Diagnostic plots generated by this package will facilitate the quality assessment of ATAC-seq data, and help researchers to evaluate their own ATAC-seq experiments as well as select high-quality ATAC-seq datasets from public repositories such as GEO to avoid generating hypotheses or drawing conclusions from low-quality ATAC-seq experiments. The software, source code, and documentation are freely available as a Bioconductor package at https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html .

  15. Nanomaterial datasets to advance tomography in scanning transmission electron microscopy

    DOE PAGES

    Levin, Barnaby D. A.; Padgett, Elliot; Chen, Chien-Chun; ...

    2016-06-07

    Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co 2 P nanocrystal, platinum nanoparticles on a carbonmore » nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.« less

  16. Nanomaterial datasets to advance tomography in scanning transmission electron microscopy.

    PubMed

    Levin, Barnaby D A; Padgett, Elliot; Chen, Chien-Chun; Scott, M C; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D; Robinson, Richard D; Ercius, Peter; Kourkoutis, Lena F; Miao, Jianwei; Muller, David A; Hovden, Robert

    2016-06-07

    Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.

  17. Nanomaterial datasets to advance tomography in scanning transmission electron microscopy

    PubMed Central

    Levin, Barnaby D.A.; Padgett, Elliot; Chen, Chien-Chun; Scott, M.C.; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D.; Robinson, Richard D.; Ercius, Peter; Kourkoutis, Lena F.; Miao, Jianwei; Muller, David A.; Hovden, Robert

    2016-01-01

    Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data. PMID:27272459

  18. A standard for measuring metadata quality in spectral libraries

    NASA Astrophysics Data System (ADS)

    Rasaiah, B.; Jones, S. D.; Bellman, C.

    2013-12-01

    A standard for measuring metadata quality in spectral libraries Barbara Rasaiah, Simon Jones, Chris Bellman RMIT University Melbourne, Australia barbara.rasaiah@rmit.edu.au, simon.jones@rmit.edu.au, chris.bellman@rmit.edu.au ABSTRACT There is an urgent need within the international remote sensing community to establish a metadata standard for field spectroscopy that ensures high quality, interoperable metadata sets that can be archived and shared efficiently within Earth observation data sharing systems. Metadata are an important component in the cataloguing and analysis of in situ spectroscopy datasets because of their central role in identifying and quantifying the quality and reliability of spectral data and the products derived from them. This paper presents approaches to measuring metadata completeness and quality in spectral libraries to determine reliability, interoperability, and re-useability of a dataset. Explored are quality parameters that meet the unique requirements of in situ spectroscopy datasets, across many campaigns. Examined are the challenges presented by ensuring that data creators, owners, and data users ensure a high level of data integrity throughout the lifecycle of a dataset. Issues such as field measurement methods, instrument calibration, and data representativeness are investigated. The proposed metadata standard incorporates expert recommendations that include metadata protocols critical to all campaigns, and those that are restricted to campaigns for specific target measurements. The implication of semantics and syntax for a robust and flexible metadata standard are also considered. Approaches towards an operational and logistically viable implementation of a quality standard are discussed. This paper also proposes a way forward for adapting and enhancing current geospatial metadata standards to the unique requirements of field spectroscopy metadata quality. [0430] BIOGEOSCIENCES / Computational methods and data processing [0480] BIOGEOSCIENCES / Remote sensing [1904] INFORMATICS / Community standards [1912] INFORMATICS / Data management, preservation, rescue [1926] INFORMATICS / Geospatial [1930] INFORMATICS / Data and information governance [1946] INFORMATICS / Metadata [1952] INFORMATICS / Modeling [1976] INFORMATICS / Software tools and services [9810] GENERAL OR MISCELLANEOUS / New fields

  19. Development and Validation of a High-Quality Composite Real-World Mortality Endpoint.

    PubMed

    Curtis, Melissa D; Griffith, Sandra D; Tucker, Melisa; Taylor, Michael D; Capra, William B; Carrigan, Gillis; Holzman, Ben; Torres, Aracelis Z; You, Paul; Arnieri, Brandon; Abernethy, Amy P

    2018-05-14

    To create a high-quality electronic health record (EHR)-derived mortality dataset for retrospective and prospective real-world evidence generation. Oncology EHR data, supplemented with external commercial and US Social Security Death Index data, benchmarked to the National Death Index (NDI). We developed a recent, linkable, high-quality mortality variable amalgamated from multiple data sources to supplement EHR data, benchmarked against the highest completeness U.S. mortality data, the NDI. Data quality of the mortality variable version 2.0 is reported here. For advanced non-small-cell lung cancer, sensitivity of mortality information improved from 66 percent in EHR structured data to 91 percent in the composite dataset, with high date agreement compared to the NDI. For advanced melanoma, metastatic colorectal cancer, and metastatic breast cancer, sensitivity of the final variable was 85 to 88 percent. Kaplan-Meier survival analyses showed that improving mortality data completeness minimized overestimation of survival relative to NDI-based estimates. For EHR-derived data to yield reliable real-world evidence, it needs to be of known and sufficiently high quality. Considering the impact of mortality data completeness on survival endpoints, we highlight the importance of data quality assessment and advocate benchmarking to the NDI. © 2018 The Authors. Health Services Research published by Wiley Periodicals, Inc. on behalf of Health Research and Educational Trust.

  20. Extensive validation of CM SAF surface radiation products over Europe.

    PubMed

    Urraca, Ruben; Gracia-Amillo, Ana M; Koubli, Elena; Huld, Thomas; Trentmann, Jörg; Riihelä, Aku; Lindfors, Anders V; Palmer, Diane; Gottschalg, Ralph; Antonanzas-Torres, Fernando

    2017-09-15

    This work presents a validation of three satellite-based radiation products over an extensive network of 313 pyranometers across Europe, from 2005 to 2015. The products used have been developed by the Satellite Application Facility on Climate Monitoring (CM SAF) and are one geostationary climate dataset (SARAH-JRC), one polar-orbiting climate dataset (CLARA-A2) and one geostationary operational product. Further, the ERA-Interim reanalysis is also included in the comparison. The main objective is to determine the quality level of the daily means of CM SAF datasets, identifying their limitations, as well as analyzing the different factors that can interfere in the adequate validation of the products. The quality of the pyranometer was the most critical source of uncertainty identified. In this respect, the use of records from Second Class pyranometers and silicon-based photodiodes increased the absolute error and the bias, as well as the dispersion of both metrics, preventing an adequate validation of the daily means. The best spatial estimates for the three datasets were obtained in Central Europe with a Mean Absolute Deviation (MAD) within 8-13 W/m 2 , whereas the MAD always increased at high-latitudes, snow-covered surfaces, high mountain ranges and coastal areas. Overall, the SARAH-JRC's accuracy was demonstrated over a dense network of stations making it the most consistent dataset for climate monitoring applications. The operational dataset was comparable to SARAH-JRC in Central Europe, but lacked of the temporal stability of climate datasets, while CLARA-A2 did not achieve the same level of accuracy despite predictions obtained showed high uniformity with a small negative bias. The ERA-Interim reanalysis shows the by-far largest deviations from the surface reference measurements.

  1. The Atlanta Urban Heat Island Mitigation and Air Quality Modeling Project: How High-Resoution Remote Sensing Data Can Improve Air Quality Models

    NASA Technical Reports Server (NTRS)

    Quattrochi, Dale A.; Estes, Maurice G., Jr.; Crosson, William L.; Khan, Maudood N.

    2006-01-01

    The Atlanta Urban Heat Island and Air Quality Project had its genesis in Project ATLANTA (ATlanta Land use Analysis: Temperature and Air quality) that began in 1996. Project ATLANTA examined how high-spatial resolution thermal remote sensing data could be used to derive better measurements of the Urban Heat Island effect over Atlanta. We have explored how these thermal remote sensing, as well as other imaged datasets, can be used to better characterize the urban landscape for improved air quality modeling over the Atlanta area. For the air quality modeling project, the National Land Cover Dataset and the local scale Landpro99 dataset at 30m spatial resolutions have been used to derive land use/land cover characteristics for input into the MM5 mesoscale meteorological model that is one of the foundations for the Community Multiscale Air Quality (CMAQ) model to assess how these data can improve output from CMAQ. Additionally, land use changes to 2030 have been predicted using a Spatial Growth Model (SGM). SGM simulates growth around a region using population, employment and travel demand forecasts. Air quality modeling simulations were conducted using both current and future land cover. Meteorological modeling simulations indicate a 0.5 C increase in daily maximum air temperatures by 2030. Air quality modeling simulations show substantial differences in relative contributions of individual atmospheric pollutant constituents as a result of land cover change. Enhanced boundary layer mixing over the city tends to offset the increase in ozone concentration expected due to higher surface temperatures as a result of urbanization.

  2. HadISD: a quality-controlled global synoptic report database for selected variables at long-term stations from 1973-2011

    NASA Astrophysics Data System (ADS)

    Dunn, R. J. H.; Willett, K. M.; Thorne, P. W.; Woolley, E. V.; Durre, I.; Dai, A.; Parker, D. E.; Vose, R. S.

    2012-10-01

    This paper describes the creation of HadISD: an automatically quality-controlled synoptic resolution dataset of temperature, dewpoint temperature, sea-level pressure, wind speed, wind direction and cloud cover from global weather stations for 1973-2011. The full dataset consists of over 6000 stations, with 3427 long-term stations deemed to have sufficient sampling and quality for climate applications requiring sub-daily resolution. As with other surface datasets, coverage is heavily skewed towards Northern Hemisphere mid-latitudes. The dataset is constructed from a large pre-existing ASCII flatfile data bank that represents over a decade of substantial effort at data retrieval, reformatting and provision. These raw data have had varying levels of quality control applied to them by individual data providers. The work proceeded in several steps: merging stations with multiple reporting identifiers; reformatting to netCDF; quality control; and then filtering to form a final dataset. Particular attention has been paid to maintaining true extreme values where possible within an automated, objective process. Detailed validation has been performed on a subset of global stations and also on UK data using known extreme events to help finalise the QC tests. Further validation was performed on a selection of extreme events world-wide (Hurricane Katrina in 2005, the cold snap in Alaska in 1989 and heat waves in SE Australia in 2009). Some very initial analyses are performed to illustrate some of the types of problems to which the final data could be applied. Although the filtering has removed the poorest station records, no attempt has been made to homogenise the data thus far, due to the complexity of retaining the true distribution of high-resolution data when applying adjustments. Hence non-climatic, time-varying errors may still exist in many of the individual station records and care is needed in inferring long-term trends from these data. This dataset will allow the study of high frequency variations of temperature, pressure and humidity on a global basis over the last four decades. Both individual extremes and the overall population of extreme events could be investigated in detail to allow for comparison with past and projected climate. A version-control system has been constructed for this dataset to allow for the clear documentation of any updates and corrections in the future.

  3. Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets

    PubMed Central

    2011-01-01

    Background M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions. Results To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252 predicted physical interologs from IntAct and S. aureus MRSA252 pull-down PPIs. Comparative analysis with several representative two-hybrid PPI datasets in other species further confirms that the H37Rv B2H PPI dataset is of low quality. Next, to test the possibility that the H37Rv STRING PPIs are not purely direct physical interactions, we compare M. tuberculosis H37Rv protein pairs that catalyze adjacent steps in enzymatic reactions to B2H PPIs and predicted PPIs in STRING, which shows it has much lower similarities with the B2H PPIs than with STRING PPIs. This result strongly suggests that the H37Rv STRING PPIs more likely correspond to indirect relationships between protein pairs than to B2H PPIs. For more precise support, we turn to S. cerevisiae for its comprehensively studied interactome. We compare S. cerevisiae predicted PPIs in STRING to three independent protein relationship datasets which respectively comprise PPIs reported in Y2H assays, protein pairs reported to be in the same protein complexes, and protein pairs that catalyze successive reaction steps in enzymatic reactions. Our analysis reveals that S. cerevisiae predicted STRING PPIs have much higher similarity to the latter two types of protein pairs than to two-hybrid PPIs. As H37Rv STRING PPIs are predicted using similar methods as S. cerevisiae predicted STRING PPIs, this suggests that these H37Rv STRING PPIs are more likely to correspond to the latter two types of protein pairs rather than to two-hybrid PPIs as well. Conclusions The H37Rv B2H PPI dataset has low quality. It should not be used as the gold standard to assess the quality of other (possibly predicted) H37Rv PPI datasets. The H37Rv STRING PPI dataset also has low quality; nevertheless, a subset consisting of STRING PPIs with score ≥770 has satisfactory quality. However, these STRING “PPIs” should be interpreted as functional associations, which include a substantial portion of indirect protein interactions, rather than direct physical interactions. These two factors cause the strikingly low similarity between these two main H37Rv PPI datasets. The results and conclusions from this comparative analysis provide valuable guidance in using these M. tuberculosis H37Rv PPI datasets in subsequent studies for a wide range of purposes. PMID:22369691

  4. Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.

  5. Use of graph theory measures to identify errors in record linkage.

    PubMed

    Randall, Sean M; Boyd, James H; Ferrante, Anna M; Bauer, Jacqueline K; Semmens, James B

    2014-07-01

    Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  6. Validation of a new SAFRAN-based gridded precipitation product for Spain and comparisons to Spain02 and ERA-Interim

    NASA Astrophysics Data System (ADS)

    Quintana-Seguí, Pere; Turco, Marco; Herrera, Sixto; Miguez-Macho, Gonzalo

    2017-04-01

    Offline land surface model (LSM) simulations are useful for studying the continental hydrological cycle. Because of the nonlinearities in the models, the results are very sensitive to the quality of the meteorological forcing; thus, high-quality gridded datasets of screen-level meteorological variables are needed. Precipitation datasets are particularly difficult to produce due to the inherent spatial and temporal heterogeneity of that variable. They do, however, have a large impact on the simulations, and it is thus necessary to carefully evaluate their quality in great detail. This paper reports the quality of two high-resolution precipitation datasets for Spain at the daily time scale: the new SAFRAN-based dataset and Spain02. SAFRAN is a meteorological analysis system that was designed to force LSMs and has recently been extended to the entirety of Spain for a long period of time (1979/1980-2013/2014). Spain02 is a daily precipitation dataset for Spain and was created mainly to validate regional climate models. In addition, ERA-Interim is included in the comparison to show the differences between local high-resolution and global low-resolution products. The study compares the different precipitation analyses with rain gauge data and assesses their temporal and spatial similarities to the observations. The validation of SAFRAN with independent data shows that this is a robust product. SAFRAN and Spain02 have very similar scores, although the latter slightly surpasses the former. The scores are robust with altitude and throughout the year, save perhaps in summer when a diminished skill is observed. As expected, SAFRAN and Spain02 perform better than ERA-Interim, which has difficulty capturing the effects of the relief on precipitation due to its low resolution. However, ERA-Interim reproduces spells remarkably well in contrast to the low skill shown by the high-resolution products. The high-resolution gridded products overestimate the number of precipitation days, which is a problem that affects SAFRAN more than Spain02 and is likely caused by the interpolation method. Both SAFRAN and Spain02 underestimate high precipitation events, but SAFRAN does so more than Spain02. The overestimation of low precipitation events and the underestimation of intense episodes will probably have hydrological consequences once the data are used to force a land surface or hydrological model.

  7. On-line 3D motion estimation using low resolution MRI

    NASA Astrophysics Data System (ADS)

    Glitzner, M.; de Senneville, B. Denis; Lagendijk, J. J. W.; Raaymakers, B. W.; Crijns, S. P. M.

    2015-08-01

    Image processing such as deformable image registration finds its way into radiotherapy as a means to track non-rigid anatomy. With the advent of magnetic resonance imaging (MRI) guided radiotherapy, intrafraction anatomy snapshots become technically feasible. MRI provides the needed tissue signal for high-fidelity image registration. However, acquisitions, especially in 3D, take a considerable amount of time. Pushing towards real-time adaptive radiotherapy, MRI needs to be accelerated without degrading the quality of information. In this paper, we investigate the impact of image resolution on the quality of motion estimations. Potentially, spatially undersampled images yield comparable motion estimations. At the same time, their acquisition times would reduce greatly due to the sparser sampling. In order to substantiate this hypothesis, exemplary 4D datasets of the abdomen were downsampled gradually. Subsequently, spatiotemporal deformations are extracted consistently using the same motion estimation for each downsampled dataset. Errors between the original and the respectively downsampled version of the dataset are then evaluated. Compared to ground-truth, results show high similarity of deformations estimated from downsampled image data. Using a dataset with {{≤ft(2.5 \\text{mm}\\right)}3} voxel size, deformation fields could be recovered well up to a downsampling factor of 2, i.e. {{≤ft(5 \\text{mm}\\right)}3} . In a therapy guidance scenario MRI, imaging speed could accordingly increase approximately fourfold, with acceptable loss of estimated motion quality.

  8. Data assimilation and model evaluation experiment datasets

    NASA Technical Reports Server (NTRS)

    Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

    1994-01-01

    The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.

  9. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as inputs to the hydrological calibration and validation of EFAS as well as for establishing long-term discharge "proxy" climatologies which can then in turn be used for statistical analysis to derive return periods or other time series derivatives. In addition, this dataset will be used to assess climatological trends in Europe. Unfortunately, to date no baseline dataset at the European scale exists to test the quality of the herein presented data. Hence, a comparison against other existing datasets can therefore only be an indication of data quality. Due to availability, a comparison was made for precipitation and temperature only, arguably the most important meteorological drivers for hydrologic models. A variety of analyses was undertaken at country scale against data reported to EUROSTAT and E-OBS datasets. The comparison revealed that while the datasets showed overall similar temporal and spatial patterns, there were some differences in magnitudes especially for precipitation. It is not straightforward to define the specific cause for these differences. However, in most cases the comparatively low observation station density appears to be the principal reason for the differences in magnitude.

  10. Background qualitative analysis of the European Reference Life Cycle Database (ELCD) energy datasets - part I: fuel datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.

  11. MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a

    PubMed Central

    Wu, Zhenqin; Ramsundar, Bharath; Feinberg, Evan N.; Gomes, Joseph; Geniesse, Caleb; Pappu, Aneesh S.; Leswing, Karl

    2017-01-01

    Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm. PMID:29629118

  12. Large-scale machine learning and evaluation platform for real-time traffic surveillance

    NASA Astrophysics Data System (ADS)

    Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel

    2016-09-01

    In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.

  13. Using Third Party Data to Update a Reference Dataset in a Quality Evaluation Service

    NASA Astrophysics Data System (ADS)

    Xavier, E. M. A.; Ariza-López, F. J.; Ureña-Cámara, M. A.

    2016-06-01

    Nowadays it is easy to find many data sources for various regions around the globe. In this 'data overload' scenario there are few, if any, information available about the quality of these data sources. In order to easily provide these data quality information we presented the architecture of a web service for the automation of quality control of spatial datasets running over a Web Processing Service (WPS). For quality procedures that require an external reference dataset, like positional accuracy or completeness, the architecture permits using a reference dataset. However, this reference dataset is not ageless, since it suffers the natural time degradation inherent to geospatial features. In order to mitigate this problem we propose the Time Degradation & Updating Module which intends to apply assessed data as a tool to maintain the reference database updated. The main idea is to utilize datasets sent to the quality evaluation service as a source of 'candidate data elements' for the updating of the reference database. After the evaluation, if some elements of a candidate dataset reach a determined quality level, they can be used as input data to improve the current reference database. In this work we present the first design of the Time Degradation & Updating Module. We believe that the outcomes can be applied in the search of a full-automatic on-line quality evaluation platform.

  14. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Levin, Barnaby D. A.; Padgett, Elliot; Chen, Chien-Chun

    Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co 2 P nanocrystal, platinum nanoparticles on a carbonmore » nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.« less

  15. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  16. Maize - GO annotation methods, evaluation, and review (Maize-GAMER)

    USDA-ARS?s Scientific Manuscript database

    Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...

  17. Primary Datasets for Case Studies of River-Water Quality

    ERIC Educational Resources Information Center

    Goulder, Raymond

    2008-01-01

    Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…

  18. Applications of the LBA-ECO Metadata Warehouse

    NASA Astrophysics Data System (ADS)

    Wilcox, L.; Morrell, A.; Griffith, P. C.

    2006-05-01

    The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.

  19. Quality-control of an hourly rainfall dataset and climatology of extremes for the UK.

    PubMed

    Blenkinsop, Stephen; Lewis, Elizabeth; Chan, Steven C; Fowler, Hayley J

    2017-02-01

    Sub-daily rainfall extremes may be associated with flash flooding, particularly in urban areas but, compared with extremes on daily timescales, have been relatively little studied in many regions. This paper describes a new, hourly rainfall dataset for the UK based on ∼1600 rain gauges from three different data sources. This includes tipping bucket rain gauge data from the UK Environment Agency (EA), which has been collected for operational purposes, principally flood forecasting. Significant problems in the use of such data for the analysis of extreme events include the recording of accumulated totals, high frequency bucket tips, rain gauge recording errors and the non-operation of gauges. Given the prospect of an intensification of short-duration rainfall in a warming climate, the identification of such errors is essential if sub-daily datasets are to be used to better understand extreme events. We therefore first describe a series of procedures developed to quality control this new dataset. We then analyse ∼380 gauges with near-complete hourly records for 1992-2011 and map the seasonal climatology of intense rainfall based on UK hourly extremes using annual maxima, n-largest events and fixed threshold approaches. We find that the highest frequencies and intensities of hourly extreme rainfall occur during summer when the usual orographically defined pattern of extreme rainfall is replaced by a weaker, north-south pattern. A strong diurnal cycle in hourly extremes, peaking in late afternoon to early evening, is also identified in summer and, for some areas, in spring. This likely reflects the different mechanisms that generate sub-daily rainfall, with convection dominating during summer. The resulting quality-controlled hourly rainfall dataset will provide considerable value in several contexts, including the development of standard, globally applicable quality-control procedures for sub-daily data, the validation of the new generation of very high-resolution climate models and improved understanding of the drivers of extreme rainfall.

  20. Absence of carious lesions at margins of glass-ionomer cement and amalgam restorations: An update of systematic review evidence

    PubMed Central

    2011-01-01

    Background This article aims to update the existing systematic review evidence elicited by Mickenautsch et al. up to 18 January 2008 (published in the European Journal of Paediatric Dentistry in 2009) and addressing the review question of whether, in the same dentition and same cavity class, glass-ionomer cement (GIC) restored cavities show less recurrent carious lesions on cavity margins than cavities restored with amalgam. Methods The systematic literature search was extended beyond the original search date and a further hand-search and reference check was done. The quality of accepted trials was assessed, using updated quality criteria, and the risk of bias was investigated in more depth than previously reported. In addition, the focus of quantitative synthesis was shifted to single datasets extracted from the accepted trials. Results The database search (up to 10 August 2010) identified 1 new trial, in addition to the 9 included in the original systematic review, and 11 further trials were included after a hand-search and reference check. Of these 21 trials, 11 were excluded and 10 were accepted for data extraction and quality assessment. Thirteen dichotomous datasets of primary outcomes and 4 datasets with secondary outcomes were extracted. Meta-analysis and cumulative meta-analysis were used in combining clinically homogenous datasets. The overall results of the computed datasets suggest that GIC has a higher caries-preventive effect than amalgam for restorations in permanent teeth. No difference was found for restorations in the primary dentition. Conclusion This outcome is in agreement with the conclusions of the original systematic review. Although the findings of the trials identified in this update may be considered to be less affected by attrition- and publication bias, their risk of selection- and detection/performance bias is high. Thus, verification of the currently available results requires further high-quality randomised control trials. PMID:21396097

  1. MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites

    PubMed Central

    2017-01-01

    Quality control of MRI is essential for excluding problematic acquisitions and avoiding bias in subsequent image processing and analysis. Visual inspection is subjective and impractical for large scale datasets. Although automated quality assessments have been demonstrated on single-site datasets, it is unclear that solutions can generalize to unseen data acquired at new sites. Here, we introduce the MRI Quality Control tool (MRIQC), a tool for extracting quality measures and fitting a binary (accept/exclude) classifier. Our tool can be run both locally and as a free online service via the OpenNeuro.org portal. The classifier is trained on a publicly available, multi-site dataset (17 sites, N = 1102). We perform model selection evaluating different normalization and feature exclusion approaches aimed at maximizing across-site generalization and estimate an accuracy of 76%±13% on new sites, using leave-one-site-out cross-validation. We confirm that result on a held-out dataset (2 sites, N = 265) also obtaining a 76% accuracy. Even though the performance of the trained classifier is statistically above chance, we show that it is susceptible to site effects and unable to account for artifacts specific to new sites. MRIQC performs with high accuracy in intra-site prediction, but performance on unseen sites leaves space for improvement which might require more labeled data and new approaches to the between-site variability. Overcoming these limitations is crucial for a more objective quality assessment of neuroimaging data, and to enable the analysis of extremely large and multi-site samples. PMID:28945803

  2. Assessing reliability of protein-protein interactions by integrative analysis of data in model organisms.

    PubMed

    Lin, Xiaotong; Liu, Mei; Chen, Xue-wen

    2009-04-29

    Protein-protein interactions play vital roles in nearly all cellular processes and are involved in the construction of biological pathways such as metabolic and signal transduction pathways. Although large-scale experiments have enabled the discovery of thousands of previously unknown linkages among proteins in many organisms, the high-throughput interaction data is often associated with high error rates. Since protein interaction networks have been utilized in numerous biological inferences, the inclusive experimental errors inevitably affect the quality of such prediction. Thus, it is essential to assess the quality of the protein interaction data. In this paper, a novel Bayesian network-based integrative framework is proposed to assess the reliability of protein-protein interactions. We develop a cross-species in silico model that assigns likelihood scores to individual protein pairs based on the information entirely extracted from model organisms. Our proposed approach integrates multiple microarray datasets and novel features derived from gene ontology. Furthermore, the confidence scores for cross-species protein mappings are explicitly incorporated into our model. Applying our model to predict protein interactions in the human genome, we are able to achieve 80% in sensitivity and 70% in specificity. Finally, we assess the overall quality of the experimentally determined yeast protein-protein interaction dataset. We observe that the more high-throughput experiments confirming an interaction, the higher the likelihood score, which confirms the effectiveness of our approach. This study demonstrates that model organisms certainly provide important information for protein-protein interaction inference and assessment. The proposed method is able to assess not only the overall quality of an interaction dataset, but also the quality of individual protein-protein interactions. We expect the method to continually improve as more high quality interaction data from more model organisms becomes available and is readily scalable to a genome-wide application.

  3. A no-reference image and video visual quality metric based on machine learning

    NASA Astrophysics Data System (ADS)

    Frantc, Vladimir; Voronin, Viacheslav; Semenishchev, Evgenii; Minkin, Maxim; Delov, Aliy

    2018-04-01

    The paper presents a novel visual quality metric for lossy compressed video quality assessment. High degree of correlation with subjective estimations of quality is due to using of a convolutional neural network trained on a large amount of pairs video sequence-subjective quality score. We demonstrate how our predicted no-reference quality metric correlates with qualitative opinion in a human observer study. Results are shown on the EVVQ dataset with comparison existing approaches.

  4. How Diverse are the Protein-Bound Conformations of Small-Molecule Drugs and Cofactors?

    NASA Astrophysics Data System (ADS)

    Friedrich, Nils-Ole; Simsir, Méliné; Kirchmair, Johannes

    2018-03-01

    Knowledge of the bioactive conformations of small molecules or the ability to predict them with theoretical methods is of key importance to the design of bioactive compounds such as drugs, agrochemicals and cosmetics. Using an elaborate cheminformatics pipeline, which also evaluates the support of individual atom coordinates by the measured electron density, we compiled a complete set (“Sperrylite Dataset”) of high-quality structures of protein-bound ligand conformations from the PDB. The Sperrylite Dataset consists of a total of 10,936 high-quality structures of 4548 unique ligands. Based on this dataset, we assessed the variability of the bioactive conformations of 91 small molecules—each represented by a minimum of ten structures—and found it to be largely independent of the number of rotatable bonds. Sixty-nine molecules had at least two distinct conformations (defined by an RMSD greater than 1 Å). For a representative subset of 17 approved drugs and cofactors we observed a clear trend for the formation of few clusters of highly similar conformers. Even for proteins that share a very low sequence identity, ligands were regularly found to adopt similar conformations. For cofactors, a clear trend for extended conformations was measured, although in few cases also coiled conformers were observed. The Sperrylite Dataset is available for download from http://www.zbh.uni-hamburg.de/sperrylite_dataset.

  5. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

    PubMed

    Ernst, Jason; Kellis, Manolis

    2015-04-01

    With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.

  6. A 30+ Year AVHRR LAI and FAPAR Climate Data Record: Algorithm Description, Validation, and Case Study

    NASA Technical Reports Server (NTRS)

    Claverie, Martin; Matthews, Jessica L.; Vermote, Eric F.; Justice, Christopher O.

    2016-01-01

    In- land surface models, which are used to evaluate the role of vegetation in the context ofglobal climate change and variability, LAI and FAPAR play a key role, specifically with respect to thecarbon and water cycles. The AVHRR-based LAIFAPAR dataset offers daily temporal resolution,an improvement over previous products. This climate data record is based on a carefully calibratedand corrected land surface reflectance dataset to provide a high-quality, consistent time-series suitablefor climate studies. It spans from mid-1981 to the present. Further, this operational dataset is availablein near real-time allowing use for monitoring purposes. The algorithm relies on artificial neuralnetworks calibrated using the MODIS LAI/FAPAR dataset. Evaluation based on cross-comparisonwith MODIS products and in situ data show the dataset is consistent and reliable with overalluncertainties of 1.03 and 0.15 for LAI and FAPAR, respectively. However, a clear saturation effect isobserved in the broadleaf forest biomes with high LAI (greater than 4.5) and FAPAR (greater than 0.8) values.

  7. ISED: Constructing a high-resolution elevation road dataset from massive, low-quality in-situ observations derived from geosocial fitness tracking data.

    PubMed

    McKenzie, Grant; Janowicz, Krzysztof

    2017-01-01

    Gaining access to inexpensive, high-resolution, up-to-date, three-dimensional road network data is a top priority beyond research, as such data would fuel applications in industry, governments, and the broader public alike. Road network data are openly available via user-generated content such as OpenStreetMap (OSM) but lack the resolution required for many tasks, e.g., emergency management. More importantly, however, few publicly available data offer information on elevation and slope. For most parts of the world, up-to-date digital elevation products with a resolution of less than 10 meters are a distant dream and, if available, those datasets have to be matched to the road network through an error-prone process. In this paper we present a radically different approach by deriving road network elevation data from massive amounts of in-situ observations extracted from user-contributed data from an online social fitness tracking application. While each individual observation may be of low-quality in terms of resolution and accuracy, taken together they form an accurate, high-resolution, up-to-date, three-dimensional road network that excels where other technologies such as LiDAR fail, e.g., in case of overpasses, overhangs, and so forth. In fact, the 1m spatial resolution dataset created in this research based on 350 million individual 3D location fixes has an RMSE of approximately 3.11m compared to a LiDAR-based ground-truth and can be used to enhance existing road network datasets where individual elevation fixes differ by up to 60m. In contrast, using interpolated data from the National Elevation Dataset (NED) results in 4.75m RMSE compared to the base line. We utilize Linked Data technologies to integrate the proposed high-resolution dataset with OpenStreetMap road geometries without requiring any changes to the OSM data model.

  8. jPOSTrepo: an international standard data repository for proteomes

    PubMed Central

    Okuda, Shujiro; Watanabe, Yu; Moriya, Yuki; Kawano, Shin; Yamamoto, Tadashi; Matsumoto, Masaki; Takami, Tomoyo; Kobayashi, Daiki; Araki, Norie; Yoshizawa, Akiyasu C.; Tabata, Tsuyoshi; Sugiyama, Naoyuki; Goto, Susumu; Ishihama, Yasushi

    2017-01-01

    Major advancements have recently been made in mass spectrometry-based proteomics, yielding an increasing number of datasets from various proteomics projects worldwide. In order to facilitate the sharing and reuse of promising datasets, it is important to construct appropriate, high-quality public data repositories. jPOSTrepo (https://repository.jpostdb.org/) has successfully implemented several unique features, including high-speed file uploading, flexible file management and easy-to-use interfaces. This repository has been launched as a public repository containing various proteomic datasets and is available for researchers worldwide. In addition, our repository has joined the ProteomeXchange consortium, which includes the most popular public repositories such as PRIDE in Europe for MS/MS datasets and PASSEL for SRM datasets in the USA. Later MassIVE was introduced in the USA and accepted into the ProteomeXchange, as was our repository in July 2016, providing important datasets from Asia/Oceania. Accordingly, this repository thus contributes to a global alliance to share and store all datasets from a wide variety of proteomics experiments. Thus, the repository is expected to become a major repository, particularly for data collected in the Asia/Oceania region. PMID:27899654

  9. Comparison of High and Low Density Airborne LIDAR Data for Forest Road Quality Assessment

    NASA Astrophysics Data System (ADS)

    Kiss, K.; Malinen, J.; Tokola, T.

    2016-06-01

    Good quality forest roads are important for forest management. Airborne laser scanning data can help create automatized road quality detection, thus avoiding field visits. Two different pulse density datasets have been used to assess road quality: high-density airborne laser scanning data from Kiihtelysvaara and low-density data from Tuusniemi, Finland. The field inventory mainly focused on the surface wear condition, structural condition, flatness, road side vegetation and drying of the road. Observations were divided into poor, satisfactory and good categories based on the current Finnish quality standards used for forest roads. Digital Elevation Models were derived from the laser point cloud, and indices were calculated to determine road quality. The calculated indices assessed the topographic differences on the road surface and road sides. The topographic position index works well in flat terrain only, while the standardized elevation index described the road surface better if the differences are bigger. Both indices require at least a 1 metre resolution. High-density data is necessary for analysis of the road surface, and the indices relate mostly to the surface wear and flatness. The classification was more precise (31-92%) than on low-density data (25-40%). However, ditch detection and classification can be carried out using the sparse dataset as well (with a success rate of 69%). The use of airborne laser scanning data can provide quality information on forest roads.

  10. Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection.

    PubMed

    Demirci, Müşerref Duygu Saçar; Allmer, Jens

    2017-07-28

    MicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods generally employ machine learning to establish models for the discrimination between miRNAs and other sequences. Positive training data for model establishment, for the most part, stems from miRBase, the miRNA registry. The quality of the entries in miRBase has been questioned, though. This unknown quality led to the development of filtering strategies in attempts to produce high quality positive datasets which can lead to a scarcity of positive data. To analyze the quality of filtered data we developed a machine learning model and found it is well able to establish data quality based on intrinsic measures. Additionally, we analyzed which features describing pre-miRNAs could discriminate between low and high quality data. Both models are applicable to data from miRBase and can be used for establishing high quality positive data. This will facilitate the development of better miRNA detection tools which will make the prediction of miRNAs in disease states more accurate. Finally, we applied both models to all miRBase data and provide the list of high quality hairpins.

  11. Common Structure in Different Physical Properties: Electrical Conductivity and Surface Waves Phase Velocity

    NASA Astrophysics Data System (ADS)

    Mandolesi, E.; Jones, A. G.; Roux, E.; Lebedev, S.

    2009-12-01

    Recently different studies were undertaken on the correlation between diverse geophysical datasets. Magnetotelluric (MT) data are used to map the electrical conductivity structure behind the Earth, but one of the problems in MT method is the lack in resolution in mapping zones beneath a region of high conductivity. Joint inversion of different datasets in which a common structure is recognizable reduces non-uniqueness and may improve the quality of interpretation when different dataset are sensitive to different physical properties with an underlined common structure. A common structure is recognized if the change of physical properties occur at the same spatial locations. Common structure may be recognized in 1D inversion of seismic and MT datasets, and numerous authors show that also 2D common structure may drive to an improvement of inversion quality while dataset are jointly inverted. In this presentation a tool to constrain MT 2D inversion with phase velocity of surface wave seismic data (SW) is proposed and is being developed and tested on synthetic data. Results obtained suggest that a joint inversion scheme could be applied with success along a section profile for which data are compatible with a 2D MT model.

  12. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alam, Ujjaini; Lasue, Jeremie, E-mail: ujjaini.alam@gmail.com, E-mail: jeremie.lasue@irap.omp.eu

    We examine three SNe Type Ia datasets: Union2.1, JLA and Panstarrs to check their consistency using cosmology blind statistical analyses as well as cosmological parameter fitting. We find that the Panstarrs dataset is the most stable of the three to changes in the data, although it does not, at the moment, go to high enough redshifts to tightly constrain the equation of state of dark energy, w . The Union2.1, drawn from several different sources, appears to be somewhat susceptible to changes within the dataset. The JLA reconstructs well for a smaller number of cosmological parameters. At higher degrees ofmore » freedom, the dependence of its errors on redshift can lead to varying results between subsets. Panstarrs is inconsistent with the other two datasets at about 2σ confidence level, and JLA and Union2.1 are about 1σ away from each other. For the Ω{sub 0} {sub m} − w cosmological reconstruction, with no additional data, the 1σ range of values in w for selected subsets of each dataset is two times larger for JLA and Union2.1 as compared to Panstarrs. The range in Ω{sub 0} {sub m} for the same subsets remains approximately similar for all three datasets. We find that although there are differences in the fitting and correction techniques used in the different samples, the most important criterion is the selection of the SNe, a slightly different SNe selection can lead to noticeably different results both in the purely statistical analysis and in cosmological reconstruction. We note that a single, high quality low redshift sample could help decrease the uncertainties in the result. We also note that lack of homogeneity in the magnitude errors may bias the results and should either be modeled, or its effect neutralized by using other, complementary datasets. A supernova sample with high quality data at both high and low redshifts, constructed from a few surveys to avoid heterogeneity in the sample, and with homogeneous errors, would result in a more robust cosmological reconstruction.« less

  13. Curatr: a web application for creating, curating and sharing a mass spectral library.

    PubMed

    Palmer, Andrew; Phapale, Prasad; Fay, Dominik; Alexandrov, Theodore

    2018-04-15

    We have developed a web application curatr for the rapid generation of high quality mass spectral fragmentation libraries from liquid-chromatography mass spectrometry datasets. Curatr handles datasets from single or multiplexed standards and extracts chromatographic profiles and potential fragmentation spectra for multiple adducts. An intuitive interface helps users to select high quality spectra that are stored along with searchable molecular information, the providence of each standard and experimental metadata. Curatr supports exports to several standard formats for use with third party software or submission to repositories. We demonstrate the use of curatr to generate the EMBL Metabolomics Core Facility spectral library http://curatr.mcf.embl.de. Source code and example data are at http://github.com/alexandrovteam/curatr/. palmer@embl.de. Supplementary data are available at Bioinformatics online.

  14. Automatic retinal interest evaluation system (ARIES).

    PubMed

    Yin, Fengshou; Wong, Damon Wing Kee; Yow, Ai Ping; Lee, Beng Hai; Quan, Ying; Zhang, Zhuo; Gopalakrishnan, Kavitha; Li, Ruoying; Liu, Jiang

    2014-01-01

    In recent years, there has been increasing interest in the use of automatic computer-based systems for the detection of eye diseases such as glaucoma, age-related macular degeneration and diabetic retinopathy. However, in practice, retinal image quality is a big concern as automatic systems without consideration of degraded image quality will likely generate unreliable results. In this paper, an automatic retinal image quality assessment system (ARIES) is introduced to assess both image quality of the whole image and focal regions of interest. ARIES achieves 99.54% accuracy in distinguishing fundus images from other types of images through a retinal image identification step in a dataset of 35342 images. The system employs high level image quality measures (HIQM) to perform image quality assessment, and achieves areas under curve (AUCs) of 0.958 and 0.987 for whole image and optic disk region respectively in a testing dataset of 370 images. ARIES acts as a form of automatic quality control which ensures good quality images are used for processing, and can also be used to alert operators of poor quality images at the time of acquisition.

  15. Public Data Archiving in Ecology and Evolution: How Well Are We Doing?

    PubMed Central

    Roche, Dominique G.; Kruuk, Loeske E. B.; Lanfear, Robert; Binning, Sandra A.

    2015-01-01

    Policies that mandate public data archiving (PDA) successfully increase accessibility to data underlying scientific publications. However, is the data quality sufficient to allow reuse and reanalysis? We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. We suggest that cultural shifts facilitating clearer benefits to authors are necessary to achieve high-quality PDA and highlight key guidelines to help authors increase their data’s reuse potential and compliance with journal data policies. PMID:26556502

  16. The Application of Satellite-Derived, High-Resolution Land Use/Land Cover Data to Improve Urban Air Quality Model Forecasts

    NASA Technical Reports Server (NTRS)

    Quattrochi, D. A.; Lapenta, W. M.; Crosson, W. L.; Estes, M. G., Jr.; Limaye, A.; Kahn, M.

    2006-01-01

    Local and state agencies are responsible for developing state implementation plans to meet National Ambient Air Quality Standards. Numerical models used for this purpose simulate the transport and transformation of criteria pollutants and their precursors. The specification of land use/land cover (LULC) plays an important role in controlling modeled surface meteorology and emissions. NASA researchers have worked with partners and Atlanta stakeholders to incorporate an improved high-resolution LULC dataset for the Atlanta area within their modeling system and to assess meteorological and air quality impacts of Urban Heat Island (UHI) mitigation strategies. The new LULC dataset provides a more accurate representation of land use, has the potential to improve model accuracy, and facilitates prediction of LULC changes. Use of the new LULC dataset for two summertime episodes improved meteorological forecasts, with an existing daytime cold bias of approx. equal to 3 C reduced by 30%. Model performance for ozone prediction did not show improvement. In addition, LULC changes due to Atlanta area urbanization were predicted through 2030, for which model simulations predict higher urban air temperatures. The incorporation of UHI mitigation strategies partially offset this warming trend. The data and modeling methods used are generally applicable to other U.S. cities.

  17. The dissolved organic matter as a potential soil quality indicator in arable soils of Hungary.

    PubMed

    Filep, Tibor; Draskovits, Eszter; Szabó, József; Koós, Sándor; László, Péter; Szalai, Zoltán

    2015-07-01

    Although several authors have suggested that the labile fraction of soils could be a potential soil quality indicator, the possibilities and limitations of using the dissolved organic matter (DOM) fraction for this purpose have not yet been investigated. The objective of this study was to evaluate the hypothesis that DOM is an adequate indicator of soil quality. To test this, the soil quality indices (SQI) of 190 arable soils from a Hungarian dataset were estimated, and these values were compared to DOM parameters (DOC and SUVA254). A clear difference in soil quality was found between the soil types, with low soil quality for arenosols (average SQI 0.5) and significantly higher values for gleysols, vertisols, regosols, solonetzes and chernozems. The SQI-DOC relationship could be described by non-linear regression, while a linear connection was observed between SQI and SUVA. The regression equations obtained for the dataset showed only one relatively weak significant correlation between the variables, for DOC (R (2) = 0.157(***); n = 190), while non-significant relationships were found for the DOC and SUVA254 values. However, an envelope curve operated with the datasets showed the robust potential of DOC to indicate soil quality changes, with a high R (2) value for the envelope curve regression equation. The limitations to using the DOM fraction of soils as a quality indicator are due to the contradictory processes which take place in soils in many cases.

  18. TSPmap, a tool making use of traveling salesperson problem solvers in the efficient and accurate construction of high-density genetic linkage maps.

    PubMed

    Monroe, J Grey; Allen, Zachariah A; Tanger, Paul; Mullen, Jack L; Lovell, John T; Moyers, Brook T; Whitley, Darrell; McKay, John K

    2017-01-01

    Recent advances in nucleic acid sequencing technologies have led to a dramatic increase in the number of markers available to generate genetic linkage maps. This increased marker density can be used to improve genome assemblies as well as add much needed resolution for loci controlling variation in ecologically and agriculturally important traits. However, traditional genetic map construction methods from these large marker datasets can be computationally prohibitive and highly error prone. We present TSPmap , a method which implements both approximate and exact Traveling Salesperson Problem solvers to generate linkage maps. We demonstrate that for datasets with large numbers of genomic markers (e.g. 10,000) and in multiple population types generated from inbred parents, TSPmap can rapidly produce high quality linkage maps with low sensitivity to missing and erroneous genotyping data compared to two other benchmark methods, JoinMap and MSTmap . TSPmap is open source and freely available as an R package. With the advancement of low cost sequencing technologies, the number of markers used in the generation of genetic maps is expected to continue to rise. TSPmap will be a useful tool to handle such large datasets into the future, quickly producing high quality maps using a large number of genomic markers.

  19. Indirectly Estimating International Net Migration Flows by Age and Gender: The Community Demographic Model International Migration (CDM-IM) Dataset

    PubMed Central

    Nawrotzki, Raphael J.; Jiang, Leiwen

    2015-01-01

    Although data for the total number of international migrant flows is now available, no global dataset concerning demographic characteristics, such as the age and gender composition of migrant flows exists. This paper reports on the methods used to generate the CDM-IM dataset of age and gender specific profiles of bilateral net (not gross) migrant flows. We employ raw data from the United Nations Global Migration Database and estimate net migrant flows by age and gender between two time points around the year 2000, accounting for various demographic processes (fertility, mortality). The dataset contains information on 3,713 net migrant flows. Validation analyses against existing data sets and the historical, geopolitical context demonstrate that the CDM-IM dataset is of reasonably high quality. PMID:26692590

  20. How Fit is Your Citizen Science Data?

    NASA Astrophysics Data System (ADS)

    Fischer, H. A.; Gerber, L. R.; Wentz, E. A.

    2017-12-01

    Data quality and accuracy is a fundamental concern with utilizing citizen science data. Although many methods can be used to assess quality and accuracy, these methods may not be sufficient to qualify citizen science data for widespread use in scientific research. While Data Fitness For Use (DFFU) does not provide a blanket assessment of data quality, it does assesses the data's ability to be used for a specific application, within a given area (Devillers and Bédard 2007). The STAAq (Spatial, Temporal, Aptness, and Accuracy) assessment was developed to assess the fitness for use of citizen science data, this assessment can be used on a stand alone dataset or be used to compare multiple datasets. The citizen science data used in this assessment was collected by volunteers of the Map of Life- Denali project, which is a tourist-centric citizen science project developed through a partnership with Arizona State University, Map of Life at Yale University, and Denali National Park and Preserve. Volunteers use the offline version of the Map of Life app to record their wildlife, insect, and plant observations in the park. To test the STAAq assessment data from different sources- Map of Life- Denali, Ride Observe and Record, and NPS wildlife surveys- were compared to determined which dataset is most fit for use for a specific research question; What is the recent Grizzly bear distribution in areas of high visitor use in Denali National Park and Preserve? These datasets were compared and ranked according to how well they performed in each of the components of the STAAq assessment. These components include spatial scale, temporal scale, aptness, and application. The Map of Life- Denali data and the ROAR program data were most for use for this research question. The STAAq assessment can be adjusted to assess the fitness for use of a single dataset or being used to compare any number of datasets. This data fitness for use assessment provides a means to assess data fitness instead of data quality for citizen science data.

  1. De-identification of health records using Anonym: effectiveness and robustness across datasets.

    PubMed

    Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

    2014-07-01

    Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.

  2. Specialized food composition dataset for vitamin D content in foods based on European standards: Application to dietary intake assessment.

    PubMed

    Milešević, Jelena; Samaniego, Lourdes; Kiely, Mairead; Glibetić, Maria; Roe, Mark; Finglas, Paul

    2018-02-01

    A review of national nutrition surveys from 2000 to date, demonstrated high prevalence of vitamin D intakes below the EFSA Adequate Intake (AI) (<15μg/d vitamin D) in adults across Europe. Dietary assessment and modelling are required to monitor efficacy and safety of ongoing strategic vitamin D fortification. To support these studies, a specialized vitamin D food composition dataset, based on EuroFIR standards, was compiled. The FoodEXplorer™ tool was used to retrieve well documented analytical data for vitamin D and arrange the data into two datasets - European (8 European countries, 981 data values) and US (1836 data values). Data were classified, using the LanguaL™, FoodEX2 and ODIN classification systems and ranked according to quality criteria. Significant differences in the content, quality of data values, missing data on vitamin D 2 and 25(OH)D 3 and documentation of analytical methods were observed. The dataset is available through the EuroFIR platform. Copyright © 2017 Elsevier Ltd. All rights reserved.

  3. Cluster Active Archive: lessons learnt

    NASA Astrophysics Data System (ADS)

    Laakso, H. E.; Perry, C. H.; Taylor, M. G.; Escoubet, C. P.; Masson, A.

    2010-12-01

    The ESA Cluster Active Archive (CAA) was opened to public in February 2006 after an initial three-year development phase. It provides access (both web GUI and command-line tool are available) to the calibrated full-resolution datasets of the four-satellite Cluster mission. The data archive is publicly accessible and suitable for science use and publication by the world-wide scientific community. There are more than 350 datasets from each spacecraft, including high-resolution magnetic and electric DC and AC fields as well as full 3-dimensional electron and ion distribution functions and moments from a few eV to hundreds of keV. The Cluster mission has been in operation since February 2001, and currently although the CAA can provide access to some recent observations, the ingestion of some other datasets can be delayed by a few years due to large and difficult calibration routines of aging detectors. The quality of the datasets is the central matter to the CAA. Having the same instrument on four spacecraft allows the cross-instrument comparisons and provide confidence on some of the instrumental calibration parameters. Furthermore it is highly important that many physical parameters are measured by more than one instrument which allow to perform extensive and continuous cross-calibration analyses. In addition some of the instruments can be regarded as absolute or reference measurements for other instruments. The CAA attempts to avoid as much as possible mission-specific acronyms and concepts and tends to use more generic terms in describing the datasets and their contents in order to ease the usage of the CAA data by “non-Cluster” scientists. Currently the CAA has more 1000 users and every month more than 150 different users log in the CAA for plotting and/or downloading observations. The users download about 1 TeraByte of data every month. The CAA has separated the graphical tool from the download tool because full-resolution datasets can be visualized in many ways and so there is no one-to-one correspondence between graphical products and full-resolution datasets. The CAA encourages users to contact the CAA team for all kind of issues whether it concerns the user interface, the content of the datasets, the quality of the observations or provision of new type of services. The CAA runs regular annual reviews on the data products and the user services in order to improve the quality and usability of the CAA system to the world-wide user community. The CAA is continuously being upgraded in terms of datasets and services.

  4. The Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) dataset and its applications in drought risk management

    NASA Astrophysics Data System (ADS)

    Shukla, Shraddhanand; Funk, Chris; Peterson, Pete; McNally, Amy; Dinku, Tufa; Barbosa, Humberto; Paredes-Trejo, Franklin; Pedreros, Diego; Husak, Greg

    2017-04-01

    A high quality, long-term, high-resolution precipitation dataset is key for supporting drought-related risk management and food security early warning. Here, we present the Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) v2.0, developed by scientists at the University of California, Santa Barbara and the U.S. Geological Survey Earth Resources Observation and Science Center under the direction of Famine Early Warning Systems Network (FEWS NET). CHIRPS is a quasi-global precipitation product and is made available at daily to seasonal time scales with a spatial resolution of 0.05° and a 1981 to near real-time period of record. We begin by describing the three main components of CHIRPS - a high-resolution climatology, time-varying cold cloud duration precipitation estimates, and in situ precipitation estimates, and how they are combined. We then present a validation of this dataset and describe how CHIRPS is being disseminated and used in different applications, such as large-scale hydrologic models and crop water balance models. Validation of CHIRPS has focused on comparisons with precipitation products with global coverage, long periods of record and near real-time availability such as CPC-Unified, CFS Reanalysis and ECMWF datasets and datasets such GPCC and GPCP that incorporate high quality in situ datasets from places such as Uganda, Colombia, and the Sahel. The CHIRPS is shown to have low systematic errors (bias) and low mean absolute errors. We find that CHIRPS performance appears quite similar to research quality products like the GPCC and GPCP, but with higher resolution and lower latency. We also present results from independent validation studies focused on South America and East Africa. CHIRPS is currently being used to drive FEWS NET Land Data Assimilation System (FLDAS), that incorporates multiple hydrologic models, and Water Requirement Satisfaction Index (WRSI), which is a widely used crop water balance model. The outputs (such as soil moisture and runoff) from these models are being used for real-time drought monitoring in Africa. Under support from the USAID FEWS NET, CHG/USGS has developed a two way strategy for dissemination of CHIRPS and related products (e.g. FLDAS, WRSI) and incorporate contributed station data. For example, we are currently working with partners in Mexico (Conagua), Southern Africa (SASSCAL), Colombia (IDEAM), Nigeria (Kukua), Somalia (SWALIM) and Ethiopia (NMA). These institutions provide in situ observations which enhance the CHIRPS and CHIRPS provides feedback on data quality. The CHIRPS is then placed in a web accessible geospatial database. Partners in these countries can then access CHIRPS and other outputs, and display this information using web-based mapping tools. This provides a win-win collaboration, leading to improved globally accessible precipitation estimates and improved climate services in developing nations.

  5. Improved Statistical Method For Hydrographic Climatic Records Quality Control

    NASA Astrophysics Data System (ADS)

    Gourrion, J.; Szekely, T.

    2016-02-01

    Climate research benefits from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of a quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to early 2014, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has been implemented in the latest version of the CORA dataset and will benefit to the next version of the Copernicus CMEMS dataset.

  6. Here the data: the new FLUXNET collection and the future for model-data integration

    NASA Astrophysics Data System (ADS)

    Papale, D.; Pastorello, G.; Trotta, C.; Chu, H.; Canfora, E.; Agarwal, D.; Baldocchi, D. D.; Torn, M. S.

    2016-12-01

    Seven years after the release of the LaThuile FLUXNET database, widely used in synthesis activities and model-data fusion exercises, a new FLUXNET collection has been released (FLUXNET 2015 - http://fluxnet.fluxdata.org) with the aim to increase the quality of the measurements and provide high quality standardized data obtained by a new processing pipeline. The new FLUXNET collection includes also sites with timeseries of 20 years of continuous carbon and energy fluxes, opening new opportunities in their use in the context of models parameterization and validation. The main characteristics of the FLUXNET 2015 dataset are the uncertainty quantification, the multiple products (e.g. partitioning in photosynthesis and ecosystem respiration) that allow consistency analysis for each site, and new long term downscaled meteorological data provided with the data. Feedbacks from new users, in particular from the modelling communities, are crucial to further improve the quality of the products and move in the direction of a coherent integration across multi-disciplinary communities. In this presentation, the new FLUXNET2015 dataset will be explained and explored, with particular focus on the meaning of the different products and variables, their potentiality but also their limitations. The future development of the dataset will be discussed, with the role of the regional networks and the ongoing efforts to provide new and advanced services such a near real time data provision and a completely open access policy to high quality standardized measurements of GHGs exchanges and additional ecological quantities.

  7. Method applied to the background analysis of energy data to be considered for the European Reference Life Cycle Database (ELCD).

    PubMed

    Fazio, Simone; Garraín, Daniel; Mathieux, Fabrice; De la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda

    2015-01-01

    Under the framework of the European Platform on Life Cycle Assessment, the European Reference Life-Cycle Database (ELCD - developed by the Joint Research Centre of the European Commission), provides core Life Cycle Inventory (LCI) data from front-running EU-level business associations and other sources. The ELCD contains energy-related data on power and fuels. This study describes the methods to be used for the quality analysis of energy data for European markets (available in third-party LC databases and from authoritative sources) that are, or could be, used in the context of the ELCD. The methodology was developed and tested on the energy datasets most relevant for the EU context, derived from GaBi (the reference database used to derive datasets for the ELCD), Ecoinvent, E3 and Gemis. The criteria for the database selection were based on the availability of EU-related data, the inclusion of comprehensive datasets on energy products and services, and the general approval of the LCA community. The proposed approach was based on the quality indicators developed within the International Reference Life Cycle Data System (ILCD) Handbook, further refined to facilitate their use in the analysis of energy systems. The overall Data Quality Rating (DQR) of the energy datasets can be calculated by summing up the quality rating (ranging from 1 to 5, where 1 represents very good, and 5 very poor quality) of each of the quality criteria indicators, divided by the total number of indicators considered. The quality of each dataset can be estimated for each indicator, and then compared with the different databases/sources. The results can be used to highlight the weaknesses of each dataset and can be used to guide further improvements to enhance the data quality with regard to the established criteria. This paper describes the application of the methodology to two exemplary datasets, in order to show the potential of the methodological approach. The analysis helps LCA practitioners to evaluate the usefulness of the ELCD datasets for their purposes, and dataset developers and reviewers to derive information that will help improve the overall DQR of databases.

  8. Substituting values for censored data from Texas, USA, reservoirs inflated and obscured trends in analyses commonly used for water quality target development.

    PubMed

    Grantz, Erin; Haggard, Brian; Scott, J Thad

    2018-06-12

    We calculated four median datasets (chlorophyll a, Chl a; total phosphorus, TP; and transparency) using multiple approaches to handling censored observations, including substituting fractions of the quantification limit (QL; dataset 1 = 1QL, dataset 2 = 0.5QL) and statistical methods for censored datasets (datasets 3-4) for approximately 100 Texas, USA reservoirs. Trend analyses of differences between dataset 1 and 3 medians indicated percent difference increased linearly above thresholds in percent censored data (%Cen). This relationship was extrapolated to estimate medians for site-parameter combinations with %Cen > 80%, which were combined with dataset 3 as dataset 4. Changepoint analysis of Chl a- and transparency-TP relationships indicated threshold differences up to 50% between datasets. Recursive analysis identified secondary thresholds in dataset 4. Threshold differences show that information introduced via substitution or missing due to limitations of statistical methods biased values, underestimated error, and inflated the strength of TP thresholds identified in datasets 1-3. Analysis of covariance identified differences in linear regression models relating transparency-TP between datasets 1, 2, and the more statistically robust datasets 3-4. Study findings identify high-risk scenarios for biased analytical outcomes when using substitution. These include high probability of median overestimation when %Cen > 50-60% for a single QL, or when %Cen is as low 16% for multiple QL's. Changepoint analysis was uniquely vulnerable to substitution effects when using medians from sites with %Cen > 50%. Linear regression analysis was less sensitive to substitution and missing data effects, but differences in model parameters for transparency cannot be discounted and could be magnified by log-transformation of the variables.

  9. Data quality through a web-based QA/QC system: implementation for atmospheric mercury data from the global mercury observation system.

    PubMed

    D'Amore, Francesco; Bencardino, Mariantonia; Cinnirella, Sergio; Sprovieri, Francesca; Pirrone, Nicola

    2015-08-01

    The overall goal of the on-going Global Mercury Observation System (GMOS) project is to develop a coordinated global monitoring network for mercury, including ground-based, high altitude and sea level stations. In order to ensure data reliability and comparability, a significant effort has been made to implement a centralized system, which is designed to quality assure and quality control atmospheric mercury datasets. This system, GMOS-Data Quality Management (G-DQM), uses a web-based approach with real-time adaptive monitoring procedures aimed at preventing the production of poor-quality data. G-DQM is plugged on a cyberinfrastructure and deployed as a service. Atmospheric mercury datasets, produced during the first-three years of the GMOS project, are used as the input to demonstrate the application of the G-DQM and how it identifies a number of key issues concerning data quality. The major issues influencing data quality are presented and discussed for the GMOS stations under study. Atmospheric mercury data collected at the Longobucco (Italy) station is used as a detailed case study.

  10. Automatic and Robust Delineation of the Fiducial Points of the Seismocardiogram Signal for Non-invasive Estimation of Cardiac Time Intervals.

    PubMed

    Khosrow-Khavar, Farzad; Tavakolian, Kouhyar; Blaber, Andrew; Menon, Carlo

    2016-10-12

    The purpose of this research was to design a delineation algorithm that could detect specific fiducial points of the seismocardiogram (SCG) signal with or without using the electrocardiogram (ECG) R-wave as the reference point. The detected fiducial points were used to estimate cardiac time intervals. Due to complexity and sensitivity of the SCG signal, the algorithm was designed to robustly discard the low-quality cardiac cycles, which are the ones that contain unrecognizable fiducial points. The algorithm was trained on a dataset containing 48,318 manually annotated cardiac cycles. It was then applied to three test datasets: 65 young healthy individuals (dataset 1), 15 individuals above 44 years old (dataset 2), and 25 patients with previous heart conditions (dataset 3). The algorithm accomplished high prediction accuracy with the rootmean- square-error of less than 5 ms for all the test datasets. The algorithm overall mean detection rate per individual recordings (DRI) were 74, 68, and 42 percent for the three test datasets when concurrent ECG and SCG were used. For the standalone SCG case, the mean DRI was 32, 14 and 21 percent. When the proposed algorithm applied to concurrent ECG and SCG signals, the desired fiducial points of the SCG signal were successfully estimated with a high detection rate. For the standalone case, however, the algorithm achieved high prediction accuracy and detection rate for only the young individual dataset. The presented algorithm could be used for accurate and non-invasive estimation of cardiac time intervals.

  11. GRDC. A Collaborative Framework for Radiological Background and Contextual Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brian J. Quiter; Ramakrishnan, Lavanya; Mark S. Bandstra

    The Radiation Mobile Analysis Platform (RadMAP) is unique in its capability to collect both high quality radiological data from both gamma-ray detectors and fast neutron detectors and a broad array of contextual data that includes positioning and stance data, high-resolution 3D radiological data from weather sensors, LiDAR, and visual and hyperspectral cameras. The datasets obtained from RadMAP are both voluminous and complex and require analyses from highly diverse communities within both the national laboratory and academic communities. Maintaining a high level of transparency will enable analysis products to further enrich the RadMAP dataset. It is in this spirit of openmore » and collaborative data that the RadMAP team proposed to collect, calibrate, and make available online data from the RadMAP system. The Berkeley Data Cloud (BDC) is a cloud-based data management framework that enables web-based data browsing visualization, and connects curated datasets to custom workflows such that analysis products can be managed and disseminated while maintaining user access rights. BDC enables cloud-based analyses of large datasets in a manner that simulates real-time data collection, such that BDC can be used to test algorithm performance on real and source-injected datasets. Using the BDC framework, a subset of the RadMAP datasets have been disseminated via the Gamma Ray Data Cloud (GRDC) that is hosted through the National Energy Research Science Computing (NERSC) Center, enabling data access to over 40 users at 10 institutions.« less

  12. NASA's Applied Remote Sensing Training (ARSET) Webinar Series

    Atmospheric Science Data Center

    2018-01-30

    ... Wednesday, January 17, 2018 Data Analysis Tools for High Resolution Air Quality Satellite Datasets   ...   For agenda, registration and additional course information, please access  https://go.nasa.gov/2jmhRVD   ...

  13. Benchmark datasets for 3D MALDI- and DESI-imaging mass spectrometry.

    PubMed

    Oetjen, Janina; Veselkov, Kirill; Watrous, Jeramie; McKenzie, James S; Becker, Michael; Hauberg-Lotte, Lena; Kobarg, Jan Hendrik; Strittmatter, Nicole; Mróz, Anna K; Hoffmann, Franziska; Trede, Dennis; Palmer, Andrew; Schiffler, Stefan; Steinhorst, Klaus; Aichler, Michaela; Goldin, Robert; Guntinas-Lichius, Orlando; von Eggeling, Ferdinand; Thiele, Herbert; Maedler, Kathrin; Walch, Axel; Maass, Peter; Dorrestein, Pieter C; Takats, Zoltan; Alexandrov, Theodore

    2015-01-01

    Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms. High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data. 3D imaging MS data were collected from five samples using two types of 3D imaging MS. 3D matrix-assisted laser desorption/ionization imaging (MALDI) MS data were collected from murine pancreas, murine kidney, human oral squamous cell carcinoma, and interacting microbial colonies cultured in Petri dishes. 3D desorption electrospray ionization (DESI) imaging MS data were collected from a human colorectal adenocarcinoma. With the aim to stimulate computational research in the field of computational 3D imaging MS, selected high-quality 3D imaging MS datasets are provided that could be used by algorithm developers as benchmark datasets.

  14. High-quality mtDNA control region sequences from 680 individuals sampled across the Netherlands to establish a national forensic mtDNA reference database.

    PubMed

    Chaitanya, Lakshmi; van Oven, Mannis; Brauer, Silke; Zimmermann, Bettina; Huber, Gabriela; Xavier, Catarina; Parson, Walther; de Knijff, Peter; Kayser, Manfred

    2016-03-01

    The use of mitochondrial DNA (mtDNA) for maternal lineage identification often marks the last resort when investigating forensic and missing-person cases involving highly degraded biological materials. As with all comparative DNA testing, a match between evidence and reference sample requires a statistical interpretation, for which high-quality mtDNA population frequency data are crucial. Here, we determined, under high quality standards, the complete mtDNA control-region sequences of 680 individuals from across the Netherlands sampled at 54 sites, covering the entire country with 10 geographic sub-regions. The complete mtDNA control region (nucleotide positions 16,024-16,569 and 1-576) was amplified with two PCR primers and sequenced with ten different sequencing primers using the EMPOP protocol. Haplotype diversity of the entire sample set was very high at 99.63% and, accordingly, the random-match probability was 0.37%. No population substructure within the Netherlands was detected with our dataset. Phylogenetic analyses were performed to determine mtDNA haplogroups. Inclusion of these high-quality data in the EMPOP database (accession number: EMP00666) will improve its overall data content and geographic coverage in the interest of all EMPOP users worldwide. Moreover, this dataset will serve as (the start of) a national reference database for mtDNA applications in forensic and missing person casework in the Netherlands. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  15. The National Library of Medicine Pill Image Recognition Challenge: An Initial Report.

    PubMed

    Yaniv, Ziv; Faruque, Jessica; Howe, Sally; Dunn, Kathel; Sharlip, David; Bond, Andrew; Perillan, Pablo; Bodenreider, Olivier; Ackerman, Michael J; Yoo, Terry S

    2016-10-01

    In January 2016 the U.S. National Library of Medicine announced a challenge competition calling for the development and discovery of high-quality algorithms and software that rank how well consumer images of prescription pills match reference images of pills in its authoritative RxIMAGE collection. This challenge was motivated by the need to easily identify unknown prescription pills both by healthcare personnel and the general public. Potential benefits of this capability include confirmation of the pill in settings where the documentation and medication have been separated, such as in a disaster or emergency; and confirmation of a pill when the prescribed medication changes from brand to generic, or for any other reason the shape and color of the pill change. The data for the competition consisted of two types of images, high quality macro photographs, reference images, and consumer quality photographs of the quality we expect users of a proposed application to acquire. A training dataset consisting of 2000 reference images and 5000 corresponding consumer quality images acquired from 1000 pills was provided to challenge participants. A second dataset acquired from 1000 pills with similar distributions of shape and color was reserved as a segregated testing set. Challenge submissions were required to produce a ranking of the reference images, given a consumer quality image as input. Determination of the winning teams was done using the mean average precision quality metric, with the three winners obtaining mean average precision scores of 0.27, 0.09, and 0.08. In the retrieval results, the correct image was amongst the top five ranked images 43%, 12%, and 11% of the time, out of 5000 query/consumer images. This is an initial promising step towards development of an NLM software system and application-programming interface facilitating pill identification. The training dataset will continue to be freely available online at: http://pir.nlm.nih.gov/challenge/submission.html.

  16. Improved statistical method for temperature and salinity quality control

    NASA Astrophysics Data System (ADS)

    Gourrion, Jérôme; Szekely, Tanguy

    2017-04-01

    Climate research and Ocean monitoring benefit from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of an automatic quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to late 2015, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has already been implemented in the latest version of the delayed-time CMEMS in-situ dataset and will be deployed soon in the equivalent near-real time products.

  17. The SysteMHC Atlas project

    PubMed Central

    Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal

    2018-01-01

    Abstract Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. PMID:28985418

  18. MIPS bacterial genomes functional annotation benchmark dataset.

    PubMed

    Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen

    2005-05-15

    Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab

  19. Characterization Of Ocean Wind Vector Retrievals Using ERS-2 High-Resolution Long-Term Dataset And Buoy Measurements

    NASA Astrophysics Data System (ADS)

    Polverari, F.; Talone, M.; Crapolicchio, R. Levy, G.; Marzano, F.

    2013-12-01

    The European Remote-sensing Satellite (ERS)-2 scatterometer provides wind retrievals over Ocean. To satisfy the needs of high quality and homogeneous set of scatterometer measurements, the European Space Agency (ESA) has developed the project Advanced Scatterometer Processing System (ASPS) with which a long-term dataset of new ERS-2 wind products, with an enhanced resolution of 25km square, has been generated by the reprocessing of the entire ERS mission. This paper presents the main results of the validation work of such new dataset using in situ measurements provided by the Prediction and Research Moored Array in the Tropical Atlantic (PIRATA). The comparison indicates that, on average, the scatterometer data agree well with buoys measurements, however the scatterometer tends to overestimates lower winds and underestimates higher winds.

  20. Temporal performance assessment of wastewater treatment plants by using multivariate statistical analysis.

    PubMed

    Ebrahimi, Milad; Gerber, Erin L; Rockaway, Thomas D

    2017-05-15

    For most water treatment plants, a significant number of performance data variables are attained on a time series basis. Due to the interconnectedness of the variables, it is often difficult to assess over-arching trends and quantify operational performance. The objective of this study was to establish simple and reliable predictive models to correlate target variables with specific measured parameters. This study presents a multivariate analysis of the physicochemical parameters of municipal wastewater. Fifteen quality and quantity parameters were analyzed using data recorded from 2010 to 2016. To determine the overall quality condition of raw and treated wastewater, a Wastewater Quality Index (WWQI) was developed. The index summarizes a large amount of measured quality parameters into a single water quality term by considering pre-established quality limitation standards. To identify treatment process performance, the interdependencies between the variables were determined by using Principal Component Analysis (PCA). The five extracted components from the 15 variables accounted for 75.25% of total dataset information and adequately represented the organic, nutrient, oxygen demanding, and ion activity loadings of influent and effluent streams. The study also utilized the model to predict quality parameters such as Biological Oxygen Demand (BOD), Total Phosphorus (TP), and WWQI. High accuracies ranging from 71% to 97% were achieved for fitting the models with the training dataset and relative prediction percentage errors less than 9% were achieved for the testing dataset. The presented techniques and procedures in this paper provide an assessment framework for the wastewater treatment monitoring programs. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Development of innovative computer software to facilitate the setup and computation of water quality index.

    PubMed

    Nabizadeh, Ramin; Valadi Amin, Maryam; Alimohammadi, Mahmood; Naddafi, Kazem; Mahvi, Amir Hossein; Yousefzadeh, Samira

    2013-04-26

    Developing a water quality index which is used to convert the water quality dataset into a single number is the most important task of most water quality monitoring programmes. As the water quality index setup is based on different local obstacles, it is not feasible to introduce a definite water quality index to reveal the water quality level. In this study, an innovative software application, the Iranian Water Quality Index Software (IWQIS), is presented in order to facilitate calculation of a water quality index based on dynamic weight factors, which will help users to compute the water quality index in cases where some parameters are missing from the datasets. A dataset containing 735 water samples of drinking water quality in different parts of the country was used to show the performance of this software using different criteria parameters. The software proved to be an efficient tool to facilitate the setup of water quality indices based on flexible use of variables and water quality databases.

  2. A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.

    PubMed

    Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai

    2017-10-01

    This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.

  3. Who, What, When, Where? Determining the Health Implications of Wildfire Smoke Exposure

    NASA Astrophysics Data System (ADS)

    Ford, B.; Lassman, W.; Gan, R.; Burke, M.; Pfister, G.; Magzamen, S.; Fischer, E. V.; Volckens, J.; Pierce, J. R.

    2016-12-01

    Exposure to poor air quality is associated with negative impacts on human health. A large natural source of PM in the western U.S. is from wildland fires. Accurately attributing health endpoints to wildland-fire smoke requires a determination of the exposed population. This is a difficult endeavor because most current methods for monitoring air quality are not at high temporal and spatial resolutions. Therefore, there is a growing effort to include multiple datasets and create blended products of smoke exposure that can exploit the strengths of each dataset. In this work, we combine model (WRF-Chem) simulations, NASA satellite (MODIS) observations, and in-situ surface monitors to improve exposure estimates. We will also introduce a social-media dataset of self-reported smoke/haze/pollution to improve population-level exposure estimates for the summer of 2015. Finally, we use these detailed exposure estimates in different epidemiologic study designs to provide an in-depth understanding of the role wildfire exposure plays on health outcomes.

  4. Towards an effective data peer review

    NASA Astrophysics Data System (ADS)

    Düsterhus, André; Hense, Andreas

    2014-05-01

    Peer review is an established procedure to ensure the quality of scientific publications and is currently used as a prerequisite for acceptance of papers in the scientific community. In the past years the publication of raw data and its metadata got increased attention, which led to the idea of bringing it to the same standards the journals for traditional publications have. One missing element to achieve this is a comparable peer review scheme. This contribution introduces the idea of a quality evaluation process, which is designed to analyse the technical quality as well as the content of a dataset. It bases on quality tests, which results are evaluated with the help of the knowledge of an expert. The results of the tests and the expert knowledge are evaluated probabilistically and are statistically combined. As a result the quality of a dataset is estimated with a single value only. This approach allows the reviewer to quickly identify the potential weaknesses of a dataset and generate a transparent and comprehensible report. To demonstrate the scheme, an application on a large meteorological dataset will be shown. Furthermore, potentials and risks of such a scheme will be introduced and practical implications for its possible introduction to data centres investigated. Especially, the effects of reducing the estimate of quality of a dataset to a single number will be critically discussed.

  5. A proposed framework on hybrid feature selection techniques for handling high dimensional educational data

    NASA Astrophysics Data System (ADS)

    Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd

    2017-10-01

    Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.

  6. Interpretation of fingerprint image quality features extracted by self-organizing maps

    NASA Astrophysics Data System (ADS)

    Danov, Ivan; Olsen, Martin A.; Busch, Christoph

    2014-05-01

    Accurate prediction of fingerprint quality is of significant importance to any fingerprint-based biometric system. Ensuring high quality samples for both probe and reference can substantially improve the system's performance by lowering false non-matches, thus allowing finer adjustment of the decision threshold of the biometric system. Furthermore, the increasing usage of biometrics in mobile contexts demands development of lightweight methods for operational environment. A novel two-tier computationally efficient approach was recently proposed based on modelling block-wise fingerprint image data using Self-Organizing Map (SOM) to extract specific ridge pattern features, which are then used as an input to a Random Forests (RF) classifier trained to predict the quality score of a propagated sample. This paper conducts an investigative comparative analysis on a publicly available dataset for the improvement of the two-tier approach by proposing additionally three feature interpretation methods, based respectively on SOM, Generative Topographic Mapping and RF. The analysis shows that two of the proposed methods produce promising results on the given dataset.

  7. Rapid underway profiling of water quality in Queensland estuaries.

    PubMed

    Hodge, Jonathan; Longstaff, Ben; Steven, Andy; Thornton, Phillip; Ellis, Peter; McKelvie, Ian

    2005-01-01

    We present an overview of a portable underway water quality monitoring system (RUM-Rapid Underway Monitoring), developed by integrating several off-the-shelf water quality instruments to provide rapid, comprehensive, and spatially referenced 'snapshots' of water quality conditions. We demonstrate the utility of the system from studies in the Northern Great Barrier Reef (Daintree River) and the Moreton Bay region. The Brisbane dataset highlights RUM's utility in characterising plumes as well as its ability to identify the smaller scale structure of large areas. RUM is shown to be particularly useful when measuring indicators with large small-scale variability such as turbidity and chlorophyll-a. Additionally, the Daintree dataset shows the ability to integrate other technologies, resulting in a more comprehensive analysis, whilst sampling offshore highlights some of the analytical issues required for sampling low concentration data. RUM is a low cost, highly flexible solution that can be modified for use in any water type, on most vessels and is only limited by the available monitoring technologies.

  8. Mineral Potential Mapping in a Frontier Region

    NASA Astrophysics Data System (ADS)

    Ford, A.

    2009-04-01

    Mineral potential mapping using Geographic Information Systems (GIS) allows for rapid evaluation of spatial geoscience data and has the potential to delineate areas which may be prospective for hosting mineral deposits. Popular methods for evaluating digital data include weights of evidence, fuzzy logic and probabilistic neural networks. To date, such methods have been mostly applied to terrains that are well-studied, well-explored, and for which high-quality data is readily available. However, despite lacking protracted exploration histories and high-quality data, many frontier regions may have high-potential for hosting world-class mineral deposits and may benefit from mineral potential mapping exercises. Sovereign risk factors can limit the scope of previous work in a frontier region, and previous research in such areas is often limited and/or inaccessible, publicly available literature and data can be restricted, and any available data may also be unreliable. Mineral potential mapping using GIS in a frontier region presents many challenges in terms of the data availability (eg. non-existent information, lack of digital data) and data quality (eg. inaccuracy, incomplete coverage). The quality of the final mineral potential map is limited by the quality of the input data and as such, is affected by data availability and quality. Such issues are not limited to frontier regions, but they are often compounded by having multiple weaknesses within the same dataset, which is uncommon for data in more well-explored, data-rich areas. We show how mineral potential mapping can be successfully applied to frontier regions in order to delineate targets with high potential for hosting a mineral deposit despite the data challenges posed. Data is evaluated using the weights of evidence and fuzzy logic methods due to their effectiveness in dealing with incomplete geoscientific datasets. Weights of evidence may be employed as a data driven method for indirectly evaluating the quality of the data. In a frontier region, the quality of both the training data (mineral deposits) and evidential layers (geological features) may be questionable. Statistical measures can be used to verify whether the data exhibits logical inconsistencies which may be the result of inaccurate training data or inaccurate data in the evidential layer. Expert geological knowledge may be used to exclude, refine or modify such datasets for further analysis using an iterative weights of evidence process. After verification of the datasets using weights of evidence, fuzzy logic can be used to prepare a mineral potential map using expert geological knowledge. Fuzzy logic is suited to new areas where data availability may be poor, and allows a geologist to select the evidential layers they believe are the most critical for the particular ore deposit style being investigated, as specific deposit models for the area may not yet exist. These critical layers can then be quantified based on expert opinion. The results of the mineral potential mapping can be verified by their ability to predict known ore deposits within the study area.

  9. Metabarcoding of marine nematodes – evaluation of reference datasets used in tree-based taxonomy assignment approach

    PubMed Central

    2016-01-01

    Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919

  10. Metabarcoding of marine nematodes - evaluation of reference datasets used in tree-based taxonomy assignment approach.

    PubMed

    Holovachov, Oleksandr

    2016-01-01

    Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.

  11. A Computational Framework for High-Throughput Isotopic Natural Abundance Correction of Omics-Level Ultra-High Resolution FT-MS Datasets

    PubMed Central

    Carreer, William J.; Flight, Robert M.; Moseley, Hunter N. B.

    2013-01-01

    New metabolomics applications of ultra-high resolution and accuracy mass spectrometry can provide thousands of detectable isotopologues, with the number of potentially detectable isotopologues increasing exponentially with the number of stable isotopes used in newer isotope tracing methods like stable isotope-resolved metabolomics (SIRM) experiments. This huge increase in usable data requires software capable of correcting the large number of isotopologue peaks resulting from SIRM experiments in a timely manner. We describe the design of a new algorithm and software system capable of handling these high volumes of data, while including quality control methods for maintaining data quality. We validate this new algorithm against a previous single isotope correction algorithm in a two-step cross-validation. Next, we demonstrate the algorithm and correct for the effects of natural abundance for both 13C and 15N isotopes on a set of raw isotopologue intensities of UDP-N-acetyl-D-glucosamine derived from a 13C/15N-tracing experiment. Finally, we demonstrate the algorithm on a full omics-level dataset. PMID:24404440

  12. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.

  13. [Consideration of guidelines, recommendations and quality indicators for treatment of stroke in the dataset "Emergency Department" of DIVI].

    PubMed

    Kulla, M; Friess, M; Schellinger, P D; Harth, A; Busse, O; Walcher, F; Helm, M

    2015-12-01

    The dataset "Emergency Department" of the German Interdisciplinary Association of Critical Care and Emergency Medicine (DIVI) has been developed during several expert meetings. Its goal is an all-encompassing documentation of the early clinical treatment of patients in emergency departments. Using the example of the index disease acute ischemic stroke (stroke), the aim was to analyze how far this approach has been fulfilled. In this study German, European and US American guidelines were used to analyze the extent of coverage of the datasets on current emergency department guidelines and recommendations from professional societies. In addition, it was examined whether the dataset includes recommended quality indicators (QI) for quality management (QM) and in a third step it was examined to what extent national provisions for billing are included. In each case a differentiation was made whether the respective rationale was primary, i.e. directly apparent or whether it was merely secondarily depicted by expertise. In the evaluation an additional differentiation was made between the level of recommendations and further quality relevant criteria. The modular design of the emergency department dataset comprising 676 data fields is briefly described. A total of 401 individual fields, divided into basic documentation, monitoring and specific neurological documentation of the treatment of stroke patients were considered. For 247 data fields a rationale was found. Partially overlapping, 78.9 % of 214 medical recommendations in 3 guidelines and 85.8 % of the 106 identified quality indicators were primarily covered. Of the 67 requirements for billing of performance of services, 55.5 % are primarily part of the emergency department dataset. Through appropriate expertise and documentation by a board certified neurologist, the results can be improved to almost 100 %. The index disease stroke illustrates that the emergency department dataset of the DIVI covers medical guidelines, especially 100 % of the German guidelines with a grade of recommendation. All necessary information to document the specialized stroke treatment procedure in the German diagnosis-related groups (DRG) system is also covered. The dataset is also suitable as a documentation tool of quality management, for example, to participate in the registry of the German Stroke Society (ADSR). Best results are obtained if the dataset is applied by a physician specialized in the treatment of patients with stroke (e.g. board certified neurologist). Finally the results show that changes in medical guidelines and recommendations for quality management as well as billing-relevant content should be implemented in the development of datasets for documentation to avoid duplicate documentation.

  14. HIGH-RESOLUTION DATASET OF URBAN CANOPY PARAMETERS FOR HOUSTON, TEXAS

    EPA Science Inventory

    Urban dispersion and air quality simulation models applied at various horizontal scales require different levels of fidelity for specifying the characteristics of the underlying surfaces. As the modeling scales approach the neighborhood level (~1 km horizontal grid spacing), the...

  15. Data-driven decision support for radiologists: re-using the National Lung Screening Trial dataset for pulmonary nodule management.

    PubMed

    Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L

    2015-02-01

    Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.

  16. Iterative non-sequential protein structural alignment.

    PubMed

    Salem, Saeed; Zaki, Mohammed J; Bystroff, Christopher

    2009-06-01

    Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.

  17. Study on a pattern classification method of soil quality based on simplified learning sample dataset

    USGS Publications Warehouse

    Zhang, Jiahua; Liu, S.; Hu, Y.; Tian, Y.

    2011-01-01

    Based on the massive soil information in current soil quality grade evaluation, this paper constructed an intelligent classification approach of soil quality grade depending on classical sampling techniques and disordered multiclassification Logistic regression model. As a case study to determine the learning sample capacity under certain confidence level and estimation accuracy, and use c-means algorithm to automatically extract the simplified learning sample dataset from the cultivated soil quality grade evaluation database for the study area, Long chuan county in Guangdong province, a disordered Logistic classifier model was then built and the calculation analysis steps of soil quality grade intelligent classification were given. The result indicated that the soil quality grade can be effectively learned and predicted by the extracted simplified dataset through this method, which changed the traditional method for soil quality grade evaluation. ?? 2011 IEEE.

  18. Development of innovative computer software to facilitate the setup and computation of water quality index

    PubMed Central

    2013-01-01

    Background Developing a water quality index which is used to convert the water quality dataset into a single number is the most important task of most water quality monitoring programmes. As the water quality index setup is based on different local obstacles, it is not feasible to introduce a definite water quality index to reveal the water quality level. Findings In this study, an innovative software application, the Iranian Water Quality Index Software (IWQIS), is presented in order to facilitate calculation of a water quality index based on dynamic weight factors, which will help users to compute the water quality index in cases where some parameters are missing from the datasets. Conclusion A dataset containing 735 water samples of drinking water quality in different parts of the country was used to show the performance of this software using different criteria parameters. The software proved to be an efficient tool to facilitate the setup of water quality indices based on flexible use of variables and water quality databases. PMID:24499556

  19. The SysteMHC Atlas project.

    PubMed

    Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Vizcaíno, Juan A; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; Heck, Albert J R; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal; Aebersold, Ruedi; Caron, Etienne

    2018-01-04

    Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. Compilation and analysis of multiple groundwater-quality datasets for Idaho

    USGS Publications Warehouse

    Hundt, Stephen A.; Hopkins, Candice B.

    2018-05-09

    Groundwater is an important source of drinking and irrigation water throughout Idaho, and groundwater quality is monitored by various Federal, State, and local agencies. The historical, multi-agency records of groundwater quality include a valuable dataset that has yet to be compiled or analyzed on a statewide level. The purpose of this study is to combine groundwater-quality data from multiple sources into a single database, to summarize this dataset, and to perform bulk analyses to reveal spatial and temporal patterns of water quality throughout Idaho. Data were retrieved from the Water Quality Portal (https://www.waterqualitydata.us/), the Idaho Department of Environmental Quality, and the Idaho Department of Water Resources. Analyses included counting the number of times a sample location had concentrations above Maximum Contaminant Levels (MCL), performing trends tests, and calculating correlations between water-quality analytes. The water-quality database and the analysis results are available through USGS ScienceBase (https://doi.org/10.5066/F72V2FBG).

  1. T1-weighted in vivo human whole brain MRI dataset with an ultrahigh isotropic resolution of 250 μm.

    PubMed

    Lüsebrink, Falk; Sciarra, Alessandro; Mattern, Hendrik; Yakupov, Renat; Speck, Oliver

    2017-03-14

    We present an ultrahigh resolution in vivo human brain magnetic resonance imaging (MRI) dataset. It consists of T 1 -weighted whole brain anatomical data acquired at 7 Tesla with a nominal isotropic resolution of 250 μm of a single young healthy Caucasian subject and was recorded using prospective motion correction. The raw data amounts to approximately 1.2 TB and was acquired in eight hours total scan time. The resolution of this dataset is far beyond any previously published in vivo structural whole brain dataset. Its potential use is to build an in vivo MR brain atlas. Methods for image reconstruction and image restoration can be improved as the raw data is made available. Pre-processing and segmentation procedures can possibly be enhanced for high magnetic field strength and ultrahigh resolution data. Furthermore, potential resolution induced changes in quantitative data analysis can be assessed, e.g., cortical thickness or volumetric measures, as high quality images with an isotropic resolution of 1 and 0.5 mm of the same subject are included in the repository as well.

  2. T1-weighted in vivo human whole brain MRI dataset with an ultrahigh isotropic resolution of 250 μm

    NASA Astrophysics Data System (ADS)

    Lüsebrink, Falk; Sciarra, Alessandro; Mattern, Hendrik; Yakupov, Renat; Speck, Oliver

    2017-03-01

    We present an ultrahigh resolution in vivo human brain magnetic resonance imaging (MRI) dataset. It consists of T1-weighted whole brain anatomical data acquired at 7 Tesla with a nominal isotropic resolution of 250 μm of a single young healthy Caucasian subject and was recorded using prospective motion correction. The raw data amounts to approximately 1.2 TB and was acquired in eight hours total scan time. The resolution of this dataset is far beyond any previously published in vivo structural whole brain dataset. Its potential use is to build an in vivo MR brain atlas. Methods for image reconstruction and image restoration can be improved as the raw data is made available. Pre-processing and segmentation procedures can possibly be enhanced for high magnetic field strength and ultrahigh resolution data. Furthermore, potential resolution induced changes in quantitative data analysis can be assessed, e.g., cortical thickness or volumetric measures, as high quality images with an isotropic resolution of 1 and 0.5 mm of the same subject are included in the repository as well.

  3. Multisite Evaluation of a Data Quality Tool for Patient-Level Clinical Data Sets

    PubMed Central

    Huser, Vojtech; DeFalco, Frank J.; Schuemie, Martijn; Ryan, Patrick B.; Shang, Ning; Velez, Mark; Park, Rae Woong; Boyce, Richard D.; Duke, Jon; Khare, Ritu; Utidjian, Levon; Bailey, Charles

    2016-01-01

    Introduction: Data quality and fitness for analysis are crucial if outputs of analyses of electronic health record data or administrative claims data should be trusted by the public and the research community. Methods: We describe a data quality analysis tool (called Achilles Heel) developed by the Observational Health Data Sciences and Informatics Collaborative (OHDSI) and compare outputs from this tool as it was applied to 24 large healthcare datasets across seven different organizations. Results: We highlight 12 data quality rules that identified issues in at least 10 of the 24 datasets and provide a full set of 71 rules identified in at least one dataset. Achilles Heel is a freely available software that provides a useful starter set of data quality rules with the ability to add additional rules. We also present results of a structured email-based interview of all participating sites that collected qualitative comments about the value of Achilles Heel for data quality evaluation. Discussion: Our analysis represents the first comparison of outputs from a data quality tool that implements a fixed (but extensible) set of data quality rules. Thanks to a common data model, we were able to compare quickly multiple datasets originating from several countries in America, Europe and Asia. PMID:28154833

  4. Comparison of Four Precipitation Forcing Datasets in Land Information System Simulations over the Continental U.S.

    NASA Technical Reports Server (NTRS)

    Case, Jonathan L.; Kumar, Sujay V.; Kuliogwski, Robert J.; Langston, Carrie

    2013-01-01

    This paper and poster presented a description of the current real-time SPoRT-LIS run over the southeastern CONUS to provide high-resolution, land surface initialization grids for local numerical model forecasts at NWS forecast offices. The LIS hourly output also offers a supplemental dataset to aid in situational awareness for convective initiation forecasts, assessing flood potential, and monitoring drought at fine scales. It is a goal of SPoRT and several NWS forecast offices to expand the LIS to an entire CONUS domain, so that LIS output can be utilized by NWS Western Region offices, among others. To make this expansion cleanly so as to provide high-quality land surface output, SPoRT tested new precipitation datasets in LIS as an alternative forcing dataset to the current radar+gauge Stage IV product. Similar to the Stage IV product, the NMQ product showed comparable patterns of precipitation and soil moisture distribution, but suffered from radar gaps in the intermountain West, and incorrectly set values to zero instead of missing in the data-void regions of Mexico and Canada. The other dataset tested was the next-generation GOES-R QPE algorithm, which experienced a high bias in both coverage and intensity of accumulated precipitation relative to the control (NLDAS2), Stage IV, and NMQ simulations. The resulting root zone soil moisture was substantially higher in most areas.

  5. Evaluation of the Global Land Data Assimilation System (GLDAS) air temperature data products

    USGS Publications Warehouse

    Ji, Lei; Senay, Gabriel B.; Verdin, James P.

    2015-01-01

    There is a high demand for agrohydrologic models to use gridded near-surface air temperature data as the model input for estimating regional and global water budgets and cycles. The Global Land Data Assimilation System (GLDAS) developed by combining simulation models with observations provides a long-term gridded meteorological dataset at the global scale. However, the GLDAS air temperature products have not been comprehensively evaluated, although the accuracy of the products was assessed in limited areas. In this study, the daily 0.25° resolution GLDAS air temperature data are compared with two reference datasets: 1) 1-km-resolution gridded Daymet data (2002 and 2010) for the conterminous United States and 2) global meteorological observations (2000–11) archived from the Global Historical Climatology Network (GHCN). The comparison of the GLDAS datasets with the GHCN datasets, including 13 511 weather stations, indicates a fairly high accuracy of the GLDAS data for daily temperature. The quality of the GLDAS air temperature data, however, is not always consistent in different regions of the world; for example, some areas in Africa and South America show relatively low accuracy. Spatial and temporal analyses reveal a high agreement between GLDAS and Daymet daily air temperature datasets, although spatial details in high mountainous areas are not sufficiently estimated by the GLDAS data. The evaluation of the GLDAS data demonstrates that the air temperature estimates are generally accurate, but caution should be taken when the data are used in mountainous areas or places with sparse weather stations.

  6. High-throughput ocular artifact reduction in multichannel electroencephalography (EEG) using component subspace projection.

    PubMed

    Ma, Junshui; Bayram, Sevinç; Tao, Peining; Svetnik, Vladimir

    2011-03-15

    After a review of the ocular artifact reduction literature, a high-throughput method designed to reduce the ocular artifacts in multichannel continuous EEG recordings acquired at clinical EEG laboratories worldwide is proposed. The proposed method belongs to the category of component-based methods, and does not rely on any electrooculography (EOG) signals. Based on a concept that all ocular artifact components exist in a signal component subspace, the method can uniformly handle all types of ocular artifacts, including eye-blinks, saccades, and other eye movements, by automatically identifying ocular components from decomposed signal components. This study also proposes an improved strategy to objectively and quantitatively evaluate artifact reduction methods. The evaluation strategy uses real EEG signals to synthesize realistic simulated datasets with different amounts of ocular artifacts. The simulated datasets enable us to objectively demonstrate that the proposed method outperforms some existing methods when no high-quality EOG signals are available. Moreover, the results of the simulated datasets improve our understanding of the involved signal decomposition algorithms, and provide us with insights into the inconsistency regarding the performance of different methods in the literature. The proposed method was also applied to two independent clinical EEG datasets involving 28 volunteers and over 1000 EEG recordings. This effort further confirms that the proposed method can effectively reduce ocular artifacts in large clinical EEG datasets in a high-throughput fashion. Copyright © 2011 Elsevier B.V. All rights reserved.

  7. A New Approach to Look at the Electrical Conductivity of Streamflow: Decomposing a Bulk Signal to Recover Individual Solute Concentrations at High-Frequency

    NASA Astrophysics Data System (ADS)

    Benettin, P.; Van Breukelen, B. M.

    2017-12-01

    The ability to evaluate stream hydrochemistry is often constrained by the capacity to sample streamwater at an adequate frequency. While technology is no longer a limiting factor, economic and management efforts can still be a barrier to high-resolution water quality instrumentation. We propose a new framework to investigate the electrical conductivity (EC) of streamwater, which can be measured continuously through inexpensive sensors. We show that EC embeds information on ion content which can be isolated to retrieve solute concentrations at high resolution. The approach can already be applied to a number of datasets worldwide where water quality campaigns are conducted, provided continuous EC measurements can be collected. The essence of the approach is the decomposition of the EC signal into its "harmonics", i.e. the specific contributions of the major ions which conduct current in water. The ion contribution is used to explore water quality patterns and to develop algorithms that reconstruct solute concentrations during periods where solute measurements are not available. The approach is validated on a hydrochemical dataset from Plynlimon, Wales. Results show that the decomposition of EC is feasible and for at least two major elements the methodology provided improved estimates of high-frequency solute dynamics. Our results support the installation of EC probes to complement water quality campaigns and suggest that the potential of EC measurements in rivers is currently far from being fully exploited.

  8. CIFAR10-DVS: An Event-Stream Dataset for Object Classification

    PubMed Central

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582

  9. CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

    PubMed

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.

  10. Impact of pre-imputation SNP-filtering on genotype imputation results

    PubMed Central

    2014-01-01

    Background Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time. PMID:25112433

  11. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

    PubMed Central

    Pujar, Shashikant; O’Leary, Nuala A; Farrell, Catherine M; Mudge, Jonathan M; Wallin, Craig; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bult, Carol J; Frankish, Adam; Pruitt, Kim D

    2018-01-01

    Abstract The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. PMID:29126148

  12. Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control

    PubMed Central

    Kirwan, Jennifer A; Weber, Ralf J M; Broadhurst, David I; Viant, Mark R

    2014-01-01

    Direct-infusion mass spectrometry (DIMS) metabolomics is an important approach for characterising molecular responses of organisms to disease, drugs and the environment. Increasingly large-scale metabolomics studies are being conducted, necessitating improvements in both bioanalytical and computational workflows to maintain data quality. This dataset represents a systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts. It comprises of twenty biological samples (cow vs. sheep) that were analysed repeatedly, in 8 batches across 7 days, together with a concurrent set of quality control (QC) samples. Data are presented from each step of the workflow and are available in MetaboLights. The strength of the dataset is that intra- and inter-batch variation can be corrected using QC spectra and the quality of this correction assessed independently using the repeatedly-measured biological samples. Originally designed to test the efficacy of a batch-correction algorithm, it will enable others to evaluate novel data processing algorithms. Furthermore, this dataset serves as a benchmark for DIMS metabolomics, derived using best-practice workflows and rigorous quality assessment. PMID:25977770

  13. Santa Margarita Estuary Water Quality Monitoring Data

    DTIC Science & Technology

    2018-02-01

    ADMINISTRATIVE INFORMATION The work described in this report was performed for the Water Quality Section of the Environmental Security Marine Corps Base...water quality model calibration given interest and the necessary resources. The dataset should also inform the stakeholders and Regional Board on...period. Several additional ancillary datasets were collected during the monitoring timeframe that provide key information though they were not collected

  14. UK surveillance: provision of quality assured information from combined datasets.

    PubMed

    Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

    2007-09-14

    Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.

  15. Cnn Based Retinal Image Upscaling Using Zero Component Analysis

    NASA Astrophysics Data System (ADS)

    Nasonov, A.; Chesnakov, K.; Krylov, A.

    2017-05-01

    The aim of the paper is to obtain high quality of image upscaling for noisy images that are typical in medical image processing. A new training scenario for convolutional neural network based image upscaling method is proposed. Its main idea is a novel dataset preparation method for deep learning. The dataset contains pairs of noisy low-resolution images and corresponding noiseless highresolution images. To achieve better results at edges and textured areas, Zero Component Analysis is applied to these images. The upscaling results are compared with other state-of-the-art methods like DCCI, SI-3 and SRCNN on noisy medical ophthalmological images. Objective evaluation of the results confirms high quality of the proposed method. Visual analysis shows that fine details and structures like blood vessels are preserved, noise level is reduced and no artifacts or non-existing details are added. These properties are essential in retinal diagnosis establishment, so the proposed algorithm is recommended to be used in real medical applications.

  16. The Centennial Trends Greater Horn of Africa precipitation dataset.

    PubMed

    Funk, Chris; Nicholson, Sharon E; Landsfeld, Martin; Klotter, Douglas; Peterson, Pete; Harrison, Laura

    2015-01-01

    East Africa is a drought prone, food and water insecure region with a highly variable climate. This complexity makes rainfall estimation challenging, and this challenge is compounded by low rain gauge densities and inhomogeneous monitoring networks. The dearth of observations is particularly problematic over the past decade, since the number of records in globally accessible archives has fallen precipitously. This lack of data coincides with an increasing scientific and humanitarian need to place recent seasonal and multi-annual East African precipitation extremes in a deep historic context. To serve this need, scientists from the UC Santa Barbara Climate Hazards Group and Florida State University have pooled their station archives and expertise to produce a high quality gridded 'Centennial Trends' precipitation dataset. Additional observations have been acquired from the national meteorological agencies and augmented with data provided by other universities. Extensive quality control of the data was carried out and seasonal anomalies interpolated using kriging. This paper documents the CenTrends methodology and data.

  17. The Centennial Trends Greater Horn of Africa precipitation dataset

    USGS Publications Warehouse

    Funk, Chris; Nicholson, Sharon E.; Landsfeld, Martin F.; Klotter, Douglas; Peterson, Pete J.; Harrison, Laura

    2015-01-01

    East Africa is a drought prone, food and water insecure region with a highly variable climate. This complexity makes rainfall estimation challenging, and this challenge is compounded by low rain gauge densities and inhomogeneous monitoring networks. The dearth of observations is particularly problematic over the past decade, since the number of records in globally accessible archives has fallen precipitously. This lack of data coincides with an increasing scientific and humanitarian need to place recent seasonal and multi-annual East African precipitation extremes in a deep historic context. To serve this need, scientists from the UC Santa Barbara Climate Hazards Group and Florida State University have pooled their station archives and expertise to produce a high quality gridded ‘Centennial Trends’ precipitation dataset. Additional observations have been acquired from the national meteorological agencies and augmented with data provided by other universities. Extensive quality control of the data was carried out and seasonal anomalies interpolated using kriging. This paper documents the CenTrends methodology and data.

  18. High-quality seamless DEM generation blending SRTM-1, ASTER GDEM v2 and ICESat/GLAS observations

    NASA Astrophysics Data System (ADS)

    Yue, Linwei; Shen, Huanfeng; Zhang, Liangpei; Zheng, Xianwei; Zhang, Fan; Yuan, Qiangqiang

    2017-01-01

    The absence of a high-quality seamless global digital elevation model (DEM) dataset has been a challenge for the Earth-related research fields. Recently, the 1-arc-second Shuttle Radar Topography Mission (SRTM-1) data have been released globally, covering over 80% of the Earth's land surface (60°N-56°S). However, voids and anomalies still exist in some tiles, which has prevented the SRTM-1 dataset from being directly used without further processing. In this paper, we propose a method to generate a seamless DEM dataset blending SRTM-1, ASTER GDEM v2, and ICESat laser altimetry data. The ASTER GDEM v2 data are used as the elevation source for the SRTM void filling. To get a reliable filling source, ICESat GLAS points are incorporated to enhance the accuracy of the ASTER data within the void regions, using an artificial neural network (ANN) model. After correction, the voids in the SRTM-1 data are filled with the corrected ASTER GDEM values. The triangular irregular network based delta surface fill (DSF) method is then employed to eliminate the vertical bias between them. Finally, an adaptive outlier filter is applied to all the data tiles. The final result is a seamless global DEM dataset. ICESat points collected from 2003 to 2009 were used to validate the effectiveness of the proposed method, and to assess the vertical accuracy of the global DEM products in China. Furthermore, channel networks in the Yangtze River Basin were also extracted for the data assessment.

  19. Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments

    PubMed Central

    Welch, Rene; Chung, Dongjun; Grass, Jeffrey; Landick, Robert

    2017-01-01

    Abstract ChIP-exo/nexus experiments rely on innovative modifications of the commonly used ChIP-seq protocol for high resolution mapping of transcription factor binding sites. Although many aspects of the ChIP-exo data analysis are similar to those of ChIP-seq, these high throughput experiments pose a number of unique quality control and analysis challenges. We develop a novel statistical quality control pipeline and accompanying R/Bioconductor package, ChIPexoQual, to enable exploration and analysis of ChIP-exo and related experiments. ChIPexoQual evaluates a number of key issues including strand imbalance, library complexity, and signal enrichment of data. Assessment of these features are facilitated through diagnostic plots and summary statistics computed over regions of the genome with varying levels of coverage. We evaluated our QC pipeline with both large collections of public ChIP-exo/nexus data and multiple, new ChIP-exo datasets from Escherichia coli. ChIPexoQual analysis of these datasets resulted in guidelines for using these QC metrics across a wide range of sequencing depths and provided further insights for modelling ChIP-exo data. PMID:28911122

  20. Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments.

    PubMed

    Welch, Rene; Chung, Dongjun; Grass, Jeffrey; Landick, Robert; Keles, Sündüz

    2017-09-06

    ChIP-exo/nexus experiments rely on innovative modifications of the commonly used ChIP-seq protocol for high resolution mapping of transcription factor binding sites. Although many aspects of the ChIP-exo data analysis are similar to those of ChIP-seq, these high throughput experiments pose a number of unique quality control and analysis challenges. We develop a novel statistical quality control pipeline and accompanying R/Bioconductor package, ChIPexoQual, to enable exploration and analysis of ChIP-exo and related experiments. ChIPexoQual evaluates a number of key issues including strand imbalance, library complexity, and signal enrichment of data. Assessment of these features are facilitated through diagnostic plots and summary statistics computed over regions of the genome with varying levels of coverage. We evaluated our QC pipeline with both large collections of public ChIP-exo/nexus data and multiple, new ChIP-exo datasets from Escherichia coli. ChIPexoQual analysis of these datasets resulted in guidelines for using these QC metrics across a wide range of sequencing depths and provided further insights for modelling ChIP-exo data. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

    ERIC Educational Resources Information Center

    Kadijevich, Djordje M.

    2015-01-01

    Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…

  2. Impact assessment of GPS radio occultation data on Antarctic analysis and forecast using WRF 3DVAR

    NASA Astrophysics Data System (ADS)

    Zhang, H.; Wee, T. K.; Liu, Z.; Lin, H. C.; Kuo, Y. H.

    2016-12-01

    This study assesses the impact of Global Positioning System (GPS) Radio Occultation (RO) refractivity data on the analysis and forecast in the Antarctic region. The RO data are continuously assimilated into the Weather Research and Forecasting (WRF) Model using the WRF 3DVAR along with other observations that were operationally available to the National Center for Environmental Prediction (NCEP) during a month period, October 2010, including the Advance Microwave Sounding Unit (AMSU) radiance data. For the month-long data assimilation experiments, three RO datasets are used: 1) The actual operational dataset, which was produced by the near real-time RO processing at that time and provided to weather forecasting centers; 2) a post-processed dataset with posterior clock and orbit estimates, and with improved RO processing algorithms; and, 3) another post-processed dataset, produced with a variational RO processing. The data impact is evaluated with comparing the forecasts and analyses to independent driftsonde observations that are made available through the Concordiasi field campaign, in addition to utilizing other traditional means of verification. A denial of RO data (while keeping all other observations) resulted in a remarkable quality degradation of analysis and forecast, indicating the high value of RO data over the Antarctic area. The post-processed RO data showed a significantly larger positive impact compared to the near real-time data, due to extra RO data from the TerraSAR-X satellite (unavailable at the time of the near real-time processing) as well as the supposedly improved data quality as a result of the post-processing. This strongly suggests that the future polar constellation of COSMIC-2 is vital. The variational RO processing further reduced the systematic and random errors in both analysis and forecasts, for instance, leading to a smaller background departure of AMSU radiance. This indicates that the variational RO processing provides an improved reference for the bias correction of satellite radiance, making the bias correction more effective. This study finds that advanced RO data processing algorithms may further enhance the high quality of RO data in high Southern latitudes.

  3. Improving resolution of MR images with an adversarial network incorporating images with different contrast.

    PubMed

    Kim, Ki Hwan; Do, Won-Joon; Park, Sung-Hong

    2018-05-04

    The routine MRI scan protocol consists of multiple pulse sequences that acquire images of varying contrast. Since high frequency contents such as edges are not significantly affected by image contrast, down-sampled images in one contrast may be improved by high resolution (HR) images acquired in another contrast, reducing the total scan time. In this study, we propose a new deep learning framework that uses HR MR images in one contrast to generate HR MR images from highly down-sampled MR images in another contrast. The proposed convolutional neural network (CNN) framework consists of two CNNs: (a) a reconstruction CNN for generating HR images from the down-sampled images using HR images acquired with a different MRI sequence and (b) a discriminator CNN for improving the perceptual quality of the generated HR images. The proposed method was evaluated using a public brain tumor database and in vivo datasets. The performance of the proposed method was assessed in tumor and no-tumor cases separately, with perceptual image quality being judged by a radiologist. To overcome the challenge of training the network with a small number of available in vivo datasets, the network was pretrained using the public database and then fine-tuned using the small number of in vivo datasets. The performance of the proposed method was also compared to that of several compressed sensing (CS) algorithms. Incorporating HR images of another contrast improved the quantitative assessments of the generated HR image in reference to ground truth. Also, incorporating a discriminator CNN yielded perceptually higher image quality. These results were verified in regions of normal tissue as well as tumors for various MRI sequences from pseudo k-space data generated from the public database. The combination of pretraining with the public database and fine-tuning with the small number of real k-space datasets enhanced the performance of CNNs in in vivo application compared to training CNNs from scratch. The proposed method outperformed the compressed sensing methods. The proposed method can be a good strategy for accelerating routine MRI scanning. © 2018 American Association of Physicists in Medicine.

  4. Data splitting for artificial neural networks using SOM-based stratified sampling.

    PubMed

    May, R J; Maier, H R; Dandy, G C

    2010-03-01

    Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. Copyright 2009 Elsevier Ltd. All rights reserved.

  5. Statistical Evaluation of Combined Daily Gauge Observations and Rainfall Satellite Estimations over Continental South America

    NASA Technical Reports Server (NTRS)

    Vila, Daniel; deGoncalves, Luis Gustavo; Toll, David L.; Rozante, Jose Roberto

    2008-01-01

    This paper describes a comprehensive assessment of a new high-resolution, high-quality gauge-satellite based analysis of daily precipitation over continental South America during 2004. This methodology is based on a combination of additive and multiplicative bias correction schemes in order to get the lowest bias when compared with the observed values. Inter-comparisons and cross-validations tests have been carried out for the control algorithm (TMPA real-time algorithm) and different merging schemes: additive bias correction (ADD), ratio bias correction (RAT) and TMPA research version, for different months belonging to different seasons and for different network densities. All compared merging schemes produce better results than the control algorithm, but when finer temporal (daily) and spatial scale (regional networks) gauge datasets is included in the analysis, the improvement is remarkable. The Combined Scheme (CoSch) presents consistently the best performance among the five techniques. This is also true when a degraded daily gauge network is used instead of full dataset. This technique appears a suitable tool to produce real-time, high-resolution, high-quality gauge-satellite based analyses of daily precipitation over land in regional domains.

  6. Demonstrating the robustness of population surveillance data: implications of error rates on demographic and mortality estimates.

    PubMed

    Fottrell, Edward; Byass, Peter; Berhane, Yemane

    2008-03-25

    As in any measurement process, a certain amount of error may be expected in routine population surveillance operations such as those in demographic surveillance sites (DSSs). Vital events are likely to be missed and errors made no matter what method of data capture is used or what quality control procedures are in place. The extent to which random errors in large, longitudinal datasets affect overall health and demographic profiles has important implications for the role of DSSs as platforms for public health research and clinical trials. Such knowledge is also of particular importance if the outputs of DSSs are to be extrapolated and aggregated with realistic margins of error and validity. This study uses the first 10-year dataset from the Butajira Rural Health Project (BRHP) DSS, Ethiopia, covering approximately 336,000 person-years of data. Simple programmes were written to introduce random errors and omissions into new versions of the definitive 10-year Butajira dataset. Key parameters of sex, age, death, literacy and roof material (an indicator of poverty) were selected for the introduction of errors based on their obvious importance in demographic and health surveillance and their established significant associations with mortality. Defining the original 10-year dataset as the 'gold standard' for the purposes of this investigation, population, age and sex compositions and Poisson regression models of mortality rate ratios were compared between each of the intentionally erroneous datasets and the original 'gold standard' 10-year data. The composition of the Butajira population was well represented despite introducing random errors, and differences between population pyramids based on the derived datasets were subtle. Regression analyses of well-established mortality risk factors were largely unaffected even by relatively high levels of random errors in the data. The low sensitivity of parameter estimates and regression analyses to significant amounts of randomly introduced errors indicates a high level of robustness of the dataset. This apparent inertia of population parameter estimates to simulated errors is largely due to the size of the dataset. Tolerable margins of random error in DSS data may exceed 20%. While this is not an argument in favour of poor quality data, reducing the time and valuable resources spent on detecting and correcting random errors in routine DSS operations may be justifiable as the returns from such procedures diminish with increasing overall accuracy. The money and effort currently spent on endlessly correcting DSS datasets would perhaps be better spent on increasing the surveillance population size and geographic spread of DSSs and analysing and disseminating research findings.

  7. A dataset describing brooding in three species of South African brittle stars, comprising seven high-resolution, micro X-ray computed tomography scans.

    PubMed

    Landschoff, Jannes; Du Plessis, Anton; Griffiths, Charles L

    2015-01-01

    Brooding brittle stars have a special mode of reproduction whereby they retain their eggs and juveniles inside respiratory body sacs called bursae. In the past, studying this phenomenon required disturbance of the sample by dissecting the adult. This caused irreversible damage and made the sample unsuitable for future studies. Micro X-ray computed tomography (μCT) is a promising technique, not only to visualise juveniles inside the bursae, but also to keep the sample intact and make the dataset of the scan available for future reference. Seven μCT scans of five freshly fixed (70 % ethanol) individuals, representing three differently sized brittle star species, provided adequate image quality to determine the numbers, sizes and postures of internally brooded young, as well as anatomy and morphology of adults. No staining agents were necessary to achieve high-resolution, high-contrast images, which permitted visualisations of both calcified and soft tissue. The raw data (projection and reconstruction images) are publicly available for download from GigaDB. Brittle stars of all sizes are suitable candidates for μCT imaging. This explicitly adds a new technique to the suite of tools available for studying the development of internally brooded young. The purpose of applying the technique was to visualise juveniles inside the adult, but because of the universally good quality of the dataset, the images can also be used for anatomical or comparative morphology-related studies of adult structures.

  8. Improving data discovery and usability through commentary and user feedback: the CHARMe project

    NASA Astrophysics Data System (ADS)

    Alegre, R.; Blower, J. D.

    2014-12-01

    Earth science datasets are highly diverse. Users of these datasets are similarly varied, ranging from research scientists through industrial users to government decision- and policy-makers. It is very important for these users to understand the applicability of any dataset to their particular problem so that they can select the most appropriate data sources for their needs. Although data providers often provide rich supporting information in the form of metadata, typically this information does not include community usage information that can help other users judge fitness-for-purpose.The CHARMe project (http://www.charme.org.uk) is filling this gap by developing a system for sharing "commentary metadata". These are annotations that are generated and shared by the user community and include: Links between publications and datasets. The CHARMe system can record information about why a particular dataset was used (e.g. the paper may describe the dataset, it may use the dataset as a source, or it may be publishing results of a dataset assessment). These publications may appear in the peer-reviewed literature, or may be technical reports, websites or blog posts. Free-text comments supplied by the user. Provenance information, including links between datasets and descriptions of processing algorithms and sensors. External events that may affect data quality (e.g. large volcanic eruptions or El Niño events); we call these "significant events". Data quality information, e.g. system maturity indices. Commentary information can be linked to anything that can be uniquely identified (e.g. a dataset with a DOI or a persistent web address). It is also possible to associate commentary with particular subsets of datasets, for example to highlight an issue that is confined to a particular geographic region. We will demonstrate tools that show these capabilities in action, showing how users can apply commentary information during data discovery, visualization and analysis. The CHARMe project has implemented a set of open-source tools to create, store and explore commentary information, using open Web standards. In this presentation we will describe the application of the CHARMe system to the particular case of the climate data community; however the techniques and technologies are generic and can be applied in many fields.

  9. International Metadata Standards and Enterprise Data Quality Metadata Systems

    NASA Astrophysics Data System (ADS)

    Habermann, T.

    2016-12-01

    Well-documented data quality is critical in situations where scientists and decision-makers need to combine multiple datasets from different disciplines and collection systems to address scientific questions or difficult decisions. Standardized data quality metadata could be very helpful in these situations. Many efforts at developing data quality standards falter because of the diversity of approaches to measuring and reporting data quality. The "one size fits all" paradigm does not generally work well in this situation. The ISO data quality standard (ISO 19157) takes a different approach with the goal of systematically describing how data quality is measured rather than how it should be measured. It introduces the idea of standard data quality measures that can be well documented in a measure repository and used for consistently describing how data quality is measured across an enterprise. The standard includes recommendations for properties of these measures that include unique identifiers, references, illustrations and examples. Metadata records can reference these measures using the unique identifier and reuse them along with details (and references) that describe how the measure was applied to a particular dataset. A second important feature of ISO 19157 is the inclusion of citations to existing papers or reports that describe quality of a dataset. This capability allows users to find this information in a single location, i.e. the dataset metadata, rather than searching the web or other catalogs. I will describe these and other capabilities of ISO 19157 with examples of how they are being used to describe data quality across the NASA EOS Enterprise and also compare these approaches with other standards.

  10. Collection and Analysis of Crowd Data with Aerial, Rooftop, and Ground Views

    DTIC Science & Technology

    2014-11-10

    collected these datasets using different aircrafts. Erista 8 HL OctaCopter is a heavy-lift aerial platform capable of using high-resolution cinema ...is another high-resolution camera that is cinema grade and high quality, with the capability of capturing videos with 4K resolution at 30 frames per...292.58 Imaging Systems and Accessories Blackmagic Production Camera 4 Crowd Counting using 4K Cameras High resolution cinema grade digital video

  11. sbtools: A package connecting R to cloud-based data for collaborative online research

    USGS Publications Warehouse

    Winslow, Luke; Chamberlain, Scott; Appling, Alison P.; Read, Jordan S.

    2016-01-01

    The adoption of high-quality tools for collaboration and reproducible research such as R and Github is becoming more common in many research fields. While Github and other version management systems are excellent resources, they were originally designed to handle code and scale poorly to large text-based or binary datasets. A number of scientific data repositories are coming online and are often focused on dataset archival and publication. To handle collaborative workflows using large scientific datasets, there is increasing need to connect cloud-based online data storage to R. In this article, we describe how the new R package sbtools enables direct access to the advanced online data functionality provided by ScienceBase, the U.S. Geological Survey’s online scientific data storage platform.

  12. Automated Quality Assessment of Structural Magnetic Resonance Brain Images Based on a Supervised Machine Learning Algorithm.

    PubMed

    Pizarro, Ricardo A; Cheng, Xi; Barnett, Alan; Lemaitre, Herve; Verchinski, Beth A; Goldman, Aaron L; Xiao, Ena; Luo, Qian; Berman, Karen F; Callicott, Joseph H; Weinberger, Daniel R; Mattay, Venkata S

    2016-01-01

    High-resolution three-dimensional magnetic resonance imaging (3D-MRI) is being increasingly used to delineate morphological changes underlying neuropsychiatric disorders. Unfortunately, artifacts frequently compromise the utility of 3D-MRI yielding irreproducible results, from both type I and type II errors. It is therefore critical to screen 3D-MRIs for artifacts before use. Currently, quality assessment involves slice-wise visual inspection of 3D-MRI volumes, a procedure that is both subjective and time consuming. Automating the quality rating of 3D-MRI could improve the efficiency and reproducibility of the procedure. The present study is one of the first efforts to apply a support vector machine (SVM) algorithm in the quality assessment of structural brain images, using global and region of interest (ROI) automated image quality features developed in-house. SVM is a supervised machine-learning algorithm that can predict the category of test datasets based on the knowledge acquired from a learning dataset. The performance (accuracy) of the automated SVM approach was assessed, by comparing the SVM-predicted quality labels to investigator-determined quality labels. The accuracy for classifying 1457 3D-MRI volumes from our database using the SVM approach is around 80%. These results are promising and illustrate the possibility of using SVM as an automated quality assessment tool for 3D-MRI.

  13. NLCD - MODIS albedo data

    EPA Pesticide Factsheets

    The NLCD-MODIS land cover-albedo database integrates high-quality MODIS albedo observations with areas of homogeneous land cover from NLCD. The spatial resolution (pixel size) of the database is 480m-x-480m aligned to the standardized UGSG Albers Equal-Area projection. The spatial extent of the database is the continental United States. This dataset is associated with the following publication:Wickham , J., C.A. Barnes, and T. Wade. Combining NLCD and MODIS to Create a Land Cover-Albedo Dataset for the Continental United States. REMOTE SENSING OF ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 170(0): 143-153, (2015).

  14. Wave equation datuming applied to marine OBS data and to land high resolution seismic profiling

    NASA Astrophysics Data System (ADS)

    Barison, Erika; Brancatelli, Giuseppe; Nicolich, Rinaldo; Accaino, Flavio; Giustiniani, Michela; Tinivella, Umberta

    2011-03-01

    One key step in seismic data processing flows is the computation of static corrections, which relocate shots and receivers at the same datum plane and remove near surface weathering effects. We applied a standard static correction and a wave equation datuming and compared the obtained results in two case studies: 1) a sparse ocean bottom seismometers dataset for deep crustal prospecting; 2) a high resolution land reflection dataset for hydrogeological investigation. In both cases, a detailed velocity field, obtained by tomographic inversion of the first breaks, was adopted to relocate shots and receivers to the datum plane. The results emphasize the importance of wave equation datuming to properly handle complex near surface conditions. In the first dataset, the deployed ocean bottom seismometers were relocated to the sea level (shot positions) and a standard processing sequence was subsequently applied to the output. In the second dataset, the application of wave equation datuming allowed us to remove the coherent noise, such as ground roll, and to improve the image quality with respect to the application of static correction. The comparison of the two approaches evidences that the main reflecting markers are better resolved when the wave equation datuming procedure is adopted.

  15. Selecting AGN through Variability in SN Datasets

    NASA Astrophysics Data System (ADS)

    Boutsia, K.; Leibundgut, B.; Trevese, D.; Vagnetti, F.

    2010-07-01

    Variability is a main property of Active Galactic Nuclei (AGN) and it was adopted as a selection criterion using multi epoch surveys conducted for the detection of supernovae (SNe). We have used two SN datasets. First we selected the AXAF field of the STRESS project, centered in the Chandra Deep Field South where, besides the deep X-ray surveys also various optical catalogs exist. Our method yielded 132 variable AGN candidates. We then extended our method including the dataset of the ESSENCE project that has been active for 6 years, producing high quality light curves in the R and I bands. We obtained a sample of ˜4800 variable sources, down to R=22, in the whole 12 deg2 ESSENCE field. Among them, a subsample of ˜500 high priority AGN candidates was created using as secondary criterion the shape of the structure function. In a pilot spectroscopic run we have confirmed the AGN nature for nearly all of our candidates.

  16. A new dataset validation system for the Planetary Science Archive

    NASA Astrophysics Data System (ADS)

    Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

    2007-08-01

    The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that has been designed to provide the user with the flexibility of defining and implementing various types of validation criteria, to iteratively and incrementally validate datasets, and to generate validation reports.

  17. Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression.

    PubMed

    Sewer, Alain; Gubian, Sylvain; Kogel, Ulrike; Veljkovic, Emilija; Han, Wanjiang; Hengstermann, Arnd; Peitsch, Manuel C; Hoeng, Julia

    2014-05-17

    High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the "common reference design" and processed as "pseudo-single-channel". They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription-polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository.

  18. Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression

    PubMed Central

    2014-01-01

    Background High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Results Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the “common reference design” and processed as “pseudo-single-channel”. They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription–polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Conclusions Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository. PMID:24886675

  19. Non-model-based correction of respiratory motion using beat-to-beat 3D spiral fat-selective imaging.

    PubMed

    Keegan, Jennifer; Gatehouse, Peter D; Yang, Guang-Zhong; Firmin, David N

    2007-09-01

    To demonstrate the feasibility of retrospective beat-to-beat correction of respiratory motion, without the need for a respiratory motion model. A high-resolution three-dimensional (3D) spiral black-blood scan of the right coronary artery (RCA) of six healthy volunteers was acquired over 160 cardiac cycles without respiratory gating. One spiral interleaf was acquired per cardiac cycle, prior to each of which a complete low-resolution fat-selective 3D spiral dataset was acquired. The respiratory motion (3D translation) on each cardiac cycle was determined by cross-correlating a region of interest (ROI) in the fat around the artery in the low-resolution datasets with that on a reference end-expiratory dataset. The measured translations were used to correct the raw data of the high-resolution spiral interleaves. Beat-to-beat correction provided consistently good results, with the image quality being better than that obtained with a fixed superior-inferior tracking factor of 0.6 and better than (N = 5) or equal to (N = 1) that achieved using a subject-specific retrospective 3D translation motion model. Non-model-based correction of respiratory motion using 3D spiral fat-selective imaging is feasible, and in this small group of volunteers produced better-quality images than a subject-specific retrospective 3D translation motion model. (c) 2007 Wiley-Liss, Inc.

  20. Using LiDAR datasets to improve HSPF water quality modeling in the Red River of the North Basin

    NASA Astrophysics Data System (ADS)

    Burke, M. P.; Foreman, C. S.

    2013-12-01

    The Red River of the North Basin (RRB), located in the lakebed of ancient glacial Lake Agassiz, comprises one of the flattest landscapes in North America. The topography of the basin, coupled with the Red River's direction of flow from south to north results in a system that is highly susceptible to flooding. The magnitude and frequency of flood events in the RRB has prompted several multijurisdictional projects and mitigation efforts. In response to the devastating 1997 flood, an International Joint Commission sponsored task force established the need for accurate elevation data to help improve flood forecasting and better understand risks. This led to the International Water Institute's Red River Basin Mapping Initiative, and the acquisition LiDAR Data for the entire US portion of the RRB. The resulting 1 meter bare earth digital elevation models have been used to improve hydraulic and hydrologic modeling within the RRB, with focus on flood prediction and mitigation. More recently, these LiDAR datasets have been incorporated into Hydrological Simulation Program-FORTRAN (HSPF) model applications to improve water quality predictions in the MN portion of the RRB. RESPEC is currently building HSPF model applications for five of MN's 8-digit HUC watersheds draining to the Red River, including: the Red Lake River, Clearwater River, Sandhill River, Two Rivers, and Tamarac River watersheds. This work is being conducted for the Minnesota Pollution Control Agency (MPCA) as part of MN's statewide watershed approach to restoring and protecting water. The HSPF model applications simulate hydrology (discharge, stage), as well as a number of water quality constituents (sediment, temperature, organic and inorganic nitrogen, total ammonia, organic and inorganic phosphorus, dissolved oxygen and biochemical oxygen demand, and algae) continuously for the period 1995-2009 and are formulated to provide predictions at points of interest within the watersheds, such as observation gages, management boundaries, compliance points, and impaired water body endpoints. Incorporation of the LiDAR datasets has been critical to representing the topographic characteristics that impact hydrologic and water quality processes in the extremely flat, heavily drained sub-basins of the RRB. Beyond providing more detailed elevation and slope measurements, the high resolution LiDAR datasets have helped to identify drainage alterations due to agricultural practices, as well as improve representation of channel geometry. Additionally, when available, LiDAR based hydraulic models completed as part of the RRB flood mitigation efforts, are incorporated to further improve flow routing. The MPCA will ultimately use these HSPF models to aid in Total Maximum Daily Load (TMDL) development, permit development/compliance, analysis of Best Management Practice (BMP) implementation scenarios, and other watershed planning and management objectives. LiDAR datasets are an essential component of the water quality models build for the watersheds within the RRB and would greatly benefit water quality modeling efforts in similarly characterized areas.

  1. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

    PubMed

    Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D

    2018-01-04

    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  2. SU-F-T-427: Utilization and Evaluation of Diagnostic CT Imaging with MAR Technique for Radiation Therapy Treatment Planning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xu, M; Foster, R; Parks, H

    Purpose: The objective was to utilize and evaluate diagnostic CT-MAR technique for radiation therapy treatment planning. Methods: A Toshiba-diagnostic-CT acquisition with SEMAR(Single-energy-MAR)-algorism was performed to make the metal-artifact-reduction (MAR) for patient treatment planning. CT-imaging datasets with and without SEMAR were taken on a Catphan-phantom. Two sets of CT-numbers were calibrated with the relative electron densities (RED). A tissue characterization phantom with Gammex various simulating material rods was used to establish the relationship between known REDs and corresponding CT-numbers. A GE-CT-sim acquisition was taken on the Catphan for comparison. A patient with bilateral hip arthroplasty was scanned in the radiotherapy CT-simmore » and the diagnostic SEMAR-CT on a flat panel. The derived SEMAR images were used as a primary CT dataset to create contours for the target, critical-structures, and for planning. A deformable registration was performed with VelocityAI to track voxel changes between SEMAR and CT-sim images. The SEMAR-CT images with minimal artifacts and high quality of geometrical and spatial integrity were employed for a treatment plan. Treatment-plans were evaluated based on deformable registration of SEMAR-CT and CT-sim dataset with assigned CT-numbers in the metal artifact regions in Eclipse v11 TPS. Results: The RED and CT-number relationships were consistent for the datasets in CT-sim and CT’s with and without SEMAR. SEMAR datasets with high image quality were used for PTV and organ delineation in the treatment planning process. For dose distribution to the PTV through the DVH analysis, the plan using CT-sim with the assigned CT-number showed a good agreement to those on deformable CT-SEMAR. Conclusion: A diagnostic-CT with MAR-algorithm can be utilized for radiotherapy treatment planning with CT-number calibrated to the RED. Treatment planning comparison and DVH shows a good agreement in the PTV and critical organs between the plans on CT-sim with assigned CT-number and the deformable SEMAR CT datasets.« less

  3. Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson's disease prediction.

    PubMed

    Khan, Maryam Mahsal; Mendes, Alexandre; Chalup, Stephan K

    2018-01-01

    Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson's disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results.

  4. Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson’s disease prediction

    PubMed Central

    Mendes, Alexandre; Chalup, Stephan K.

    2018-01-01

    Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson’s disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results. PMID:29420578

  5. Contribution of TOMS to Earth Science- An Overview

    NASA Technical Reports Server (NTRS)

    Bhartia, P. K.

    2004-01-01

    The TOMS instrument was launched on the Nimbus-7 satellite in Oct 1978 with the goal of understanding the meteorological influences on the ozone column. The nominal lifetime of the instrument was 1 year. However, in response to the concern over possible man-made influences on the ozone layer NASA continued to nurse the instrument for 13.5 years and launched a major program to produce accurate trend quality dataset of ozone. Despite severe optical degradation and other significant anomalies that developed in the instrument over its lifetime, the effort turned out to be a tremendous success. In 1984, TOMS took center stage as the primary provider of Antarctic ozone hole maps to the world community; it continues to play that role until today. An unexpected benefit of the close attention paid to improving the TOMS data quality was that several atmospheric constituents that interfere with ozone measurement were also identified and meticulously converted into long-term datasets of their own. These constituents include clouds, volcanic S02, aerosols, and ocean phytoplankton. In addition, the high quality of the basic datasets made it possible to produce global maps of surface UV and tropospheric ozone. In most cases there are no other sources of these data sets. Advanced UV instruments currently under development in the US and Europe will continue to exploit the TOMS-developed techniques for several decades.

  6. Support vector machine-an alternative to artificial neuron network for water quality forecasting in an agricultural nonpoint source polluted river?

    PubMed

    Liu, Mei; Lu, Jun

    2014-09-01

    Water quality forecasting in agricultural drainage river basins is difficult because of the complicated nonpoint source (NPS) pollution transport processes and river self-purification processes involved in highly nonlinear problems. Artificial neural network (ANN) and support vector model (SVM) were developed to predict total nitrogen (TN) and total phosphorus (TP) concentrations for any location of the river polluted by agricultural NPS pollution in eastern China. River flow, water temperature, flow travel time, rainfall, dissolved oxygen, and upstream TN or TP concentrations were selected as initial inputs of the two models. Monthly, bimonthly, and trimonthly datasets were selected to train the two models, respectively, and the same monthly dataset which had not been used for training was chosen to test the models in order to compare their generalization performance. Trial and error analysis and genetic algorisms (GA) were employed to optimize the parameters of ANN and SVM models, respectively. The results indicated that the proposed SVM models performed better generalization ability due to avoiding the occurrence of overtraining and optimizing fewer parameters based on structural risk minimization (SRM) principle. Furthermore, both TN and TP SVM models trained by trimonthly datasets achieved greater forecasting accuracy than corresponding ANN models. Thus, SVM models will be a powerful alternative method because it is an efficient and economic tool to accurately predict water quality with low risk. The sensitivity analyses of two models indicated that decreasing upstream input concentrations during the dry season and NPS emission along the reach during average or flood season should be an effective way to improve Changle River water quality. If the necessary water quality and hydrology data and even trimonthly data are available, the SVM methodology developed here can easily be applied to other NPS-polluted rivers.

  7. DeepQA: improving the estimation of single protein model quality with deep belief networks.

    PubMed

    Cao, Renzhi; Bhattacharya, Debswapna; Hou, Jie; Cheng, Jianlin

    2016-12-05

    Protein quality assessment (QA) useful for ranking and selecting protein models has long been viewed as one of the major challenges for protein tertiary structure prediction. Especially, estimating the quality of a single protein model, which is important for selecting a few good models out of a large model pool consisting of mostly low-quality models, is still a largely unsolved problem. We introduce a novel single-model quality assessment method DeepQA based on deep belief network that utilizes a number of selected features describing the quality of a model from different perspectives, such as energy, physio-chemical characteristics, and structural information. The deep belief network is trained on several large datasets consisting of models from the Critical Assessment of Protein Structure Prediction (CASP) experiments, several publicly available datasets, and models generated by our in-house ab initio method. Our experiments demonstrate that deep belief network has better performance compared to Support Vector Machines and Neural Networks on the protein model quality assessment problem, and our method DeepQA achieves the state-of-the-art performance on CASP11 dataset. It also outperformed two well-established methods in selecting good outlier models from a large set of models of mostly low quality generated by ab initio modeling methods. DeepQA is a useful deep learning tool for protein single model quality assessment and protein structure prediction. The source code, executable, document and training/test datasets of DeepQA for Linux is freely available to non-commercial users at http://cactus.rnet.missouri.edu/DeepQA/ .

  8. Gridded global surface ozone metrics for atmospheric chemistry model evaluation

    NASA Astrophysics Data System (ADS)

    Sofen, E. D.; Bowdalo, D.; Evans, M. J.; Apadula, F.; Bonasoni, P.; Cupeiro, M.; Ellul, R.; Galbally, I. E.; Girgzdiene, R.; Luppo, S.; Mimouni, M.; Nahas, A. C.; Saliba, M.; Tørseth, K.; Wmo Gaw, Epa Aqs, Epa Castnet, Capmon, Naps, Airbase, Emep, Eanet Ozone Datasets, All Other Contributors To

    2015-07-01

    The concentration of ozone at the Earth's surface is measured at many locations across the globe for the purposes of air quality monitoring and atmospheric chemistry research. We have brought together all publicly available surface ozone observations from online databases from the modern era to build a consistent dataset for the evaluation of chemical transport and chemistry-climate (Earth System) models for projects such as the Chemistry-Climate Model Initiative and Aer-Chem-MIP. From a total dataset of approximately 6600 sites and 500 million hourly observations from 1971-2015, approximately 2200 sites and 200 million hourly observations pass screening as high-quality sites in regional background locations that are appropriate for use in global model evaluation. There is generally good data volume since the start of air quality monitoring networks in 1990 through 2013. Ozone observations are biased heavily toward North America and Europe with sparse coverage over the rest of the globe. This dataset is made available for the purposes of model evaluation as a set of gridded metrics intended to describe the distribution of ozone concentrations on monthly and annual timescales. Metrics include the moments of the distribution, percentiles, maximum daily eight-hour average (MDA8), SOMO35, AOT40, and metrics related to air quality regulatory thresholds. Gridded datasets are stored as netCDF-4 files and are available to download from the British Atmospheric Data Centre (doi:10.5285/08fbe63d-fa6d-4a7a-b952-5932e3ab0452). We provide recommendations to the ozone measurement community regarding improving metadata reporting to simplify ongoing and future efforts in working with ozone data from disparate networks in a consistent manner.

  9. Assessing Landscape Connectivity and River Water Quality Changes Using an 8-Day, 30-Meter Land Cover Dataset

    NASA Astrophysics Data System (ADS)

    Kamarinas, I.; Julian, J.; Owsley, B.; de Beurs, K.; Hughes, A.

    2014-12-01

    Water quality is dictated by interactions among geomorphic processes, vegetation characteristics, weather patterns, and anthropogenic land uses over multiple spatio-temporal scales. In order to understand how changes in climate and land use impact river water quality, a suite of data with high temporal resolution over a long period is needed. Further, all of this data must be analyzed with respect to connectivity to the river, thus requiring high spatial resolution data. Here, we present how changes in climate and land use over the past 25 years have affected water quality in the 268 sq. km Hoteo River catchment in New Zealand. Hydro-climatic data included daily solar radiation, temperature, soil moisture, rainfall, drought indices, and runoff at 5-km resolution. Land cover changes were measured every 8 days at 30-m resolution by fusing Landsat and MODIS satellite imagery. Water quality was assessed using 15-min turbidity (2011-2014) and monthly data for a suite of variables (1990-2014). Watershed connectivity was modeled using a corrected 15-m DEM and a high-resolution drainage network. Our analyses revealed that this catchment experiences cyclical droughts which, when combined with intense land uses such as livestock grazing and plantation forest harvesting, leaves many areas in the catchment disturbed (i.e. exposed soil) that are connected to the river through surface runoff. As a result, flow-normalized turbidity was elevated during droughts and remained relatively low during wet periods. For example, disturbed land area decreased from 9% to 4% over 2009-2013, which was a relatively wet period. During the extreme drought of 2013, disturbed area increased to 6% in less than a year due mainly to slow pasture recovery after heavy stocking rates. The relationships found in this study demonstrate that high spatiotemporal resolution land cover datasets are very important to understanding the interactions between landscape and climate, and how these interactions affect water quality.

  10. Unassigned MS/MS Spectra: Who Am I?

    PubMed

    Pathan, Mohashin; Samuel, Monisha; Keerthikumar, Shivakumar; Mathivanan, Suresh

    2017-01-01

    Recent advances in high resolution tandem mass spectrometry (MS) has resulted in the accumulation of high quality data. Paralleled with these advances in instrumentation, bioinformatics software have been developed to analyze such quality datasets. In spite of these advances, data analysis in mass spectrometry still remains critical for protein identification. In addition, the complexity of the generated MS/MS spectra, unpredictable nature of peptide fragmentation, sequence annotation errors, and posttranslational modifications has impeded the protein identification process. In a typical MS data analysis, about 60 % of the MS/MS spectra remains unassigned. While some of these could attribute to the low quality of the MS/MS spectra, a proportion can be classified as high quality. Further analysis may reveal how much of the unassigned MS spectra attribute to search space, sequence annotation errors, mutations, and/or posttranslational modifications. In this chapter, the tools used to identify proteins and ways to assign unassigned tandem MS spectra are discussed.

  11. Comparing Simulated and Observed Spectroscopic Signatures of Mix in Omega Capsules

    NASA Astrophysics Data System (ADS)

    Tregillis, I. L.; Shah, R. C.; Hakel, P.; Cobble, J. A.; Murphy, T. J.; Krasheninnikova, N. S.; Hsu, S. C.; Bradley, P. A.; Schmitt, M. J.; Batha, S. H.; Mancini, R. C.

    2012-10-01

    The Defect-Induced Mix Experiment (DIME) campaign at Los Alamos National Laboratory uses multi-monochromatic X-ray imaging (MMI)footnotetextT. Nagayama, R.C. Mancini, R. Florido, et al, J. App. Phys. 109, 093303 (2011) to detect the migration of high-Z spectroscopic dopants into the hot core of an imploded capsule. We have developed an MMI post-processing tool for producing synthetic datasets from two- and three-dimensional Lagrangian numerical simulations of Omega and NIF shots. These synthetic datasets are of sufficient quality, and contain sufficient physics, that they can be analyzed in the same manner as actual MMI data. We have carried out an extensive comparison between simulated and observed MMI data for a series of polar direct-drive shots carried out at the Omega laser facility in January, 2011. The capsule diameter was 870 microns; the 15 micron CH ablators contained a 2 micron Ti-doped layer along the inner edge. All capsules were driven with 17 kJ; some capsules were manufactured with an equatorial ``trench'' defect. This talk will focus on the construction of spectroscopic-quality synthetic MMI datasets from numerical simulations, and their correlation with MMI measurements.

  12. Iterative reconstruction of simulated low count data: a comparison of post-filtering versus regularised OSEM

    NASA Astrophysics Data System (ADS)

    Karaoglanis, K.; Efthimiou, N.; Tsoumpas, C.

    2015-09-01

    Low count PET data is a challenge for medical image reconstruction. The statistics of a dataset is a key factor of the quality of the reconstructed images. Reconstruction algorithms which would be able to compensate for low count datasets could provide the means to reduce the patient injected doses and/or reduce the scan times. It has been shown that the use of priors improve the image quality in low count conditions. In this study we compared regularised versus post-filtered OSEM for their performance on challenging simulated low count datasets. Initial visual comparison demonstrated that both algorithms improve the image quality, although the use of regularization does not introduce the undesired blurring as post-filtering.

  13. Can single empirical algorithms accurately predict inland shallow water quality status from high resolution, multi-sensor, multi-temporal satellite data?

    NASA Astrophysics Data System (ADS)

    Theologou, I.; Patelaki, M.; Karantzalos, K.

    2015-04-01

    Assessing and monitoring water quality status through timely, cost effective and accurate manner is of fundamental importance for numerous environmental management and policy making purposes. Therefore, there is a current need for validated methodologies which can effectively exploit, in an unsupervised way, the enormous amount of earth observation imaging datasets from various high-resolution satellite multispectral sensors. To this end, many research efforts are based on building concrete relationships and empirical algorithms from concurrent satellite and in-situ data collection campaigns. We have experimented with Landsat 7 and Landsat 8 multi-temporal satellite data, coupled with hyperspectral data from a field spectroradiometer and in-situ ground truth data with several physico-chemical and other key monitoring indicators. All available datasets, covering a 4 years period, in our case study Lake Karla in Greece, were processed and fused under a quantitative evaluation framework. The performed comprehensive analysis posed certain questions regarding the applicability of single empirical models across multi-temporal, multi-sensor datasets towards the accurate prediction of key water quality indicators for shallow inland systems. Single linear regression models didn't establish concrete relations across multi-temporal, multi-sensor observations. Moreover, the shallower parts of the inland system followed, in accordance with the literature, different regression patterns. Landsat 7 and 8 resulted in quite promising results indicating that from the recreation of the lake and onward consistent per-sensor, per-depth prediction models can be successfully established. The highest rates were for chl-a (r2=89.80%), dissolved oxygen (r2=88.53%), conductivity (r2=88.18%), ammonium (r2=87.2%) and pH (r2=86.35%), while the total phosphorus (r2=70.55%) and nitrates (r2=55.50%) resulted in lower correlation rates.

  14. Data-driven gating in PET: Influence of respiratory signal noise on motion resolution.

    PubMed

    Büther, Florian; Ernst, Iris; Frohwein, Lynn Johann; Pouw, Joost; Schäfers, Klaus Peter; Stegger, Lars

    2018-05-21

    Data-driven gating (DDG) approaches for positron emission tomography (PET) are interesting alternatives to conventional hardware-based gating methods. In DDG, the measured PET data themselves are utilized to calculate a respiratory signal, that is, subsequently used for gating purposes. The success of gating is then highly dependent on the statistical quality of the PET data. In this study, we investigate how this quality determines signal noise and thus motion resolution in clinical PET scans using a center-of-mass-based (COM) DDG approach, specifically with regard to motion management of target structures in future radiotherapy planning applications. PET list mode datasets acquired in one bed position of 19 different radiotherapy patients undergoing pretreatment [ 18 F]FDG PET/CT or [ 18 F]FDG PET/MRI were included into this retrospective study. All scans were performed over a region with organs (myocardium, kidneys) or tumor lesions of high tracer uptake and under free breathing. Aside from the original list mode data, datasets with progressively decreasing PET statistics were generated. From these, COM DDG signals were derived for subsequent amplitude-based gating of the original list mode file. The apparent respiratory shift d from end-expiration to end-inspiration was determined from the gated images and expressed as a function of signal-to-noise ratio SNR of the determined gating signals. This relation was tested against additional 25 [ 18 F]FDG PET/MRI list mode datasets where high-precision MR navigator-like respiratory signals were available as reference signal for respiratory gating of PET data, and data from a dedicated thorax phantom scan. All original 19 high-quality list mode datasets demonstrated the same behavior in terms of motion resolution when reducing the amount of list mode events for DDG signal generation. Ratios and directions of respiratory shifts between end-respiratory gates and the respective nongated image were constant over all statistic levels. Motion resolution d/d max could be modeled as d/dmax=1-e-1.52(SNR-1)0.52, with d max as the actual respiratory shift. Determining d max from d and SNR in the 25 test datasets and the phantom scan demonstrated no significant differences to the MR navigator-derived shift values and the predefined shift, respectively. The SNR can serve as a general metric to assess the success of COM-based DDG, even in different scanners and patients. The derived formula for motion resolution can be used to estimate the actual motion extent reasonably well in cases of limited PET raw data statistics. This may be of interest for individualized radiotherapy treatment planning procedures of target structures subjected to respiratory motion. © 2018 American Association of Physicists in Medicine.

  15. Modelling spatio-temporal heterogeneities in groundwater quality in Ghana: a multivariate chemometric approach.

    PubMed

    Armah, Frederick Ato; Paintsil, Arnold; Yawson, David Oscar; Adu, Michael Osei; Odoi, Justice O

    2017-08-01

    Chemometric techniques were applied to evaluate the spatial and temporal heterogeneities in groundwater quality data for approximately 740 goldmining and agriculture-intensive locations in Ghana. The strongest linear and monotonic relationships occurred between Mn and Fe. Sixty-nine per cent of total variance in the dataset was explained by four variance factors: physicochemical properties, bacteriological quality, natural geologic attributes and anthropogenic factors (artisanal goldmining). There was evidence of significant differences in means of all trace metals and physicochemical parameters (p < 0.001) between goldmining and non-goldmining locations. Arsenic and turbidity produced very high value F's demonstrating that 'physical properties and chalcophilic elements' was the function that most discriminated between non-goldmining and goldmining locations. Variations in Escherichia coli and total coliforms were observed between the dry and wet seasons. The overall predictive accuracy of the discriminant function showed that non-goldmining locations were classified with slightly better accuracy (89%) than goldmining areas (69.6%). There were significant differences between the underlying distributions of Cd, Mn and Pb in the wet and dry seasons. This study emphasizes the practicality of chemometrics in the assessment and elucidation of complex water quality datasets to promote effective management of groundwater resources for sustaining human health.

  16. Genomic structural differences between cattle and river buffalo identified through a combination and genomic and transcriptomic analysis

    USDA-ARS?s Scientific Manuscript database

    Water buffalo (Bubalus bubalis L.) is an important livestock species worldwide. Like many other livestock species, water buffalo lacks high quality and continuous reference genome assembly required for fine-scale comparative genomics studies. In this work, we present a dataset, which characterizes g...

  17. Exploring the Use of Participatory Information to Improve Monitoring, Mapping and Assessment of Aquatic Ecosystem Services at Landascape Scales

    EPA Science Inventory

    Traditionally, the EPA has monitored aquatic ecosystems using statistically rigorous sample designs and intensive field efforts which provide high quality datasets. But by their nature they leave many aquatic systems unsampled, follow a top down approach, have a long lag between ...

  18. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold.

    PubMed

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel; Ten Have, Arjen

    2018-01-01

    Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.

  19. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

    PubMed Central

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel

    2018-01-01

    Background Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER. PMID:29579071

  20. Benford's Law for Quality Assurance of Manner of Death Counts in Small and Large Databases.

    PubMed

    Daniels, Jeremy; Caetano, Samantha-Jo; Huyer, Dirk; Stephen, Andrew; Fernandes, John; Lytwyn, Alice; Hoppe, Fred M

    2017-09-01

    To assess if Benford's law, a mathematical law used for quality assurance in accounting, can be applied as a quality assurance measure for the manner of death determination. We examined a regional forensic pathology service's monthly manner of death counts (N = 2352) from 2011 to 2013, and provincial monthly and weekly death counts from 2009 to 2013 (N = 81,831). We tested whether each dataset's leading digit followed Benford's law via the chi-square test. For each database, we assessed whether number 1 was the most common leading digit. The manner of death counts first digit followed Benford's law in all the three datasets. Two of the three datasets had 1 as the most frequent leading digit. The manner of death data in this study showed qualities consistent with Benford's law. The law has potential as a quality assurance metric in the manner of death determination for both small and large databases. © 2017 American Academy of Forensic Sciences.

  1. A computational image analysis glossary for biologists.

    PubMed

    Roeder, Adrienne H K; Cunha, Alexandre; Burl, Michael C; Meyerowitz, Elliot M

    2012-09-01

    Recent advances in biological imaging have resulted in an explosion in the quality and quantity of images obtained in a digital format. Developmental biologists are increasingly acquiring beautiful and complex images, thus creating vast image datasets. In the past, patterns in image data have been detected by the human eye. Larger datasets, however, necessitate high-throughput objective analysis tools to computationally extract quantitative information from the images. These tools have been developed in collaborations between biologists, computer scientists, mathematicians and physicists. In this Primer we present a glossary of image analysis terms to aid biologists and briefly discuss the importance of robust image analysis in developmental studies.

  2. Methods for assessing long-term mean pathogen count in drinking water and risk management implications.

    PubMed

    Englehardt, James D; Ashbolt, Nicholas J; Loewenstine, Chad; Gadzinski, Erik R; Ayenu-Prah, Albert Y

    2012-06-01

    Recently pathogen counts in drinking and source waters were shown theoretically to have the discrete Weibull (DW) or closely related discrete growth distribution (DGD). The result was demonstrated versus nine short-term and three simulated long-term water quality datasets. These distributions are highly skewed such that available datasets seldom represent the rare but important high-count events, making estimation of the long-term mean difficult. In the current work the methods, and data record length, required to assess long-term mean microbial count were evaluated by simulation of representative DW and DGD waterborne pathogen count distributions. Also, microbial count data were analyzed spectrally for correlation and cycles. In general, longer data records were required for more highly skewed distributions, conceptually associated with more highly treated water. In particular, 500-1,000 random samples were required for reliable assessment of the population mean ±10%, though 50-100 samples produced an estimate within one log (45%) below. A simple correlated first order model was shown to produce count series with 1/f signal, and such periodicity over many scales was shown in empirical microbial count data, for consideration in sampling. A tiered management strategy is recommended, including a plan for rapid response to unusual levels of routinely-monitored water quality indicators.

  3. Intraday price dynamics in spot and derivatives markets

    NASA Astrophysics Data System (ADS)

    Kim, Jun Sik; Ryu, Doojin

    2014-01-01

    This study examines intraday relationships among the spot index, index futures, and the implied volatility index based on the VAR(1)-asymmetric BEKK-MGARCH model. Analysis of a high-frequency dataset from the Korean financial market confirms that there is a strong intraday market linkage between the spot index, KOSPI200 futures, and VKOSPI and that asymmetric volatility behaviour is clearly present in the Korean market. The empirical results indicate that the futures return shock affects the spot market more severely than the spot return shock affects the futures market, though there is a bi-directional causal relationship between the spot and futures markets. Our results, based on a high-quality intraday dataset, satisfy both the positive risk-return relationship and asymmetric volatility effect, which are not reconciled in the frameworks of previous studies.

  4. General practitioner (family physician) workforce in Australia: comparing geographic data from surveys, a mailing list and medicare

    PubMed Central

    2013-01-01

    Background Good quality spatial data on Family Physicians or General Practitioners (GPs) are key to accurately measuring geographic access to primary health care. The validity of computed associations between health outcomes and measures of GP access such as GP density is contingent on geographical data quality. This is especially true in rural and remote areas, where GPs are often small in number and geographically dispersed. However, there has been limited effort in assessing the quality of nationally comprehensive, geographically explicit, GP datasets in Australia or elsewhere. Our objective is to assess the extent of association or agreement between different spatially explicit nationwide GP workforce datasets in Australia. This is important since disagreement would imply differential relationships with primary healthcare relevant outcomes with different datasets. We also seek to enumerate these associations across categories of rurality or remoteness. Method We compute correlations of GP headcounts and workload contributions between four different datasets at two different geographical scales, across varying levels of rurality and remoteness. Results The datasets are in general agreement with each other at two different scales. Small numbers of absolute headcounts, with relatively larger fractions of locum GPs in rural areas cause unstable statistical estimates and divergences between datasets. Conclusion In the Australian context, many of the available geographic GP workforce datasets may be used for evaluating valid associations with health outcomes. However, caution must be exercised in interpreting associations between GP headcounts or workloads and outcomes in rural and remote areas. The methods used in these analyses may be replicated in other locales with multiple GP or physician datasets. PMID:24005003

  5. Assessment of the NASA-USGS Global Land Survey (GLS) Datasets

    USGS Publications Warehouse

    Gutman, Garik; Huang, Chengquan; Chander, Gyanesh; Noojipady, Praveen; Masek, Jeffery G.

    2013-01-01

    The Global Land Survey (GLS) datasets are a collection of orthorectified, cloud-minimized Landsat-type satellite images, providing near complete coverage of the global land area decadally since the early 1970s. The global mosaics are centered on 1975, 1990, 2000, 2005, and 2010, and consist of data acquired from four sensors: Enhanced Thematic Mapper Plus, Thematic Mapper, Multispectral Scanner, and Advanced Land Imager. The GLS datasets have been widely used in land-cover and land-use change studies at local, regional, and global scales. This study evaluates the GLS datasets with respect to their spatial coverage, temporal consistency, geodetic accuracy, radiometric calibration consistency, image completeness, extent of cloud contamination, and residual gaps. In general, the three latest GLS datasets are of a better quality than the GLS-1990 and GLS-1975 datasets, with most of the imagery (85%) having cloud cover of less than 10%, the acquisition years clustered much more tightly around their target years, better co-registration relative to GLS-2000, and better radiometric absolute calibration. Probably, the most significant impediment to scientific use of the datasets is the variability of image phenology (i.e., acquisition day of year). This paper provides end-users with an assessment of the quality of the GLS datasets for specific applications, and where possible, suggestions for mitigating their deficiencies.

  6. Separation of pulsar signals from noise using supervised machine learning algorithms

    NASA Astrophysics Data System (ADS)

    Bethapudi, S.; Desai, S.

    2018-04-01

    We evaluate the performance of four different machine learning (ML) algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP), Adaboost, Gradient Boosting Classifier (GBC), and XGBoost, for the separation of pulsars from radio frequency interference (RFI) and other sources of noise, using a dataset obtained from the post-processing of a pulsar search pipeline. This dataset was previously used for the cross-validation of the SPINN-based machine learning engine, obtained from the reprocessing of the HTRU-S survey data (Morello et al., 2014). We have used the Synthetic Minority Over-sampling Technique (SMOTE) to deal with high-class imbalance in the dataset. We report a variety of quality scores from all four of these algorithms on both the non-SMOTE and SMOTE datasets. For all the above ML methods, we report high accuracy and G-mean for both the non-SMOTE and SMOTE cases. We study the feature importances using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum Relevance approach to report algorithm-agnostic feature ranking. From these methods, we find that the signal to noise of the folded profile to be the best feature. We find that all the ML algorithms report FPRs about an order of magnitude lower than the corresponding FPRs obtained in Morello et al. (2014), for the same recall value.

  7. Reefs of tomorrow: eutrophication reduces coral biodiversity in an urbanized seascape.

    PubMed

    Duprey, Nicolas N; Yasuhara, Moriaki; Baker, David M

    2016-11-01

    Although the impacts of nutrient pollution on coral reefs are well known, surprisingly, no statistical relationships have ever been established between water quality parameters, coral biodiversity and coral cover. Hong Kong provides a unique opportunity to assess this relationship. Here, coastal waters have been monitored monthly since 1986, at 76 stations, providing a highly spatially resolved water quality dataset including 68 903 data points. Moreover, a robust coral species richness (S) dataset is available from more than 100 surveyed locations, composed of 3453 individual colonies' observations, as well as a coral cover (CC) dataset including 85 sites. This wealth of data provides a unique opportunity to test the hypothesis that water quality, and in particular nutrients, drives coral biodiversity. The influence of water quality on S and CC was analyzed using GIS and multiple regression modeling. Eutrophication (as chlorophyll-a concentration; CHLA) was negatively correlated with S and CC, whereas physicochemical parameters (DO and salinity) had no significant effect. The modeling further illustrated that particulate suspended matter, dissolved inorganic nitrogen (DIN) and dissolved inorganic phosphorus (DIP) had a negative effect on S and on CC; however, the effect of nutrients was 1.5-fold to twofold greater. The highest S and CC occurred where CHLA <2 μg L -1 , DIN < 2 μm and DIP < 0.1 μm. Where these values were exceeded, S and CC were significantly lower and no live corals were observed where CHLA > 15 μg L -1 , DIN > 9 μm and DIP > 0.33 μm. This study demonstrates the importance of nutrients over other water quality parameters in coral biodiversity loss and highlights the key role of eutrophication in shaping coastal coral reef ecosystems. This work also provides ecological thresholds that may be useful for water quality guidelines and nutrient mitigation policies. © 2016 John Wiley & Sons Ltd.

  8. High Quality Facade Segmentation Based on Structured Random Forest, Region Proposal Network and Rectangular Fitting

    NASA Astrophysics Data System (ADS)

    Rahmani, K.; Mayer, H.

    2018-05-01

    In this paper we present a pipeline for high quality semantic segmentation of building facades using Structured Random Forest (SRF), Region Proposal Network (RPN) based on a Convolutional Neural Network (CNN) as well as rectangular fitting optimization. Our main contribution is that we employ features created by the RPN as channels in the SRF.We empirically show that this is very effective especially for doors and windows. Our pipeline is evaluated on two datasets where we outperform current state-of-the-art methods. Additionally, we quantify the contribution of the RPN and the rectangular fitting optimization on the accuracy of the result.

  9. Sub-sampling genetic data to estimate black bear population size: A case study

    USGS Publications Warehouse

    Tredick, C.A.; Vaughan, M.R.; Stauffer, D.F.; Simek, S.L.; Eason, T.

    2007-01-01

    Costs for genetic analysis of hair samples collected for individual identification of bears average approximately US$50 [2004] per sample. This can easily exceed budgetary allowances for large-scale studies or studies of high-density bear populations. We used 2 genetic datasets from 2 areas in the southeastern United States to explore how reducing costs of analysis by sub-sampling affected precision and accuracy of resulting population estimates. We used several sub-sampling scenarios to create subsets of the full datasets and compared summary statistics, population estimates, and precision of estimates generated from these subsets to estimates generated from the complete datasets. Our results suggested that bias and precision of estimates improved as the proportion of total samples used increased, and heterogeneity models (e.g., Mh[CHAO]) were more robust to reduced sample sizes than other models (e.g., behavior models). We recommend that only high-quality samples (>5 hair follicles) be used when budgets are constrained, and efforts should be made to maximize capture and recapture rates in the field.

  10. The role of biological fertility in predicting family size.

    PubMed

    Joffe, M; Key, J; Best, N; Jensen, T K; Keiding, N

    2009-08-01

    It is plausible that a couple's ability to achieve the desired number of children is limited by biological fertility, especially if childbearing is postponed. Family size has declined and semen quality may have deteriorated in much of Europe, although studies have found an increase rather than a decrease in couple fertility. Using four high-quality European datasets, we took the reported time to pregnancy (TTP) as the predictor variable; births reported as following contraceptive failure were an additional category. The outcome variable was final or near-final family size. Potential confounders were maternal age when unprotected sex began prior to the first birth, and maternal smoking. Desired family size was available in only one of the datasets. Couples with a TTP of at least 12 months tended to have smaller families, with odds ratios for the risk of not having a second child approximately 1.8, and for the risk of not having a third child approximately 1.6. Below 12 months no association was observed. Findings were generally consistent across datasets. There was also a more than 2-fold risk of not achieving the desired family size if TTP was 12 months or more for the first child. Within the limits of the available data quality, family size appears to be predicted by biological fertility, even after adjustment for maternal age, if the woman was at least 20 years old when the couple's first attempt at conception started. The contribution of behavioural factors to this result also needs to be investigated.

  11. Postsecondary Schooling and Parental Resources: Evidence from the PSID and HRS

    ERIC Educational Resources Information Center

    Haider, Steven J.; McGarry, Kathleen

    2018-01-01

    We examine the association between young adult postsecondary schooling and parental financial resources using two datasets that contain high-quality data on parental resources: the Panel Study of Income Dynamics (PSID) and the Health and Retirement Study (HRS). We find the association to be pervasive--it exists for income and wealth, it extends…

  12. AN ASSESSMENT OF THE ABILITY OF 3-D AIR QUALITY MODELS WITH CURRENT THERMODYNAMIC EQUILIBRIUM MODELS TO PREDICT AEROSOL NO3

    EPA Science Inventory

    The partitioning of total nitrate (TNO3) and total ammonium (TNH4) between gas and aerosol phases is studied with two thermodynamic equilibrium models, ISORROPIA and AIM, and three datasets: high time-resolution measurement data from the 1999 Atlanta SuperSite Experiment and from...

  13. Identifying Personal and Contextual Factors that Contribute to Attrition Rates for Texas Public School Teachers

    ERIC Educational Resources Information Center

    Sass, Daniel A.; Flores, Belinda Bustos; Claeys, Lorena; Perez, Bertha

    2012-01-01

    Teacher attrition is a significant problem facing schools, with a large percentage of teachers leaving the profession within their first few years. Given the need to retain high-quality teachers, research is needed to identify those teachers with higher retention rates. Using survival analyses and a large state dataset, researchers examined…

  14. High resolution infrared datasets useful for validating stratospheric models

    NASA Technical Reports Server (NTRS)

    Rinsland, Curtis P.

    1992-01-01

    An important objective of the High Speed Research Program (HSRP) is to support research in the atmospheric sciences that will improve the basic understanding of the circulation and chemistry of the stratosphere and lead to an interim assessment of the impact of a projected fleet of High Speed Civil Transports (HSCT's) on the stratosphere. As part of this work, critical comparisons between models and existing high quality measurements are planned. These comparisons will be used to test the reliability of current atmospheric chemistry models. Two suitable sets of high resolution infrared measurements are discussed.

  15. Dataset Lifecycle Policy

    NASA Technical Reports Server (NTRS)

    Armstrong, Edward; Tauer, Eric

    2013-01-01

    The presentation focused on describing a new dataset lifecycle policy that the NASA Physical Oceanography DAAC (PO.DAAC) has implemented for its new and current datasets to foster improved stewardship and consistency across its archive. The overarching goal is to implement this dataset lifecycle policy for all new GHRSST GDS2 datasets and bridge the mission statements from the GHRSST Project Office and PO.DAAC to provide the best quality SST data in a cost-effective, efficient manner, preserving its integrity so that it will be available and usable to a wide audience.

  16. A comprehensive evaluation of assembly scaffolding tools

    PubMed Central

    2014-01-01

    Background Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. Results Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. Conclusions The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity. PMID:24581555

  17. A microarray whole-genome gene expression dataset in a rat model of inflammatory corneal angiogenesis.

    PubMed

    Mukwaya, Anthony; Lindvall, Jessica M; Xeroudaki, Maria; Peebo, Beatrice; Ali, Zaheer; Lennikov, Anton; Jensen, Lasse Dahl Ejby; Lagali, Neil

    2016-11-22

    In angiogenesis with concurrent inflammation, many pathways are activated, some linked to VEGF and others largely VEGF-independent. Pathways involving inflammatory mediators, chemokines, and micro-RNAs may play important roles in maintaining a pro-angiogenic environment or mediating angiogenic regression. Here, we describe a gene expression dataset to facilitate exploration of pro-angiogenic, pro-inflammatory, and remodelling/normalization-associated genes during both an active capillary sprouting phase, and in the restoration of an avascular phenotype. The dataset was generated by microarray analysis of the whole transcriptome in a rat model of suture-induced inflammatory corneal neovascularisation. Regions of active capillary sprout growth or regression in the cornea were harvested and total RNA extracted from four biological replicates per group. High quality RNA was obtained for gene expression analysis using microarrays. Fold change of selected genes was validated by qPCR, and protein expression was evaluated by immunohistochemistry. We provide a gene expression dataset that may be re-used to investigate corneal neovascularisation, and may also have implications in other contexts of inflammation-mediated angiogenesis.

  18. Image quality of mean temporal arterial and mean temporal portal venous phase images calculated from low dose dynamic volume perfusion CT datasets in patients with hepatocellular carcinoma and pancreatic cancer.

    PubMed

    Wang, X; Henzler, T; Gawlitza, J; Diehl, S; Wilhelm, T; Schoenberg, S O; Jin, Z Y; Xue, H D; Smakic, A

    2016-11-01

    Dynamic volume perfusion CT (dVPCT) provides valuable information on tissue perfusion in patients with hepatocellular carcinoma (HCC) and pancreatic cancer. However, currently dVPCT is often performed in addition to conventional CT acquisitions due to the limited morphologic image quality of dose optimized dVPCT protocols. The aim of this study was to prospectively compare objective and subjective image quality, lesion detectability and radiation dose between mean temporal arterial (mTA) and mean temporal portal venous (mTPV) images calculated from low dose dynamic volume perfusion CT (dVPCT) datasets with linearly blended 120-kVp arterial and portal venous datasets in patients with HCC and pancreatic cancer. All patients gave written informed consent for this institutional review board-approved HIPAA compliant study. 27 consecutive patients (18 men, 9 women, mean age, 69.1 years±9.4) with histologically proven HCC or suspected pancreatic cancer were prospectively enrolled. The study CT protocol included a dVPCT protocol performed with 70 or 80kVp tube voltage (18 spiral acquisitions, 71.2s total acquisition times) and standard dual-energy (90/150kVpSn) arterial and portal venous acquisition performed 25min after the dVPCT. The mTA and mTPV images were manually reconstructed from the 3 to 5 best visually selected single arterial and 3 to 5 best single portal venous phases dVPCT dataset. The linearly blended 120-kVp images were calculated from dual-energy CT (DECT) raw data. Image noise, SNR, and CNR of the liver, abdominal aorta (AA) and main portal vein (PV) were compared between the mTA/mTPV and the linearly blended 120-kVp dual-energy arterial and portal venous datasets, respectively. Subjective image quality was evaluated by two radiologists regarding subjective image noise, sharpness and overall diagnostic image quality using a 5-point Likert Scale. In addition, liver lesion detectability was performed for each liver segment by the two radiologists using the linearly blended120-kVp arterial and portal venous datasets as the reference standard. Image noise, SNR and CNR values of the mTA and mTPV were significantly higher when compared to the corresponding linearly blended arterial and portal venous 120-kVp datasets (all p<0.001) except for image noise within the PV in the portal venous phases (p=0.136). image quality of mTA and mTPV were rated significantly better when compared to the linearly blended 120-kVp arterial and portal venous datasets. Both readers were able to detect all liver lesions found on the linearly blended 120-kVp arterial and portal venous datasets using the mTA and mTPV datasets. The effective radiation dose of the dVPCT was 27.6mSv for the 80kVp protocol and 14.5mSv for the 70kVp protocol. The mean effective radiation dose for the linearly blended 120-kVp arterial and portal venous CT protocol together of the upper abdomen was 5.60mSv±1.48mSv. Our preliminary data suggest that subjective and objective image quality of mTA and mTPV datasets calculated from low-kVp dVPCT datasets is non-inferior when compared to linearly blended 120-kVp arterial and portal venous acquisitions in patients with HCC and pancreatic cancer. Thus, dVPCT could be used as a stand-alone imaging technique without additionally performed conventional arterial and portal venous CT acquisitions. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  19. Derivation, Validation and Application of a Pragmatic Risk Prediction Index for Benchmarking of Surgical Outcomes.

    PubMed

    Spence, Richard T; Chang, David C; Kaafarani, Haytham M A; Panieri, Eugenio; Anderson, Geoffrey A; Hutter, Matthew M

    2018-02-01

    Despite the existence of multiple validated risk assessment and quality benchmarking tools in surgery, their utility outside of high-income countries is limited. We sought to derive, validate and apply a scoring system that is both (1) feasible, and (2) reliably predicts mortality in a middle-income country (MIC) context. A 5-step methodology was used: (1) development of a de novo surgical outcomes database modeled around the American College of Surgeons' National Surgical Quality Improvement Program (ACS-NSQIP) in South Africa (SA dataset), (2) use of the resultant data to identify all predictors of in-hospital death with more than 90% capture indicating feasibility of collection, (3) use these predictors to derive and validate an integer-based score that reliably predicts in-hospital death in the 2012 ACS-NSQIP, (4) apply the score in the original SA dataset and demonstrate its performance, (5) identify threshold cutoffs of the score to prompt action and drive quality improvement. Following step one-three above, the 13 point Codman's score was derived and validated on 211,737 and 109,079 patients, respectively, and includes: age 65 (1), partially or completely dependent functional status (1), preoperative transfusions ≥4 units (1), emergency operation (2), sepsis or septic shock (2) American Society of Anesthesia score ≥3 (3) and operative procedure (1-3). Application of the score to 373 patients in the SA dataset showed good discrimination and calibration to predict an in-hospital death. A Codman Score of 8 is an optimal cutoff point for defining expected and unexpected deaths. We have designed a novel risk prediction score specific for a MIC context. The Codman Score can prove useful for both (1) preoperative decision-making and (2) benchmarking the quality of surgical care in MIC's.

  20. Multifactorial Optimization of Contrast-Enhanced Nanofocus Computed Tomography for Quantitative Analysis of Neo-Tissue Formation in Tissue Engineering Constructs.

    PubMed

    Sonnaert, Maarten; Kerckhofs, Greet; Papantoniou, Ioannis; Van Vlierberghe, Sandra; Boterberg, Veerle; Dubruel, Peter; Luyten, Frank P; Schrooten, Jan; Geris, Liesbet

    2015-01-01

    To progress the fields of tissue engineering (TE) and regenerative medicine, development of quantitative methods for non-invasive three dimensional characterization of engineered constructs (i.e. cells/tissue combined with scaffolds) becomes essential. In this study, we have defined the most optimal staining conditions for contrast-enhanced nanofocus computed tomography for three dimensional visualization and quantitative analysis of in vitro engineered neo-tissue (i.e. extracellular matrix containing cells) in perfusion bioreactor-developed Ti6Al4V constructs. A fractional factorial 'design of experiments' approach was used to elucidate the influence of the staining time and concentration of two contrast agents (Hexabrix and phosphotungstic acid) and the neo-tissue volume on the image contrast and dataset quality. Additionally, the neo-tissue shrinkage that was induced by phosphotungstic acid staining was quantified to determine the operating window within which this contrast agent can be accurately applied. For Hexabrix the staining concentration was the main parameter influencing image contrast and dataset quality. Using phosphotungstic acid the staining concentration had a significant influence on the image contrast while both staining concentration and neo-tissue volume had an influence on the dataset quality. The use of high concentrations of phosphotungstic acid did however introduce significant shrinkage of the neo-tissue indicating that, despite sub-optimal image contrast, low concentrations of this staining agent should be used to enable quantitative analysis. To conclude, design of experiments allowed us to define the most optimal staining conditions for contrast-enhanced nanofocus computed tomography to be used as a routine screening tool of neo-tissue formation in Ti6Al4V constructs, transforming it into a robust three dimensional quality control methodology.

  1. Tools for proactive collection and use of quality metadata in GEOSS

    NASA Astrophysics Data System (ADS)

    Bastin, L.; Thum, S.; Maso, J.; Yang, K. X.; Nüst, D.; Van den Broek, M.; Lush, V.; Papeschi, F.; Riverola, A.

    2012-12-01

    The GEOSS Common Infrastructure allows interactive evaluation and selection of Earth Observation datasets by the scientific community and decision makers, but the data quality information needed to assess fitness for use is often patchy and hard to visualise when comparing candidate datasets. In a number of studies over the past decade, users repeatedly identified the same types of gaps in quality metadata, specifying the need for enhancements such as peer and expert review, better traceability and provenance information, information on citations and usage of a dataset, warning about problems identified with a dataset and potential workarounds, and 'soft knowledge' from data producers (e.g. recommendations for use which are not easily encoded using the existing standards). Despite clear identification of these issues in a number of recommendations, the gaps persist in practice and are highlighted once more in our own, more recent, surveys. This continuing deficit may well be the result of a historic paucity of tools to support the easy documentation and continual review of dataset quality. However, more recent developments in tools and standards, as well as more general technological advances, present the opportunity for a community of scientific users to adopt a more proactive attitude by commenting on their uses of data, and for that feedback to be federated with more traditional and static forms of metadata, allowing a user to more accurately assess the suitability of a dataset for their own specific context and reliability thresholds. The EU FP7 GeoViQua project aims to develop this opportunity by adding data quality representations to the existing search and visualisation functionalities of the Geo Portal. Subsequently we will help to close the gap by providing tools to easily create quality information, and to permit user-friendly exploration of that information as the ultimate incentive for improved data quality documentation. Quality information is derived from producer metadata, from the data themselves, from validation of in-situ sensor data, from provenance information and from user feedback, and will be aggregated to produce clear and useful summaries of quality, including a GEO Label. GeoViQua's conceptual quality information models for users and producers are specifically described and illustrated in this presentation. These models (which have been encoded as XML schemas and can be accessed at http://schemas.geoviqua.org/) are designed to satisfy the identified user needs while remaining consistent with current standards such as ISO 19115 and advanced drafts such as ISO 19157. The resulting components being developed for the GEO Portal are designed to lower the entry barrier to users who wish to help to generate and explore rich and useful metadata. This metadata will include reviews, comments and ratings, reports of usage in specific domains and specification of datasets used for benchmarking, as well as rich quantitative information encoded in more traditional data quality elements such as thematic correctness and positional accuracy. The value of the enriched metadata will also be enhanced by graphical tools for visualizing spatially distributed uncertainties. We demonstrate practical example applications in selected environmental application domains.

  2. Extraction and Analysis of Regional Emission and Absorption Events of Greenhouse Gases with GOSAT and OCO-2

    NASA Astrophysics Data System (ADS)

    Kasai, K.; Shiomi, K.; Konno, A.; Tadono, T.; Hori, M.

    2016-12-01

    Global observation of greenhouse gases such as carbon dioxide (CO2) and methane (CH4) with high spatio-temporal resolution and accurate estimation of sources and sinks are important to understand greenhouse gases dynamics. Greenhouse Gases Observing Satellite (GOSAT) has observed column-averaged dry-air mole fractions of CO2 (XCO2) and CH4 (XCH4) over 7 years since January 2009 with wide swath but sparse pointing. Orbiting Carbon Observatory-2 (OCO-2) has observed XCO2 jointly on orbit since July 2014 with narrow swath but high resolution. We use two retrieved datasets as GOSAT observation data. One is ACOS GOSAT/TANSO-FTS Level 2 Full Product by NASA/JPL, and the other is NIES TANSO-FTS L2 column amount (SWIR). By using these GOSAT datasets and OCO-2 L2 Full Product, the biases among datasets, local sources and sinks, and temporal variability of greenhouse gases are clarified. In addition, CarbonTracker, which is a global model of atmospheric CO2 and CH4 developed by NOAA/ESRL, are also analyzed for comparing between satellite observation data and atmospheric model data. Before analyzing these datasets, outliers are screened by using quality flag, outcome flag, and warn level in land or sea parts. Time series data of XCO2 and XCH4 are obtained globally from satellite observation and atmospheric model datasets, and functions which express typical inter-annual and seasonal variation are fitted to each spatial grid. Consequently, anomalous events of XCO2 and XCH4 are extracted by the difference between each time series dataset and the fitted function. Regional emission and absorption events are analyzed by time series variation of satellite observation data and by comparing with atmospheric model data.

  3. Effect of room temperature transport vials on DNA quality and phylogenetic composition of faecal microbiota of elderly adults and infants.

    PubMed

    Hill, Cian J; Brown, Jillian R M; Lynch, Denise B; Jeffery, Ian B; Ryan, C Anthony; Ross, R Paul; Stanton, Catherine; O'Toole, Paul W

    2016-05-10

    Alterations in intestinal microbiota have been correlated with a growing number of diseases. Investigating the faecal microbiota is widely used as a non-invasive and ethically simple proxy for intestinal biopsies. There is an urgent need for collection and transport media that would allow faecal sampling at distance from the processing laboratory, obviating the need for same-day DNA extraction recommended by previous studies of freezing and processing methods for stool. We compared the faecal bacterial DNA quality and apparent phylogenetic composition derived using a commercial kit for stool storage and transport (DNA Genotek OMNIgene GUT) with that of freshly extracted samples, 22 from infants and 20 from older adults. Use of the storage vials increased the quality of extracted bacterial DNA by reduction of DNA shearing. When infant and elderly datasets were examined separately, no differences in microbiota composition were observed due to storage. When the two datasets were combined, there was a difference according to a Wilcoxon test in the relative proportions of Faecalibacterium, Sporobacter, Clostridium XVIII, and Clostridium XlVa after 1 week's storage compared to immediately extracted samples. After 2 weeks' storage, Bacteroides abundance was also significantly different, showing an apparent increase from week 1 to week 2. The microbiota composition of infant samples was more affected than that of elderly samples by storage, with significantly higher Spearman distances between paired freshly extracted and stored samples (p < 0.001). When the microbiota profiles were analysed at the operational taxonomic unit (OTU) level, three infant datasets in the study did not cluster together, while only one elderly dataset did not. The lower microbiota diversity of the infant gut microbiota compared to the elderly gut microbiota (p < 0.001) means that any alteration in the infant datasets has a proportionally larger effect. The commercial storage vials appear to be suitable for high diversity microbiota samples, but may be less appropriate for lower diversity samples. Differences between fresh and stored samples mean that where storage is unavoidable, a consistent storage regime should be used. We would recommend extraction ideally within the first week of storage.

  4. Water quality assessment of Australian ports using water quality evaluation indices

    PubMed Central

    Jahan, Sayka

    2017-01-01

    Australian ports serve diverse and extensive activities, such as shipping, tourism and fisheries, which may all impact the quality of port water. In this work water quality monitoring at different ports using a range of water quality evaluation indices was applied to assess the port water quality. Seawater samples at 30 stations in the year 2016–2017 from six ports in NSW, Australia, namely Port Jackson, Botany, Kembla, Newcastle, Yamba and Eden, were investigated to determine the physicochemical and biological variables that affect the port water quality. The large datasets obtained were designed to determine the Water Quality Index, Heavy metal Evaluation Index, Contamination Index and newly developed Environmental Water Quality Index. The study revealed medium water quality index and high and medium heavy metal evaluation index at three of the study ports and high contamination index in almost all study ports. Low level dissolved oxygen and higher level of total dissolved solids, turbidity, fecal coliforms, copper, iron, lead, zinc, manganese, cadmium and cobalt are mainly responsible for the poor water qualities of the port areas. Good water quality at the background samples indicated that various port activities are the likely cause for poor water quality inside the port area. PMID:29244876

  5. A 20-year collection of sub-surface salinity and temperature observations for the Australian shelf seas

    NASA Astrophysics Data System (ADS)

    Proctor, R.; Mancini, S.; Hoenner, X.; Tattersall, K.; Pasquer, B.; Galibert, G.; Moltmann, T.

    2016-02-01

    Salinity and temperature measurements from different sources have been assembled into a common data structure in a relational database. Quality Control flags have been mapped to a common scheme and associated to each measurement. For datasets like gliders, moorings or ship underway which are sampled at high temporal resolution (e.g. data every second) a binning and sub-sampling approach has been applied to some datasets in order to reduce the number of measurements to hourly sampling. After averaging approximately 25 Million measurements are available in this dataset collection. A national shelf and coastal data atlas has been created using all the temperature and salinity measurements that pass various quality control checks. These observations have been binned spatially on a horizontal grid of ¼ degree with standard vertical levels (every 10 meters from the surface to 500m depth) and temporally on a monthly time range over the period January 1995 to December 2014. The number of observations in each bin has been determined and additional statistics, the mean, the standard deviation, minimum and maximum values, have been calculated, enabling a degree of uncertainty to be associated with any measurement. The data atlas is available as a Web Feature Service.

  6. Improved Hourly and Sub-Hourly Gauge Data for Assessing Precipitation Extremes in the U.S.

    NASA Astrophysics Data System (ADS)

    Lawrimore, J. H.; Wuertz, D.; Palecki, M. A.; Kim, D.; Stevens, S. E.; Leeper, R.; Korzeniewski, B.

    2017-12-01

    The NOAA/National Weather Service (NWS) Fischer-Porter (F&P) weighing bucket precipitation gauge network consists of approximately 2000 stations that comprise a subset of the NWS Cooperative Observers Program network. This network has operated since the mid-20th century, providing one of the longest records of hourly and 15-minute precipitation observations in the U.S. The lengthy record of this dataset combined with its relatively high spatial density, provides an important source of data for many hydrological applications including understanding trends and variability in the frequency and intensity of extreme precipitation events. In recent years NOAA's National Centers for Environmental Information initiated an upgrade of its end-to-end processing and quality control system for these data. This involved a change from a largely manual review and edit process to a fully automated system that removes the subjectivity that was previously a necessary part of dataset quality control and processing. An overview of improvements to this dataset is provided along with the results of an analysis of observed variability and trends in U.S. precipitation extremes since the mid-20th century. Multi-decadal trends in many parts of the nation are consistent with model projections of an increase in the frequency and intensity of heavy precipitation in a warming world.

  7. Large-scale seismic waveform quality metric calculation using Hadoop

    NASA Astrophysics Data System (ADS)

    Magana-Zook, S.; Gaylord, J. M.; Knapp, D. R.; Dodge, D. A.; Ruppert, S. D.

    2016-09-01

    In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of 0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.

  8. Color imaging of Mars by the High Resolution Imaging Science Experiment (HiRISE)

    USGS Publications Warehouse

    Delamere, W.A.; Tornabene, L.L.; McEwen, A.S.; Becker, K.; Bergstrom, J.W.; Bridges, N.T.; Eliason, E.M.; Gallagher, D.; Herkenhoff, K. E.; Keszthelyi, L.; Mattson, S.; McArthur, G.K.; Mellon, M.T.; Milazzo, M.; Russell, P.S.; Thomas, N.

    2010-01-01

    HiRISE has been producing a large number of scientifically useful color products of Mars and other planetary objects. The three broad spectral bands, coupled with the highly sensitive 14 bit detectors and time delay integration, enable detection of subtle color differences. The very high spatial resolution of HiRISE can augment the mineralogic interpretations based on multispectral (THEMIS) and hyperspectral datasets (TES, OMEGA and CRISM) and thereby enable detailed geologic and stratigraphic interpretations at meter scales. In addition to providing some examples of color images and their interpretation, we describe the processing techniques used to produce them and note some of the minor artifacts in the output. We also provide an example of how HiRISE color products can be effectively used to expand mineral and lithologic mapping provided by CRISM data products that are backed by other spectral datasets. The utility of high quality color data for understanding geologic processes on Mars has been one of the major successes of HiRISE. ?? 2009 Elsevier Inc.

  9. ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.

    PubMed

    Qin, Qian; Mei, Shenglin; Wu, Qiu; Sun, Hanfei; Li, Lewyn; Taing, Len; Chen, Sujun; Li, Fugen; Liu, Tao; Zang, Chongzhi; Xu, Han; Chen, Yiwen; Meyer, Clifford A; Zhang, Yong; Brown, Myles; Long, Henry W; Liu, X Shirley

    2016-10-03

    Transcription factor binding, histone modification, and chromatin accessibility studies are important approaches to understanding the biology of gene regulation. ChIP-seq and DNase-seq have become the standard techniques for studying protein-DNA interactions and chromatin accessibility respectively, and comprehensive quality control (QC) and analysis tools are critical to extracting the most value from these assay types. Although many analysis and QC tools have been reported, few combine ChIP-seq and DNase-seq data analysis and quality control in a unified framework with a comprehensive and unbiased reference of data quality metrics. ChiLin is a computational pipeline that automates the quality control and data analyses of ChIP-seq and DNase-seq data. It is developed using a flexible and modular software framework that can be easily extended and modified. ChiLin is ideal for batch processing of many datasets and is well suited for large collaborative projects involving ChIP-seq and DNase-seq from different designs. ChiLin generates comprehensive quality control reports that include comparisons with historical data derived from over 23,677 public ChIP-seq and DNase-seq samples (11,265 datasets) from eight literature-based classified categories. To the best of our knowledge, this atlas represents the most comprehensive ChIP-seq and DNase-seq related quality metric resource currently available. These historical metrics provide useful heuristic quality references for experiment across all commonly used assay types. Using representative datasets, we demonstrate the versatility of the pipeline by applying it to different assay types of ChIP-seq data. The pipeline software is available open source at https://github.com/cfce/chilin . ChiLin is a scalable and powerful tool to process large batches of ChIP-seq and DNase-seq datasets. The analysis output and quality metrics have been structured into user-friendly directories and reports. We have successfully compiled 23,677 profiles into a comprehensive quality atlas with fine classification for users.

  10. The importance of accurate road data for spatial applications in public health: customizing a road network

    PubMed Central

    Frizzelle, Brian G; Evenson, Kelly R; Rodriguez, Daniel A; Laraia, Barbara A

    2009-01-01

    Background Health researchers have increasingly adopted the use of geographic information systems (GIS) for analyzing environments in which people live and how those environments affect health. One aspect of this research that is often overlooked is the quality and detail of the road data and whether or not it is appropriate for the scale of analysis. Many readily available road datasets, both public domain and commercial, contain positional errors or generalizations that may not be compatible with highly accurate geospatial locations. This study examined the accuracy, completeness, and currency of four readily available public and commercial sources for road data (North Carolina Department of Transportation, StreetMap Pro, TIGER/Line 2000, TIGER/Line 2007) relative to a custom road dataset which we developed and used for comparison. Methods and Results A custom road network dataset was developed to examine associations between health behaviors and the environment among pregnant and postpartum women living in central North Carolina in the United States. Three analytical measures were developed to assess the comparative accuracy and utility of four publicly and commercially available road datasets and the custom dataset in relation to participants' residential locations over three time periods. The exclusion of road segments and positional errors in the four comparison road datasets resulted in between 5.9% and 64.4% of respondents lying farther than 15.24 meters from their nearest road, the distance of the threshold set by the project to facilitate spatial analysis. Agreement, using a Pearson's correlation coefficient, between the customized road dataset and the four comparison road datasets ranged from 0.01 to 0.82. Conclusion This study demonstrates the importance of examining available road datasets and assessing their completeness, accuracy, and currency for their particular study area. This paper serves as an example for assessing the feasibility of readily available commercial or public road datasets, and outlines the steps by which an improved custom dataset for a study area can be developed. PMID:19409088

  11. Use of autocorrelation scanning in DNA copy number analysis.

    PubMed

    Zhang, Liangcai; Zhang, Li

    2013-11-01

    Data quality is a critical issue in the analyses of DNA copy number alterations obtained from microarrays. It is commonly assumed that copy number alteration data can be modeled as piecewise constant and the measurement errors of different probes are independent. However, these assumptions do not always hold in practice. In some published datasets, we find that measurement errors are highly correlated between probes that interrogate nearby genomic loci, and the piecewise-constant model does not fit the data well. The correlated errors cause problems in downstream analysis, leading to a large number of DNA segments falsely identified as having copy number gains and losses. We developed a simple tool, called autocorrelation scanning profile, to assess the dependence of measurement error between neighboring probes. Autocorrelation scanning profile can be used to check data quality and refine the analysis of DNA copy number data, which we demonstrate in some typical datasets. lzhangli@mdanderson.org. Supplementary data are available at Bioinformatics online.

  12. Water quality assessment for groundwater around a municipal waste dumpsite.

    PubMed

    Kayode, Olusola T; Okagbue, Hilary I; Achuka, Justina A

    2018-04-01

    The dataset for this article contains geostatistical analysis of the level to which groundwater quality around a municipal waste dumpsite located in Oke-Afa, Oshodi/Isolo area of Lagos state, southwestern has been compromised for drinking. Groundwater samples were collected from eight hand-dug wells and two borehole wells around or near the dumpsite. The pH, turbidity, salinity, conductivity, total hydrocarbon, total dissolved solids (TDS), dissolved oxygen, chloride, Sulphate (SO 4 ), Nitrate (NO 3 ) and Phosphate (PO 4 ) were determined for the water samples and compared with World Health Organization (WHO) drinking water standard. Notably, the turbidity, TDS, chloride and conductivity of some of the samples were above the WHO acceptable limits. Also, high quantities of heavy metals such as Aluminum and Barium were also present as shown from the data. The dataset can provide insights into the health implications of the contaminants especially when the mean concentration levels of the contaminants are above the recommended WHO drinking water standard.

  13. Factors Influencing Early Adolescents' Mathematics Achievement: High-Quality Teaching Rather than Relationships

    ERIC Educational Resources Information Center

    Winheller, Sandra; Hattie, John A.; Brown, Gavin T. L.

    2013-01-01

    This study used data from the Assessment Tools for Teaching and Learning project, which involved data on the academic performance of more than 90,000 New Zealand students in six subjects (i.e. reading, writing and mathematics in two languages). Two sub-samples of this dataset were included for detailed re-analysis to test the general applicability…

  14. The Effects of Baseline Estimation on the Reliability, Validity, and Precision of CBM-R Growth Estimates

    ERIC Educational Resources Information Center

    Van Norman, Ethan R.; Christ, Theodore J.; Zopluoglu, Cengiz

    2013-01-01

    This study examined the effect of baseline estimation on the quality of trend estimates derived from Curriculum Based Measurement of Oral Reading (CBM-R) progress monitoring data. The authors used a linear mixed effects regression (LMER) model to simulate progress monitoring data for schedules ranging from 6-20 weeks for datasets with high and low…

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Habte, A.; Sengupta, M.; Wilcox, S.

    This report was part of a multiyear collaboration with the University of Wisconsin and the National Oceanic and Atmospheric Administration (NOAA) to produce high-quality, satellite-based, solar resource datasets for the United States. High-quality, solar resource assessment accelerates technology deployment by making a positive impact on decision making and reducing uncertainty in investment decisions. Satellite-based solar resource datasets are used as a primary source in solar resource assessment. This is mainly because satellites provide larger areal coverage and longer periods of record than ground-based measurements. With the advent of newer satellites with increased information content and faster computers that can processmore » increasingly higher data volumes, methods that were considered too computationally intensive are now feasible. One class of sophisticated methods for retrieving solar resource information from satellites is a two-step, physics-based method that computes cloud properties and uses the information in a radiative transfer model to compute solar radiation. This method has the advantage of adding additional information as satellites with newer channels come on board. This report evaluates the two-step method developed at NOAA and adapted for solar resource assessment for renewable energy with the goal of identifying areas that can be improved in the future.« less

  16. Global Contrast Based Salient Region Detection.

    PubMed

    Cheng, Ming-Ming; Mitra, Niloy J; Huang, Xiaolei; Torr, Philip H S; Hu, Shi-Min

    2015-03-01

    Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer graphics applications. We introduce a regional contrast based salient object detection algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut, namely SaliencyCut, for high quality unsupervised salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms 15 existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.

  17. Preparing Nursing Home Data from Multiple Sites for Clinical Research – A Case Study Using Observational Health Data Sciences and Informatics

    PubMed Central

    Boyce, Richard D.; Handler, Steven M.; Karp, Jordan F.; Perera, Subashan; Reynolds, Charles F.

    2016-01-01

    Introduction: A potential barrier to nursing home research is the limited availability of research quality data in electronic form. We describe a case study of converting electronic health data from five skilled nursing facilities to a research quality longitudinal dataset by means of open-source tools produced by the Observational Health Data Sciences and Informatics (OHDSI) collaborative. Methods: The Long-Term Care Minimum Data Set (MDS), drug dispensing, and fall incident data from five SNFs were extracted, translated, and loaded into version 4 of the OHDSI common data model. Quality assurance involved identifying errors using the Achilles data characterization tool and comparing both quality measures and drug exposures in the new database for concordance with externally available sources. Findings: Records for a total 4,519 patients (95.1%) made it into the final database. Achilles identified 10 different types of errors that were addressed in the final dataset. Drug exposures based on dispensing were generally accurate when compared with medication administration data from the pharmacy services provider. Quality measures were generally concordant between the new database and Nursing Home Compare for measures with a prevalence ≥ 10%. Fall data recorded in MDS was found to be more complete than data from fall incident reports. Conclusions: The new dataset is ready to support observational research on topics of clinical importance in the nursing home including patient-level prediction of falls. The extraction, translation, and loading process enabled the use of OHDSI data characterization tools that improved the quality of the final dataset. PMID:27891528

  18. A Compilation of Spatial Datasets to Support a Preliminary Assessment of Pesticides and Pesticide Use on Tribal Lands in Oklahoma

    USGS Publications Warehouse

    Mashburn, Shana L.; Winton, Kimberly T.

    2010-01-01

    This CD-ROM contains spatial datasets that describe natural and anthropogenic features and county-level estimates of agricultural pesticide use and pesticide data for surface-water, groundwater, and biological specimens in the state of Oklahoma. County-level estimates of pesticide use were compiled from the Pesticide National Synthesis Project of the U.S. Geological Survey, National Water-Quality Assessment Program. Pesticide data for surface water, groundwater, and biological specimens were compiled from U.S. Geological Survey National Water Information System database. These spatial datasets that describe natural and manmade features were compiled from several agencies and contain information collected by the U.S. Geological Survey. The U.S. Geological Survey datasets were not collected specifically for this compilation, but were previously collected for projects with various objectives. The spatial datasets were created by different agencies from sources with varied quality. As a result, features common to multiple layers may not overlay exactly. Users should check the metadata to determine proper use of these spatial datasets. These data were not checked for accuracy or completeness. If a question of accuracy or completeness arise, the user should contact the originator cited in the metadata.

  19. A Simple Sampling Method for Estimating the Accuracy of Large Scale Record Linkage Projects.

    PubMed

    Boyd, James H; Guiver, Tenniel; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Anderson, Phil; Dickinson, Teresa

    2016-05-17

    Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.

  20. The Importance of Context: Risk-based De-identification of Biomedical Data.

    PubMed

    Prasser, Fabian; Kohlmayer, Florian; Kuhn, Klaus A

    2016-08-05

    Data sharing is a central aspect of modern biomedical research. It is accompanied by significant privacy concerns and often data needs to be protected from re-identification. With methods of de-identification datasets can be transformed in such a way that it becomes extremely difficult to link their records to identified individuals. The most important challenge in this process is to find an adequate balance between an increase in privacy and a decrease in data quality. Accurately measuring the risk of re-identification in a specific data sharing scenario is an important aspect of data de-identification. Overestimation of risks will significantly deteriorate data quality, while underestimation will leave data prone to attacks on privacy. Several models have been proposed for measuring risks, but there is a lack of generic methods for risk-based data de-identification. The aim of the work described in this article was to bridge this gap and to show how the quality of de-identified datasets can be improved by using risk models to tailor the process of de-identification to a concrete context. We implemented a generic de-identification process and several models for measuring re-identification risks into the ARX de-identification tool for biomedical data. By integrating the methods into an existing framework, we were able to automatically transform datasets in such a way that information loss is minimized while it is ensured that re-identification risks meet a user-defined threshold. We performed an extensive experimental evaluation to analyze the impact of using different risk models and assumptions about the goals and the background knowledge of an attacker on the quality of de-identified data. The results of our experiments show that data quality can be improved significantly by using risk models for data de-identification. On a scale where 100 % represents the original input dataset and 0 % represents a dataset from which all information has been removed, the loss of information content could be reduced by up to 10 % when protecting datasets against strong adversaries and by up to 24 % when protecting datasets against weaker adversaries. The methods studied in this article are well suited for protecting sensitive biomedical data and our implementation is available as open-source software. Our results can be used by data custodians to increase the information content of de-identified data by tailoring the process to a specific data sharing scenario. Improving data quality is important for fostering the adoption of de-identification methods in biomedical research.

  1. Discovering New Global Climate Patterns: Curating a 21-Year High Temporal (Hourly) and Spatial (40km) Resolution Reanalysis Dataset

    NASA Astrophysics Data System (ADS)

    Hou, C. Y.; Dattore, R.; Peng, G. S.

    2014-12-01

    The National Center for Atmospheric Research's Global Climate Four-Dimensional Data Assimilation (CFDDA) Hourly 40km Reanalysis dataset is a dynamically downscaled dataset with high temporal and spatial resolution. The dataset contains three-dimensional hourly analyses in netCDF format for the global atmospheric state from 1985 to 2005 on a 40km horizontal grid (0.4°grid increment) with 28 vertical levels, providing good representation of local forcing and diurnal variation of processes in the planetary boundary layer. This project aimed to make the dataset publicly available, accessible, and usable in order to provide a unique resource to allow and promote studies of new climate characteristics. When the curation project started, it had been five years since the data files were generated. Also, although the Principal Investigator (PI) had generated a user document at the end of the project in 2009, the document had not been maintained. Furthermore, the PI had moved to a new institution, and the remaining team members were reassigned to other projects. These factors made data curation in the areas of verifying data quality, harvest metadata descriptions, documenting provenance information especially challenging. As a result, the project's curation process found that: Data curator's skill and knowledge helped make decisions, such as file format and structure and workflow documentation, that had significant, positive impact on the ease of the dataset's management and long term preservation. Use of data curation tools, such as the Data Curation Profiles Toolkit's guidelines, revealed important information for promoting the data's usability and enhancing preservation planning. Involving data curators during each stage of the data curation life cycle instead of at the end could improve the curation process' efficiency. Overall, the project showed that proper resources invested in the curation process would give datasets the best chance to fulfill their potential to help with new climate pattern discovery.

  2. A comparison of U.S. geological survey seamless elevation models with shuttle radar topography mission data

    USGS Publications Warehouse

    Gesch, D.; Williams, J.; Miller, W.

    2001-01-01

    Elevation models produced from Shuttle Radar Topography Mission (SRTM) data will be the most comprehensive, consistently processed, highest resolution topographic dataset ever produced for the Earth's land surface. Many applications that currently use elevation data will benefit from the increased availability of data with higher accuracy, quality, and resolution, especially in poorly mapped areas of the globe. SRTM data will be produced as seamless data, thereby avoiding many of the problems inherent in existing multi-source topographic databases. Serving as precursors to SRTM datasets, the U.S. Geological Survey (USGS) has produced and is distributing seamless elevation datasets that facilitate scientific use of elevation data over large areas. GTOPO30 is a global elevation model with a 30 arc-second resolution (approximately 1-kilometer). The National Elevation Dataset (NED) covers the United States at a resolution of 1 arc-second (approximately 30-meters). Due to their seamless format and broad area coverage, both GTOPO30 and NED represent an advance in the usability of elevation data, but each still includes artifacts from the highly variable source data used to produce them. The consistent source data and processing approach for SRTM data will result in elevation products that will be a significant addition to the current availability of seamless datasets, specifically for many areas outside the U.S. One application that demonstrates some advantages that may be realized with SRTM data is delineation of land surface drainage features (watersheds and stream channels). Seamless distribution of elevation data in which a user interactively specifies the area of interest and order parameters via a map server is already being successfully demonstrated with existing USGS datasets. Such an approach for distributing SRTM data is ideal for a dataset that undoubtedly will be of very high interest to the spatial data user community.

  3. Enhancing e-waste estimates: Improving data quality by multivariate Input–Output Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wang, Feng, E-mail: fwang@unu.edu; Design for Sustainability Lab, Faculty of Industrial Design Engineering, Delft University of Technology, Landbergstraat 15, 2628CE Delft; Huisman, Jaco

    2013-11-15

    Highlights: • A multivariate Input–Output Analysis method for e-waste estimates is proposed. • Applying multivariate analysis to consolidate data can enhance e-waste estimates. • We examine the influence of model selection and data quality on e-waste estimates. • Datasets of all e-waste related variables in a Dutch case study have been provided. • Accurate modeling of time-variant lifespan distributions is critical for estimate. - Abstract: Waste electrical and electronic equipment (or e-waste) is one of the fastest growing waste streams, which encompasses a wide and increasing spectrum of products. Accurate estimation of e-waste generation is difficult, mainly due to lackmore » of high quality data referred to market and socio-economic dynamics. This paper addresses how to enhance e-waste estimates by providing techniques to increase data quality. An advanced, flexible and multivariate Input–Output Analysis (IOA) method is proposed. It links all three pillars in IOA (product sales, stock and lifespan profiles) to construct mathematical relationships between various data points. By applying this method, the data consolidation steps can generate more accurate time-series datasets from available data pool. This can consequently increase the reliability of e-waste estimates compared to the approach without data processing. A case study in the Netherlands is used to apply the advanced IOA model. As a result, for the first time ever, complete datasets of all three variables for estimating all types of e-waste have been obtained. The result of this study also demonstrates significant disparity between various estimation models, arising from the use of data under different conditions. It shows the importance of applying multivariate approach and multiple sources to improve data quality for modelling, specifically using appropriate time-varying lifespan parameters. Following the case study, a roadmap with a procedural guideline is provided to enhance e-waste estimation studies.« less

  4. Building a gold standard to construct search filters: a case study with biomarkers for oral cancer.

    PubMed

    Frazier, John J; Stein, Corey D; Tseytlin, Eugene; Bekhuis, Tanja

    2015-01-01

    To support clinical researchers, librarians and informationists may need search filters for particular tasks. Development of filters typically depends on a "gold standard" dataset. This paper describes generalizable methods for creating a gold standard to support future filter development and evaluation using oral squamous cell carcinoma (OSCC) as a case study. OSCC is the most common malignancy affecting the oral cavity. Investigation of biomarkers with potential prognostic utility is an active area of research in OSCC. The methods discussed here should be useful for designing quality search filters in similar domains. The authors searched MEDLINE for prognostic studies of OSCC, developed annotation guidelines for screeners, ran three calibration trials before annotating the remaining body of citations, and measured inter-annotator agreement (IAA). We retrieved 1,818 citations. After calibration, we screened the remaining citations (n = 1,767; 97.2%); IAA was substantial (kappa = 0.76). The dataset has 497 (27.3%) citations representing OSCC studies of potential prognostic biomarkers. The gold standard dataset is likely to be high quality and useful for future development and evaluation of filters for OSCC studies of potential prognostic biomarkers. The methodology we used is generalizable to other domains requiring a reference standard to evaluate the performance of search filters. A gold standard is essential because the labels regarding relevance enable computation of diagnostic metrics, such as sensitivity and specificity. Librarians and informationists with data analysis skills could contribute to developing gold standard datasets and subsequent filters tuned for their patrons' domains of interest.

  5. Estimating flow-duration and low-flow frequency statistics for unregulated streams in Oregon.

    DOT National Transportation Integrated Search

    2008-08-01

    Flow statistical datasets, basin-characteristic datasets, and regression equations were developed to provide decision makers with surface-water information needed for activities such as water-quality regulation, water-rights adjudication, biological ...

  6. ES-doc-errata: an issue tracker platform for CMIP6

    NASA Astrophysics Data System (ADS)

    Ben Nasser, Atef; Levavasseur, Guillaume; Greenslade, Mark; Denvil, Sébastien

    2017-04-01

    In the context of overseeing the quality of data, and as a result of the inherent complexity of projects such as CMIP5/6, it is a mandatory task to keep track of the status of datasets and the version evolution they sustain in their life-cycle. The ESdoc-errata project aims to keep track of the issues affecting specific versions of datasets/files. It enables users to resolve the history tree of each dataset/file enabling a better choice of the data used in their work based on the data status. The ES-doc-errata project has been designed and built on top of the Parent-IDentifiers handle service that will be deployed in the next iteration of the CMIP project, by ensuring maximum usability of ESGF ecosystem and encapsulated in the ES-doc structure. Consuming PIDs from handle service is guided by a specifically built algorithm that extracts meta-data regarding the issues that may or may not affect the quality of datasets/files and cause newer version to be published replacing older deprecated versions. This algorithm is able to deduce the nature of the flaws to the file granularity, that is of high value to the end-user. This new platform has been designed keeping in mind usability by end-users specialized in the data publishing process or other scientists requiring feedback on reliability of data required for their work. To this end, a specific set of rules and a code of conduct has been defined. A validation process ensures the quality of this newly introduced errata meta-data , an authentication safe-guard was implemented to prevent tampering with the archived data, and a wide variety of tools were put at users disposal to interact safely with the platform including a command-line client and a dedicated front-end.

  7. Assessing Subjectivity in Sensor Data Post Processing via a Controlled Experiment

    NASA Astrophysics Data System (ADS)

    Jones, A. S.; Horsburgh, J. S.; Eiriksson, D.

    2017-12-01

    Environmental data collected by in situ sensors must be reviewed to verify validity, and conducting quality control often requires making edits in post processing to generate approved datasets. This process involves decisions by technicians, data managers, or data users on how to handle problematic data. Options include: removing data from a series, retaining data with annotations, and altering data based on algorithms related to adjacent data points or the patterns of data at other locations or of other variables. Ideally, given the same dataset and the same quality control guidelines, multiple data quality control technicians would make the same decisions in data post processing. However, despite the development and implementation of guidelines aimed to ensure consistent quality control procedures, we have faced ambiguity when performing post processing, and we have noticed inconsistencies in the practices of individuals performing quality control post processing. Technicians with the same level of training and using the same input datasets may produce different results, affecting the overall quality and comparability of finished data products. Different results may also be produced by technicians that do not have the same level of training. In order to assess the effect of subjective decision making by the individual technician on the end data product, we designed an experiment where multiple users performed quality control post processing on the same datasets using a consistent set of guidelines, field notes, and tools. We also assessed the effect of technician experience and training by conducting the same procedures with a group of novices unfamiliar with the data and the quality control process and compared their results to those generated by a group of more experienced technicians. In this presentation, we report our observations of the degree of subjectivity in sensor data post processing, assessing and quantifying the impacts of individual technician as well as technician experience on quality controlled data products.

  8. High-resolution daily gridded datasets of air temperature and wind speed for Europe

    NASA Astrophysics Data System (ADS)

    Brinckmann, S.; Krähenmann, S.; Bissolli, P.

    2015-08-01

    New high-resolution datasets for near surface daily air temperature (minimum, maximum and mean) and daily mean wind speed for Europe (the CORDEX domain) are provided for the period 2001-2010 for the purpose of regional model validation in the framework of DecReg, a sub-project of the German MiKlip project, which aims to develop decadal climate predictions. The main input data sources are hourly SYNOP observations, partly supplemented by station data from the ECA&D dataset (http://www.ecad.eu). These data are quality tested to eliminate erroneous data and various kinds of inhomogeneities. Grids in a resolution of 0.044° (5 km) are derived by spatial interpolation of these station data into the CORDEX area. For temperature interpolation a modified version of a regression kriging method developed by Krähenmann et al. (2011) is used. At first, predictor fields of altitude, continentality and zonal mean temperature are chosen for a regression applied to monthly station data. The residuals of the monthly regression and the deviations of the daily data from the monthly averages are interpolated using simple kriging in a second and third step. For wind speed a new method based on the concept used for temperature was developed, involving predictor fields of exposure, roughness length, coastal distance and ERA Interim reanalysis wind speed at 850 hPa. Interpolation uncertainty is estimated by means of the kriging variance and regression uncertainties. Furthermore, to assess the quality of the final daily grid data, cross validation is performed. Explained variance ranges from 70 to 90 % for monthly temperature and from 50 to 60 % for monthly wind speed. The resulting RMSE for the final daily grid data amounts to 1-2 °C and 1-1.5 m s-1 (depending on season and parameter) for daily temperature parameters and daily mean wind speed, respectively. The datasets presented in this article are published at http://dx.doi.org/10.5676/DWD_CDC/DECREG0110v1.

  9. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay

    PubMed Central

    Bray, Mark-Anthony; Gustafsdottir, Sigrun M; Rohban, Mohammad H; Singh, Shantanu; Ljosa, Vebjorn; Sokolnicki, Katherine L; Bittker, Joshua A; Bodycombe, Nicole E; Dančík, Vlado; Hasaka, Thomas P; Hon, Cindy S; Kemp, Melissa M; Li, Kejie; Walpita, Deepika; Wawer, Mathias J; Golub, Todd R; Schreiber, Stuart L; Clemons, Paul A; Shamji, Alykhan F

    2017-01-01

    Abstract Background Large-scale image sets acquired by automated microscopy of perturbed samples enable a detailed comparison of cell states induced by each perturbation, such as a small molecule from a diverse library. Highly multiplexed measurements of cellular morphology can be extracted from each image and subsequently mined for a number of applications. Findings This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at “The Cell Image Library” (CIL) repository. It also includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged (i.e., per-well) level; the image analysis workflows that generated the morphological features are also provided. Quality-control metrics are provided as metadata, indicating fields of view that are out-of-focus or containing highly fluorescent material or debris. Lastly, chemical annotations are supplied for the compound treatments applied. Conclusions Because computational algorithms and methods for handling single-cell morphological measurements are not yet routine, the dataset serves as a useful resource for the wider scientific community applying morphological (image-based) profiling. The dataset can be mined for many purposes, including small-molecule library enrichment and chemical mechanism-of-action studies, such as target identification. Integration with genetically perturbed datasets could enable identification of small-molecule mimetics of particular disease- or gene-related phenotypes that could be useful as probes or potential starting points for development of future therapeutics. PMID:28327978

  10. Calibrating a numerical model's morphology using high-resolution spatial and temporal datasets from multithread channel flume experiments.

    NASA Astrophysics Data System (ADS)

    Javernick, L.; Bertoldi, W.; Redolfi, M.

    2017-12-01

    Accessing or acquiring high quality, low-cost topographic data has never been easier due to recent developments of the photogrammetric techniques of Structure-from-Motion (SfM). Researchers can acquire the necessary SfM imagery with various platforms, with the ability to capture millimetre resolution and accuracy, or large-scale areas with the help of unmanned platforms. Such datasets in combination with numerical modelling have opened up new opportunities to study river environments physical and ecological relationships. While numerical models overall predictive accuracy is most influenced by topography, proper model calibration requires hydraulic data and morphological data; however, rich hydraulic and morphological datasets remain scarce. This lack in field and laboratory data has limited model advancement through the inability to properly calibrate, assess sensitivity, and validate the models performance. However, new time-lapse imagery techniques have shown success in identifying instantaneous sediment transport in flume experiments and their ability to improve hydraulic model calibration. With new capabilities to capture high resolution spatial and temporal datasets of flume experiments, there is a need to further assess model performance. To address this demand, this research used braided river flume experiments and captured time-lapse observed sediment transport and repeat SfM elevation surveys to provide unprecedented spatial and temporal datasets. Through newly created metrics that quantified observed and modeled activation, deactivation, and bank erosion rates, the numerical model Delft3d was calibrated. This increased temporal data of both high-resolution time series and long-term temporal coverage provided significantly improved calibration routines that refined calibration parameterization. Model results show that there is a trade-off between achieving quantitative statistical and qualitative morphological representations. Specifically, statistical agreement simulations suffered to represent braiding planforms (evolving toward meandering), and parameterization that ensured braided produced exaggerated activation and bank erosion rates. Marie Sklodowska-Curie Individual Fellowship: River-HMV, 656917

  11. In-situ databases and comparison of ESA Ocean Colour Climate Change Initiative (OC-CCI) products with precursor data, towards an integrated approach for ocean colour validation and climate studies

    NASA Astrophysics Data System (ADS)

    Brotas, Vanda; Valente, André; Couto, André B.; Grant, Mike; Chuprin, Andrei; Jackson, Thomas; Groom, Steve; Sathyendranath, Shubha

    2014-05-01

    Ocean colour (OC) is an Oceanic Essential Climate Variable, which is used by climate modellers and researchers. The European Space Agency (ESA) Climate Change Initiative project, is the ESA response for the need of climate-quality satellite data, with the goal of providing stable, long-term, satellite-based ECV data products. The ESA Ocean Colour CCI focuses on the production of Ocean Colour ECV uses remote sensing reflectances to derive inherent optical properties and chlorophyll a concentration from ESA's MERIS (2002-2012) and NASA's SeaWiFS (1997 - 2010) and MODIS (2002-2012) sensor archives. This work presents an integrated approach by setting up a global database of in situ measurements and by inter-comparing OC-CCI products with pre-cursor datasets. The availability of in situ databases is fundamental for the validation of satellite derived ocean colour products. A global distribution in situ database was assembled, from several pre-existing datasets, with data spanning between 1997 and 2012. It includes in-situ measurements of remote sensing reflectances, concentration of chlorophyll-a, inherent optical properties and diffuse attenuation coefficient. The database is composed from observations of the following datasets: NOMAD, SeaBASS, MERMAID, AERONET-OC, BOUSSOLE and HOTS. The result was a merged dataset tuned for the validation of satellite-derived ocean colour products. This was an attempt to gather, homogenize and merge, a large high-quality bio-optical marine in situ data, as using all datasets in a single validation exercise increases the number of matchups and enhances the representativeness of different marine regimes. An inter-comparison analysis between OC-CCI chlorophyll-a product and satellite pre-cursor datasets was done with single missions and merged single mission products. Single mission datasets considered were SeaWiFS, MODIS-Aqua and MERIS; merged mission datasets were obtained from the GlobColour (GC) as well as the Making Earth Science Data Records for Use in Research Environments (MEaSUREs). OC-CCI product was found to be most similar to SeaWiFS record, and generally, the OC-CCI record was most similar to records derived from single mission than merged mission initiatives. Results suggest that CCI product is a more consistent dataset than other available merged mission initiatives. In conclusion, climate related science, requires long term data records to provide robust results, OC-CCI product proves to be a worthy data record for climate research, as it combines multi-sensor OC observations to provide a >15-year global error-characterized record.

  12. Sensitivity Gains, Linearity, and Spectral Reproducibility in Nonuniformly Sampled Multidimensional MAS NMR Spectra of High Dynamic Range.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Suiter, Christopher L.; Paramasivam, Sivakumar; Hou, Guangjin

    Recently, we have demonstrated that considerable inherent sensitivity gains are attained in MAS NMR spectra acquired by nonuniform sampling (NUS) and introduced maximum entropy interpolation (MINT) processing that assures the linearity of transformation between the time and frequency domains. In this report, we examine the utility of the NUS/MINT approach in multidimensional datasets possessing high dynamic range, such as homonuclear 13C–13C correlation spectra. We demonstrate on model compounds and on 1–73-(U-13C,15N)/74–108-(U-15N) E. coli thioredoxin reassembly, that with appropriately constructed 50 % NUS schedules inherent sensitivity gains of 1.7–2.1-fold are readily reached in such datasets. We show that both linearity andmore » line width are retained under these experimental conditions throughout the entire dynamic range of the signals. Furthermore, we demonstrate that the reproducibility of the peak intensities is excellent in the NUS/MINT approach when experiments are repeated multiple times and identical experimental and processing conditions are employed. Finally, we discuss the principles for design and implementation of random exponentially biased NUS sampling schedules for homonuclear 13C–13C MAS correlation experiments that yield high quality artifact-free datasets.« less

  13. Sensitivity gains, linearity, and spectral reproducibility in nonuniformly sampled multidimensional MAS NMR spectra of high dynamic range

    PubMed Central

    Suiter, Christopher L.; Paramasivam, Sivakumar; Hou, Guangjin; Sun, Shangjin; Rice, David; Hoch, Jeffrey C.; Rovnyak, David

    2014-01-01

    Recently, we have demonstrated that considerable inherent sensitivity gains are attained in MAS NMR spectra acquired by nonuniform sampling (NUS) and introduced maximum entropy interpolation (MINT) processing that assures the linearity of transformation between the time and frequency domains. In this report, we examine the utility of the NUS/MINT approach in multidimensional datasets possessing high dynamic range, such as homonuclear 13C–13C correlation spectra. We demonstrate on model compounds and on 1–73-(U-13C, 15N)/74–108-(U-15N) E. coli thioredoxin reassembly, that with appropriately constructed 50 % NUS schedules inherent sensitivity gains of 1.7–2.1-fold are readily reached in such datasets. We show that both linearity and line width are retained under these experimental conditions throughout the entire dynamic range of the signals. Furthermore, we demonstrate that the reproducibility of the peak intensities is excellent in the NUS/MINT approach when experiments are repeated multiple times and identical experimental and processing conditions are employed. Finally, we discuss the principles for design and implementation of random exponentially biased NUS sampling schedules for homonuclear 13C–13C MAS correlation experiments that yield high-quality artifact-free datasets. PMID:24752819

  14. Quantifying Urban Watershed Stressor Gradients and Evaluating How Different Land Cover Datasets Affect Stream Management.

    PubMed

    Smucker, Nathan J; Kuhn, Anne; Charpentier, Michael A; Cruz-Quinones, Carlos J; Elonen, Colleen M; Whorley, Sarah B; Jicha, Terri M; Serbst, Jonathan R; Hill, Brian H; Wehr, John D

    2016-03-01

    Watershed management and policies affecting downstream ecosystems benefit from identifying relationships between land cover and water quality. However, different data sources can create dissimilarities in land cover estimates and models that characterize ecosystem responses. We used a spatially balanced stream study (1) to effectively sample development and urban stressor gradients while representing the extent of a large coastal watershed (>4400 km(2)), (2) to document differences between estimates of watershed land cover using 30-m resolution national land cover database (NLCD) and <1-m resolution land cover data, and (3) to determine if predictive models and relationships between water quality and land cover differed when using these two land cover datasets. Increased concentrations of nutrients, anions, and cations had similarly significant correlations with increased watershed percent impervious cover (IC), regardless of data resolution. The NLCD underestimated percent forest for 71/76 sites by a mean of 11 % and overestimated percent wetlands for 71/76 sites by a mean of 8 %. The NLCD almost always underestimated IC at low development intensities and overestimated IC at high development intensities. As a result of underestimated IC, regression models using NLCD data predicted mean background concentrations of NO3 (-) and Cl(-) that were 475 and 177 %, respectively, of those predicted when using finer resolution land cover data. Our sampling design could help states and other agencies seeking to create monitoring programs and indicators responsive to anthropogenic impacts. Differences between land cover datasets could affect resource protection due to misguided management targets, watershed development and conservation practices, or water quality criteria.

  15. Big Data solution for CTBT monitoring: CEA-IDC joint global cross correlation project

    NASA Astrophysics Data System (ADS)

    Bobrov, Dmitry; Bell, Randy; Brachet, Nicolas; Gaillard, Pierre; Kitov, Ivan; Rozhkov, Mikhail

    2014-05-01

    Waveform cross-correlation when applied to historical datasets of seismic records provides dramatic improvements in detection, location, and magnitude estimation of natural and manmade seismic events. With correlation techniques, the amplitude threshold of signal detection can be reduced globally by a factor of 2 to 3 relative to currently standard beamforming and STA/LTA detector. The gain in sensitivity corresponds to a body wave magnitude reduction by 0.3 to 0.4 units and doubles the number of events meeting high quality requirements (e.g. detected by three and more seismic stations of the International Monitoring System (IMS). This gain is crucial for seismic monitoring under the Comprehensive Nuclear-Test-Ban Treaty. The International Data Centre (IDC) dataset includes more than 450,000 seismic events, tens of millions of raw detections and continuous seismic data from the primary IMS stations since 2000. This high-quality dataset is a natural candidate for an extensive cross correlation study and the basis of further enhancements in monitoring capabilities. Without this historical dataset recorded by the permanent IMS Seismic Network any improvements would not be feasible. However, due to the mismatch between the volume of data and the performance of the standard Information Technology infrastructure, it becomes impossible to process all the data within tolerable elapsed time. To tackle this problem known as "BigData", the CEA/DASE is part of the French project "DataScale". One objective is to reanalyze 10 years of waveform data from the IMS network with the cross-correlation technique thanks to a dedicated High Performance Computer (HPC) infrastructure operated by the Centre de Calcul Recherche et Technologie (CCRT) at the CEA of Bruyères-le-Châtel. Within 2 years we are planning to enhance detection and phase association algorithms (also using machine learning and automatic classification) and process about 30 terabytes of data provided by the IDC to update the world seismicity map. From the new events and those in the IDC Reviewed Event Bulletin, we will automatically create various sets of master event templates that will be used for the event location globally by the CTBTO and CEA.

  16. HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets.

    PubMed

    Omta, Wienand A; van Heesbeen, Roy G; Pagliero, Romina J; van der Velden, Lieke M; Lelieveld, Daphne; Nellen, Mehdi; Kramer, Maik; Yeong, Marley; Saeidi, Amir M; Medema, Rene H; Spruit, Marco; Brinkkemper, Sjaak; Klumperman, Judith; Egan, David A

    2016-10-01

    High-content screening (HCS) can generate large multidimensional datasets and when aligned with the appropriate data mining tools, it can yield valuable insights into the mechanism of action of bioactive molecules. However, easy-to-use data mining tools are not widely available, with the result that these datasets are frequently underutilized. Here, we present HC StratoMineR, a web-based tool for high-content data analysis. It is a decision-supportive platform that guides even non-expert users through a high-content data analysis workflow. HC StratoMineR is built by using My Structured Query Language for storage and querying, PHP: Hypertext Preprocessor as the main programming language, and jQuery for additional user interface functionality. R is used for statistical calculations, logic and data visualizations. Furthermore, C++ and graphical processor unit power is diffusely embedded in R by using the rcpp and rpud libraries for operations that are computationally highly intensive. We show that we can use HC StratoMineR for the analysis of multivariate data from a high-content siRNA knock-down screen and a small-molecule screen. It can be used to rapidly filter out undesirable data; to select relevant data; and to perform quality control, data reduction, data exploration, morphological hit picking, and data clustering. Our results demonstrate that HC StratoMineR can be used to functionally categorize HCS hits and, thus, provide valuable information for hit prioritization.

  17. A high resolution spatial population database of Somalia for disease risk mapping.

    PubMed

    Linard, Catherine; Alegana, Victor A; Noor, Abdisalan M; Snow, Robert W; Tatem, Andrew J

    2010-09-14

    Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org.

  18. A high resolution spatial population database of Somalia for disease risk mapping

    PubMed Central

    2010-01-01

    Background Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Results Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. Conclusions The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org. PMID:20840751

  19. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    PubMed

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.

  20. 75 FR 42680 - Proposed Information Collection; Topographic and Bathymetric Data Survey

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-22

    .... Twenty-one pieces of information about each dataset will be collected to give an accurate picture of data quality and give users of the Topographic and Bathymetric Data Inventory access to each dataset. The end...

  1. Constructing a Reward-Related Quality of Life Statistic in Daily Life—a Proof of Concept Study Using Positive Affect

    PubMed Central

    Verhagen, Simone J. W.; Simons, Claudia J. P.; van Zelst, Catherine; Delespaul, Philippe A. E. G.

    2017-01-01

    Background: Mental healthcare needs person-tailored interventions. Experience Sampling Method (ESM) can provide daily life monitoring of personal experiences. This study aims to operationalize and test a measure of momentary reward-related Quality of Life (rQoL). Intuitively, quality of life improves by spending more time on rewarding experiences. ESM clinical interventions can use this information to coach patients to find a realistic, optimal balance of positive experiences (maximize reward) in daily life. rQoL combines the frequency of engaging in a relevant context (a ‘behavior setting’) with concurrent (positive) affect. High rQoL occurs when the most frequent behavior settings are combined with positive affect or infrequent behavior settings co-occur with low positive affect. Methods: Resampling procedures (Monte Carlo experiments) were applied to assess the reliability of rQoL using various behavior setting definitions under different sampling circumstances, for real or virtual subjects with low-, average- and high contextual variability. Furthermore, resampling was used to assess whether rQoL is a distinct concept from positive affect. Virtual ESM beep datasets were extracted from 1,058 valid ESM observations for virtual and real subjects. Results: Behavior settings defined by Who-What contextual information were most informative. Simulations of at least 100 ESM observations are needed for reliable assessment. Virtual ESM beep datasets of a real subject can be defined by Who-What-Where behavior setting combinations. Large sample sizes are necessary for reliable rQoL assessments, except for subjects with low contextual variability. rQoL is distinct from positive affect. Conclusion: rQoL is a feasible concept. Monte Carlo experiments should be used to assess the reliable implementation of an ESM statistic. Future research in ESM should asses the behavior of summary statistics under different sampling situations. This exploration is especially relevant in clinical implementation, where often only small datasets are available. PMID:29163294

  2. Constructing a Reward-Related Quality of Life Statistic in Daily Life-a Proof of Concept Study Using Positive Affect.

    PubMed

    Verhagen, Simone J W; Simons, Claudia J P; van Zelst, Catherine; Delespaul, Philippe A E G

    2017-01-01

    Background: Mental healthcare needs person-tailored interventions. Experience Sampling Method (ESM) can provide daily life monitoring of personal experiences. This study aims to operationalize and test a measure of momentary reward-related Quality of Life (rQoL). Intuitively, quality of life improves by spending more time on rewarding experiences. ESM clinical interventions can use this information to coach patients to find a realistic, optimal balance of positive experiences (maximize reward) in daily life. rQoL combines the frequency of engaging in a relevant context (a 'behavior setting') with concurrent (positive) affect. High rQoL occurs when the most frequent behavior settings are combined with positive affect or infrequent behavior settings co-occur with low positive affect. Methods: Resampling procedures (Monte Carlo experiments) were applied to assess the reliability of rQoL using various behavior setting definitions under different sampling circumstances, for real or virtual subjects with low-, average- and high contextual variability. Furthermore, resampling was used to assess whether rQoL is a distinct concept from positive affect. Virtual ESM beep datasets were extracted from 1,058 valid ESM observations for virtual and real subjects. Results: Behavior settings defined by Who-What contextual information were most informative. Simulations of at least 100 ESM observations are needed for reliable assessment. Virtual ESM beep datasets of a real subject can be defined by Who-What-Where behavior setting combinations. Large sample sizes are necessary for reliable rQoL assessments, except for subjects with low contextual variability. rQoL is distinct from positive affect. Conclusion: rQoL is a feasible concept. Monte Carlo experiments should be used to assess the reliable implementation of an ESM statistic. Future research in ESM should asses the behavior of summary statistics under different sampling situations. This exploration is especially relevant in clinical implementation, where often only small datasets are available.

  3. An MCMC determination of the primordial helium abundance

    NASA Astrophysics Data System (ADS)

    Aver, Erik; Olive, Keith A.; Skillman, Evan D.

    2012-04-01

    Spectroscopic observations of the chemical abundances in metal-poor H II regions provide an independent method for estimating the primordial helium abundance. H II regions are described by several physical parameters such as electron density, electron temperature, and reddening, in addition to y, the ratio of helium to hydrogen. It had been customary to estimate or determine self-consistently these parameters to calculate y. Frequentist analyses of the parameter space have been shown to be successful in these parameter determinations, and Markov Chain Monte Carlo (MCMC) techniques have proven to be very efficient in sampling this parameter space. Nevertheless, accurate determination of the primordial helium abundance from observations of H II regions is constrained by both systematic and statistical uncertainties. In an attempt to better reduce the latter, and continue to better characterize the former, we apply MCMC methods to the large dataset recently compiled by Izotov, Thuan, & Stasińska (2007). To improve the reliability of the determination, a high quality dataset is needed. In pursuit of this, a variety of cuts are explored. The efficacy of the He I λ4026 emission line as a constraint on the solutions is first examined, revealing the introduction of systematic bias through its absence. As a clear measure of the quality of the physical solution, a χ2 analysis proves instrumental in the selection of data compatible with the theoretical model. Nearly two-thirds of the observations fall outside a standard 95% confidence level cut, which highlights the care necessary in selecting systems and warrants further investigation into potential deficiencies of the model or data. In addition, the method also allows us to exclude systems for which parameter estimations are statistical outliers. As a result, the final selected dataset gains in reliability and exhibits improved consistency. Regression to zero metallicity yields Yp = 0.2534 ± 0.0083, in broad agreement with the WMAP result. The inclusion of more observations shows promise for further reducing the uncertainty, but more high quality spectra are required.

  4. Fragmentation of urban forms and the environmental consequences: results from a high-spatial resolution model system

    NASA Astrophysics Data System (ADS)

    Tang, U. W.; Wang, Z. S.

    2008-10-01

    Each city has its unique urban form. The importance of urban form on sustainable development has been recognized in recent years. Traditionally, air quality modelling in a city is in a mesoscale with grid resolution of kilometers, regardless of its urban form. This paper introduces a GIS-based air quality and noise model system developed to study the built environment of highly compact urban forms. Compared with traditional mesoscale air quality model system, the present model system has a higher spatial resolution down to individual buildings along both sides of the street. Applying the developed model system in the Macao Peninsula with highly compact urban forms, the average spatial resolution of input and output data is as high as 174 receptor points per km2. Based on this input/output dataset with a high spatial resolution, this study shows that even the highly compact urban forms can be fragmented into a very small geographic scale of less than 3 km2. This is due to the significant temporal variation of urban development. The variation of urban form in each fragment in turn affects air dispersion, traffic condition, and thus air quality and noise in a measurable scale.

  5. q-Space Upsampling Using x-q Space Regularization.

    PubMed

    Chen, Geng; Dong, Bin; Zhang, Yong; Shen, Dinggang; Yap, Pew-Thian

    2017-09-01

    Acquisition time in diffusion MRI increases with the number of diffusion-weighted images that need to be acquired. Particularly in clinical settings, scan time is limited and only a sparse coverage of the vast q -space is possible. In this paper, we show how non-local self-similar information in the x - q space of diffusion MRI data can be harnessed for q -space upsampling. More specifically, we establish the relationships between signal measurements in x - q space using a patch matching mechanism that caters to unstructured data. We then encode these relationships in a graph and use it to regularize an inverse problem associated with recovering a high q -space resolution dataset from its low-resolution counterpart. Experimental results indicate that the high-resolution datasets reconstructed using the proposed method exhibit greater quality, both quantitatively and qualitatively, than those obtained using conventional methods, such as interpolation using spherical radial basis functions (SRBFs).

  6. Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

    NASA Astrophysics Data System (ADS)

    Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

    2017-04-01

    CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the validation test results is now performed to find the best features of both datasets. The study shows the differences between the EN4 and CORA validation results. It highlights the complementarity between the EN4 and CORA higher order tests. The design of the CORA and EN4 validation charts is discussed to understand how a different approach on the dataset scope can lead to differences in data validation. The new validation chart of the Copernicus Marine Service dataset is presented.

  7. Thresholds of Toxicological Concern for cosmetics-related substances: New database, thresholds, and enrichment of chemical space.

    PubMed

    Yang, Chihae; Barlow, Susan M; Muldoon Jacobs, Kristi L; Vitcheva, Vessela; Boobis, Alan R; Felter, Susan P; Arvidson, Kirk B; Keller, Detlef; Cronin, Mark T D; Enoch, Steven; Worth, Andrew; Hollnagel, Heli M

    2017-11-01

    A new dataset of cosmetics-related chemicals for the Threshold of Toxicological Concern (TTC) approach has been compiled, comprising 552 chemicals with 219, 40, and 293 chemicals in Cramer Classes I, II, and III, respectively. Data were integrated and curated to create a database of No-/Lowest-Observed-Adverse-Effect Level (NOAEL/LOAEL) values, from which the final COSMOS TTC dataset was developed. Criteria for study inclusion and NOAEL decisions were defined, and rigorous quality control was performed for study details and assignment of Cramer classes. From the final COSMOS TTC dataset, human exposure thresholds of 42 and 7.9 μg/kg-bw/day were derived for Cramer Classes I and III, respectively. The size of Cramer Class II was insufficient for derivation of a TTC value. The COSMOS TTC dataset was then federated with the dataset of Munro and colleagues, previously published in 1996, after updating the latter using the quality control processes for this project. This federated dataset expands the chemical space and provides more robust thresholds. The 966 substances in the federated database comprise 245, 49 and 672 chemicals in Cramer Classes I, II and III, respectively. The corresponding TTC values of 46, 6.2 and 2.3 μg/kg-bw/day are broadly similar to those of the original Munro dataset. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.

  8. A guide to evaluating linkage quality for the analysis of linked data.

    PubMed

    Harron, Katie L; Doidge, James C; Knight, Hannah E; Gilbert, Ruth E; Goldstein, Harvey; Cromwell, David A; van der Meulen, Jan H

    2017-10-01

    Linked datasets are an important resource for epidemiological and clinical studies, but linkage error can lead to biased results. For data security reasons, linkage of personal identifiers is often performed by a third party, making it difficult for researchers to assess the quality of the linked dataset in the context of specific research questions. This is compounded by a lack of guidance on how to determine the potential impact of linkage error. We describe how linkage quality can be evaluated and provide widely applicable guidance for both data providers and researchers. Using an illustrative example of a linked dataset of maternal and baby hospital records, we demonstrate three approaches for evaluating linkage quality: applying the linkage algorithm to a subset of gold standard data to quantify linkage error; comparing characteristics of linked and unlinked data to identify potential sources of bias; and evaluating the sensitivity of results to changes in the linkage procedure. These approaches can inform our understanding of the potential impact of linkage error and provide an opportunity to select the most appropriate linkage procedure for a specific analysis. Evaluating linkage quality in this way will improve the quality and transparency of epidemiological and clinical research using linked data. © The Author 2017. Published by Oxford University Press on behalf of the International Epidemiological Association.

  9. Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques.

    PubMed

    Gandy, Lisa M; Gumm, Jordan; Fertig, Benjamin; Thessen, Anne; Kennish, Michael J; Chavan, Sameer; Marchionni, Luigi; Xia, Xiaoxin; Shankrit, Shambhavi; Fertig, Elana J

    2017-01-01

    Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

  10. Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

    PubMed Central

    Gumm, Jordan; Fertig, Benjamin; Thessen, Anne; Kennish, Michael J.; Chavan, Sameer; Marchionni, Luigi; Xia, Xiaoxin; Shankrit, Shambhavi; Fertig, Elana J.

    2017-01-01

    Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases. PMID:28437440

  11. Using Modern Statistical Methods to Analyze Demographics of Kansas ABE/GED Students Who Transition to Community or Technical College Programs

    ERIC Educational Resources Information Center

    Zacharakis, Jeff; Wang, Haiyan; Patterson, Margaret Becker; Andersen, Lori

    2015-01-01

    This research analyzed linked high-quality state data from K-12, adult education, and postsecondary state datasets in order to better understand the association between student demographics and successful completion of a postsecondary program. Due to the relatively small sample size compared to the large number of features, we analyzed the data…

  12. Continuation of the NVAP Global Water Vapor Data Sets for Pathfinder Science Analysis

    NASA Technical Reports Server (NTRS)

    VonderHaar, Thomas H.; Engelen, Richard J.; Forsythe, John M.; Randel, David L.; Ruston, Benjamin C.; Woo, Shannon; Dodge, James (Technical Monitor)

    2001-01-01

    This annual report covers August 2000 - August 2001 under NASA contract NASW-0032, entitled "Continuation of the NVAP (NASA's Water Vapor Project) Global Water Vapor Data Sets for Pathfinder Science Analysis". NASA has created a list of Earth Science Research Questions which are outlined by Asrar, et al. Particularly relevant to NVAP are the following questions: (a) How are global precipitation, evaporation, and the cycling of water changing? (b) What trends in atmospheric constituents and solar radiation are driving global climate? (c) How well can long-term climatic trends be assessed or predicted? Water vapor is a key greenhouse gas, and an understanding of its behavior is essential in global climate studies. Therefore, NVAP plays a key role in addressing the above climate questions by creating a long-term global water vapor dataset and by updating the dataset with recent advances in satellite instrumentation. The NVAP dataset produced from 1988-1998 has found wide use in the scientific community. Studies of interannual variability are particularly important. A recent paper by Simpson, et al. that examined the NVAP dataset in detail has shown that its relative accuracy is sufficient for the variability studies that contribute toward meeting NASA's goals. In the past year, we have made steady progress towards continuing production of this high-quality dataset as well as performing our own investigations of the data. This report summarizes the past year's work on production of the NVAP dataset and presents results of analyses we have performed in the past year.

  13. YummyData: providing high-quality open life science data

    PubMed Central

    Yamaguchi, Atsuko; Splendiani, Andrea

    2018-01-01

    Abstract Many life science datasets are now available via Linked Data technologies, meaning that they are represented in a common format (the Resource Description Framework), and are accessible via standard APIs (SPARQL endpoints). While this is an important step toward developing an interoperable bioinformatics data landscape, it also creates a new set of obstacles, as it is often difficult for researchers to find the datasets they need. Different providers frequently offer the same datasets, with different levels of support: as well as having more or less up-to-date data, some providers add metadata to describe the content, structures, and ontologies of the stored datasets while others do not. We currently lack a place where researchers can go to easily assess datasets from different providers in terms of metrics such as service stability or metadata richness. We also lack a space for collecting feedback and improving data providers’ awareness of user needs. To address this issue, we have developed YummyData, which consists of two components. One periodically polls a curated list of SPARQL endpoints, monitoring the states of their Linked Data implementations and content. The other presents the information measured for the endpoints and provides a forum for discussion and feedback. YummyData is designed to improve the findability and reusability of life science datasets provided as Linked Data and to foster its adoption. It is freely accessible at http://yummydata.org/. Database URL: http://yummydata.org/ PMID:29688370

  14. Highlights of the Version 8 SBUV and TOMS Datasets Released at this Symposium

    NASA Technical Reports Server (NTRS)

    Bhartia, Pawan K.; McPeters, Richard D.; Flynn, Lawrence E.; Wellemeyer, Charles G.

    2004-01-01

    Last October was the 25th anniversary of the launch of the SBUV and TOMS instruments on NASA's Nimbus-7 satellite. Total Ozone and ozone profile datasets produced by these and following instruments have produced a quarter century long record. Over time we have released several versions of these datasets to incorporate advances in UV radiative transfer, inverse modeling, and instrument characterization. In this meeting we are releasing datasets produced from the version 8 algorithms. They replace the previous versions (V6 SBUV, and V7 TOMS) released about a decade ago. About a dozen companion papers in this meeting provide details of the new algorithms and intercomparison of the new data with external data. In this paper we present key features of the new algorithm, and discuss how the new results differ from those released previously. We show that the new datasets have better internal consistency and also agree better with external datasets. A key feature of the V8 SBUV algorithm is that the climatology has no influence on inter-annual variability and trends; it only affects the mean values and, to a limited extent, the seasonal dependence. By contrast, climatology does have some influence on TOMS total O3 trends, particularly at large solar zenith angles. For this reason, and also because TOMS record has gaps, md EP/TOMS is suffering from data quality problems, we recommend using SBUV total ozone data for applications where the high spatial resolution of TOMS is not essential.

  15. Potential for using regional and global datasets for national scale ecosystem service modelling

    NASA Astrophysics Data System (ADS)

    Maxwell, Deborah; Jackson, Bethanna

    2016-04-01

    Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European Soils Database. NEXTmap elevation data, which covers the UK and parts of continental Europe, are compared to global AsterDEM and SRTM30 topographical products. While the regional and global datasets can be used to fill gaps in data requirements, the coarser resolution of these datasets means that there is greater aggregation of information over larger areas. This loss of detail impacts on the reliability of model output, particularly where significant discrepancies between datasets exist. The implications of this loss of detail in terms of spatial planning and decision making is discussed. Finally, in the context of broader development the need for better nationally and globally available data to allow LUCI and other ecosystem models to become more globally applicable is highlighted.

  16. The phylogeny and evolutionary history of tyrannosauroid dinosaurs.

    PubMed

    Brusatte, Stephen L; Carr, Thomas D

    2016-02-02

    Tyrannosauroids--the group of carnivores including Tyrannosaurs rex--are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.

  17. The phylogeny and evolutionary history of tyrannosauroid dinosaurs

    PubMed Central

    Brusatte, Stephen L.; Carr, Thomas D.

    2016-01-01

    Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work. PMID:26830019

  18. The phylogeny and evolutionary history of tyrannosauroid dinosaurs

    NASA Astrophysics Data System (ADS)

    Brusatte, Stephen L.; Carr, Thomas D.

    2016-02-01

    Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.

  19. Real-time quality monitoring in debutanizer column with regression tree and ANFIS

    NASA Astrophysics Data System (ADS)

    Siddharth, Kumar; Pathak, Amey; Pani, Ajaya Kumar

    2018-05-01

    A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the output is the butane concentration in the debutanizer column bottom product. The input-output dataset is divided equally into a training (calibration) set and a validation (testing) set. The training set data were used to develop fuzzy inference, adaptive neuro fuzzy (ANFIS) and regression tree models for the debutanizer column. The accuracy of the developed models were evaluated by simulation of the models with the validation dataset. It is observed that the ANFIS model has better estimation accuracy than other models developed in this work and many data-driven models proposed so far in the literature for the debutanizer column.

  20. Prediction of Mutagenicity of Chemicals from Their Calculated Molecular Descriptors: A Case Study with Structurally Homogeneous versus Diverse Datasets.

    PubMed

    Basak, Subhash C; Majumdar, Subhabrata

    2015-01-01

    Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n < p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.

  1. Ontology-based meta-analysis of global collections of high-throughput public data.

    PubMed

    Kupershmidt, Ilya; Su, Qiaojuan Jane; Grewal, Anoop; Sundaresh, Suman; Halperin, Inbal; Flynn, James; Shekar, Mamatha; Wang, Helen; Park, Jenny; Cui, Wenwu; Wall, Gregory D; Wisotzkey, Robert; Alag, Satnam; Akhtari, Saeid; Ronaghi, Mostafa

    2010-09-29

    The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today. We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets. Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.

  2. Acquisition of thin coronal sectional dataset of cadaveric liver.

    PubMed

    Lou, Li; Liu, Shu Wei; Zhao, Zhen Mei; Tang, Yu Chun; Lin, Xiang Tao

    2014-04-01

    To obtain the thin coronal sectional anatomic dataset of the liver by using digital freezing milling technique. The upper abdomen of one Chinese adult cadaver was selected as the specimen. After CT and MRI examinations verification of absent liver lesions, the specimen was embedded with gelatin in stand erect position and frozen under profound hypothermia, and the specimen was then serially sectioned from anterior to posterior layer by layer with digital milling machine in the freezing chamber. The sequential images were captured by means of a digital camera and the dataset was imported to imaging workstation. The thin serial section of the liver added up to 699 layers with each layer being 0.2 mm in thickness. The shape, location, structure, intrahepatic vessels and adjacent structures of the liver was displayed clearly on each layer of the coronal sectional slice. CT and MR images through the body were obtained at 1.0 and 3.0 mm intervals, respectively. The methodology reported here is an adaptation of the milling methods previously described, which is a new data acquisition method for sectional anatomy. The thin coronal sectional anatomic dataset of the liver obtained by this technique is of high precision and good quality.

  3. Study of the Integration of LIDAR and Photogrammetric Datasets by in Situ Camera Calibration and Integrated Sensor Orientation

    NASA Astrophysics Data System (ADS)

    Mitishita, E.; Costa, F.; Martins, M.

    2017-05-01

    Photogrammetric and Lidar datasets should be in the same mapping or geodetic frame to be used simultaneously in an engineering project. Nowadays direct sensor orientation is a common procedure used in simultaneous photogrammetric and Lidar surveys. Although the direct sensor orientation technologies provide a high degree of automation process due to the GNSS/INS technologies, the accuracies of the results obtained from the photogrammetric and Lidar surveys are dependent on the quality of a group of parameters that models accurately the user conditions of the system at the moment the job is performed. This paper shows the study that was performed to verify the importance of the in situ camera calibration and Integrated Sensor Orientation without control points to increase the accuracies of the photogrammetric and LIDAR datasets integration. The horizontal and vertical accuracies of photogrammetric and Lidar datasets integration by photogrammetric procedure improved significantly when the Integrated Sensor Orientation (ISO) approach was performed using Interior Orientation Parameter (IOP) values estimated from the in situ camera calibration. The horizontal and vertical accuracies, estimated by the Root Mean Square Error (RMSE) of the 3D discrepancies from the Lidar check points, increased around of 37% and 198% respectively.

  4. Detecting Surgical Tools by Modelling Local Appearance and Global Shape.

    PubMed

    Bouget, David; Benenson, Rodrigo; Omran, Mohamed; Riffaud, Laurent; Schiele, Bernt; Jannin, Pierre

    2015-12-01

    Detecting tools in surgical videos is an important ingredient for context-aware computer-assisted surgical systems. To this end, we present a new surgical tool detection dataset and a method for joint tool detection and pose estimation in 2d images. Our two-stage pipeline is data-driven and relaxes strong assumptions made by previous works regarding the geometry, number, and position of tools in the image. The first stage classifies each pixel based on local appearance only, while the second stage evaluates a tool-specific shape template to enforce global shape. Both local appearance and global shape are learned from training data. Our method is validated on a new surgical tool dataset of 2 476 images from neurosurgical microscopes, which is made freely available. It improves over existing datasets in size, diversity and detail of annotation. We show that our method significantly improves over competitive baselines from the computer vision field. We achieve 15% detection miss-rate at 10(-1) false positives per image (for the suction tube) over our surgical tool dataset. Results indicate that performing semantic labelling as an intermediate task is key for high quality detection.

  5. SCPortalen: human and mouse single-cell centric database

    PubMed Central

    Noguchi, Shuhei; Böttcher, Michael; Hasegawa, Akira; Kouno, Tsukasa; Kato, Sachi; Tada, Yuhki; Ura, Hiroki; Abe, Kuniya; Shin, Jay W; Plessy, Charles; Carninci, Piero

    2018-01-01

    Abstract Published single-cell datasets are rich resources for investigators who want to address questions not originally asked by the creators of the datasets. The single-cell datasets might be obtained by different protocols and diverse analysis strategies. The main challenge in utilizing such single-cell data is how we can make the various large-scale datasets to be comparable and reusable in a different context. To challenge this issue, we developed the single-cell centric database ‘SCPortalen’ (http://single-cell.clst.riken.jp/). The current version of the database covers human and mouse single-cell transcriptomics datasets that are publicly available from the INSDC sites. The original metadata was manually curated and single-cell samples were annotated with standard ontology terms. Following that, common quality assessment procedures were conducted to check the quality of the raw sequence. Furthermore, primary data processing of the raw data followed by advanced analyses and interpretation have been performed from scratch using our pipeline. In addition to the transcriptomics data, SCPortalen provides access to single-cell image files whenever available. The target users of SCPortalen are all researchers interested in specific cell types or population heterogeneity. Through the web interface of SCPortalen users are easily able to search, explore and download the single-cell datasets of their interests. PMID:29045713

  6. Enhancement of digital radiography image quality using a convolutional neural network.

    PubMed

    Sun, Yuewen; Li, Litao; Cong, Peng; Wang, Zhentao; Guo, Xiaojing

    2017-01-01

    Digital radiography system is widely used for noninvasive security check and medical imaging examination. However, the system has a limitation of lower image quality in spatial resolution and signal to noise ratio. In this study, we explored whether the image quality acquired by the digital radiography system can be improved with a modified convolutional neural network to generate high-resolution images with reduced noise from the original low-quality images. The experiment evaluated on a test dataset, which contains 5 X-ray images, showed that the proposed method outperformed the traditional methods (i.e., bicubic interpolation and 3D block-matching approach) as measured by peak signal to noise ratio (PSNR) about 1.3 dB while kept highly efficient processing time within one second. Experimental results demonstrated that a residual to residual (RTR) convolutional neural network remarkably improved the image quality of object structural details by increasing the image resolution and reducing image noise. Thus, this study indicated that applying this RTR convolutional neural network system was useful to improve image quality acquired by the digital radiography system.

  7. Embedding the perceptions of people with dementia into quantitative research design.

    PubMed

    O'Rourke, Hannah M; Duggleby, Wendy; Fraser, Kimberly D

    2015-05-01

    Patient perspectives about quality of life are often found in the results of qualitative research and could be applied to steer the direction of future research. The purpose of this paper was to describe how findings from a body of qualitative research on patient perspectives about quality of life were linked to a clinical administrative dataset and then used to design a subsequent quantitative study. Themes from two systematic reviews of qualitative evidence (i.e., metasyntheses) identified what affects quality of life according to people with dementia. Selected themes and their sub-concepts were then mapped to an administrative dataset (the Resident Assessment Instrument 2.0) to determine the study focus, formulate nine hypotheses, and select a patient-reported outcome. A literature review followed to confirm existence of a knowledge gap, identify adjustment variables, and support design decisions. A quantitative study to test the association between conflict and sadness for people with dementia in long-term care was derived from metasynthesis themes. Challenges included (1) mapping broad themes to the administrative dataset; (2) decisions associated with inclusion of variables not identified by people with dementia from the qualitative research; and (3) selecting a patient-reported outcome, when the dataset lacked a valid subjective quality-of-life measure. Themes derived from a body of qualitative research capturing a target populations' perspective can be linked to administrative data and used to design a quantitative study. Using this approach, the quantitative findings will be meaningful with respect to the quality of life of the target population.

  8. Image quality classification for DR screening using deep learning.

    PubMed

    FengLi Yu; Jing Sun; Annan Li; Jun Cheng; Cheng Wan; Jiang Liu

    2017-07-01

    The quality of input images significantly affects the outcome of automated diabetic retinopathy (DR) screening systems. Unlike the previous methods that only consider simple low-level features such as hand-crafted geometric and structural features, in this paper we propose a novel method for retinal image quality classification (IQC) that performs computational algorithms imitating the working of the human visual system. The proposed algorithm combines unsupervised features from saliency map and supervised features coming from convolutional neural networks (CNN), which are fed to an SVM to automatically detect high quality vs poor quality retinal fundus images. We demonstrate the superior performance of our proposed algorithm on a large retinal fundus image dataset and the method could achieve higher accuracy than other methods. Although retinal images are used in this study, the methodology is applicable to the image quality assessment and enhancement of other types of medical images.

  9. Data quality can make or break a research infrastructure

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Gunter, D.; Chu, H.; Christianson, D. S.; Trotta, C.; Canfora, E.; Faybishenko, B.; Cheah, Y. W.; Beekwilder, N.; Chan, S.; Dengel, S.; Keenan, T. F.; O'Brien, F.; Elbashandy, A.; Poindexter, C.; Humphrey, M.; Papale, D.; Agarwal, D.

    2017-12-01

    Research infrastructures (RIs) commonly support observational data provided by multiple, independent sources. Uniformity in the data distributed by such RIs is important in most applications, e.g., in comparative studies using data from two or more sources. Achieving uniformity in terms of data quality is challenging, especially considering that many data issues are unpredictable and cannot be detected until a first occurrence of the issue. With that, many data quality control activities within RIs require a manual, human-in-the-loop element, making it an expensive activity. Our motivating example is the FLUXNET2015 dataset - a collection of ecosystem-level carbon, water, and energy fluxes between land and atmosphere from over 200 sites around the world, some sites with over 20 years of data. About 90% of the human effort to create the dataset was spent in data quality related activities. Based on this experience, we have been working on solutions to increase the automation of data quality control procedures. Since it is nearly impossible to fully automate all quality related checks, we have been drawing from the experience with techniques used in software development, which shares a few common constraints. In both managing scientific data and writing software, human time is a precious resource; code bases, as Science datasets, can be large, complex, and full of errors; both scientific and software endeavors can be pursued by individuals, but collaborative teams can accomplish a lot more. The lucrative and fast-paced nature of the software industry fueled the creation of methods and tools to increase automation and productivity within these constraints. Issue tracking systems, methods for translating problems into automated tests, powerful version control tools are a few examples. Terrestrial and aquatic ecosystems research relies heavily on many types of observational data. As volumes of data collection increases, ensuring data quality is becoming an unwieldy challenge for RIs. Business as usual approaches to data quality do not work with larger data volumes. We believe RIs can benefit greatly from adapting and imitating this body of theory and practice from software quality into data quality, enabling systematic and reproducible safeguards against errors and mistakes in datasets as much as in software.

  10. Fine PM2.5 around July 4th

    EPA Pesticide Factsheets

    Data used in these analyses was obtained from publically-available sources, specifically the EPA's AirNow website (https://www.epa.gov/outdoor-air-quality-data). The dataset provided includes the subset of data from AirNow that was used in our analyses.This dataset is associated with the following publication:Dickerson, A., A. Benson, B. Buckley, and E. Chan. Concentrations of individual fine particulate matter components in the United States around July 4th. Air Quality, Atmosphere & Health. Springer Netherlands, NETHERLANDS, 1-10, (2016).

  11. The importance of data curation on QSAR Modeling ...

    EPA Pesticide Factsheets

    During the last few decades many QSAR models and tools have been developed at the US EPA, including the widely used EPISuite. During this period the arsenal of computational capabilities supporting cheminformatics has broadened dramatically with multiple software packages. These modern tools allow for more advanced techniques in terms of chemical structure representation and storage, as well as enabling automated data-mining and standardization approaches to examine and fix data quality issues.This presentation will investigate the impact of data curation on the reliability of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. As part of this work we have attempted to disentangle the influence of the quality versus quantity of data based on the Syracuse PHYSPROP database partly used by EPISuite software. We will review our automated approaches to examining key datasets related to the EPISuite data to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers) and approaches to standardize data into QSAR-ready formats prior to modeling procedures. Our efforts to quantify and segregate data into quality categories has allowed us to evaluate the resulting models that can be developed from these data slices and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accur

  12. High-Quality Seismic Observations of Sonic Booms

    NASA Technical Reports Server (NTRS)

    Wurman, Gilead; Haering, Edward A., Jr.; Price, Michael J.

    2011-01-01

    The SonicBREWS project (Sonic Boom Resistant Earthquake Warning Systems) is a collaborative effort between Seismic Warning Systems, Inc. and NASA Dryden Flight Research Center. This project aims to evaluate the effects of sonic booms on Earthquake Warning Systems in order to prevent such systems from experiencing false alarms due to sonic booms. The airspace above the Antelope Valley, California includes the High Altitude Supersonic Corridor and the Black Mountain Supersonic Corridor. These corridors are among the few places in the US where supersonic flight is permitted, and sonic booms are commonplace in the Antelope Valley. One result of this project is a rich dataset of high-quality accelerometer records of sonic booms which can shed light on the interaction between these atmospheric phenomena and the solid earth. Nearly 100 sonic booms were recorded with low-noise triaxial MEMS accelerometers recording 1000 samples per second. The sonic booms had peak overpressures ranging up to approximately 10 psf and were recorded in three flight series in 2010 and 2011. Each boom was recorded with up to four accelerometers in various array configurations up to 100 meter baseline lengths, both in the built environment and the free field. All sonic booms were also recorded by nearby microphones. We present the results of the project in terms of the potential for sonic-boom-induced false alarms in Earthquake Warning Systems, and highlight some of the interesting features of the dataset.

  13. Genome measures used for quality control are dependent on gene function and ancestry.

    PubMed

    Wang, Jing; Raskin, Leon; Samuels, David C; Shyr, Yu; Guo, Yan

    2015-02-01

    The transition/transversion (Ti/Tv) ratio and heterozygous/nonreference-homozygous (het/nonref-hom) ratio have been commonly computed in genetic studies as a quality control (QC) measurement. Additionally, these two ratios are helpful in our understanding of the patterns of DNA sequence evolution. To thoroughly understand these two genomic measures, we performed a study using 1000 Genomes Project (1000G) released genotype data (N=1092). An additional two datasets (N=581 and N=6) were used to validate our findings from the 1000G dataset. We compared the two ratios among continental ancestry, genome regions and gene functionality. We found that the Ti/Tv ratio can be used as a quality indicator for single nucleotide polymorphisms inferred from high-throughput sequencing data. The Ti/Tv ratio varies greatly by genome region and functionality, but not by ancestry. The het/nonref-hom ratio varies greatly by ancestry, but not by genome regions and functionality. Furthermore, extreme guanine + cytosine content (either high or low) is negatively associated with the Ti/Tv ratio magnitude. Thus, when performing QC assessment using these two measures, care must be taken to apply the correct thresholds based on ancestry and genome region. Failure to take these considerations into account at the QC stage will bias any following analysis. yan.guo@vanderbilt.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  14. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  15. Modelling land cover change in the Ganga basin

    NASA Astrophysics Data System (ADS)

    Moulds, S.; Tsarouchi, G.; Mijic, A.; Buytaert, W.

    2013-12-01

    Over recent decades the green revolution in India has driven substantial environmental change. Modelling experiments have identified northern India as a 'hot spot' of land-atmosphere coupling strength during the boreal summer. However, there is a wide range of sensitivity of atmospheric variables to soil moisture between individual climate models. The lack of a comprehensive land cover change dataset to force climate models has been identified as a major contributor to model uncertainty. In this work a time series dataset of land cover change between 1970 and 2010 is constructed for northern India to improve the quantification of regional hydrometeorological feedbacks. The MODIS instrument on board the Aqua and Terra satellites provides near-continuous remotely sensed datasets from 2000 to the present day. However, the quality of satellite products before 2000 is poor. To complete the dataset MODIS images are extrapolated back in time using the Conversion of Land Use and its Effects at small regional extent (CLUE-s) modelling framework. Non-spatial estimates of land cover area from national agriculture and forest statistics, available on a state-wise, annual basis, are used as a direct model input. Land cover change is allocated spatially as a function of biophysical and socioeconomic drivers identified using logistic regression. This dataset will provide an essential input to a high resolution, physically based land surface model to generate the lower boundary condition to assess the impact of land cover change on regional climate.

  16. Ontology-Based Search of Genomic Metadata.

    PubMed

    Fernandez, Javier D; Lenzerini, Maurizio; Masseroli, Marco; Venco, Francesco; Ceri, Stefano

    2016-01-01

    The Encyclopedia of DNA Elements (ENCODE) is a huge and still expanding public repository of more than 4,000 experiments and 25,000 data files, assembled by a large international consortium since 2007; unknown biological knowledge can be extracted from these huge and largely unexplored data, leading to data-driven genomic, transcriptomic, and epigenomic discoveries. Yet, search of relevant datasets for knowledge discovery is limitedly supported: metadata describing ENCODE datasets are quite simple and incomplete, and not described by a coherent underlying ontology. Here, we show how to overcome this limitation, by adopting an ENCODE metadata searching approach which uses high-quality ontological knowledge and state-of-the-art indexing technologies. Specifically, we developed S.O.S. GeM (http://www.bioinformatics.deib.polimi.it/SOSGeM/), a system supporting effective semantic search and retrieval of ENCODE datasets. First, we constructed a Semantic Knowledge Base by starting with concepts extracted from ENCODE metadata, matched to and expanded on biomedical ontologies integrated in the well-established Unified Medical Language System. We prove that this inference method is sound and complete. Then, we leveraged the Semantic Knowledge Base to semantically search ENCODE data from arbitrary biologists' queries. This allows correctly finding more datasets than those extracted by a purely syntactic search, as supported by the other available systems. We empirically show the relevance of found datasets to the biologists' queries.

  17. A Robust Post-Processing Workflow for Datasets with Motion Artifacts in Diffusion Kurtosis Imaging

    PubMed Central

    Li, Xianjun; Yang, Jian; Gao, Jie; Luo, Xue; Zhou, Zhenyu; Hu, Yajie; Wu, Ed X.; Wan, Mingxi

    2014-01-01

    Purpose The aim of this study was to develop a robust post-processing workflow for motion-corrupted datasets in diffusion kurtosis imaging (DKI). Materials and methods The proposed workflow consisted of brain extraction, rigid registration, distortion correction, artifacts rejection, spatial smoothing and tensor estimation. Rigid registration was utilized to correct misalignments. Motion artifacts were rejected by using local Pearson correlation coefficient (LPCC). The performance of LPCC in characterizing relative differences between artifacts and artifact-free images was compared with that of the conventional correlation coefficient in 10 randomly selected DKI datasets. The influence of rejected artifacts with information of gradient directions and b values for the parameter estimation was investigated by using mean square error (MSE). The variance of noise was used as the criterion for MSEs. The clinical practicality of the proposed workflow was evaluated by the image quality and measurements in regions of interest on 36 DKI datasets, including 18 artifact-free (18 pediatric subjects) and 18 motion-corrupted datasets (15 pediatric subjects and 3 essential tremor patients). Results The relative difference between artifacts and artifact-free images calculated by LPCC was larger than that of the conventional correlation coefficient (p<0.05). It indicated that LPCC was more sensitive in detecting motion artifacts. MSEs of all derived parameters from the reserved data after the artifacts rejection were smaller than the variance of the noise. It suggested that influence of rejected artifacts was less than influence of noise on the precision of derived parameters. The proposed workflow improved the image quality and reduced the measurement biases significantly on motion-corrupted datasets (p<0.05). Conclusion The proposed post-processing workflow was reliable to improve the image quality and the measurement precision of the derived parameters on motion-corrupted DKI datasets. The workflow provided an effective post-processing method for clinical applications of DKI in subjects with involuntary movements. PMID:24727862

  18. A robust post-processing workflow for datasets with motion artifacts in diffusion kurtosis imaging.

    PubMed

    Li, Xianjun; Yang, Jian; Gao, Jie; Luo, Xue; Zhou, Zhenyu; Hu, Yajie; Wu, Ed X; Wan, Mingxi

    2014-01-01

    The aim of this study was to develop a robust post-processing workflow for motion-corrupted datasets in diffusion kurtosis imaging (DKI). The proposed workflow consisted of brain extraction, rigid registration, distortion correction, artifacts rejection, spatial smoothing and tensor estimation. Rigid registration was utilized to correct misalignments. Motion artifacts were rejected by using local Pearson correlation coefficient (LPCC). The performance of LPCC in characterizing relative differences between artifacts and artifact-free images was compared with that of the conventional correlation coefficient in 10 randomly selected DKI datasets. The influence of rejected artifacts with information of gradient directions and b values for the parameter estimation was investigated by using mean square error (MSE). The variance of noise was used as the criterion for MSEs. The clinical practicality of the proposed workflow was evaluated by the image quality and measurements in regions of interest on 36 DKI datasets, including 18 artifact-free (18 pediatric subjects) and 18 motion-corrupted datasets (15 pediatric subjects and 3 essential tremor patients). The relative difference between artifacts and artifact-free images calculated by LPCC was larger than that of the conventional correlation coefficient (p<0.05). It indicated that LPCC was more sensitive in detecting motion artifacts. MSEs of all derived parameters from the reserved data after the artifacts rejection were smaller than the variance of the noise. It suggested that influence of rejected artifacts was less than influence of noise on the precision of derived parameters. The proposed workflow improved the image quality and reduced the measurement biases significantly on motion-corrupted datasets (p<0.05). The proposed post-processing workflow was reliable to improve the image quality and the measurement precision of the derived parameters on motion-corrupted DKI datasets. The workflow provided an effective post-processing method for clinical applications of DKI in subjects with involuntary movements.

  19. Applying Advances in GPM Radiometer Intercalibration and Algorithm Development to a Long-Term TRMM/GPM Global Precipitation Dataset

    NASA Astrophysics Data System (ADS)

    Berg, W. K.

    2016-12-01

    The Global Precipitation Mission (GPM) Core Observatory, which was launched in February of 2014, provides a number of advances for satellite monitoring of precipitation including a dual-frequency radar, high frequency channels on the GPM Microwave Imager (GMI), and coverage over middle and high latitudes. The GPM concept, however, is about producing unified precipitation retrievals from a constellation of microwave radiometers to provide approximately 3-hourly global sampling. This involves intercalibration of the input brightness temperatures from the constellation radiometers, development of an apriori precipitation database using observations from the state-of-the-art GPM radiometer and radars, and accounting for sensor differences in the retrieval algorithm in a physically-consistent way. Efforts by the GPM inter-satellite calibration working group, or XCAL team, and the radiometer algorithm team to create unified precipitation retrievals from the GPM radiometer constellation were fully implemented into the current version 4 GPM precipitation products. These include precipitation estimates from a total of seven conical-scanning and six cross-track scanning radiometers as well as high spatial and temporal resolution global level 3 gridded products. Work is now underway to extend this unified constellation-based approach to the combined TRMM/GPM data record starting in late 1997. The goal is to create a long-term global precipitation dataset employing these state-of-the-art calibration and retrieval algorithm approaches. This new long-term global precipitation dataset will incorporate the physics provided by the combined GPM GMI and DPR sensors into the apriori database, extend prior TRMM constellation observations to high latitudes, and expand the available TRMM precipitation data to the full constellation of available conical and cross-track scanning radiometers. This combined TRMM/GPM precipitation data record will thus provide a high-quality high-temporal resolution global dataset for use in a wide variety of weather and climate research applications.

  20. Wearable Sensor Data Classification for Human Activity Recognition Based on an Iterative Learning Framework.

    PubMed

    Davila, Juan Carlos; Cretu, Ana-Maria; Zaremba, Marek

    2017-06-07

    The design of multiple human activity recognition applications in areas such as healthcare, sports and safety relies on wearable sensor technologies. However, when making decisions based on the data acquired by such sensors in practical situations, several factors related to sensor data alignment, data losses, and noise, among other experimental constraints, deteriorate data quality and model accuracy. To tackle these issues, this paper presents a data-driven iterative learning framework to classify human locomotion activities such as walk, stand, lie, and sit, extracted from the Opportunity dataset. Data acquired by twelve 3-axial acceleration sensors and seven inertial measurement units are initially de-noised using a two-stage consecutive filtering approach combining a band-pass Finite Impulse Response (FIR) and a wavelet filter. A series of statistical parameters are extracted from the kinematical features, including the principal components and singular value decomposition of roll, pitch, yaw and the norm of the axial components. The novel interactive learning procedure is then applied in order to minimize the number of samples required to classify human locomotion activities. Only those samples that are most distant from the centroids of data clusters, according to a measure presented in the paper, are selected as candidates for the training dataset. The newly built dataset is then used to train an SVM multi-class classifier. The latter will produce the lowest prediction error. The proposed learning framework ensures a high level of robustness to variations in the quality of input data, while only using a much lower number of training samples and therefore a much shorter training time, which is an important consideration given the large size of the dataset.

  1. EarthServer: Visualisation and use of uncertainty as a data exploration tool

    NASA Astrophysics Data System (ADS)

    Walker, Peter; Clements, Oliver; Grant, Mike

    2013-04-01

    The Ocean Science/Earth Observation community generates huge datasets from satellite observation. Until recently it has been difficult to obtain matching uncertainty information for these datasets and to apply this to their processing. In order to make use of uncertainty information when analysing "Big Data" we need both the uncertainty itself (attached to the underlying data) and a means of working with the combined product without requiring the entire dataset to be downloaded. The European Commission FP7 project EarthServer (http://earthserver.eu) is addressing the problem of accessing and ad-hoc analysis of extreme-size Earth Science data using cutting-edge Array Database technology. The core software (Rasdaman) and web services wrapper (Petascope) allow huge datasets to be accessed using Open Geospatial Consortium (OGC) standard interfaces including the well established standards, Web Coverage Service (WCS) and Web Map Service (WMS) as well as the emerging standard, Web Coverage Processing Service (WCPS). The WCPS standard allows the running of ad-hoc queries on any of the data stored within Rasdaman, creating an infrastructure where users are not restricted by bandwidth when manipulating or querying huge datasets. The ESA Ocean Colour - Climate Change Initiative (OC-CCI) project (http://www.esa-oceancolour-cci.org/), is producing high-resolution, global ocean colour datasets over the full time period (1998-2012) where high quality observations were available. This climate data record includes per-pixel uncertainty data for each variable, based on an analytic method that classifies how much and which types of water are present in a pixel, and assigns uncertainty based on robust comparisons to global in-situ validation datasets. These uncertainty values take two forms, Root Mean Square (RMS) and Bias uncertainty, respectively representing the expected variability and expected offset error. By combining the data produced through the OC-CCI project with the software from the EarthServer project we can produce a novel data offering that allows the use of traditional exploration and access mechanisms such as WMS and WCS. However the real benefits can be seen when utilising WCPS to explore the data . We will show two major benefits to this infrastructure. Firstly we will show that the visualisation of the combined chlorophyll and uncertainty datasets through a web based GIS portal gives users the ability to instantaneously assess the quality of the data they are exploring using traditional web based plotting techniques as well as through novel web based 3 dimensional visualisation. Secondly we will showcase the benefits available when combining these data with the WCPS standard. The uncertainty data can be utilised in queries using the standard WCPS query language. This allows selection of data either for download or use within the query, based on the respective uncertainty values as well as the possibility of incorporating both the chlorophyll data and uncertainty data into complex queries to produce additional novel data products. By filtering with uncertainty at the data source rather than the client we can minimise traffic over the network allowing huge datasets to be worked on with a minimal time penalty.

  2. Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

    NASA Astrophysics Data System (ADS)

    Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

    2016-04-01

    Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not suitable for CPV systems. For these systems, the TMY datasets obtained using dedicated drivers (DNI only or more precise one) are more representative to derive TMY datasets from limited long-term meteorological dataset.

  3. SDCLIREF - A sub-daily gridded reference dataset

    NASA Astrophysics Data System (ADS)

    Wood, Raul R.; Willkofer, Florian; Schmid, Franz-Josef; Trentini, Fabian; Komischke, Holger; Ludwig, Ralf

    2017-04-01

    Climate change is expected to impact the intensity and frequency of hydrometeorological extreme events. In order to adequately capture and analyze extreme rainfall events, in particular when assessing flood and flash flood situations, data is required at high spatial and sub-daily resolution which is often not available in sufficient density and over extended time periods. The ClimEx project (Climate Change and Hydrological Extreme Events) addresses the alteration of hydrological extreme events under climate change conditions. In order to differentiate between a clear climate change signal and the limits of natural variability, unique Single-Model Regional Climate Model Ensembles (CRCM5 driven by CanESM2, RCP8.5) were created for a European and North-American domain, each comprising 50 members of 150 years (1951-2100). In combination with the CORDEX-Database, this newly created ClimEx-Ensemble is a one-of-a-kind model dataset to analyze changes of sub-daily extreme events. For the purpose of bias-correcting the regional climate model ensembles as well as for the baseline calibration and validation of hydrological catchment models, a new sub-daily (3h) high-resolution (500m) gridded reference dataset (SDCLIREF) was created for a domain covering the Upper Danube and Main watersheds ( 100.000km2). As the sub-daily observations lack a continuous time series for the reference period 1980-2010, the need for a suitable method to bridge the gap of the discontinuous time series arouse. The Method of Fragments (Sharma and Srikanthan (2006); Westra et al. (2012)) was applied to transform daily observations to sub-daily rainfall events to extend the time series and densify the station network. Prior to applying the Method of Fragments and creating the gridded dataset using rigorous interpolation routines, data collection of observations, operated by several institutions in three countries (Germany, Austria, Switzerland), and the subsequent quality control of the observations was carried out. Among others, the quality control checked for steps, extensive dry seasons, temporal consistency and maximum hourly values. The resulting SDCLIREF dataset provides a robust precipitation reference for hydrometeorological applications in unprecedented high spatio-temporal resolution. References: Sharma, A.; Srikanthan, S. (2006): Continuous Rainfall Simulation: A Nonparametric Alternative. In: 30th Hydrology and Water Resources Symposium 4-7 December 2006, Launceston, Tasmania. Westra, S.; Mehrotra, R.; Sharma, A.; Srikanthan, R. (2012): Continuous rainfall simulation. 1. A regionalized subdaily disaggregation approach. In: Water Resour. Res. 48 (1). DOI: 10.1029/2011WR010489.

  4. QualityML: a dictionary for quality metadata encoding

    NASA Astrophysics Data System (ADS)

    Ninyerola, Miquel; Sevillano, Eva; Serral, Ivette; Pons, Xavier; Zabala, Alaitz; Bastin, Lucy; Masó, Joan

    2014-05-01

    The scenario of rapidly growing geodata catalogues requires tools focused on facilitate users the choice of products. Having quality fields populated in metadata allow the users to rank and then select the best fit-for-purpose products. In this direction, we have developed the QualityML (http://qualityml.geoviqua.org), a dictionary that contains hierarchically structured concepts to precisely define and relate quality levels: from quality classes to quality measurements. Generically, a quality element is the path that goes from the higher level (quality class) to the lowest levels (statistics or quality metrics). This path is used to encode quality of datasets in the corresponding metadata schemas. The benefits of having encoded quality, in the case of data producers, are related with improvements in their product discovery and better transmission of their characteristics. In the case of data users, particularly decision-makers, they would find quality and uncertainty measures to take the best decisions as well as perform dataset intercomparison. Also it allows other components (such as visualization, discovery, or comparison tools) to be quality-aware and interoperable. On one hand, the QualityML is a profile of the ISO geospatial metadata standards providing a set of rules for precisely documenting quality indicator parameters that is structured in 6 levels. On the other hand, QualityML includes semantics and vocabularies for the quality concepts. Whenever possible, if uses statistic expressions from the UncertML dictionary (http://www.uncertml.org) encoding. However it also extends UncertML to provide list of alternative metrics that are commonly used to quantify quality. A specific example, based on a temperature dataset, is shown below. The annual mean temperature map has been validated with independent in-situ measurements to obtain a global error of 0.5 ° C. Level 0: Quality class (e.g., Thematic accuracy) Level 1: Quality indicator (e.g., Quantitative attribute correctness) Level 2: Measurement field (e.g., DifferentialErrors1D) Level 3: Statistic or Metric (e.g., Half-lengthConfidenceInterval) Level 4: Units (e.g. Celsius degrees) Level 5: Value (e.g.0.5) Level 6: Specifications. Additional information on how the measurement took place, citation of the reference data, the traceability of the process and a publication describing the validation process encoded using new 19157 elements or the GeoViQua (http://www.geoviqua.org) Quality Model (PQM-UQM) extensions to the ISO models. Finally, keep in mind, that QualityML is not just suitable for encoding dataset level but also considers pixel and object level uncertainties. This is done by link the metadata quality descriptions with layers representing not just the data but the uncertainty values associated with each geospatial element.

  5. Comparison and validation of gridded precipitation datasets for Spain

    NASA Astrophysics Data System (ADS)

    Quintana-Seguí, Pere; Turco, Marco; Míguez-Macho, Gonzalo

    2016-04-01

    In this study, two gridded precipitation datasets are compared and validated in Spain: the recently developed SAFRAN dataset and the Spain02 dataset. These are validated using rain gauges and they are also compared to the low resolution ERA-Interim reanalysis. The SAFRAN precipitation dataset has been recently produced, using the SAFRAN meteorological analysis, which is extensively used in France (Durand et al. 1993, 1999; Quintana-Seguí et al. 2008; Vidal et al., 2010) and which has recently been applied to Spain (Quintana-Seguí et al., 2015). SAFRAN uses an optimal interpolation (OI) algorithm and uses all available rain gauges from the Spanish State Meteorological Agency (Agencia Estatal de Meteorología, AEMET). The product has a spatial resolution of 5 km and it spans from September 1979 to August 2014. This dataset has been produced mainly to be used in large scale hydrological applications. Spain02 (Herrera et al. 2012, 2015) is another high quality precipitation dataset for Spain based on a dense network of quality-controlled stations and it has different versions at different resolutions. In this study we used the version with a resolution of 0.11°. The product spans from 1971 to 2010. Spain02 is well tested and widely used, mainly, but not exclusively, for RCM model validation and statistical downscliang. ERA-Interim is a well known global reanalysis with a spatial resolution of ˜79 km. It has been included in the comparison because it is a widely used product for continental and global scale studies and also in smaller scale studies in data poor countries. Thus, its comparison with higher resolution products of a data rich country, such as Spain, allows us to quantify the errors made when using such datasets for national scale studies, in line with some of the objectives of the EU-FP7 eartH2Observe project. The comparison shows that SAFRAN and Spain02 perform similarly, even though their underlying principles are different. Both products are largely better than ERA-Interim, which has a much coarser representation of the relief, which is crucial for precipitation. These results are a contribution to the Spanish Case Study of the eartH2Observe project, which is focused on the simulation of drought processes in Spain using Land-Surface Models (LSM). This study will also be helpful in the Spanish MARCO project, which aims at improving the ability of RCMs to simulate hydrometeorological extremes.

  6. Weighted analysis of paired microarray experiments.

    PubMed

    Kristiansson, Erik; Sjögren, Anders; Rudemo, Mats; Nerman, Olle

    2005-01-01

    In microarray experiments quality often varies, for example between samples and between arrays. The need for quality control is therefore strong. A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays. The model is of mixed type with some parameters estimated by an empirical Bayes method. Differences in quality are modelled by individual variances and correlations between repetitions. The method is applied to three real and several simulated datasets. Two of the real datasets are of Affymetrix type with patients profiled before and after treatment, and the third dataset is of two-colour spotted cDNA type. In all cases, the patients or arrays had different estimated variances, leading to distinctly unequal weights in the analysis. We suggest also plots which illustrate the variances and correlations that affect the weights computed by our analysis method. For simulated data the improvement relative to previously published methods without weighting is shown to be substantial.

  7. Maintaining High Quality Data and Consistency Across a Diverse Flux Network: The Ameriflux QA/QC Technical Team

    NASA Astrophysics Data System (ADS)

    Chan, S.; Billesbach, D. P.; Hanson, C. V.; Biraud, S.

    2014-12-01

    The AmeriFlux quality assurance and quality control (QA/QC) technical team conducts short term (<2 weeks) intercomparisons using a portable eddy covariance system (PECS) to maintain high quality data observations and data consistency across the AmeriFlux network (http://ameriflux.lbl.gov/). Site intercomparisons identify discrepancies between the in situ and portable measurements and calculated fluxes. Findings are jointly discussed by the site staff and the QA/QC team to improve in the situ observations. Despite the relatively short duration of an individual site intercomparison, the accumulated record of all site visits (numbering over 100 since 2002) is a unique dataset. The ability to deploy redundant sensors provides a rare opportunity to identify, quantify, and understand uncertainties in eddy covariance and ancillary measurements. We present a few specific case studies from QA/QC site visits to highlight and share new and relevant findings related to eddy covariance instrumentation and operation.

  8. Communicating data quality through Web Map Services

    NASA Astrophysics Data System (ADS)

    Blower, Jon; Roberts, Charles; Griffiths, Guy; Lewis, Jane; Yang, Kevin

    2013-04-01

    The sharing and visualization of environmental data through spatial data infrastructures is becoming increasingly common. However, information about the quality of data is frequently unavailable or presented in an inconsistent fashion. ("Data quality" is a phrase with many possible meanings but here we define it as "fitness for purpose" - therefore different users have different notions of what constitutes a "high quality" dataset.) The GeoViQua project (www.geoviqua.org) is developing means for eliciting, formatting, discovering and visualizing quality information using ISO and Open Geospatial Consortium (OGC) standards. Here we describe one aspect of the innovations of the GeoViQua project. In this presentation, we shall demonstrate new developments in using Web Map Services to communicate data quality at the level of datasets, variables and individual samples. We shall outline a new draft set of conventions (known as "WMS-Q"), which describe a set of rules for using WMS to convey quality information (OGC draft Engineering Report 12-160). We shall demonstrate these conventions through new prototype software, based upon the widely-used ncWMS software, that applies these rules to enable the visualization of uncertainties in raster data such as satellite products and the results of numerical simulations. Many conceptual and practical issues have arisen from these experiments. How can source data be formatted so that a WMS implementation can detect the semantic links between variables (e.g. the links between a mean field and its variance)? The visualization of uncertainty can be a complex task - how can we provide users with the power and flexibility to choose an optimal strategy? How can we maintain compatibility (as far as possible) with existing WMS clients? We explore these questions with reference to existing standards and approaches, including UncertML, NetCDF-U and Styled Layer Descriptors.

  9. Urban compaction or dispersion? An air quality modelling study

    NASA Astrophysics Data System (ADS)

    Martins, Helena

    2012-07-01

    Urban sprawl is altering the landscape, with current trends pointing to further changes in land use that will, in turn, lead to changes in population, energy consumption, atmospheric emissions and air quality. Urban planners have debated on the most sustainable urban structure, with arguments in favour and against urban compaction and dispersion. However, it is clear that other areas of expertise have to be involved. Urban air quality and human exposure to atmospheric pollutants as indicators of urban sustainability can contribute to the discussion, namely through the study of the relation between urban structure and air quality. This paper addresses the issue by analysing the impacts of alternative urban growth patterns on the air quality of Porto urban region in Portugal, through a 1-year simulation with the MM5-CAMx modelling system. This region has been experiencing one of the highest European rates of urban sprawl, and at the same time presents a poor air quality. As part of the modelling system setup, a sensitivity study was conducted regarding different land use datasets and spatial distribution of emissions. Two urban development scenarios were defined, SPRAWL and COMPACT, together with their new land use and emission datasets; then meteorological and air quality simulations were performed. Results reveal that SPRAWL land use changes resulted in an average temperature increase of 0.4 °C, with local increases reaching as high as 1.5 °C. SPRAWL results also show an aggravation of PM10 annual average values and an increase in the exceedances to the daily limit value. For ozone, differences between scenarios were smaller, with SPRAWL presenting larger concentration differences than COMPACT. Finally, despite the higher concentrations found in SPRAWL, population exposure to the pollutants is higher for COMPACT because more inhabitants are found in areas of highest concentration levels.

  10. Does chlorhexidine prevent dry socket?

    PubMed

    Richards, Derek

    2012-01-01

    The BBO (Bibliografia Brasileira de Odontologia), Biomed Central, Cochrane Library, Directory of Open Access Journals, LILACS, Open-J-Gate, OpenSIGLE, PubMed, Sabinet and Science-Direct databases were searched. Articles were selected for review from the search results on the basis of their compliance with the broad inclusion criteria: relevant to the review question; and prospective two-arm (or more) clinical study. The primary outcome measure was the incidence of AO reported at the patient level. Two reviewers (VY and SM) independently extracted data and assessed the quality of the accepted articles. Individual dichotomous datasets for the control and test group were extracted from each article. Where possible, missing data were calculated from information given in the text or tables. In addition, authors were contacted in order to obtain missing information. Datasets were assessed for their clinical and methodological heterogeneity following Cochrane guidelines. Meta-analysis was conducted with homogeneous datasets. Publication bias was assessed by use of a funnel plot and Egger's regression. Ten randomised trials were included; almost all involved the removal of third molars. Only two of six identified application protocols (single application of chlorhexidine 0.2% gel or multiple application of 0.12% rinse versus placebo) were found to significantly decrease the incidence of AO. Within the limitations of this review, only two of six identified application protocols were found to significantly decrease the incidence of AO. The evidence for both protocols is weak and may be challenged on the grounds of high risk of selection, detection/performance and attrition bias. This systematic review could not identify sufficient evidence supporting the use of chlorhexidine for the prevention of AO. Chlorhexidine seems not to cause any significantly higher adverse reactions than placebo. Future high-quality randomised control trials are needed to provide conclusive evidence on this topic.

  11. Hidden treasures in "ancient" microarrays: gene-expression portrays biology and potential resistance pathways of major lung cancer subtypes and normal tissue.

    PubMed

    Kerkentzes, Konstantinos; Lagani, Vincenzo; Tsamardinos, Ioannis; Vyberg, Mogens; Røe, Oluf Dimitri

    2014-01-01

    Novel statistical methods and increasingly more accurate gene annotations can transform "old" biological data into a renewed source of knowledge with potential clinical relevance. Here, we provide an in silico proof-of-concept by extracting novel information from a high-quality mRNA expression dataset, originally published in 2001, using state-of-the-art bioinformatics approaches. The dataset consists of histologically defined cases of lung adenocarcinoma (AD), squamous (SQ) cell carcinoma, small-cell lung cancer, carcinoid, metastasis (breast and colon AD), and normal lung specimens (203 samples in total). A battery of statistical tests was used for identifying differential gene expressions, diagnostic and prognostic genes, enriched gene ontologies, and signaling pathways. Our results showed that gene expressions faithfully recapitulate immunohistochemical subtype markers, as chromogranin A in carcinoids, cytokeratin 5, p63 in SQ, and TTF1 in non-squamous types. Moreover, biological information with putative clinical relevance was revealed as potentially novel diagnostic genes for each subtype with specificity 93-100% (AUC = 0.93-1.00). Cancer subtypes were characterized by (a) differential expression of treatment target genes as TYMS, HER2, and HER3 and (b) overrepresentation of treatment-related pathways like cell cycle, DNA repair, and ERBB pathways. The vascular smooth muscle contraction, leukocyte trans-endothelial migration, and actin cytoskeleton pathways were overexpressed in normal tissue. Reanalysis of this public dataset displayed the known biological features of lung cancer subtypes and revealed novel pathways of potentially clinical importance. The findings also support our hypothesis that even old omics data of high quality can be a source of significant biological information when appropriate bioinformatics methods are used.

  12. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.

    PubMed

    Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J

    2016-11-01

    The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.

  13. Multiplatform observations enabling albedo retrievals with high temporal resolution

    NASA Astrophysics Data System (ADS)

    Riihelä, Aku; Manninen, Terhikki; Key, Jeffrey; Sun, Qingsong; Sütterlin, Melanie; Lattanzio, Alessio; Schaaf, Crystal

    2017-04-01

    In this paper we show that combining observations from different polar orbiting satellite families (such as AVHRR and MODIS) is physically justifiable and technically feasible. Our proposed approach will lead to surface albedo retrievals at higher temporal resolution than the state of the art, with comparable or better accuracy. This study is carried out in the World Meteorological Organization (WMO) Sustained and coordinated processing of Environmental Satellite data for Climate Monitoring (SCOPE-CM) project SCM-02 (http://www.scope-cm.org/projects/scm-02/). Following a spectral homogenization of the Top-of-Atmosphere reflectances of bands 1 & 2 from AVHRR and MODIS, both observation datasets are atmospherically corrected with a coherent atmospheric profile and algorithm. The resulting surface reflectances are then fed into an inversion of the RossThick-LiSparse-Reciprocal surface bidirectional reflectance distribution function (BRDF) model. The results of the inversion (BRDF kernels) may then be integrated to estimate various surface albedo quantities. A key principle here is that the larger number of valid surface observations with multiple satellites allows us to invert the BRDF coefficients within a shorter time span, enabling the monitoring of relatively rapid surface phenomena such as snowmelt. The proposed multiplatform approach is expected to bring benefits in particular to the observation of the albedo of the polar regions, where persistent cloudiness and long atmospheric path lengths present challenges to satellite-based retrievals. Following a similar logic, the retrievals over tropical regions with high cloudiness should also benefit from the method. We present results from a demonstrator dataset of a global combined AVHRR-GAC and MODIS dataset covering the year 2010. The retrieved surface albedo is compared against quality-monitored in situ albedo observations from the Baseline Surface Radiation Network (BSRN). Additionally, the combined retrieval dataset is compared against MODIS C6 albedo/BRDF datasets to assess the quality of the multiplatform approach against current state of the art. This approach is not limited to AHVRR and MODIS observations. Provided that the spectral homogenization produces an acceptably good match, any instrument observing the Earth's surface in the visible and near-infrared wavelengths could, in principal, be included to further enhance the temporal resolution and accuracy of the retrievals. The SCOPE-CM initiative provides a potential framework for such expansion in the future.

  14. Large-scale seismic waveform quality metric calculation using Hadoop

    DOE PAGES

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; ...

    2016-05-27

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less

  15. Large-scale seismic waveform quality metric calculation using Hadoop

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less

  16. Trends of pesticides and nitrate in ground water of the Central Columbia Plateau, Washington, 1993-2003

    USGS Publications Warehouse

    Frans, L.

    2008-01-01

    Pesticide and nitrate data for ground water sampled in the Central Columbia Plateau, Washington, between 1993 and 2003 by the U.S. Geological Survey National Water-Quality Assessment Program were evaluated for trends in concentration. A total of 72 wells were sampled in 1993-1995 and again in 2002-2003 in three well networks that targeted row crop and orchard land use settings as well as the regional basalt aquifer. The Regional Kendall trend test indicated that only deethylatrazine (DEA) concentrations showed a significant trend. Deethylatrazine concentrations were found to increase beneath the row crop land use well network, the regional aquifer well network, and for the dataset as a whole. No other pesticides showed a significant trend (nor did nitrate) in the 72-well dataset. Despite the lack of a trend in nitrate concentrations within the National Water-Quality Assessment dataset, previous work has found a statistically significant decrease in nitrate concentrations from 1998-2002 for wells with nitrate concentrations above 10 mg L-1 within the Columbia Basin ground water management area, which is located within the National Water-Quality Assessment study unit boundary. The increasing trend in DEA concentrations was found to negatively correlate with soil hydrologic group using logistic regression and with soil hydrologic group and drainage class using Spearman's correlation. The decreasing trend in high nitrate concentrations was found to positively correlate with the depth to which the well was cased using logistic regression, to positively correlate with nitrate application rates and sand content of the soil, and to negatively correlate with soil hydrologic group using Spearman's correlation. Copyright ?? 2008 by the American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America. All rights reserved.

  17. Evaluation of Model Recognition for Grammar-Based Automatic 3d Building Model Reconstruction

    NASA Astrophysics Data System (ADS)

    Yu, Qian; Helmholz, Petra; Belton, David

    2016-06-01

    In recent years, 3D city models are in high demand by many public and private organisations, and the steadily growing capacity in both quality and quantity are increasing demand. The quality evaluation of these 3D models is a relevant issue both from the scientific and practical points of view. In this paper, we present a method for the quality evaluation of 3D building models which are reconstructed automatically from terrestrial laser scanning (TLS) data based on an attributed building grammar. The entire evaluation process has been performed in all the three dimensions in terms of completeness and correctness of the reconstruction. Six quality measures are introduced to apply on four datasets of reconstructed building models in order to describe the quality of the automatic reconstruction, and also are assessed on their validity from the evaluation point of view.

  18. Quantitative Prediction of Beef Quality Using Visible and NIR Spectroscopy with Large Data Samples Under Industry Conditions

    NASA Astrophysics Data System (ADS)

    Qiao, T.; Ren, J.; Craigie, C.; Zabalza, J.; Maltin, Ch.; Marshall, S.

    2015-03-01

    It is well known that the eating quality of beef has a significant influence on the repurchase behavior of consumers. There are several key factors that affect the perception of quality, including color, tenderness, juiciness, and flavor. To support consumer repurchase choices, there is a need for an objective measurement of quality that could be applied to meat prior to its sale. Objective approaches such as offered by spectral technologies may be useful, but the analytical algorithms used remain to be optimized. For visible and near infrared (VISNIR) spectroscopy, Partial Least Squares Regression (PLSR) is a widely used technique for meat related quality modeling and prediction. In this paper, a Support Vector Machine (SVM) based machine learning approach is presented to predict beef eating quality traits. Although SVM has been successfully used in various disciplines, it has not been applied extensively to the analysis of meat quality parameters. To this end, the performance of PLSR and SVM as tools for the analysis of meat tenderness is evaluated, using a large dataset acquired under industrial conditions. The spectral dataset was collected using VISNIR spectroscopy with the wavelength ranging from 350 to 1800 nm on 234 beef M. longissimus thoracis steaks from heifers, steers, and young bulls. As the dimensionality with the VISNIR data is very high (over 1600 spectral bands), the Principal Component Analysis (PCA) technique was applied for feature extraction and data reduction. The extracted principal components (less than 100) were then used for data modeling and prediction. The prediction results showed that SVM has a greater potential to predict beef eating quality than PLSR, especially for the prediction of tenderness. The infl uence of animal gender on beef quality prediction was also investigated, and it was found that beef quality traits were predicted most accurately in beef from young bulls.

  19. LIME: 3D visualisation and interpretation of virtual geoscience models

    NASA Astrophysics Data System (ADS)

    Buckley, Simon; Ringdal, Kari; Dolva, Benjamin; Naumann, Nicole; Kurz, Tobias

    2017-04-01

    Three-dimensional and photorealistic acquisition of surface topography, using methods such as laser scanning and photogrammetry, has become widespread across the geosciences over the last decade. With recent innovations in photogrammetric processing software, robust and automated data capture hardware, and novel sensor platforms, including unmanned aerial vehicles, obtaining 3D representations of exposed topography has never been easier. In addition to 3D datasets, fusion of surface geometry with imaging sensors, such as multi/hyperspectral, thermal and ground-based InSAR, and geophysical methods, create novel and highly visual datasets that provide a fundamental spatial framework to address open geoscience research questions. Although data capture and processing routines are becoming well-established and widely reported in the scientific literature, challenges remain related to the analysis, co-visualisation and presentation of 3D photorealistic models, especially for new users (e.g. students and scientists new to geomatics methods). Interpretation and measurement is essential for quantitative analysis of 3D datasets, and qualitative methods are valuable for presentation purposes, for planning and in education. Motivated by this background, the current contribution presents LIME, a lightweight and high performance 3D software for interpreting and co-visualising 3D models and related image data in geoscience applications. The software focuses on novel data integration and visualisation of 3D topography with image sources such as hyperspectral imagery, logs and interpretation panels, geophysical datasets and georeferenced maps and images. High quality visual output can be generated for dissemination purposes, to aid researchers with communication of their research results. The background of the software is described and case studies from outcrop geology, in hyperspectral mineral mapping and geophysical-geospatial data integration are used to showcase the novel methods developed.

  20. Integrated synoptic surveys of the hydrodynamics and water-quality distributions in two Lake Michigan rivermouth mixing zones using an autonomous underwater vehicle and a manned boat

    USGS Publications Warehouse

    Jackson, P. Ryan; Reneau, Paul C.

    2014-01-01

    The U.S. Geological Survey (USGS), in cooperation with the National Monitoring Network for U.S. Coastal Waters and Tributaries, launched a pilot project in 2010 to determine the value of integrated synoptic surveys of rivermouths using autonomous underwater vehicle technology in response to a call for rivermouth research, which includes study domains that envelop both the fluvial and lacustrine boundaries of the rivermouth mixing zone. The pilot project was implemented at two Lake Michigan rivermouths with largely different scales, hydrodynamics, and settings, but employing primarily the same survey techniques and methods. The Milwaukee River Estuary Area of Concern (AOC) survey included measurements in the lower 2 to 3 miles of the Milwaukee, Menomonee, and Kinnickinnic Rivers and inner and outer Milwaukee Harbor. This estuary is situated in downtown Milwaukee, Wisconsin, and is the most populated basin that flows directly into Lake Michigan. In contrast, the Manitowoc rivermouth has a relatively small harbor separating the rivermouth from Lake Michigan, and the Manitowoc River Watershed is primarily agricultural. Both the Milwaukee and Manitowoc rivermouths are unregulated and allow free exchange of water with Lake Michigan. This pilot study of the Milwaukee River Estuary and Manitowoc rivermouth using an autonomous underwater vehicle (AUV) paired with a manned survey boat resulted in high spatial and temporal resolution datasets of basic water-quality parameter distributions and hydrodynamics. The AUV performed well in these environments and was found primarily well-suited for harbor and nearshore surveys of three-dimensional water-quality distributions. Both case studies revealed that the use of a manned boat equipped with an acoustic Doppler current profiler (ADCP) and multiparameter sonde (and an optional flow-through water-quality sampling system) was the best option for riverine surveys. To ensure that the most accurate and highest resolution velocity data were collected concurrently with the AUV surveys, the pilot study used a manned boat equipped with an ADCP. Combining the AUV and manned boat datasets resulted in datasets that are essentially continuous from the fluvial through the lacustrine zones of a rivermouth. Whereas the pilot studies were completed during low flows on the tributaries, completion of surveys at higher flows using the same techniques is possible, but the use of the AUV would be limited to areas with relatively low velocities (less than 2 feet per second) such as the harbors and nearshore zones of Lake Michigan. Overall, this pilot study aimed at evaluation of AUV technology for integrated synoptic surveys of rivermouth mixing zones was successful, and the techniques and methods employed in this pilot study should be transferrable to other sites with similar success. The use of the AUV provided significant time savings compared to traditional sampling techniques. For example, the survey of outer Milwaukee Harbor using the AUV required less than 7 hours for approximately 600 profiles compared to the 150 hours it would have taken using traditional methods in a manned boat (a 95 percent reduction in man-hours). The integrated datasets resulting from the AUV and manned survey boat are of high value and present a picture of the mixing and hydrodynamics of these highly dynamic, highly variable rivermouth mixing zones from the relatively well-mixed fluvial environment through the rivermouth to the stratified lacustrine receiving body of Lake Michigan. Such datasets not only allow researchers to understand more about the physical processes occurring in these rivermouths, but they provide high spatial resolution data required for interpretation of relations between disparate point samples and calibration and validation of numerical models.

  1. Precise positioning with sparse radio tracking: How LRO-LOLA and GRAIL enable future lunar exploration

    NASA Astrophysics Data System (ADS)

    Mazarico, E.; Goossens, S. J.; Barker, M. K.; Neumann, G. A.; Zuber, M. T.; Smith, D. E.

    2017-12-01

    Two recent NASA missions to the Moon, the Lunar Reconnaissance Orbiter (LRO) and the Gravity Recovery and Interior Laboratory (GRAIL), have obtained highly accurate information about the lunar shape and gravity field. These global geodetic datasets resolve long-standing issues with mission planning; the tidal lock of the Moon long prevented collection of accurate gravity measurements over the farside, and deteriorated precise positioning of topographic data. We describe key datasets and results from the LRO and GRAIL mission that are directly relevant to future lunar missions. SmallSat and CubeSat missions especially would benefit from these recent improvements, as they are typically more resource-constrained. Even with limited radio tracking data, accurate knowledge of topography and gravity enables precise orbit determination (OD) (e.g., limiting the scope of geolocation and co-registration tasks) and long-term predictions of altitude (e.g., dramatically reducing uncertainties in impact time). With one S-band tracking pass per day, LRO OD now routinely achieves total position knowledge better than 10 meters and radial position knowledge around 0.5 meter. Other tracking data, such as Laser Ranging from Earth-based SLR stations, can further support OD. We also show how altimetry can be used to substantially improve orbit reconstruction with the accurate topographic maps now available from Lunar Orbiter Laser Altimeter (LOLA) data. We present new results with SELENE extended mission and LRO orbits processed with direct altimetry measurements. With even a simple laser altimeter onboard, high-quality OD can be achieved for future missions because of the datasets acquired by LRO and GRAIL, without the need for regular radio contact. Onboard processing of altimetric ranges would bring high-quality real-time position knowledge to support autonomous operation. We also describe why optical ranging transponders are ideal payloads for future lunar missions, as they can address both communication and navigation needs with little resources.

  2. Comparing Emission Inventories and Model-Ready Emission Datasets between Europe and North America for the AQMEII Project

    EPA Science Inventory

    This paper highlights the similarities and differences in how emission inventories and datasets were developed and processed across North America and Europe for the Air Quality Model Evaluation International Initiative (AQMEII) project and then characterizes the emissions for the...

  3. Assessing Metadata Quality of a Federally Sponsored Health Data Repository.

    PubMed

    Marc, David T; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui

    2016-01-01

    The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn't frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality.

  4. Assessing Metadata Quality of a Federally Sponsored Health Data Repository

    PubMed Central

    Marc, David T.; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui

    2016-01-01

    The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn’t frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality. PMID:28269883

  5. American Society for Enhanced Recovery (ASER) and Perioperative Quality Initiative (POQI) joint consensus statement on measurement to maintain and improve quality of enhanced recovery pathways for elective colorectal surgery.

    PubMed

    Moonesinghe, S Ramani; Grocott, Michael P W; Bennett-Guerrero, Elliott; Bergamaschi, Roberto; Gottumukkala, Vijaya; Hopkins, Thomas J; McCluskey, Stuart; Gan, Tong J; Mythen, Michael Monty G; Shaw, Andrew D; Miller, Timothy E

    2017-01-01

    This article sets out a framework for measurement of quality of care relevant to enhanced recovery pathways (ERPs) in elective colorectal surgery. The proposed framework is based on established measurement systems and/or theories, and provides an overview of the different approaches for improving clinical monitoring, and enhancing quality improvement or research in varied settings with different levels of available resources. Using a structure-process-outcome framework, we make recommendations for three hierarchical tiers of data collection. Core, Quality Improvement, and Best Practice datasets are proposed. The suggested datasets incorporate patient data to describe case-mix, process measures to describe delivery of enhanced recovery and clinical outcomes. The fundamental importance of routine collection of data for the initiation, maintenance, and enhancement of enhanced recovery pathways is emphasized.

  6. Assessing microscope image focus quality with deep learning.

    PubMed

    Yang, Samuel J; Berndl, Marc; Michael Ando, D; Barch, Mariya; Narayanaswamy, Arunachalam; Christiansen, Eric; Hoyer, Stephan; Roat, Chris; Hung, Jane; Rueden, Curtis T; Shankar, Asim; Finkbeiner, Steven; Nelson, Philip

    2018-03-15

    Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality. We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument. Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler.

  7. Near real-time qualitative monitoring of lake water chlorophyll globally using GoogleEarth Engine

    NASA Astrophysics Data System (ADS)

    Zlinszky, András; Supan, Peter; Koma, Zsófia

    2017-04-01

    Monitoring ocean chlorophyll and suspended sediment has been made possible using optical satellite imaging, and has contributed immensely to our understanding of the Earth and its climate. However, lake water quality monitoring has limitations due to the optical complexity of shallow, sediment- and organic matter-laden waters. Meanwhile, timely and detailed information on basic lake water quality parameters would be essential for sustainable management of inland waters. Satellite-based remote sensing can deliver area-covering, high resolution maps of basic lake water quality parameters, but scientific application of these datasets for lake monitoring has been hindered by limitations to calibration and accuracy evaluation, and therefore access to such data has been the privilege of scientific users. Nevertheless, since for many inland waters satellite imaging is the only source of monitoring data, we believe it is urgent to make map products of chlorophyll and suspended sediment concentrations available to a wide range of users. Even if absolute accuracy can not be validated, patterns, processes and qualitative information delivered by such datasets in near-real time can act as an early warning system, raise awareness to water quality processes and serve education, in addition to complementing local monitoring activities. By making these datasets openly available on the internet through an easy to use framework, dialogue between stakeholders, management and governance authorities can be facilitated. We use GoogleEarthEngine to access and process archive and current satellite data. GoogleEarth Engine is a development and visualization framework that provides access to satellite datasets and processing capacity for analysis at the Petabyte scale. Based on earlier investigations, we chose the fluorescence line height index to represent water chlorophyll concentration. This index relies on the chlorophyll fluorescence peak at 680 nm, and has been tested for open ocean but also inland lake situations for MODIS and MERIS satellite sensor data. In addition to being relatively robust and less sensitive to atmospheric influence, this algorithm is also very simple, being based on the height of the 680 nm peak above the linear interpolation of the two neighbouring bands. However, not all satellite datasets suitable for FLH are catalogued for GoogleEarth Engine. In the current testing phase, Landsat 7, Landsat 8 (30 m resolution), and Sentinel 2 (20 m) are being tested. Landsat 7 has suitable band configuration, but has a strip error due to a sensor problem. Landsat 8 and Sentinel 2 lack a single spectral optimal for FLH. Sentinel 3 would be an optimal data source and has shown good performace during small-scale initial tests, but is not distributed globally for GoogleEarth Engine. In addition to FLH data from these satellites, our system delivers cloud and ice masking, qualitative suspended sediment data (based on the band closest to 600 nm) and true colour images, all within an easy-to-use Google Maps background. This allows on-demand understanding and interpretation of water quality patterns and processes in near real time. While the system is still under development, we believe it could significantly contribute to lake water quality management and monitoring worldwide.

  8. Recent Advances in WRF Modeling for Air Quality Applications

    EPA Science Inventory

    The USEPA uses WRF in conjunction with the Community Multiscale Air Quality (CMAQ) for air quality regulation and research. Over the years we have added physics options and geophysical datasets to the WRF system to enhance model capabilities especially for extended retrospective...

  9. Meraculous2

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2014-06-01

    meraculous2 is a whole genome shotgun assembler for short-reads that is capable of assembling large, polymorphic genomes with modest computational requirements. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. Additional features include (1) handling of allelic variation using "bubble" structures within the deBruijn graph, (2) gap closing of repetitive and low quality regions using localized assemblies, and (3) an improved scaffolding algorithm that produces more complete assemblies without compromising onmore » scaffolding accuracy« less

  10. Updated population metadata for United States historical climatology network stations

    USGS Publications Warehouse

    Owen, T.W.; Gallo, K.P.

    2000-01-01

    The United States Historical Climatology Network (HCN) serial temperature dataset is comprised of 1221 high-quality, long-term climate observing stations. The HCN dataset is available in several versions, one of which includes population-based temperature modifications to adjust urban temperatures for the "heat-island" effect. Unfortunately, the decennial population metadata file is not complete as missing values are present for 17.6% of the 12 210 population values associated with the 1221 individual stations during the 1900-90 interval. Retrospective grid-based populations. Within a fixed distance of an HCN station, were estimated through the use of a gridded population density dataset and historically available U.S. Census county data. The grid-based populations for the HCN stations provide values derived from a consistent methodology compared to the current HCN populations that can vary as definitions of the area associated with a city change over time. The use of grid-based populations may minimally be appropriate to augment populations for HCN climate stations that lack any population data, and are recommended when consistent and complete population data are required. The recommended urban temperature adjustments based on the HCN and grid-based methods of estimating station population can be significantly different for individual stations within the HCN dataset.

  11. Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts.

    PubMed

    Dashtban, M; Balafar, Mohammadali

    2017-03-01

    Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Helioseismic inferences of the solar cycles 23 and 24: GOLF and VIRGO observations

    NASA Astrophysics Data System (ADS)

    Salabert, D.; García, R. A.; Jiménez, A.

    2014-12-01

    The Sun-as-a star helioseismic spectrophotometer GOLF and photometer VIRGO instruments onboard the SoHO spacecraft are collecting high-quality, continuous data since April 1996. We analyze here these unique datasets in order to investigate the peculiar and weak on-going solar cycle 24. As this cycle 24 is reaching its maximum, we compare its rising phase with the rising phase of the previous solar cycle 23.

  13. Measuring Quality of Healthcare Outcomes in Type 2 Diabetes from Routine Data: a Seven-nation Survey Conducted by the IMIA Primary Health Care Working Group.

    PubMed

    Hinton, W; Liyanage, H; McGovern, A; Liaw, S-T; Kuziemsky, C; Munro, N; de Lusignan, S

    2017-08-01

    Background: The Institute of Medicine framework defines six dimensions of quality for healthcare systems: (1) safety, (2) effectiveness, (3) patient centeredness, (4) timeliness of care, (5) efficiency, and (6) equity. Large health datasets provide an opportunity to assess quality in these areas. Objective: To perform an international comparison of the measurability of the delivery of these aims, in people with type 2 diabetes mellitus (T2DM) from large datasets. Method: We conducted a survey to assess healthcare outcomes data quality of existing databases and disseminated this through professional networks. We examined the data sources used to collect the data, frequency of data uploads, and data types used for identifying people with T2DM. We compared data completeness across the six areas of healthcare quality, using selected measures pertinent to T2DM management. Results: We received 14 responses from seven countries (Australia, Canada, Italy, the Netherlands, Norway, Portugal, Turkey and the UK). Most databases reported frequent data uploads and would be capable of near real time analysis of healthcare quality.The majority of recorded data related to safety (particularly medication adverse events) and treatment efficacy (glycaemic control and microvascular disease). Data potentially measuring equity was less well recorded. Recording levels were lowest for patient-centred care, timeliness of care, and system efficiency, with the majority of databases containing no data in these areas. Databases using primary care sources had higher data quality across all areas measured. Conclusion: Data quality could be improved particularly in the areas of patient-centred care, timeliness, and efficiency. Primary care derived datasets may be most suited to healthcare quality assessment. Georg Thieme Verlag KG Stuttgart.

  14. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  15. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  16. Avulsion research using flume experiments and highly accurate and temporal-rich SfM datasets

    NASA Astrophysics Data System (ADS)

    Javernick, L.; Bertoldi, W.; Vitti, A.

    2017-12-01

    SfM's ability to produce high-quality, large-scale digital elevation models (DEMs) of complicated and rapidly evolving systems has made it a valuable technique for low-budget researchers and practitioners. While SfM has provided valuable datasets that capture single-flood event DEMs, there is an increasing scientific need to capture higher temporal resolution datasets that can quantify the evolutionary processes instead of pre- and post-flood snapshots. However, flood events' dangerous field conditions and image matching challenges (e.g. wind, rain) prevent quality SfM-image acquisition. Conversely, flume experiments offer opportunities to document flood events, but achieving consistent and accurate DEMs to detect subtle changes in dry and inundated areas remains a challenge for SfM (e.g. parabolic error signatures).This research aimed at investigating the impact of naturally occurring and manipulated avulsions on braided river morphology and on the encroachment of floodplain vegetation, using laboratory experiments. This required DEMs with millimeter accuracy and precision and at a temporal resolution to capture the processes. SfM was chosen as it offered the most practical method. Through redundant local network design and a meticulous ground control point (GCP) survey with a Leica Total Station in red laser configuration (reported 2 mm accuracy), the SfM residual errors compared to separate ground truthing data produced mean errors of 1.5 mm (accuracy) and standard deviations of 1.4 mm (precision) without parabolic error signatures. Lighting conditions in the flume were limited to uniform, oblique, and filtered LED strips, which removed glint and thus improved bed elevation mean errors to 4 mm, but errors were further reduced by means of an open source software for refraction correction. The obtained datasets have provided the ability to quantify how small flood events with avulsion can have similar morphologic and vegetation impacts as large flood events without avulsion. Further, this research highlights the potential application of SfM in the laboratory and ability to document physical and biological processes at greater spatial and temporal resolution. Marie Sklodowska-Curie Individual Fellowship: River-HMV, 656917

  17. The Wide-Field Imaging Interferometry Testbed: Enabling Techniques for High Angular Resolution Astronomy

    NASA Technical Reports Server (NTRS)

    Rinehart, S. A.; Armstrong, T.; Frey, Bradley J.; Jung, J.; Kirk, J.; Leisawitz, David T.; Leviton, Douglas B.; Lyon, R.; Maher, Stephen; Martino, Anthony J.; hide

    2007-01-01

    The Wide-Field Imaging Interferometry Testbed (WIIT) was designed to develop techniques for wide-field of view imaging interferometry, using "double-Fourier" methods. These techniques will be important for a wide range of future spacebased interferometry missions. We have provided simple demonstrations of the methodology already, and continuing development of the testbed will lead to higher data rates, improved data quality, and refined algorithms for image reconstruction. At present, the testbed effort includes five lines of development; automation of the testbed, operation in an improved environment, acquisition of large high-quality datasets, development of image reconstruction algorithms, and analytical modeling of the testbed. We discuss the progress made towards the first four of these goals; the analytical modeling is discussed in a separate paper within this conference.

  18. Lessons learned and way forward from 6 years of Aerosol_cci

    NASA Astrophysics Data System (ADS)

    Popp, Thomas; de Leeuw, Gerrit; Pinnock, Simon

    2017-04-01

    Within the ESA Climate Change Initiative (CCI) Aerosol_cci (2010 - 2017) conducts intensive work to improve and qualify algorithms for the retrieval of aerosol information from European sensors. Meanwhile, several validated (multi-) decadal time series of different aerosol parameters from complementary sensors are available: Aerosol Optical Depth (AOD), stratospheric extinction profiles, a qualitative Absorbing Aerosol Index (AAI), fine mode AOD, mineral dust AOD; absorption information and aerosol layer height are in an evaluation phase and the multi-pixel GRASP algorithm for the POLDER instrument is used for selected regions. Validation (vs. AERONET, MAN) and inter-comparison to other satellite datasets (MODIS, MISR, SeaWIFS) proved the high quality of the available datasets comparable to other satellite retrievals and revealed needs for algorithm improvement (for example for higher AOD values) which were taken into account in an iterative evolution cycle. The datasets contain pixel level uncertainty estimates which were also validated and improved in the reprocessing. The use of an ensemble method was tested, where several algorithms are applied to the same sensor. The presentation will summarize and discuss the lessons learned from the 6 years of intensive collaboration and highlight major achievements (significantly improved AOD quality, fine mode AOD, dust AOD, pixel level uncertainties, ensemble approach); also limitations and remaining deficits shall be discussed. An outlook will discuss the way forward for the continuous algorithm improvement and re-processing together with opportunities for time series extension with successor instruments of the Sentinel family and the complementarity of the different satellite aerosol products.

  19. Enhancing the spatial coverage of a regional high-quality hydraulic conductivity dataset with estimates made from domestic water-well specific-capacity tests

    NASA Astrophysics Data System (ADS)

    Priebe, Elizabeth H.; Neville, C. J.; Rudolph, D. L.

    2018-03-01

    The spatial coverage of hydraulic conductivity ( K) values for large-scale groundwater investigations is often poor because of the high costs associated with hydraulic testing and the large areas under investigation. Domestic water wells are ubiquitous and their well logs represent an untapped resource of information that includes mandatory specific-capacity tests, from which K can be estimated. These specific-capacity tests are routinely conducted at such low pumping rates that well losses are normally insignificant. In this study, a simple and practical approach to augmenting high-quality K values with reconnaissance-level K values from water-well specific-capacity tests is assessed. The integration of lesser quality K values from specific-capacity tests with a high-quality K data set is assessed through comparisons at two different scales: study-area-wide (a 600-km2 area in Ontario, Canada) and in a single geological formation within a portion of the broader study area (200 km2). Results of the comparisons demonstrate that reconnaissance-level K estimates from specific-capacity tests approximate the ranges and distributions of the high-quality K values. Sufficient detail about the physical basis and assumptions that are invoked in the development of the approach are presented here so that it can be applied with confidence by practitioners seeking to enhance their spatial coverage of K values with specific-capacity tests.

  20. Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...

    2015-04-14

    During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less

  1. Meraculous: De Novo Genome Assembly with Short Paired-End Reads

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chapman, Jarrod A.; Ho, Isaac; Sunkara, Sirisha

    2011-08-18

    We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions inmore » the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ~280 bp or ~3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.« less

  2. A data discovery index for the social sciences

    PubMed Central

    Krämer, Thomas; Klas, Claus-Peter; Hausstein, Brigitte

    2018-01-01

    This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency’s database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data’s metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available. PMID:29633988

  3. False alarm reduction in BSN-based cardiac monitoring using signal quality and activity type information.

    PubMed

    Tanantong, Tanatorn; Nantajeewarawat, Ekawit; Thiemjarus, Surapa

    2015-02-09

    False alarms in cardiac monitoring affect the quality of medical care, impacting on both patients and healthcare providers. In continuous cardiac monitoring using wireless Body Sensor Networks (BSNs), the quality of ECG signals can be deteriorated owing to several factors, e.g., noises, low battery power, and network transmission problems, often resulting in high false alarm rates. In addition, body movements occurring from activities of daily living (ADLs) can also create false alarms. This paper presents a two-phase framework for false arrhythmia alarm reduction in continuous cardiac monitoring, using signals from an ECG sensor and a 3D accelerometer. In the first phase, classification models constructed using machine learning algorithms are used for labeling input signals. ECG signals are labeled with heartbeat types and signal quality levels, while 3D acceleration signals are labeled with ADL types. In the second phase, a rule-based expert system is used for combining classification results in order to determine whether arrhythmia alarms should be accepted or suppressed. The proposed framework was validated on datasets acquired using BSNs and the MIT-BIH arrhythmia database. For the BSN dataset, acceleration and ECG signals were collected from 10 young and 10 elderly subjects while they were performing ADLs. The framework reduced the false alarm rate from 9.58% to 1.43% in our experimental study, showing that it can potentially assist physicians in diagnosing a vast amount of data acquired from wireless sensors and enhance the performance of continuous cardiac monitoring.

  4. Spatial and temporal air quality pattern recognition using environmetric techniques: a case study in Malaysia.

    PubMed

    Syed Abdul Mutalib, Sharifah Norsukhairin; Juahir, Hafizan; Azid, Azman; Mohd Sharif, Sharifah; Latif, Mohd Talib; Aris, Ahmad Zaharin; Zain, Sharifuddin M; Dominick, Doreena

    2013-09-01

    The objective of this study is to identify spatial and temporal patterns in the air quality at three selected Malaysian air monitoring stations based on an eleven-year database (January 2000-December 2010). Four statistical methods, Discriminant Analysis (DA), Hierarchical Agglomerative Cluster Analysis (HACA), Principal Component Analysis (PCA) and Artificial Neural Networks (ANNs), were selected to analyze the datasets of five air quality parameters, namely: SO2, NO2, O3, CO and particulate matter with a diameter size of below 10 μm (PM10). The three selected air monitoring stations share the characteristic of being located in highly urbanized areas and are surrounded by a number of industries. The DA results show that spatial characterizations allow successful discrimination between the three stations, while HACA shows the temporal pattern from the monthly and yearly factor analysis which correlates with severe haze episodes that have happened in this country at certain periods of time. The PCA results show that the major source of air pollution is mostly due to the combustion of fossil fuel in motor vehicles and industrial activities. The spatial pattern recognition (S-ANN) results show a better prediction performance in discriminating between the regions, with an excellent percentage of correct classification compared to DA. This study presents the necessity and usefulness of environmetric techniques for the interpretation of large datasets aiming to obtain better information about air quality patterns based on spatial and temporal characterizations at the selected air monitoring stations.

  5. Enhancing e-waste estimates: improving data quality by multivariate Input-Output Analysis.

    PubMed

    Wang, Feng; Huisman, Jaco; Stevels, Ab; Baldé, Cornelis Peter

    2013-11-01

    Waste electrical and electronic equipment (or e-waste) is one of the fastest growing waste streams, which encompasses a wide and increasing spectrum of products. Accurate estimation of e-waste generation is difficult, mainly due to lack of high quality data referred to market and socio-economic dynamics. This paper addresses how to enhance e-waste estimates by providing techniques to increase data quality. An advanced, flexible and multivariate Input-Output Analysis (IOA) method is proposed. It links all three pillars in IOA (product sales, stock and lifespan profiles) to construct mathematical relationships between various data points. By applying this method, the data consolidation steps can generate more accurate time-series datasets from available data pool. This can consequently increase the reliability of e-waste estimates compared to the approach without data processing. A case study in the Netherlands is used to apply the advanced IOA model. As a result, for the first time ever, complete datasets of all three variables for estimating all types of e-waste have been obtained. The result of this study also demonstrates significant disparity between various estimation models, arising from the use of data under different conditions. It shows the importance of applying multivariate approach and multiple sources to improve data quality for modelling, specifically using appropriate time-varying lifespan parameters. Following the case study, a roadmap with a procedural guideline is provided to enhance e-waste estimation studies. Copyright © 2013 Elsevier Ltd. All rights reserved.

  6. Improving Low-Dose Blood-Brain Barrier Permeability Quantification Using Sparse High-Dose Induced Prior for Patlak Model

    PubMed Central

    Fang, Ruogu; Karlsson, Kolbeinn; Chen, Tsuhan; Sanelli, Pina C.

    2014-01-01

    Blood-brain-barrier permeability (BBBP) measurements extracted from the perfusion computed tomography (PCT) using the Patlak model can be a valuable indicator to predict hemorrhagic transformation in patients with acute stroke. Unfortunately, the standard Patlak model based PCT requires excessive radiation exposure, which raised attention on radiation safety. Minimizing radiation dose is of high value in clinical practice but can degrade the image quality due to the introduced severe noise. The purpose of this work is to construct high quality BBBP maps from low-dose PCT data by using the brain structural similarity between different individuals and the relations between the high- and low-dose maps. The proposed sparse high-dose induced (shd-Patlak) model performs by building a high-dose induced prior for the Patlak model with a set of location adaptive dictionaries, followed by an optimized estimation of BBBP map with the prior regularized Patlak model. Evaluation with the simulated low-dose clinical brain PCT datasets clearly demonstrate that the shd-Patlak model can achieve more significant gains than the standard Patlak model with improved visual quality, higher fidelity to the gold standard and more accurate details for clinical analysis. PMID:24200529

  7. High Quality Data for Grid Integration Studies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clifton, Andrew; Draxl, Caroline; Sengupta, Manajit

    As variable renewable power penetration levels increase in power systems worldwide, renewable integration studies are crucial to ensure continued economic and reliable operation of the power grid. The existing electric grid infrastructure in the US in particular poses significant limitations on wind power expansion. In this presentation we will shed light on requirements for grid integration studies as far as wind and solar energy are concerned. Because wind and solar plants are strongly impacted by weather, high-resolution and high-quality weather data are required to drive power system simulations. Future data sets will have to push limits of numerical weather predictionmore » to yield these high-resolution data sets, and wind data will have to be time-synchronized with solar data. Current wind and solar integration data sets are presented. The Wind Integration National Dataset (WIND) Toolkit is the largest and most complete grid integration data set publicly available to date. A meteorological data set, wind power production time series, and simulated forecasts created using the Weather Research and Forecasting Model run on a 2-km grid over the continental United States at a 5-min resolution is now publicly available for more than 126,000 land-based and offshore wind power production sites. The National Solar Radiation Database (NSRDB) is a similar high temporal- and spatial resolution database of 18 years of solar resource data for North America and India. The need for high-resolution weather data pushes modeling towards finer scales and closer synchronization. We also present how we anticipate such datasets developing in the future, their benefits, and the challenges with using and disseminating such large amounts of data.« less

  8. Low-cost oblique illumination: an image quality assessment.

    PubMed

    Ruiz-Santaquiteria, Jesus; Espinosa-Aranda, Jose Luis; Deniz, Oscar; Sanchez, Carlos; Borrego-Ramos, Maria; Blanco, Saul; Cristobal, Gabriel; Bueno, Gloria

    2018-01-01

    We study the effectiveness of several low-cost oblique illumination filters to improve overall image quality, in comparison with standard bright field imaging. For this purpose, a dataset composed of 3360 diatom images belonging to 21 taxa was acquired. Subjective and objective image quality assessments were done. The subjective evaluation was performed by a group of diatom experts by psychophysical test where resolution, focus, and contrast were assessed. Moreover, some objective nonreference image quality metrics were applied to the same image dataset to complete the study, together with the calculation of several texture features to analyze the effect of these filters in terms of textural properties. Both image quality evaluation methods, subjective and objective, showed better results for images acquired using these illumination filters in comparison with the no filtered image. These promising results confirm that this kind of illumination filters can be a practical way to improve the image quality, thanks to the simple and low cost of the design and manufacturing process. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).

  9. Maelstrom Research guidelines for rigorous retrospective data harmonization

    PubMed Central

    Fortier, Isabel; Raina, Parminder; Van den Heuvel, Edwin R; Griffith, Lauren E; Craig, Camille; Saliba, Matilda; Doiron, Dany; Stolk, Ronald P; Knoppers, Bartha M; Ferretti, Vincent; Granda, Peter; Burton, Paul

    2017-01-01

    Abstract Background: It is widely accepted and acknowledged that data harmonization is crucial: in its absence, the co-analysis of major tranches of high quality extant data is liable to inefficiency or error. However, despite its widespread practice, no formalized/systematic guidelines exist to ensure high quality retrospective data harmonization. Methods: To better understand real-world harmonization practices and facilitate development of formal guidelines, three interrelated initiatives were undertaken between 2006 and 2015. They included a phone survey with 34 major international research initiatives, a series of workshops with experts, and case studies applying the proposed guidelines. Results: A wide range of projects use retrospective harmonization to support their research activities but even when appropriate approaches are used, the terminologies, procedures, technologies and methods adopted vary markedly. The generic guidelines outlined in this article delineate the essentials required and describe an interdependent step-by-step approach to harmonization: 0) define the research question, objectives and protocol; 1) assemble pre-existing knowledge and select studies; 2) define targeted variables and evaluate harmonization potential; 3) process data; 4) estimate quality of the harmonized dataset(s) generated; and 5) disseminate and preserve final harmonization products. Conclusions: This manuscript provides guidelines aiming to encourage rigorous and effective approaches to harmonization which are comprehensively and transparently documented and straightforward to interpret and implement. This can be seen as a key step towards implementing guiding principles analogous to those that are well recognised as being essential in securing the foundational underpinning of systematic reviews and the meta-analysis of clinical trials. PMID:27272186

  10. PRISM Climate Group, Oregon State U

    Science.gov Websites

    FAQ PRISM Climate Data The PRISM Climate Group gathers climate observations from a wide range of monitoring networks, applies sophisticated quality control measures, and develops spatial climate datasets to reveal short- and long-term climate patterns. The resulting datasets incorporate a variety of modeling

  11. The experience of linking Victorian emergency medical service trauma data

    PubMed Central

    Boyle, Malcolm J

    2008-01-01

    Background The linking of a large Emergency Medical Service (EMS) dataset with the Victorian Department of Human Services (DHS) hospital datasets and Victorian State Trauma Outcome Registry and Monitoring (VSTORM) dataset to determine patient outcomes has not previously been undertaken in Victoria. The objective of this study was to identify the linkage rate of a large EMS trauma dataset with the Department of Human Services hospital datasets and VSTORM dataset. Methods The linking of an EMS trauma dataset to the hospital datasets utilised deterministic and probabilistic matching. The linking of three EMS trauma datasets to the VSTORM dataset utilised deterministic, probabilistic and manual matching. Results There were 66.7% of patients from the EMS dataset located in the VEMD. There were 96% of patients located in the VAED who were defined in the VEMD as being admitted to hospital. 3.7% of patients located in the VAED could not be found in the VEMD due to hospitals not reporting to the VEMD. For the EMS datasets, there was a 146% increase in successful links with the trauma profile dataset, a 221% increase in successful links with the mechanism of injury only dataset, and a 46% increase with sudden deterioration dataset, to VSTORM when using manual compared to deterministic matching. Conclusion This study has demonstrated that EMS data can be successfully linked to other health related datasets using deterministic and probabilistic matching with varying levels of success. The quality of EMS data needs to be improved to ensure better linkage success rates with other health related datasets. PMID:19014622

  12. Improvement of Disease Prediction and Modeling through the Use of Meteorological Ensembles: Human Plague in Uganda

    PubMed Central

    Moore, Sean M.; Monaghan, Andrew; Griffith, Kevin S.; Apangu, Titus; Mead, Paul S.; Eisen, Rebecca J.

    2012-01-01

    Climate and weather influence the occurrence, distribution, and incidence of infectious diseases, particularly those caused by vector-borne or zoonotic pathogens. Thus, models based on meteorological data have helped predict when and where human cases are most likely to occur. Such knowledge aids in targeting limited prevention and control resources and may ultimately reduce the burden of diseases. Paradoxically, localities where such models could yield the greatest benefits, such as tropical regions where morbidity and mortality caused by vector-borne diseases is greatest, often lack high-quality in situ local meteorological data. Satellite- and model-based gridded climate datasets can be used to approximate local meteorological conditions in data-sparse regions, however their accuracy varies. Here we investigate how the selection of a particular dataset can influence the outcomes of disease forecasting models. Our model system focuses on plague (Yersinia pestis infection) in the West Nile region of Uganda. The majority of recent human cases have been reported from East Africa and Madagascar, where meteorological observations are sparse and topography yields complex weather patterns. Using an ensemble of meteorological datasets and model-averaging techniques we find that the number of suspected cases in the West Nile region was negatively associated with dry season rainfall (December-February) and positively with rainfall prior to the plague season. We demonstrate that ensembles of available meteorological datasets can be used to quantify climatic uncertainty and minimize its impacts on infectious disease models. These methods are particularly valuable in regions with sparse observational networks and high morbidity and mortality from vector-borne diseases. PMID:23024750

  13. Integrating multiple immunogenetic data sources for feature extraction and mining somatic hypermutation patterns: the case of "towards analysis" in chronic lymphocytic leukaemia.

    PubMed

    Kavakiotis, Ioannis; Xochelli, Aliki; Agathangelidis, Andreas; Tsoumakas, Grigorios; Maglaveras, Nicos; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Vlahavas, Ioannis; Chouvarda, Ioanna

    2016-06-06

    Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.

  14. Analysis of lidar elevation data for improved identification and delineation of lands vulnerable to sea-level rise

    USGS Publications Warehouse

    Gesch, Dean B.

    2009-01-01

    The importance of sea-level rise in shaping coastal landscapes is well recognized within the earth science community, but as with many natural hazards, communicating the risks associated with sea-level rise remains a challenge. Topography is a key parameter that influences many of the processes involved in coastal change, and thus, up-to-date, high-resolution, high-accuracy elevation data are required to model the coastal environment. Maps of areas subject to potential inundation have great utility to planners and managers concerned with the effects of sea-level rise. However, most of the maps produced to date are simplistic representations derived from older, coarse elevation data. In the last several years, vast amounts of high quality elevation data derived from lidar have become available. Because of their high vertical accuracy and spatial resolution, these lidar data are an excellent source of up-to-date information from which to improve identification and delineation of vulnerable lands. Four elevation datasets of varying resolution and accuracy were processed to demonstrate that the improved quality of lidar data leads to more precise delineation of coastal lands vulnerable to inundation. A key component of the comparison was to calculate and account for the vertical uncertainty of the elevation datasets. This comparison shows that lidar allows for a much more detailed delineation of the potential inundation zone when compared to other types of elevation models. It also shows how the certainty of the delineation of lands vulnerable to a given sea-level rise scenario is much improved when derived from higher resolution lidar data.

  15. 3D reconstruction from multi-view VHR-satellite images in MicMac

    NASA Astrophysics Data System (ADS)

    Rupnik, Ewelina; Pierrot-Deseilligny, Marc; Delorme, Arthur

    2018-05-01

    This work addresses the generation of high quality digital surface models by fusing multiple depths maps calculated with the dense image matching method. The algorithm is adapted to very high resolution multi-view satellite images, and the main contributions of this work are in the multi-view fusion. The algorithm is insensitive to outliers, takes into account the matching quality indicators, handles non-correlated zones (e.g. occlusions), and is solved with a multi-directional dynamic programming approach. No geometric constraints (e.g. surface planarity) or auxiliary data in form of ground control points are required for its operation. Prior to the fusion procedures, the RPC geolocation parameters of all images are improved in a bundle block adjustment routine. The performance of the algorithm is evaluated on two VHR (Very High Resolution)-satellite image datasets (Pléiades, WorldView-3) revealing its good performance in reconstructing non-textured areas, repetitive patterns, and surface discontinuities.

  16. Argumentation Based Joint Learning: A Novel Ensemble Learning Approach

    PubMed Central

    Xu, Junyi; Yao, Li; Li, Le

    2015-01-01

    Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification. PMID:25966359

  17. Severe European winters in a secular perspective

    NASA Astrophysics Data System (ADS)

    Hoy, Andreas; Hänsel, Stephanie

    2017-04-01

    Temperature conditions during the winter time are substantially shaped by a strong year-to-year variability. European winters since the late 1980s - compared to previous decades and centuries - were mainly characterised by a high temperature level, including recent record-warm winters. Yet, comparably cold winters and severe cold spells still occur nowadays, like recently observed from 2009 to 2013 and in early 2017. Central England experienced its second coldest December since start of observations more than 350 years ago in 2010, and some of the lowest temperatures ever measured in northern Europe (below -50 °C in Lapland) were recorded in January 1999. Analysing thermal characteristics and spatial distribution of severe (historical) winters - using early instrumental data - helps expanding and consolidating our knowledge of past weather extremes. This contribution presents efforts towards this direction. We focus on a) compiling and assessing a very long-term instrumental, spatially widespread and well-distributed, high-quality meteorological data set to b) investigate very cold winter temperatures in Europe from early measurements until today. In a first step, we analyse the longest available time series of monthly temperature averages within Europe. Our dataset extends from the Nordic countries up to the Mediterranean and from the British Isles up to Russia. We utilise as much as possible homogenised times series in order to ensure reliable results. Homogenised data derive from the NORDHOM (Scandinavia) and HISTALP (greater alpine region) datasets or were obtained from national weather services and universities. Other (not specifically homogenised) data were derived from the ECA&D dataset or national institutions. The employed time series often start already during the 18th century, with Paris & Central England being the longest datasets (from 1659). In a second step, daily temperature averages are involved. Only some of those series are homogenised, but those available are sufficiently distributed throughout Europe to ensure reliable results. Furthermore, the comparably dense network of long-term observations allows an appropriate quality checking within the network. Additionally, the large collective of homogenised monthly data enables assessing the quality of many daily series. Daily data are used to sum up negative values for the respective winter periods to create times series of "cold summations", which are a good indicator for the severeness of winters in most parts of Europe. Additionally, days below certain thresholds may be counted or summed up. Future work will include daily minimum and maximum temperatures, allowing calculating and applying an extensive set of climate indices, refining the work presented here.

  18. GTN-G, WGI, RGI, DCW, GLIMS, WGMS, GCOS - What's all this about? (Invited)

    NASA Astrophysics Data System (ADS)

    Paul, F.; Raup, B. H.; Zemp, M.

    2013-12-01

    In a large collaborative effort, the glaciological community has compiled a new and spa-tially complete global dataset of glacier outlines, the so-called Randolph Glacier Inventory or RGI. Despite its regional shortcomings in quality (e.g. in regard to geolocation, gener-alization, and interpretation), this dataset was heavily used for global-scale modelling ap-plications (e.g. determination of total glacier volume and glacier contribution to sea-level rise) in support of the forthcoming 5th Assessment Report (AR5) of Working Group I of the IPCC. The RGI is a merged dataset that is largely based on the GLIMS database and several new datasets provided by the community (both are mostly derived from satellite data), as well as the Digital Chart of the World (DCW) and glacier attribute information (location, size) from the World Glacier Inventory (WGI). There are now two key tasks to be performed, (1) improving the quality of the RGI in all regions where the outlines do not met the quality required for local scale applications, and (2) integrating the RGI in the GLIMS glacier database to improve its spatial completeness. While (1) requires again a huge effort but is already ongoing, (2) is mainly a technical issue that is nearly solved. Apart from this technical dimension, there is also a more political or structural one. While GLIMS is responsible for the remote sensing and glacier inventory part (Tier 5) of the Global Terrestrial Network for Glaciers (GTN-G) within the Global Climate Observing System (GCOS), the World Glacier Monitoring Service (WGMS) is collecting and dis-seminating the field observations. Along with new global products derived from satellite data (e.g. elevation changes and velocity fields) and the community wish to keep a snap-shot dataset such as the RGI available, how to make all these datasets available to the community without duplicating efforts and making best use of the very limited financial resources available must now be discussed. This overview presentation describes the cur-rently available datasets, clarifying the terminology and the international framework, and suggesting a way forward to serve the community at best.

  19. Water quality of arctic rivers in Finnish Lapland.

    PubMed

    Niemi, Jorma

    2010-02-01

    The water quality monitoring data of eight rivers situated in the Finnish Lapland above the Arctic Circle were investigated. These rivers are icebound annually for about 200 days. They belong to the International River Basin District founded according to the European Union Water Framework Directive and shared with Norway. They are part of the European river monitoring network that includes some 3,400 river sites. The water quality monitoring datasets available varied between the rivers, the longest comprising the period 1975-2003 and the shortest 1989-2003. For each river, annual medians of eight water quality variables were calculated. In addition, medians and fifth and 95th percentiles were calculated for the whole observation periods. The medians indicated good river water quality in comparison to other national or foreign rivers. However, the river water quality oscillated widely. Some rivers were in practice in pristine state, whereas some showed slight human impacts, e.g., occasional high values of hygienic indicator bacteria.

  20. Tracing the influence of land-use change on water quality and coral reefs using a Bayesian model.

    PubMed

    Brown, Christopher J; Jupiter, Stacy D; Albert, Simon; Klein, Carissa J; Mangubhai, Sangeeta; Maina, Joseph M; Mumby, Peter; Olley, Jon; Stewart-Koster, Ben; Tulloch, Vivitskaia; Wenger, Amelia

    2017-07-06

    Coastal ecosystems can be degraded by poor water quality. Tracing the causes of poor water quality back to land-use change is necessary to target catchment management for coastal zone management. However, existing models for tracing the sources of pollution require extensive data-sets which are not available for many of the world's coral reef regions that may have severe water quality issues. Here we develop a hierarchical Bayesian model that uses freely available satellite data to infer the connection between land-uses in catchments and water clarity in coastal oceans. We apply the model to estimate the influence of land-use change on water clarity in Fiji. We tested the model's predictions against underwater surveys, finding that predictions of poor water quality are consistent with observations of high siltation and low coverage of sediment-sensitive coral genera. The model thus provides a means to link land-use change to declines in coastal water quality.

  1. Guaranteeing the quality and integrity of pork - An Australian case study.

    PubMed

    Channon, H A; D'Souza, D N; Jarrett, R G; Lee, G S H; Watling, R J; Jolley, J Y C; Dunshea, F R

    2018-04-27

    The Australian pork industry is strongly committed to assuring the integrity of its product, with substantial research investment made over the past ten years to develop and implement systems to assure the consistency and quality of fresh pork and to enable accurate tracing of unpackaged fresh pork back to property of origin using trace elemental profiling. These initiatives are pivotal to allow Australian pork of guaranteed eating quality to be successfully positioned as higher value products, across a range of international and domestic markets, whilst managing any threats of product substitution. This paper describes the current status of the development of a predictive eating quality model for Australian pork, utilizing eating quality datasets generated from recent Australian studies. The implementation of trace elemental profiling, by Physi-Trace™, to verify and defend provenance claims and support the supply of consistently high eating quality Australian pork to its customers, is also discussed. Copyright © 2018 Elsevier Ltd. All rights reserved.

  2. A Tool for Creating Regionally Calibrated High-Resolution Land Cover Data Sets for the West African Sahel: Using Machine Learning to Scale Up Hand-Classified Maps in a Data-Sparse Environment

    NASA Astrophysics Data System (ADS)

    Van Gordon, M.; Van Gordon, S.; Min, A.; Sullivan, J.; Weiner, Z.; Tappan, G. G.

    2017-12-01

    Using support vector machine (SVM) learning and high-accuracy hand-classified maps, we have developed a publicly available land cover classification tool for the West African Sahel. Our classifier produces high-resolution and regionally calibrated land cover maps for the Sahel, representing a significant contribution to the data available for this region. Global land cover products are unreliable for the Sahel, and accurate land cover data for the region are sparse. To address this gap, the U.S. Geological Survey and the Regional Center for Agriculture, Hydrology and Meteorology (AGRHYMET) in Niger produced high-quality land cover maps for the region via hand-classification of Landsat images. This method produces highly accurate maps, but the time and labor required constrain the spatial and temporal resolution of the data products. By using these hand-classified maps alongside SVM techniques, we successfully increase the resolution of the land cover maps by 1-2 orders of magnitude, from 2km-decadal resolution to 30m-annual resolution. These high-resolution regionally calibrated land cover datasets, along with the classifier we developed to produce them, lay the foundation for major advances in studies of land surface processes in the region. These datasets will provide more accurate inputs for food security modeling, hydrologic modeling, analyses of land cover change and climate change adaptation efforts. The land cover classification tool we have developed will be publicly available for use in creating additional West Africa land cover datasets with future remote sensing data and can be adapted for use in other parts of the world.

  3. GlyQ-IQ: Glycomics Quintavariate-Informed Quantification with High-Performance Computing and GlycoGrid 4D Visualization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kronewitter, Scott R.; Slysz, Gordon W.; Marginean, Ioan

    2014-05-31

    Dense LC-MS datasets have convoluted extracted ion chromatograms with multiple chromatographic peaks that cloud the differentiation between intact compounds with their overlapping isotopic distributions, peaks due to insource ion fragmentation, and noise. Making this differentiation is critical in glycomics datasets because chromatographic peaks correspond to different intact glycan structural isomers. The GlyQ-IQ software is targeted chromatography centric software designed for chromatogram and mass spectra data processing and subsequent glycan composition annotation. The targeted analysis approach offers several key advantages to LC-MS data processing and annotation over traditional algorithms. A priori information about the individual target’s elemental composition allows for exactmore » isotope profile modeling for improved feature detection and increased sensitivity by focusing chromatogram generation and peak fitting on the isotopic species in the distribution having the highest intensity and data quality. Glycan target annotation is corroborated by glycan family relationships and in source fragmentation detection. The GlyQ-IQ software is developed in this work (Part 1) and was used to profile N-glycan compositions from human serum LC-MS Datasets. The companion manuscript GlyQ-IQ Part 2 discusses developments in human serum N-glycan sample preparation, glycan isomer separation, and glycan electrospray ionization. A case study is presented to demonstrate how GlyQ-IQ identifies and removes confounding chromatographic peaks from high mannose glycan isomers from human blood serum. In addition, GlyQ-IQ was used to generate a broad N-glycan profile from a high resolution (100K/60K) nESI-LS-MS/MS dataset including CID and HCD fragmentation acquired on a Velos Pro Mass spectrometer. 101 glycan compositions and 353 isomer peaks were detected from a single sample. 99% of the GlyQ-IQ glycan-feature assignments passed manual validation and are backed with high resolution mass spectra and mass accuracies less than 7 ppm.« less

  4. Culture and behaviour in the English National Health Service: overview of lessons from a large multimethod study

    PubMed Central

    Dixon-Woods, Mary; Baker, Richard; Charles, Kathryn; Dawson, Jeremy; Jerzembek, Gabi; Martin, Graham; McCarthy, Imelda; McKee, Lorna; Minion, Joel; Ozieranski, Piotr; Willars, Janet; Wilkie, Patricia; West, Michael

    2014-01-01

    Background Problems of quality and safety persist in health systems worldwide. We conducted a large research programme to examine culture and behaviour in the English National Health Service (NHS). Methods Mixed-methods study involving collection and triangulation of data from multiple sources, including interviews, surveys, ethnographic case studies, board minutes and publicly available datasets. We narratively synthesised data across the studies to produce a holistic picture and in this paper present a high-level summary. Results We found an almost universal desire to provide the best quality of care. We identified many ‘bright spots’ of excellent caring and practice and high-quality innovation across the NHS, but also considerable inconsistency. Consistent achievement of high-quality care was challenged by unclear goals, overlapping priorities that distracted attention, and compliance-oriented bureaucratised management. The institutional and regulatory environment was populated by multiple external bodies serving different but overlapping functions. Some organisations found it difficult to obtain valid insights into the quality of the care they provided. Poor organisational and information systems sometimes left staff struggling to deliver care effectively and disempowered them from initiating improvement. Good staff support and management were also highly variable, though they were fundamental to culture and were directly related to patient experience, safety and quality of care. Conclusions Our results highlight the importance of clear, challenging goals for high-quality care. Organisations need to put the patient at the centre of all they do, get smart intelligence, focus on improving organisational systems, and nurture caring cultures by ensuring that staff feel valued, respected, engaged and supported. PMID:24019507

  5. First archaeointensity catalogue and intensity secular variation curve for Iberia spanning the last 3000 years

    NASA Astrophysics Data System (ADS)

    Molina-Cardín, Alberto; Campuzano, Saioa A.; Rivero, Mercedes; Osete, María Luisa; Gómez-Paccard, Miriam; Pérez-Fuentes, José Carlos; Pavón-Carrasco, F. Javier; Chauvin, Annick; Palencia-Ortas, Alicia

    2017-04-01

    In this work we present the first archaeomagnetic intensity database for the Iberian Peninsula covering the last 3 millennia. In addition to previously published archaeointensities (about 100 data), we present twenty new high-quality archaeointensities. The new data have been obtained following the Thellier and Thellier method including pTRM-checks and have been corrected for the effect of the anisotropy of thermoremanent magnetization upon archaeointensity estimates. Importantly, about 50% of the new data obtained correspond to the first millennium BC, a period for which there was not possible to develop an intensity palaeosecular variation curve before due to the lack of high-quality archaeointensity data. The different qualities of the data included in the Iberian dataset have been evaluated following different palaeomagnetic criteria, such as the number of specimens analysed, the laboratory protocol applied and the kind of material analysed. Finally, we present the first intensity palaeosecular variation curve for the Iberian Peninsula centred at Madrid for the last 3000 years. In order to obtain the most reliable secular variation curve, it has been generated using only selected high-quality data from the catalogue.

  6. Toward a complete dataset of drug-drug interaction information from publicly available sources.

    PubMed

    Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D

    2015-06-01

    Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  7. Quality Controlling CMIP datasets at GFDL

    NASA Astrophysics Data System (ADS)

    Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

    2017-12-01

    As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.

  8. Population Consequences of Acoustic Disturbance of Blainville’s Beaked Whales at AUTEC

    DTIC Science & Technology

    2014-09-30

    female, sub-adult male, adult female and adult male (e.g. Ford et al. 2007). The photo- identification dataset consisted of high- quality photographs...the mother and calf to separate, especially at depth. Unlike sperm whale calves that are left at the surface with babysitters while their mothers go...e17009 doi:10.1371/journal.pone.0017009 Whitehead, H. (1996) Babysitting, dive synchrony, and indications of alloparental care in sperm whales

  9. Limited privacy protection and poor sensitivity: Is it time to move on from the statistical linkage key-581?

    PubMed

    Randall, Sean M; Ferrante, Anna M; Boyd, James H; Brown, Adrian P; Semmens, James B

    2016-08-01

    The statistical linkage key (SLK-581) is a common tool for record linkage in Australia, due to its ability to provide some privacy protection. However, newer privacy-preserving approaches may provide greater privacy protection, while allowing high-quality linkage. To evaluate the standard SLK-581, encrypted SLK-581 and a newer privacy-preserving approach using Bloom filters, in terms of both privacy and linkage quality. Linkage quality was compared by conducting linkages on Australian health datasets using these three techniques and examining results. Privacy was compared qualitatively in relation to a series of scenarios where privacy breaches may occur. The Bloom filter technique offered greater privacy protection and linkage quality compared to the SLK-based method commonly used in Australia. The adoption of new privacy-preserving methods would allow both greater confidence in research results, while significantly improving privacy protection. © The Author(s) 2016.

  10. Heterogeneous Optimization Framework: Reproducible Preprocessing of Multi-Spectral Clinical MRI for Neuro-Oncology Imaging Research.

    PubMed

    Milchenko, Mikhail; Snyder, Abraham Z; LaMontagne, Pamela; Shimony, Joshua S; Benzinger, Tammie L; Fouke, Sarah Jost; Marcus, Daniel S

    2016-07-01

    Neuroimaging research often relies on clinically acquired magnetic resonance imaging (MRI) datasets that can originate from multiple institutions. Such datasets are characterized by high heterogeneity of modalities and variability of sequence parameters. This heterogeneity complicates the automation of image processing tasks such as spatial co-registration and physiological or functional image analysis. Given this heterogeneity, conventional processing workflows developed for research purposes are not optimal for clinical data. In this work, we describe an approach called Heterogeneous Optimization Framework (HOF) for developing image analysis pipelines that can handle the high degree of clinical data non-uniformity. HOF provides a set of guidelines for configuration, algorithm development, deployment, interpretation of results and quality control for such pipelines. At each step, we illustrate the HOF approach using the implementation of an automated pipeline for Multimodal Glioma Analysis (MGA) as an example. The MGA pipeline computes tissue diffusion characteristics of diffusion tensor imaging (DTI) acquisitions, hemodynamic characteristics using a perfusion model of susceptibility contrast (DSC) MRI, and spatial cross-modal co-registration of available anatomical, physiological and derived patient images. Developing MGA within HOF enabled the processing of neuro-oncology MR imaging studies to be fully automated. MGA has been successfully used to analyze over 160 clinical tumor studies to date within several research projects. Introduction of the MGA pipeline improved image processing throughput and, most importantly, effectively produced co-registered datasets that were suitable for advanced analysis despite high heterogeneity in acquisition protocols.

  11. Adaptation of a Weighted Regression Approach to Evaluate Water Quality Trends in an Estuary

    EPA Science Inventory

    To improve the description of long-term changes in water quality, we adapted a weighted regression approach to analyze a long-term water quality dataset from Tampa Bay, Florida. The weighted regression approach, originally developed to resolve pollutant transport trends in rivers...

  12. Adaptation of a weighted regression approach to evaluate water quality trends in anestuary

    EPA Science Inventory

    To improve the description of long-term changes in water quality, a weighted regression approach developed to describe trends in pollutant transport in rivers was adapted to analyze a long-term water quality dataset from Tampa Bay, Florida. The weighted regression approach allows...

  13. Data Quality Screening Service

    NASA Technical Reports Server (NTRS)

    Strub, Richard; Lynnes, Christopher; Hearty, Thomas; Won, Young-In; Fox, Peter; Zednik, Stephan

    2013-01-01

    A report describes the Data Quality Screening Service (DQSS), which is designed to help automate the filtering of remote sensing data on behalf of science users. Whereas this process often involves much research through quality documents followed by laborious coding, the DQSS is a Web Service that provides data users with data pre-filtered to their particular criteria, while at the same time guiding the user with filtering recommendations of the cognizant data experts. The DQSS design is based on a formal semantic Web ontology that describes data fields and the quality fields for applying quality control within a data product. The accompanying code base handles several remote sensing datasets and quality control schemes for data products stored in Hierarchical Data Format (HDF), a common format for NASA remote sensing data. Together, the ontology and code support a variety of quality control schemes through the implementation of the Boolean expression with simple, reusable conditional expressions as operands. Additional datasets are added to the DQSS simply by registering instances in the ontology if they follow a quality scheme that is already modeled in the ontology. New quality schemes are added by extending the ontology and adding code for each new scheme.

  14. Realistic 3D computer model of the gerbil middle ear, featuring accurate morphology of bone and soft tissue structures.

    PubMed

    Buytaert, Jan A N; Salih, Wasil H M; Dierick, Manual; Jacobs, Patric; Dirckx, Joris J J

    2011-12-01

    In order to improve realism in middle ear (ME) finite-element modeling (FEM), comprehensive and precise morphological data are needed. To date, micro-scale X-ray computed tomography (μCT) recordings have been used as geometric input data for FEM models of the ME ossicles. Previously, attempts were made to obtain these data on ME soft tissue structures as well. However, due to low X-ray absorption of soft tissue, quality of these images is limited. Another popular approach is using histological sections as data for 3D models, delivering high in-plane resolution for the sections, but the technique is destructive in nature and registration of the sections is difficult. We combine data from high-resolution μCT recordings with data from high-resolution orthogonal-plane fluorescence optical-sectioning microscopy (OPFOS), both obtained on the same gerbil specimen. State-of-the-art μCT delivers high-resolution data on the 3D shape of ossicles and other ME bony structures, while the OPFOS setup generates data of unprecedented quality both on bone and soft tissue ME structures. Each of these techniques is tomographic and non-destructive and delivers sets of automatically aligned virtual sections. The datasets coming from different techniques need to be registered with respect to each other. By combining both datasets, we obtain a complete high-resolution morphological model of all functional components in the gerbil ME. The resulting 3D model can be readily imported in FEM software and is made freely available to the research community. In this paper, we discuss the methods used, present the resulting merged model, and discuss the morphological properties of the soft tissue structures, such as muscles and ligaments.

  15. The MetabolomeExpress Project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets.

    PubMed

    Carroll, Adam J; Badger, Murray R; Harvey Millar, A

    2010-07-14

    Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.

  16. [Ecological regionalization of national cotton fiber quality in China using GGE biplot analysis method].

    PubMed

    Xu, Nai Yin; Jin, Shi Qiao; Li, Jian

    2017-01-01

    The distinctive regional characteristics of cotton fiber quality in the major cotton-producing areas in China enhance the textile use efficiency of raw cotton yarn by improving fiber quality through ecological regionalization. The "environment vs. trait" GGE biplot analysis method was adopted to explore the interaction between conventional cotton sub-regions and cotton fiber quality traits based on the datasets collected from the national cotton regional trials from 2011 to 2015. The results showed that the major cotton-producing area in China were divided into four fiber quality ecological regions, namely, the "high fiber quality ecological region", the "low micronaire ecological region", the "high fiber strength and micronaire ecological region", and the "moderate fiber quality ecological region". The high fiber quality ecological region was characterized by harmonious development of cotton fiber length, strength, micronaire value and the highest spinning consistency index, and located in the conventional cotton regions in the upper and lower reaches of Yangtze River Valley. The low micronaire value ecological region composed of the northern and south Xinjiang cotton regions was characterized by low micronaire value, relatively lower fiber strength, and relatively high spinning consistency index performance. The high fiber strength and micronaire value ecological region covered the middle reaches of Yangtze River Valley, Nanxiang Basin and Huaibei Plain, and was prominently characterized by high strength and micronaire value, and moderate performance of other traits. The moderate fiber quality ecological region included North China Plain and Loess Plateau cotton growing regions in the Yellow River Valley, and was characterized by moderate or lower performances of all fiber quality traits. This study effectively applied "environment vs. trait" GGE biplot to regionalize cotton fiber quality, which provided a helpful reference for the regiona-lized cotton growing regions in terms of optimal raw fiber production for textile industry, and gave a good example for the implementation of similar ecological regionalization of other crops as well.

  17. The 3D Elevation Program: summary for Texas

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the State of Texas, elevation data are critical for natural resources conservation; wildfire management, planning, and response; flood risk management; agriculture and precision farming; infrastructure and construction management; water supply and quality; and other business uses. Today, high-quality light detection and ranging (lidar) data are the source for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative, managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  18. The 3D Elevation Program: summary for Minnesota

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the State of Minnesota, elevation data are critical for agriculture and precision farming, natural resources conservation, flood risk management, infrastructure and construction management, water supply and quality, coastal zone management, and other business uses. Today, high-quality light detection and ranging (lidar) data are the sources for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative, managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  19. The 3D Elevation Program: summary for Wisconsin

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the State of Wisconsin, elevation data are critical for agriculture and precision farming, natural resources conservation, flood risk management, infrastructure and construction management, water supply and quality, and other business uses. Today, high-quality light detection and ranging (lidar) data are the sources for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative, managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  20. The Need for Careful Data Collection for Pattern Recognition in Digital Pathology.

    PubMed

    Marée, Raphaël

    2017-01-01

    Effective pattern recognition requires carefully designed ground-truth datasets. In this technical note, we first summarize potential data collection issues in digital pathology and then propose guidelines to build more realistic ground-truth datasets and to control their quality. We hope our comments will foster the effective application of pattern recognition approaches in digital pathology.

  1. School Opportunity Hoarding? Racial Segregation and Access to High Growth Schools

    PubMed Central

    Fiel, Jeremy E.

    2017-01-01

    Abstract Persistent school segregation may allow advantaged groups to hoard educational opportunities and consign minority students to lower-quality educational experiences. Although minority students are concentrated in low-achieving schools, relatively little previous research directly links segregation to measures of school quality based on student achievement growth, which more plausibly reflect learning opportunities. Using a dataset of public elementary schools in California, this study provides the first analysis detailing the distribution of a growth-based measure of school quality using standard inequality indices, allowing disparities to be decomposed across geographic and organizational scales. We find mixed support for the school opportunity hoarding hypothesis. We find small White and Asian advantages in access to high-growth schools, but most of the inequality in exposure to school growth is within racial groups. Growth-based disparities both between and within groups tend to be on a more local scale than disparities in absolute achievement levels, focusing attention on within-district policies to mitigate school-based inequalities in opportunities to learn. PMID:28607527

  2. Implementation of compressive sensing for preclinical cine-MRI

    NASA Astrophysics Data System (ADS)

    Tan, Elliot; Yang, Ming; Ma, Lixin; Zheng, Yahong Rosa

    2014-03-01

    This paper presents a practical implementation of Compressive Sensing (CS) for a preclinical MRI machine to acquire randomly undersampled k-space data in cardiac function imaging applications. First, random undersampling masks were generated based on Gaussian, Cauchy, wrapped Cauchy and von Mises probability distribution functions by the inverse transform method. The best masks for undersampling ratios of 0.3, 0.4 and 0.5 were chosen for animal experimentation, and were programmed into a Bruker Avance III BioSpec 7.0T MRI system through method programming in ParaVision. Three undersampled mouse heart datasets were obtained using a fast low angle shot (FLASH) sequence, along with a control undersampled phantom dataset. ECG and respiratory gating was used to obtain high quality images. After CS reconstructions were applied to all acquired data, resulting images were quantitatively analyzed using the performance metrics of reconstruction error and Structural Similarity Index (SSIM). The comparative analysis indicated that CS reconstructed images from MRI machine undersampled data were indeed comparable to CS reconstructed images from retrospective undersampled data, and that CS techniques are practical in a preclinical setting. The implementation achieved 2 to 4 times acceleration for image acquisition and satisfactory quality of image reconstruction.

  3. a Free and Open Source Tool to Assess the Accuracy of Land Cover Maps: Implementation and Application to Lombardy Region (italy)

    NASA Astrophysics Data System (ADS)

    Bratic, G.; Brovelli, M. A.; Molinari, M. E.

    2018-04-01

    The availability of thematic maps has significantly increased over the last few years. Validation of these maps is a key factor in assessing their suitability for different applications. The evaluation of the accuracy of classified data is carried out through a comparison with a reference dataset and the generation of a confusion matrix from which many quality indexes can be derived. In this work, an ad hoc free and open source Python tool was implemented to automatically compute all the matrix confusion-derived accuracy indexes proposed by literature. The tool was integrated into GRASS GIS environment and successfully applied to evaluate the quality of three high-resolution global datasets (GlobeLand30, Global Urban Footprint, Global Human Settlement Layer Built-Up Grid) in the Lombardy Region area (Italy). In addition to the most commonly used accuracy measures, e.g. overall accuracy and Kappa, the tool allowed to compute and investigate less known indexes such as the Ground Truth and the Classification Success Index. The promising tool will be further extended with spatial autocorrelation analysis functions and made available to researcher and user community.

  4. orthAgogue: an agile tool for the rapid prediction of orthology relations.

    PubMed

    Ekseth, Ole Kristian; Kuiper, Martin; Mironov, Vladimir

    2014-03-01

    The comparison of genes and gene products across species depends on high-quality tools to determine the relationships between gene or protein sequences from various species. Although some excellent applications are available and widely used, their performance leaves room for improvement. We developed orthAgogue: a multithreaded C application for high-speed estimation of homology relations in massive datasets, operated via a flexible and easy command-line interface. The orthAgogue software is distributed under the GNU license. The source code and binaries compiled for Linux are available at https://code.google.com/p/orthagogue/.

  5. The agreement between 3D, standard 2D and triplane 2D speckle tracking: effects of image quality and 3D volume rate.

    PubMed

    Trache, Tudor; Stöbe, Stephan; Tarr, Adrienn; Pfeiffer, Dietrich; Hagendorff, Andreas

    2014-12-01

    Comparison of 3D and 2D speckle tracking performed on standard 2D and triplane 2D datasets of normal and pathological left ventricular (LV) wall-motion patterns with a focus on the effect that 3D volume rate (3DVR), image quality and tracking artifacts have on the agreement between 2D and 3D speckle tracking. 37 patients with normal LV function and 18 patients with ischaemic wall-motion abnormalities underwent 2D and 3D echocardiography, followed by offline speckle tracking measurements. The values of 3D global, regional and segmental strain were compared with the standard 2D and triplane 2D strain values. Correlation analysis with the LV ejection fraction (LVEF) was also performed. The 3D and 2D global strain values correlated good in both normally and abnormally contracting hearts, though systematic differences between the two methods were observed. Of the 3D strain parameters, the area strain showed the best correlation with the LVEF. The numerical agreement of 3D and 2D analyses varied significantly with the volume rate and image quality of the 3D datasets. The highest correlation between 2D and 3D peak systolic strain values was found between 3D area and standard 2D longitudinal strain. Regional wall-motion abnormalities were similarly detected by 2D and 3D speckle tracking. 2DST of triplane datasets showed similar results to those of conventional 2D datasets. 2D and 3D speckle tracking similarly detect normal and pathological wall-motion patterns. Limited image quality has a significant impact on the agreement between 3D and 2D numerical strain values.

  6. Measures and Indicators of Vgi Quality: AN Overview

    NASA Astrophysics Data System (ADS)

    Antoniou, V.; Skopeliti, A.

    2015-08-01

    The evaluation of VGI quality has been a very interesting and popular issue amongst academics and researchers. Various metrics and indicators have been proposed for evaluating VGI quality elements. Various efforts have focused on the use of well-established methodologies for the evaluation of VGI quality elements against authoritative data. In this paper, a number of research papers have been reviewed and summarized in a detailed report on measures for each spatial data quality element. Emphasis is given on the methodology followed and the data used in order to assess and evaluate the quality of the VGI datasets. However, as the use of authoritative data is not always possible many researchers have turned their focus on the analysis of new quality indicators that can function as proxies for the understanding of VGI quality. In this paper, the difficulties in using authoritative datasets are briefly presented and new proposed quality indicators are discussed, as recorded through the literature review. We classify theses new indicators in four main categories that relate with: i) data, ii) demographics, iii) socio-economic situation and iv) contributors. This paper presents a dense, yet comprehensive overview of the research on this field and provides the basis for the ongoing academic effort to create a practical quality evaluation method through the use of appropriate quality indicators.

  7. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

    PubMed

    Mapleson, Daniel; Garcia Accinelli, Gonzalo; Kettleborough, George; Wright, Jonathan; Clavijo, Bernardo J

    2017-02-15

    De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies. We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT . bernardo.clavijo@earlham.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  8. EAPhy: A Flexible Tool for High-throughput Quality Filtering of Exon-alignments and Data Processing for Phylogenetic Methods.

    PubMed

    Blom, Mozes P K

    2015-08-05

    Recently developed molecular methods enable geneticists to target and sequence thousands of orthologous loci and infer evolutionary relationships across the tree of life. Large numbers of genetic markers benefit species tree inference but visual inspection of alignment quality, as traditionally conducted, is challenging with thousands of loci. Furthermore, due to the impracticality of repeated visual inspection with alternative filtering criteria, the potential consequences of using datasets with different degrees of missing data remain nominally explored in most empirical phylogenomic studies. In this short communication, I describe a flexible high-throughput pipeline designed to assess alignment quality and filter exonic sequence data for subsequent inference. The stringency criteria for alignment quality and missing data can be adapted based on the expected level of sequence divergence. Each alignment is automatically evaluated based on the stringency criteria specified, significantly reducing the number of alignments that require visual inspection. By developing a rapid method for alignment filtering and quality assessment, the consistency of phylogenetic estimation based on exonic sequence alignments can be further explored across distinct inference methods, while accounting for different degrees of missing data.

  9. Investigation of undersampling and reconstruction algorithm dependence on respiratory correlated 4D-MRI for online MR-guided radiation therapy

    NASA Astrophysics Data System (ADS)

    Mickevicius, Nikolai J.; Paulson, Eric S.

    2017-04-01

    The purpose of this work is to investigate the effects of undersampling and reconstruction algorithm on the total processing time and image quality of respiratory phase-resolved 4D MRI data. Specifically, the goal is to obtain quality 4D-MRI data with a combined acquisition and reconstruction time of five minutes or less, which we reasoned would be satisfactory for pre-treatment 4D-MRI in online MRI-gRT. A 3D stack-of-stars, self-navigated, 4D-MRI acquisition was used to scan three healthy volunteers at three image resolutions and two scan durations. The NUFFT, CG-SENSE, SPIRiT, and XD-GRASP reconstruction algorithms were used to reconstruct each dataset on a high performance reconstruction computer. The overall image quality, reconstruction time, artifact prevalence, and motion estimates were compared. The CG-SENSE and XD-GRASP reconstructions provided superior image quality over the other algorithms. The combination of a 3D SoS sequence and parallelized reconstruction algorithms using computing hardware more advanced than those typically seen on product MRI scanners, can result in acquisition and reconstruction of high quality respiratory correlated 4D-MRI images in less than five minutes.

  10. A hybrid Land Cover Dataset for Russia: a new methodology for merging statistics, remote sensing and in-situ information

    NASA Astrophysics Data System (ADS)

    Schepaschenko, D.; McCallum, I.; Shvidenko, A.; Kraxner, F.; Fritz, S.

    2009-04-01

    There is a critical need for accurate land cover information for resource assessment, biophysical modeling, greenhouse gas studies, and for estimating possible terrestrial responses and feedbacks to climate change. However, practically all existing land cover datasets have quite a high level of uncertainty and suffer from a lack of important details that does not allow for relevant parameterization, e.g., data derived from different forest inventories. The objective of this study is to develop a methodology in order to create a hybrid land cover dataset at the level which would satisfy requirements of the verified terrestrial biota full greenhouse gas account (Shvidenko et al., 2008) for large regions i.e. Russia. Such requirements necessitate a detailed quantification of land classes (e.g., for forests - dominant species, age, growing stock, net primary production, etc.) with additional information on uncertainties of the major biometric and ecological parameters in the range of 10-20% and a confidence interval of around 0.9. The approach taken here allows the integration of different datasets to explore synergies and in particular the merging and harmonization of land and forest inventories, ecological monitoring, remote sensing data and in-situ information. The following datasets have been integrated: Remote sensing: Global Land Cover 2000 (Fritz et al., 2003), Vegetation Continuous Fields (Hansen et al., 2002), Vegetation Fire (Sukhinin, 2007), Regional land cover (Schmullius et al., 2005); GIS: Soil 1:2.5 Mio (Dokuchaev Soil Science Institute, 1996), Administrative Regions 1:2.5 Mio, Vegetation 1:4 Mio, Bioclimatic Zones 1:4 Mio (Stolbovoi & McCallum, 2002), Forest Enterprises 1:2.5 Mio, Rivers/Lakes and Roads/Railways 1:1 Mio (IIASA's data base); Inventories and statistics: State Land Account (FARSC RF, 2006), State Forest Account - SFA (FFS RF, 2003), Disturbances in forests (FFS RF, 2006). The resulting hybrid land cover dataset at 1-km resolution comprises the following classes: Forest (each grid links to the SFA database, which contains 86,613 records); Agriculture (5 classes, parameterized by 89 administrative units); Wetlands (8 classes, parameterized by 83 zone/region units); Open Woodland, Burnt area; Shrub/grassland (50 classes, parameterized by 300 zone/region units); Water; Unproductive area. This study has demonstrated the ability to produce a highly detailed (both spatially and thematically) land cover dataset over Russia. Future efforts include further validation of the hybrid land cover dataset for Russia, and its use for assessment of the terrestrial biota full greenhouse gas budget across Russia. The methodology proposed in this study could be applied at the global level. Results of such an undertaking would however be highly dependent upon the quality of the available ground data. The implementation of the hybrid land cover dataset was undertaken in a way that it can be regularly updated based on new ground data and remote sensing products (ie. MODIS).

  11. Integration of prior CT into CBCT reconstruction for improved image quality via reconstruction of difference: first patient studies

    NASA Astrophysics Data System (ADS)

    Zhang, Hao; Gang, Grace J.; Lee, Junghoon; Wong, John; Stayman, J. Webster

    2017-03-01

    Purpose: There are many clinical situations where diagnostic CT is used for an initial diagnosis or treatment planning, followed by one or more CBCT scans that are part of an image-guided intervention. Because the high-quality diagnostic CT scan is a rich source of patient-specific anatomical knowledge, this provides an opportunity to incorporate the prior CT image into subsequent CBCT reconstruction for improved image quality. We propose a penalized-likelihood method called reconstruction of difference (RoD), to directly reconstruct differences between the CBCT scan and the CT prior. In this work, we demonstrate the efficacy of RoD with clinical patient datasets. Methods: We introduce a data processing workflow using the RoD framework to reconstruct anatomical changes between the prior CT and current CBCT. This workflow includes processing steps to account for non-anatomical differences between the two scans including 1) scatter correction for CBCT datasets due to increased scatter fractions in CBCT data; 2) histogram matching for attenuation variations between CT and CBCT; and 3) registration for different patient positioning. CBCT projection data and CT planning volumes for two radiotherapy patients - one abdominal study and one head-and-neck study - were investigated. Results: In comparisons between the proposed RoD framework and more traditional FDK and penalized-likelihood reconstructions, we find a significant improvement in image quality when prior CT information is incorporated into the reconstruction. RoD is able to provide additional low-contrast details while correctly incorporating actual physical changes in patient anatomy. Conclusions: The proposed framework provides an opportunity to either improve image quality or relax data fidelity constraints for CBCT imaging when prior CT studies of the same patient are available. Possible clinical targets include CBCT image-guided radiotherapy and CBCT image-guided surgeries.

  12. The advanced qualtiy control techniques planned for the Internation Soil Moisture Network

    NASA Astrophysics Data System (ADS)

    Xaver, A.; Gruber, A.; Hegiova, A.; Sanchis-Dufau, A. D.; Dorigo, W. A.

    2012-04-01

    In situ soil moisture observations are essential to evaluate and calibrate modeled and remotely sensed soil moisture products. Although a number of meteorological networks and field campaigns measuring soil moisture exist on a global and long-term scale, their observations are not easily accessible and lack standardization of both technique and protocol. Thus, handling and especially comparing these datasets with satellite products or land surface models is a demanding issue. To overcome these limitations the International Soil Moisture Network (ISMN; http://www.ipf.tuwien.ac.at/insitu/) has been initiated to act as a centralized data hosting facility. One advantage of the ISMN is that users are able to access the harmonized datasets easily through a web portal. Another advantage is the fully automated processing chain including the data harmonization in terms of units and sampling interval, but even more important is the advanced quality control system each measurement has to run through. The quality of in situ soil moisture measurements is crucial for the validation of satellite- and model-based soil moisture retrievals; therefore a sophisticated quality control system was developed. After a check for plausibility and geophysical limits a quality flag is added to each measurement. An enhanced flagging mechanism was recently defined using a spectrum based approach to detect spurious spikes, jumps and plateaus. The International Soil Moisture Network has already evolved to one of the most important distribution platforms for in situ soil moisture observations and is still growing. Currently, data from 27 networks in total covering more than 800 stations in Europe, North America, Australia, Asia and Africa is hosted by the ISMN. Available datasets also include historical datasets as well as near real-time measurements. The improved quality control system will provide important information for satellite-based as well as land surface model-based validation studies.

  13. Enhancing DInSAR capabilities for landslide monitoring by applying GIS-based multicriteria filtering analysis

    NASA Astrophysics Data System (ADS)

    Beyene, F.; Knospe, S.; Busch, W.

    2015-04-01

    Landslide detection and monitoring remain difficult with conventional differential radar interferometry (DInSAR) because most pixels of radar interferograms around landslides are affected by different error sources. These are mainly related to the nature of high radar viewing angles and related spatial distortions (such as overlays and shadows), temporal decorrelations owing to vegetation cover, and speed and direction of target sliding masses. On the other hand, GIS can be used to integrate spatial datasets obtained from many sources (including radar and non-radar sources). In this paper, a GRID data model is proposed to integrate deformation data derived from DInSAR processing with other radar origin data (coherence, layover and shadow, slope and aspect, local incidence angle) and external datasets collected from field study of landslide sites and other sources (geology, geomorphology, hydrology). After coordinate transformation and merging of data, candidate landslide representing pixels of high quality radar signals were filtered out by applying a GIS based multicriteria filtering analysis (GIS-MCFA), which excludes grid points in areas of shadow and overlay, low coherence, non-detectable and non-landslide deformations, and other possible sources of errors from the DInSAR data processing. At the end, the results obtained from GIS-MCFA have been verified by using the external datasets (existing landslide sites collected from fieldworks, geological and geomorphologic maps, rainfall data etc.).

  14. Monitoring and long-term assessment of the Mediterranean Sea physical state

    NASA Astrophysics Data System (ADS)

    Simoncelli, Simona; Fratianni, Claudia; Clementi, Emanuela; Drudi, Massimiliano; Pistoia, Jenny; Grandi, Alessandro; Del Rosso, Damiano

    2017-04-01

    The near real time monitoring and long-term assessment of the physical state of the ocean are crucial for the wide CMEMS user community providing a continuous and up to date overview of key indicators computed from operational analysis and reanalysis datasets. This constitutes an operational warning system on particular events, stimulating the research towards a deeper understanding of them and consequently increasing CMEMS products uptake. Ocean Monitoring Indicators (OMIs) of some Essential Ocean Variables have been identified and developed by the Mediterranean Monitoring and Forecasting Centre (MED-MFC) under the umbrella of the CMEMS MYP WG (Multi Year Products Working Group). These OMIs have been operationally implemented starting from the physical reanalysis products and then they have been applied to the operational analyses product. Sea surface temperature, salinity, height as well as heat, water and momentum fluxes at the air-sea interface have been operationally implemented since the reanalysis system development as a real time monitoring of the data production. Their consistency analysis against available observational products or budget values recognized in literature guarantees the high quality of the numerical dataset. The results of the reanalysis validation procedures are yearly published in the QUality Information Document since 2014 available through the CMEMS catalogue (http://marine.copernicus.eu), together with the yearly dataset extension. New OMIs of the winter mixed layer depth, the eddy kinetic energy and the heat content will be presented, in particular we will analyze their time evolution and trends starting from 1987, then we will focus on the recent time period 2013-2016 when reanalysis and analyses datasets overlap to show their consistency beside their different system implementation (i.e. atmospheric forcing, wave coupling, nesting). At the end the focus will be on 2016 sea state and circulation of the Mediterranean Sea and its anomaly with respect to the climatological fields to early detect the 2016 peculiarities.

  15. Data Publication: A Partnership between Scientists, Data Managers and Librarians

    NASA Astrophysics Data System (ADS)

    Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.

    2012-04-01

    Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain specific data repositories. These datasets currently include audio, text and image files. This research is being conducted by a team of librarians, data managers and scientists that are collaborating with representatives from the Scientific Committee on Oceanic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE) of the Intergovernmental Oceanographic Commission (IOC). The goal is to identify best practices for tracking data provenance and clearly attributing credit to data collectors/providers.

  16. Component-Level Tuning of Kinematic Features from Composite Therapist Impressions of Movement Quality

    PubMed Central

    Venkataraman, Vinay; Turaga, Pavan; Baran, Michael; Lehrer, Nicole; Du, Tingfang; Cheng, Long; Rikakis, Thanassis; Wolf, Steven L.

    2016-01-01

    In this paper, we propose a general framework for tuning component-level kinematic features using therapists’ overall impressions of movement quality, in the context of a Home-based Adaptive Mixed Reality Rehabilitation (HAMRR) system. We propose a linear combination of non-linear kinematic features to model wrist movement, and propose an approach to learn feature thresholds and weights using high-level labels of overall movement quality provided by a therapist. The kinematic features are chosen such that they correlate with the quality of wrist movements to clinical assessment scores. Further, the proposed features are designed to be reliably extracted from an inexpensive and portable motion capture system using a single reflective marker on the wrist. Using a dataset collected from ten stroke survivors, we demonstrate that the framework can be reliably used for movement quality assessment in HAMRR systems. The system is currently being deployed for large-scale evaluations, and will represent an increasingly important application area of motion capture and activity analysis. PMID:25438331

  17. Quality control and improvement of cancer care: what is needed? 4th European Roundtable Meeting (ERTM) May 5th, 2017, Berlin, Germany.

    PubMed

    Ortmann, Olaf; Helbig, Ulrike; Torode, Julie; Schreck, Stefan; Karjalainen, Sakari; Bettio, Manola; Ringborg, Ulrik; Klinkhammer-Schalke, Monika; Bray, Freddy

    2018-06-01

    National Cancer Control Plans (NCCPs) often describe structural requirements for high quality cancer care. During the fourth European Roundtable Meeting (ERTM) participants shared learnings from their own national setting to formulate best practice in optimizing communication strategies between parties involved in clinical cancer registries, cancer centers and guideline groups. A decentralized model of data collection close to the patient and caregiver enhances timely completion and the quality of the data captured. Nevertheless, central coordination is necessary to define datasets, indicators, standard settings, education, training and quality control to maintain standards across the network. In particular, interaction of parties in cancer care network has to be established and maintained on a regular basis. After establishing the structural requirements of cancer care networks, communication between the different components and parties is required to analyze outcome data, provide regular reporting to all and develop strategies for continuous improvement of quality across the network.

  18. Automatic segmentation and co-registration of gated CT angiography datasets: measuring abdominal aortic pulsatility

    NASA Astrophysics Data System (ADS)

    Wentz, Robert; Manduca, Armando; Fletcher, J. G.; Siddiki, Hassan; Shields, Raymond C.; Vrtiska, Terri; Spencer, Garrett; Primak, Andrew N.; Zhang, Jie; Nielson, Theresa; McCollough, Cynthia; Yu, Lifeng

    2007-03-01

    Purpose: To develop robust, novel segmentation and co-registration software to analyze temporally overlapping CT angiography datasets, with an aim to permit automated measurement of regional aortic pulsatility in patients with abdominal aortic aneurysms. Methods: We perform retrospective gated CT angiography in patients with abdominal aortic aneurysms. Multiple, temporally overlapping, time-resolved CT angiography datasets are reconstructed over the cardiac cycle, with aortic segmentation performed using a priori anatomic assumptions for the aorta and heart. Visual quality assessment is performed following automatic segmentation with manual editing. Following subsequent centerline generation, centerlines are cross-registered across phases, with internal validation of co-registration performed by examining registration at the regions of greatest diameter change (i.e. when the second derivative is maximal). Results: We have performed gated CT angiography in 60 patients. Automatic seed placement is successful in 79% of datasets, requiring either no editing (70%) or minimal editing (less than 1 minute; 12%). Causes of error include segmentation into adjacent, high-attenuating, nonvascular tissues; small segmentation errors associated with calcified plaque; and segmentation of non-renal, small paralumbar arteries. Internal validation of cross-registration demonstrates appropriate registration in our patient population. In general, we observed that aortic pulsatility can vary along the course of the abdominal aorta. Pulsation can also vary within an aneurysm as well as between aneurysms, but the clinical significance of these findings remain unknown. Conclusions: Visualization of large vessel pulsatility is possible using ECG-gated CT angiography, partial scan reconstruction, automatic segmentation, centerline generation, and coregistration of temporally resolved datasets.

  19. USGS Mineral Resources Program; national maps and datasets for research and land planning

    USGS Publications Warehouse

    Nicholson, S.W.; Stoeser, D.B.; Ludington, S.D.; Wilson, Frederic H.

    2001-01-01

    The U.S. Geological Survey, the Nation’s leader in producing and maintaining earth science data, serves as an advisor to Congress, the Department of the Interior, and many other Federal and State agencies. Nationwide datasets that are easily available and of high quality are critical for addressing a wide range of land-planning, resource, and environmental issues. Four types of digital databases (geological, geophysical, geochemical, and mineral occurrence) are being compiled and upgraded by the Mineral Resources Program on regional and national scales to meet these needs. Where existing data are incomplete, new data are being collected to ensure national coverage. Maps and analyses produced from these databases provide basic information essential for mineral resource assessments and environmental studies, as well as fundamental information for regional and national land-use studies. Maps and analyses produced from the databases are instrumental to ongoing basic research, such as the identification of mineral deposit origins, determination of regional background values of chemical elements with known environmental impact, and study of the relationships between toxic elements or mining practices to human health. As datasets are completed or revised, the information is made available through a variety of media, including the Internet. Much of the available information is the result of cooperative activities with State and other Federal agencies. The upgraded Mineral Resources Program datasets make geologic, geophysical, geochemical, and mineral occurrence information at the state, regional, and national scales available to members of Congress, State and Federal government agencies, researchers in academia, and the general public. The status of the Mineral Resources Program datasets is outlined below.

  20. Improving global data infrastructures for more effective and scalable analysis of Earth and environmental data: the Australian NCI NERDIP Approach

    NASA Astrophysics Data System (ADS)

    Evans, Ben; Wyborn, Lesley; Druken, Kelsey; Richards, Clare; Trenham, Claire; Wang, Jingbo; Rozas Larraondo, Pablo; Steer, Adam; Smillie, Jon

    2017-04-01

    The National Computational Infrastructure (NCI) facility hosts one of Australia's largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences domains. The data are obtained from national and international sources, spanning a wide range of gridded and ungridded (i.e., line surveys, point clouds) data, and raster imagery, as well as diverse coordinate reference projections and resolutions. Rather than managing these data assets as a digital library, whereby users can discover and download files to personal servers (similar to borrowing 'books' from a 'library'), NCI has built an extensive and well-integrated research data platform, the National Environmental Research Data Interoperability Platform (NERDIP, http://nci.org.au/data-collections/nerdip/). The NERDIP architecture enables programmatic access to data via standards-compliant services for high performance data analysis, and provides a flexible cloud-based environment to facilitate the next generation of transdisciplinary scientific research across all data domains. To improve use of modern scalable data infrastructures that are focused on efficient data analysis, the data organisation needs to be carefully managed including performance evaluations of projections and coordinate systems, data encoding standards and formats. A complication is that we have often found multiple domain vocabularies and ontologies are associated with equivalent datasets. It is not practical for individual dataset managers to determine which standards are best to apply to their dataset as this could impact accessibility and interoperability. Instead, they need to work with data custodians across interrelated communities and, in partnership with the data repository, the international scientific community to determine the most useful approach. For the data repository, this approach is essential to enable different disciplines and research communities to invoke new forms of analysis and discovery in an increasingly complex data-rich environment. Driven by the heterogeneity of Earth and environmental datasets, NCI developed a Data Quality/Data Assurance Strategy to ensure consistency is maintained within and across all datasets, as well as functionality testing to ensure smooth interoperability between products, tools, and services. This is particularly so for collections that contain data generated from multiple data acquisition campaigns, often using instruments and models that have evolved over time. By implementing the NCI Data Quality Strategy we have seen progressive improvement in the integration and quality of the datasets across the different subject domains, and through this, the ease by which the users can access data from this major data infrastructure. By both adhering to international standards and also contributing to extensions of these standards, data from the NCI NERDIP platform can be federated with data from other globally distributed data repositories and infrastructures. The NCI approach builds on our experience working with the astronomy and climate science communities, which have been internationally coordinating such interoperability standards within their disciplines for some years. The results of our work so far demonstrate more could be done in the Earth science, solid earth and environmental communities, particularly through establishing better linkages between international/national community efforts such as EPOS, ENVRIplus, EarthCube, AuScope and the Research Data Alliance.

  1. Advanced multivariate data analysis to determine the root cause of trisulfide bond formation in a novel antibody-peptide fusion.

    PubMed

    Goldrick, Stephen; Holmes, William; Bond, Nicholas J; Lewis, Gareth; Kuiper, Marcel; Turner, Richard; Farid, Suzanne S

    2017-10-01

    Product quality heterogeneities, such as a trisulfide bond (TSB) formation, can be influenced by multiple interacting process parameters. Identifying their root cause is a major challenge in biopharmaceutical production. To address this issue, this paper describes the novel application of advanced multivariate data analysis (MVDA) techniques to identify the process parameters influencing TSB formation in a novel recombinant antibody-peptide fusion expressed in mammalian cell culture. The screening dataset was generated with a high-throughput (HT) micro-bioreactor system (Ambr TM 15) using a design of experiments (DoE) approach. The complex dataset was firstly analyzed through the development of a multiple linear regression model focusing solely on the DoE inputs and identified the temperature, pH and initial nutrient feed day as important process parameters influencing this quality attribute. To further scrutinize the dataset, a partial least squares model was subsequently built incorporating both on-line and off-line process parameters and enabled accurate predictions of the TSB concentration at harvest. Process parameters identified by the models to promote and suppress TSB formation were implemented on five 7 L bioreactors and the resultant TSB concentrations were comparable to the model predictions. This study demonstrates the ability of MVDA to enable predictions of the key performance drivers influencing TSB formation that are valid also upon scale-up. Biotechnol. Bioeng. 2017;114: 2222-2234. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc.

  2. Implementing a Data Quality Strategy to Simplify Access to Data

    NASA Astrophysics Data System (ADS)

    Druken, K. A.; Trenham, C. E.; Evans, B. J. K.; Richards, C. J.; Wang, J.; Wyborn, L. A.

    2016-12-01

    To ensure seamless programmatic access for data analysis (including machine learning), standardization of both data and services is vital. At the Australian National Computational Infrastructure (NCI) we have developed a Data Quality Strategy (DQS) that currently provides processes for: (1) the consistency of data structures in the underlying High Performance Data (HPD) platform; (2) quality control through compliance with recognized community standards; and (3) data quality assurance through demonstrated functionality across common platforms, tools and services. NCI hosts one of Australia's largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences. A key challenge is the application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used. Within these disciplines, data span a wide range of gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as diverse coordinate reference projections and resolutions. By implementing our DQS we have seen progressive improvement in the quality of the datasets across the different subject domains, and through this, the ease by which the users can programmatically access the data, either in situ or via web services. As part of its quality control procedures, NCI has developed a compliance checker based upon existing domain standards. The DQS also includes extensive Functionality Testing which include readability by commonly used libraries (e.g., netCDF, HDF, GDAL, etc.); accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer), validation against scientific analysis and programming platforms (e.g., Python, Matlab, QGIS); and visualization tools (e.g., ParaView, NASA Web World Wind). These tests ensure smooth interoperability between products and services as well as exposing unforeseen requirements and dependencies. The results provide an important component of quality control within the DQS as well as clarifying the requirement for any extensions to the relevant standards that help support the uptake of data by broader international communities.

  3. A biclustering algorithm for extracting bit-patterns from binary datasets.

    PubMed

    Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

    2011-10-01

    Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.

  4. Documentation Resources on the ESIP Wiki

    NASA Technical Reports Server (NTRS)

    Habermann, Ted; Kozimor, John; Gordon, Sean

    2017-01-01

    The ESIP community includes data providers and users that communicate with one another through datasets and metadata that describe them. Improving this communication depends on consistent high-quality metadata. The ESIP Documentation Cluster and the wiki play an important central role in facilitating this communication. We will describe and demonstrate sections of the wiki that provide information about metadata concept definitions, metadata recommendation, metadata dialects, and guidance pages. We will also describe and demonstrate the ISO Explorer, a tool that the community is developing to help metadata creators.

  5. Application of Climate Assessment Tool (CAT) to estimate climate variability impacts on nutrient loading from local watersheds

    Treesearch

    Ying Ouyang; Prem B. Parajuli; Gary Feng; Theodor D. Leininger; Yongshan Wan; Padmanava Dash

    2018-01-01

    A vast amount of future climate scenario datasets, created by climate models such as general circulation models (GCMs), have been used in conjunction with watershed models to project future climate variability impact on hydrological processes and water quality. However, these low spatial-temporal resolution datasets are often difficult to downscale spatially and...

  6. The Increasing Availability of Official Datasets: Methods, Limitations and Opportunities for Studies of Education

    ERIC Educational Resources Information Center

    Gorard, Stephen

    2012-01-01

    The re-use of existing and official data has a very long and largely honourable history in education and social science. The principal change in the 60 years since the first issue of the "British Journal of Educational Studies" has been the increasing range, availability and quality of existing numeric datasets. New and valuable fields…

  7. The Impact of 3D Data Quality on Improving GNSS Performance Using City Models Initial Simulations

    NASA Astrophysics Data System (ADS)

    Ellul, C.; Adjrad, M.; Groves, P.

    2016-10-01

    There is an increasing demand for highly accurate positioning information in urban areas, to support applications such as people and vehicle tracking, real-time air quality detection and navigation. However systems such as GPS typically perform poorly in dense urban areas. A number of authors have made use of 3D city models to enhance accuracy, obtaining good results, but to date the influence of the quality of the 3D city model on these results has not been tested. This paper addresses the following question: how does the quality, and in particular the variation in height, level of generalization and completeness and currency of a 3D dataset, impact the results obtained for the preliminary calculations in a process known as Shadow Matching, which takes into account not only where satellite signals are visible on the street but also where they are predicted to be absent. We describe initial simulations to address this issue, examining the variation in elevation angle - i.e. the angle above which the satellite is visible, for three 3D city models in a test area in London, and note that even within one dataset using different available height values could cause a difference in elevation angle of up to 29°. Missing or extra buildings result in an elevation variation of around 85°. Variations such as these can significantly influence the predicted satellite visibility which will then not correspond to that experienced on the ground, reducing the accuracy of the resulting Shadow Matching process.

  8. Chemical elements in the environment: multi-element geochemical datasets from continental to national scale surveys on four continents

    USGS Publications Warehouse

    Caritat, Patrice de; Reimann, Clemens; Smith, David; Wang, Xueqiu

    2017-01-01

    During the last 10-20 years, Geological Surveys around the world have undertaken a major effort towards delivering fully harmonized and tightly quality-controlled low-density multi-element soil geochemical maps and datasets of vast regions including up to whole continents. Concentrations of between 45 and 60 elements commonly have been determined in a variety of different regolith types (e.g., sediment, soil). The multi-element datasets are published as complete geochemical atlases and made available to the general public. Several other geochemical datasets covering smaller areas but generally at a higher spatial density are also available. These datasets may, however, not be found by superficial internet-based searches because the elements are not mentioned individually either in the title or in the keyword lists of the original references. This publication attempts to increase the visibility and discoverability of these fundamental background datasets covering large areas up to whole continents.

  9. Local environmental quality positively predicts breastfeeding in the UK’s Millennium Cohort Study

    PubMed Central

    Sear, Rebecca

    2017-01-01

    ABSTRACT Background and Objectives: Breastfeeding is an important form of parental investment with clear health benefits. Despite this, rates remain low in the UK; understanding variation can therefore help improve interventions. Life history theory suggests that environmental quality may pattern maternal investment, including breastfeeding. We analyse a nationally representative dataset to test two predictions: (i) higher local environmental quality predicts higher likelihood of breastfeeding initiation and longer duration; (ii) higher socioeconomic status (SES) provides a buffer against the adverse influences of low local environmental quality. Methodology: We ran factor analysis on a wide range of local-level environmental variables. Two summary measures of local environmental quality were generated by this analysis—one ‘objective’ (based on an independent assessor’s neighbourhood scores) and one ‘subjective’ (based on respondent’s scores). We used mixed-effects regression techniques to test our hypotheses. Results: Higher objective, but not subjective, local environmental quality predicts higher likelihood of starting and maintaining breastfeeding over and above individual SES and area-level measures of environmental quality. Higher individual SES is protective, with women from high-income households having relatively high breastfeeding initiation rates and those with high status jobs being more likely to maintain breastfeeding, even in poor environmental conditions. Conclusions and Implications: Environmental quality is often vaguely measured; here we present a thorough investigation of environmental quality at the local level, controlling for individual- and area-level measures. Our findings support a shift in focus away from individual factors and towards altering the landscape of women’s decision making contexts when considering behaviours relevant to public health. PMID:29354262

  10. Local environmental quality positively predicts breastfeeding in the UK's Millennium Cohort Study.

    PubMed

    Brown, Laura J; Sear, Rebecca

    2017-01-01

    Background and Objectives: Breastfeeding is an important form of parental investment with clear health benefits. Despite this, rates remain low in the UK; understanding variation can therefore help improve interventions. Life history theory suggests that environmental quality may pattern maternal investment, including breastfeeding. We analyse a nationally representative dataset to test two predictions: (i) higher local environmental quality predicts higher likelihood of breastfeeding initiation and longer duration; (ii) higher socioeconomic status (SES) provides a buffer against the adverse influences of low local environmental quality. Methodology: We ran factor analysis on a wide range of local-level environmental variables. Two summary measures of local environmental quality were generated by this analysis-one 'objective' (based on an independent assessor's neighbourhood scores) and one 'subjective' (based on respondent's scores). We used mixed-effects regression techniques to test our hypotheses. Results: Higher objective, but not subjective, local environmental quality predicts higher likelihood of starting and maintaining breastfeeding over and above individual SES and area-level measures of environmental quality. Higher individual SES is protective, with women from high-income households having relatively high breastfeeding initiation rates and those with high status jobs being more likely to maintain breastfeeding, even in poor environmental conditions. Conclusions and Implications: Environmental quality is often vaguely measured; here we present a thorough investigation of environmental quality at the local level, controlling for individual- and area-level measures. Our findings support a shift in focus away from individual factors and towards altering the landscape of women's decision making contexts when considering behaviours relevant to public health.

  11. The satellite-based remote sensing of particulate matter (PM) in support to urban air quality: PM variability and hot spots within the Cordoba city (Argentina) as revealed by the high-resolution MAIAC-algorithm retrievals applied to a ten-years dataset (2

    NASA Astrophysics Data System (ADS)

    Della Ceca, Lara Sofia; Carreras, Hebe A.; Lyapustin, Alexei I.; Barnaba, Francesca

    2016-04-01

    Particulate matter (PM) is one of the major harmful pollutants to public health and the environment [1]. In developed countries, specific air-quality legislation establishes limit values for PM metrics (e.g., PM10, PM2.5) to protect the citizens health (e.g., European Commission Directive 2008/50, US Clean Air Act). Extensive PM measuring networks therefore exist in these countries to comply with the legislation. In less developed countries air quality monitoring networks are still lacking and satellite-based datasets could represent a valid alternative to fill observational gaps. The main PM (or aerosol) parameter retrieved from satellite is the 'aerosol optical depth' (AOD), an optical parameter quantifying the aerosol load in the whole atmospheric column. Datasets from the MODIS sensors on board of the NASA spacecrafts TERRA and AQUA are among the longest records of AOD from space. However, although extremely useful in regional and global studies, the standard 10 km-resolution MODIS AOD product is not suitable to be employed at the urban scale. Recently, a new algorithm called Multi-Angle Implementation of Atmospheric Correction (MAIAC) was developed for MODIS, providing AOD at 1 km resolution [2]. In this work, the MAIAC AOD retrievals over the decade 2003-2013 were employed to investigate the spatiotemporal variation of atmospheric aerosols over the Argentinean city of Cordoba and its surroundings, an area where a very scarce dataset of in situ PM data is available. The MAIAC retrievals over the city were firstly validated using a 'ground truth' AOD dataset from the Cordoba sunphotometer operating within the global AERONET network [3]. This validation showed the good performances of the MAIAC algorithm in the area. The satellite MAIAC AOD dataset was therefore employed to investigate the 10-years trend as well as seasonal and monthly patterns of particulate matter in the Cordoba city. The first showed a marked increase of AOD over time, particularly evident in some areas of the city (hot spots). These hot spots were put in relation with changes in vehicular traffic flows after the construction of new roads in the urban area. The monthly-resolved analysis showed a marked seasonal cycle, evidencing the influence of both meteorological conditions and season-dependent sources on the AOD parameter. For instance, in the Cordoba rural area an increase of AOD is observed during March-April, which is the soybean harvesting period, the main agricultural activity in the region. Furthermore, higher AOD signals were observed in the vicinity of main roads during summer months (December to February), likely related to the increase in vehicular traffic flow due to tourism. Long-range transport is also shown to play a role at the city scale, as high AODs throughout the study area are observed between August and November. In fact, this is the biomass-burning season over the Amazon region and over most of South America, with huge amounts of fire-related particles injected into the atmosphere and transported across the continent [4]. References [1] WHO, 2013; REVIHAAP, Project Technical Report [2] Lyapustin et al., 2011; doi: 10.1029/2010JD014986 [3] Holben et al., 1998, doi:10.1016/S0034-4257(98)00031-5 [4] Castro et al., 2013; doi:10.1016/j.atmosres.2012.10.026

  12. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.

    PubMed

    Liu, Jianfang; Lichtenberg, Tara; Hoadley, Katherine A; Poisson, Laila M; Lazar, Alexander J; Cherniack, Andrew D; Kovatich, Albert J; Benz, Christopher C; Levine, Douglas A; Lee, Adrian V; Omberg, Larsson; Wolf, Denise M; Shriver, Craig D; Thorsson, Vesteinn; Hu, Hai

    2018-04-05

    For a decade, The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types. TCGA clinical data contain key features representing the democratized nature of the data collection process. To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints. In addition to detailing major challenges and statistical limitations encountered during the effort of integrating the acquired clinical data, we present a summary that includes endpoint usage recommendations for each cancer type. These TCGA-CDR findings appear to be consistent with cancer genomics studies independent of the TCGA effort and provide opportunities for investigating cancer biology using clinical correlates at an unprecedented scale. Copyright © 2018 Elsevier Inc. All rights reserved.

  13. Genome-wide assessment of differential translations with ribosome profiling data.

    PubMed

    Xiao, Zhengtao; Zou, Qin; Liu, Yu; Yang, Xuerui

    2016-04-04

    The closely regulated process of mRNA translation is crucial for precise control of protein abundance and quality. Ribosome profiling, a combination of ribosome foot-printing and RNA deep sequencing, has been used in a large variety of studies to quantify genome-wide mRNA translation. Here, we developed Xtail, an analysis pipeline tailored for ribosome profiling data that comprehensively and accurately identifies differentially translated genes in pairwise comparisons. Applied on simulated and real datasets, Xtail exhibits high sensitivity with minimal false-positive rates, outperforming existing methods in the accuracy of quantifying differential translations. With published ribosome profiling datasets, Xtail does not only reveal differentially translated genes that make biological sense, but also uncovers new events of differential translation in human cancer cells on mTOR signalling perturbation and in human primary macrophages on interferon gamma (IFN-γ) treatment. This demonstrates the value of Xtail in providing novel insights into the molecular mechanisms that involve translational dysregulations.

  14. Construction of a large collection of small genome variations in French dairy and beef breeds using whole-genome sequences.

    PubMed

    Boussaha, Mekki; Michot, Pauline; Letaief, Rabia; Hozé, Chris; Fritz, Sébastien; Grohs, Cécile; Esquerré, Diane; Duchesne, Amandine; Philippe, Romain; Blanquet, Véronique; Phocas, Florence; Floriot, Sandrine; Rocha, Dominique; Klopp, Christophe; Capitan, Aurélien; Boichard, Didier

    2016-11-15

    In recent years, several bovine genome sequencing projects were carried out with the aim of developing genomic tools to improve dairy and beef production efficiency and sustainability. In this study, we describe the first French cattle genome variation dataset obtained by sequencing 274 whole genomes representing several major dairy and beef breeds. This dataset contains over 28 million single nucleotide polymorphisms (SNPs) and small insertions and deletions. Comparisons between sequencing results and SNP array genotypes revealed a very high genotype concordance rate, which indicates the good quality of our data. To our knowledge, this is the first large-scale catalog of small genomic variations in French dairy and beef cattle. This resource will contribute to the study of gene functions and population structure and also help to improve traits through genotype-guided selection.

  15. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities

    DOE PAGES

    Kang, Dongwan D.; Froula, Jeff; Egan, Rob; ...

    2015-01-01

    Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. Lastly, it automatically formsmore » hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.« less

  16. Experiments to Distribute Map Generalization Processes

    NASA Astrophysics Data System (ADS)

    Berli, Justin; Touya, Guillaume; Lokhat, Imran; Regnauld, Nicolas

    2018-05-01

    Automatic map generalization requires the use of computationally intensive processes often unable to deal with large datasets. Distributing the generalization process is the only way to make them scalable and usable in practice. But map generalization is a highly contextual process, and the surroundings of a generalized map feature needs to be known to generalize the feature, which is a problem as distribution might partition the dataset and parallelize the processing of each part. This paper proposes experiments to evaluate the past propositions to distribute map generalization, and to identify the main remaining issues. The past propositions to distribute map generalization are first discussed, and then the experiment hypotheses and apparatus are described. The experiments confirmed that regular partitioning was the quickest strategy, but also the less effective in taking context into account. The geographical partitioning, though less effective for now, is quite promising regarding the quality of the results as it better integrates the geographical context.

  17. AmeriFlux Network Data Activities: updates, progress and plans

    NASA Astrophysics Data System (ADS)

    Yang, B.; Boden, T.; Krassovski, M.; Song, X.

    2013-12-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at the Oak Ridge National Laboratory serves as the long-term data repository for the AmeriFlux network. Datasets currently available include hourly or half-hourly meteorological and flux observations, biological measurement records, and synthesis data products. In this presentation, we provide an update of this network database including a comprehensive review and evaluation of the biological data from about 70 sites, development of a new product for flux uncertainty estimates, and re-formatting of Level-2 standard files. In 2013, we also provided data support to two synthesis studies --- 2012 drought synthesis and FACE synthesis. Issues related to data quality and solutions in compiling datasets for these synthesis studies will be discussed. We will also present our work plans in developing and producing other high-level products, such as derivation of phenology from the available measurements at flux sites.

  18. A hierarchical network-based algorithm for multi-scale watershed delineation

    NASA Astrophysics Data System (ADS)

    Castronova, Anthony M.; Goodall, Jonathan L.

    2014-11-01

    Watershed delineation is a process for defining a land area that contributes surface water flow to a single outlet point. It is a commonly used in water resources analysis to define the domain in which hydrologic process calculations are applied. There has been a growing effort over the past decade to improve surface elevation measurements in the U.S., which has had a significant impact on the accuracy of hydrologic calculations. Traditional watershed processing on these elevation rasters, however, becomes more burdensome as data resolution increases. As a result, processing of these datasets can be troublesome on standard desktop computers. This challenge has resulted in numerous works that aim to provide high performance computing solutions to large data, high resolution data, or both. This work proposes an efficient watershed delineation algorithm for use in desktop computing environments that leverages existing data, U.S. Geological Survey (USGS) National Hydrography Dataset Plus (NHD+), and open source software tools to construct watershed boundaries. This approach makes use of U.S. national-level hydrography data that has been precomputed using raster processing algorithms coupled with quality control routines. Our approach uses carefully arranged data and mathematical graph theory to traverse river networks and identify catchment boundaries. We demonstrate this new watershed delineation technique, compare its accuracy with traditional algorithms that derive watershed solely from digital elevation models, and then extend our approach to address subwatershed delineation. Our findings suggest that the open-source hierarchical network-based delineation procedure presented in the work is a promising approach to watershed delineation that can be used summarize publicly available datasets for hydrologic model input pre-processing. Through our analysis, we explore the benefits of reusing the NHD+ datasets for watershed delineation, and find that the our technique offers greater flexibility and extendability than traditional raster algorithms.

  19. Who shares? Who doesn't? Factors associated with openly archiving raw research data.

    PubMed

    Piwowar, Heather A

    2011-01-01

    Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.

  20. Characterising droughts in Central America with uncertain hydro-meteorological data

    NASA Astrophysics Data System (ADS)

    Quesada Montano, B.; Westerberg, I.; Wetterhall, F.; Hidalgo, H. G.; Halldin, S.

    2015-12-01

    Droughts studies are scarce in Central America, a region frequently affected by droughts that cause significant socio-economic and environmental problems. Drought characterisation is important for water management and planning and can be done with the help of drought indices. Many indices have been developed in the last decades but their ability to suitably characterise droughts depends on the region of application. In Central America, comprehensive and high-quality observational networks of meteorological and hydrological data are not available. This limits the choice of drought indices and denotes the need to evaluate the quality of the data used in their calculation. This paper aimed to find which combination(s) of drought index and meteorological database are most suitable for characterising droughts in Central America. The drought indices evaluated were the standardised precipitation index (SPI), deciles (DI), the standardised precipitation evapotranspiration index (SPEI) and the effective drought index (EDI). These were calculated using precipitation data from the Climate Hazards Group Infra-Red Precipitation with station (CHIRPS), CRN073, the Climate Research Unit (CRU), ERA-Interim and station databases, and temperature data from the CRU database. All the indices were calculated at 1-, 3-, 6-, 9- and 12-month accumulation times. As a first step, the large-scale meteorological precipitation datasets were compared to have an overview of the level of agreement between them and find possible quality problems. Then, the performance of all the combinations of drought indices and meteorological datasets were evaluated against independent river discharge data, in form of the standardised streamflow index (SSI). Results revealed the large disagreement between the precipitation datasets; we found the selection of database to be more important than the selection of drought index. We found that the best combinations of meteorological drought index and database were obtained using the SPI and DI, calculated with CHIRPS and station data.

  1. Assessing Data Quality in Emergent Domains of Earth Sciences

    NASA Astrophysics Data System (ADS)

    Darch, P. T.; Borgman, C.

    2016-12-01

    As earth scientists seek to study known phenomena in new ways, and to study new phenomena, they often develop new technologies and new methods such as embedded network sensing, or reapply extant technologies, such as seafloor drilling. Emergent domains are often highly multidisciplinary as researchers from many backgrounds converge on new research questions. They may adapt existing methods, or develop methods de novo. As a result, emerging domains tend to be methodologically heterogeneous. As these domains mature, pressure to standardize methods increases. Standardization promotes trust, reliability, accuracy, and reproducibility, and simplifies data management. However, for standardization to occur, researchers must be able to assess which of the competing methods produces the highest quality data. The exploratory nature of emerging domains discourages standardization. Because competing methods originate in different disciplinary backgrounds, their scientific credibility is difficult to compare. Instead of direct comparison, researchers attempt to conduct meta-analyses. Scientists compare datasets produced by different methods to assess their consistency and efficiency. This paper presents findings from a long-term qualitative case study of research on the deep subseafloor biosphere, an emergent domain. A diverse community converged on the study of microbes in the seafloor and those microbes' interactions with the physical environments they inhabit. Data on this problem are scarce, leading to calls for standardization as a means to acquire and analyze greater volumes of data. Lacking consistent methods, scientists attempted to conduct meta-analyses to determine the most promising methods on which to standardize. Among the factors that inhibited meta-analyses were disparate approaches to metadata and to curating data. Datasets may be deposited in a variety of databases or kept on individual scientists' servers. Associated metadata may be inconsistent or hard to interpret. Incentive structures, including prospects for journal publication, often favor new data over reanalyzing extant datasets. Assessing data quality in emergent domains is extremely difficult and will require adaptations in infrastructure, culture, and incentives.

  2. Identification and influence of spatio-temporal outliers in urban air quality measurements.

    PubMed

    O'Leary, Brendan; Reiners, John J; Xu, Xiaohong; Lemke, Lawrence D

    2016-12-15

    Forty eight potential outliers in air pollution measurements taken simultaneously in Detroit, Michigan, USA and Windsor, Ontario, Canada in 2008 and 2009 were identified using four independent methods: box plots, variogram clouds, difference maps, and the Local Moran's I statistic. These methods were subsequently used in combination to reduce and select a final set of 13 outliers for nitrogen dioxide (NO 2 ), volatile organic compounds (VOCs), total benzene, toluene, ethyl benzene, and xylene (BTEX), and particulate matter in two size fractions (PM 2.5 and PM 10 ). The selected outliers were excluded from the measurement datasets and used to revise air pollution models. In addition, a set of temporally-scaled air pollution models was generated using time series measurements from community air quality monitors, with and without the selected outliers. The influence of outlier exclusion on associations with asthma exacerbation rates aggregated at a postal zone scale in both cities was evaluated. Results demonstrate that the inclusion or exclusion of outliers influences the strength of observed associations between intraurban air quality and asthma exacerbation in both cities. The box plot, variogram cloud, and difference map methods largely determined the final list of outliers, due to the high degree of conformity among their results. The Moran's I approach was not useful for outlier identification in the datasets studied. Removing outliers changed the spatial distribution of modeled concentration values and derivative exposure estimates averaged over postal zones. Overall, associations between air pollution and acute asthma exacerbation rates were weaker with outliers removed, but improved with the addition of temporal information. Decreases in statistically significant associations between air pollution and asthma resulted, in part, from smaller pollutant concentration ranges used for linear regression. Nevertheless, the practice of identifying outliers through congruence among multiple methods strengthens confidence in the analysis of outlier presence and influence in environmental datasets. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.

  3. Standardising trauma monitoring: the development of a minimum dataset for trauma registries in Australia and New Zealand.

    PubMed

    Palmer, Cameron S; Davey, Tamzyn M; Mok, Meng Tuck; McClure, Rod J; Farrow, Nathan C; Gruen, Russell L; Pollard, Cliff W

    2013-06-01

    Trauma registries are central to the implementation of effective trauma systems. However, differences between trauma registry datasets make comparisons between trauma systems difficult. In 2005, the collaborative Australian and New Zealand National Trauma Registry Consortium began a process to develop a bi-national minimum dataset (BMDS) for use in Australasian trauma registries. This study aims to describe the steps taken in the development and preliminary evaluation of the BMDS. A working party comprising sixteen representatives from across Australasia identified and discussed the collectability and utility of potential BMDS fields. This included evaluating existing national and international trauma registry datasets, as well as reviewing all quality indicators and audit filters in use in Australasian trauma centres. After the working party activities concluded, this process was continued by a number of interested individuals, with broader feedback sought from the Australasian trauma community on a number of occasions. Once the BMDS had reached a suitable stage of development, an email survey was conducted across Australasian trauma centres to assess whether BMDS fields met an ideal minimum standard of field collectability. The BMDS was also compared with three prominent international datasets to assess the extent of dataset overlap. Following this, the BMDS was encapsulated in a data dictionary, which was introduced in late 2010. The finalised BMDS contained 67 data fields. Forty-seven of these fields met a previously published criterion of 80% collectability across respondent trauma institutions; the majority of the remaining fields either could be collected without any change in resources, or could be calculated from other data fields in the BMDS. However, comparability with international registry datasets was poor. Only nine BMDS fields had corresponding, directly comparable fields in all the national and international-level registry datasets evaluated. A draft BMDS has been developed for use in trauma registries across Australia and New Zealand. The email survey provided strong indications of the utility of the fields contained in the BMDS. The BMDS has been adopted as the dataset to be used by an ongoing Australian Trauma Quality Improvement Program. Copyright © 2012 Elsevier Ltd. All rights reserved.

  4. High-Quality T2-Weighted 4-Dimensional Magnetic Resonance Imaging for Radiation Therapy Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Du, Dongsu; Caruthers, Shelton D.; Glide-Hurst, Carri

    2015-06-01

    Purpose: The purpose of this study was to improve triggering efficiency of the prospective respiratory amplitude-triggered 4-dimensional magnetic resonance imaging (4DMRI) method and to develop a 4DMRI imaging protocol that could offer T2 weighting for better tumor visualization, good spatial coverage and spatial resolution, and respiratory motion sampling within a reasonable amount of time for radiation therapy applications. Methods and Materials: The respiratory state splitting (RSS) and multi-shot acquisition (MSA) methods were analytically compared and validated in a simulation study by using the respiratory signals from 10 healthy human subjects. The RSS method was more effective in improving triggering efficiency.more » It was implemented in prospective respiratory amplitude-triggered 4DMRI. 4DMRI image datasets were acquired from 5 healthy human subjects. Liver motion was estimated using the acquired 4DMRI image datasets. Results: The simulation study showed the RSS method was more effective for improving triggering efficiency than the MSA method. The average reductions in 4DMRI acquisition times were 36% and 10% for the RSS and MSA methods, respectively. The human subject study showed that T2-weighted 4DMRI with 10 respiratory states, 60 slices at a spatial resolution of 1.5 × 1.5 × 3.0 mm{sup 3} could be acquired in 9 to 18 minutes, depending on the individual's breath pattern. Based on the acquired 4DMRI image datasets, the ranges of peak-to-peak liver displacements among 5 human subjects were 9.0 to 12.9 mm, 2.5 to 3.9 mm, and 0.5 to 2.3 mm in superior-inferior, anterior-posterior, and left-right directions, respectively. Conclusions: We demonstrated that with the RSS method, it was feasible to acquire high-quality T2-weighted 4DMRI within a reasonable amount of time for radiation therapy applications.« less

  5. Carotid dual-energy CT angiography: Evaluation of low keV calculated monoenergetic datasets by means of a frequency-split approach for noise reduction at low keV levels.

    PubMed

    Riffel, Philipp; Haubenreisser, Holger; Meyer, Mathias; Sudarski, Sonja; Morelli, John N; Schmidt, Bernhard; Schoenberg, Stefan O; Henzler, Thomas

    2016-04-01

    Calculated monoenergetic ultra-low keV datasets did not lead to improved contrast-to-noise ratio (CNR) due to the dramatic increase in image noise. The aim of the present study was to evaluate the objective image quality of ultra-low keV monoenergetic images (MEIs) calculated from carotid DECT angiography data with a new monoenergetic imaging algorithm using a frequency-split technique. 20 patients (12 male; mean age 53±17 years) were retrospectively analyzed. MEIs from 40 to 120 keV were reconstructed using the monoenergetic split frequency approach (MFSA). Additionally MEIs were reconstructed for 40 and 50 keV using a conventional monoenergetic (CM) software application. Signal intensity, noise, signal-to-noise ratio (SNR) and CNR were assessed in the basilar, common, internal carotid arteries. Ultra-low keV MEIs at 40 keV and 50 keV demonstrated highest vessel attenuation, significantly greater than those of the polyenergetic images (PEI) (all p-values <0.05). The highest SNR level and CNR level was found at 40 keV and 50 keV (all p-values <0.05). MEIs with MFSA showed significantly lower noise levels than those processed with CM (all p-values <0.05) and no significant differences in vessel attenuation (p>0.05). Thus MEIs with MFSA showed significantly higher SNR and CNR compared to MEIs with CM. Combining the lower spatial frequency stack for contrast at low keV levels with the high spatial frequency stack for noise at high keV levels (frequency-split technique) leads to improved image quality of ultra-low keV monoenergetic DECT datasets when compared to previous monoenergetic reconstruction techniques without the frequency-split technique. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  6. Building a high level sample processing and quality assessment model for biogeochemical measurements: a case study from the ocean acidification community

    NASA Astrophysics Data System (ADS)

    Thomas, R.; Connell, D.; Spears, T.; Leadbetter, A.; Burger, E. F.

    2016-12-01

    The scientific literature heavily features small-scale studies with the impact of the results extrapolated to regional/global importance. There are on-going initiatives (e.g. OA-ICC, GOA-ON, GEOTRACES, EMODNet Chemistry) aiming to assemble regional to global-scale datasets that are available for trend or meta-analyses. Assessing the quality and comparability of these data requires information about the processing chain from "sampling to spreadsheet". This provenance information needs to be captured and readily available to assess data fitness for purpose. The NOAA Ocean Acidification metadata template was designed in consultation with domain experts for this reason; the core carbonate chemistry variables have 23-37 metadata fields each and for scientists generating these datasets there could appear to be an ever increasing amount of metadata expected to accompany a dataset. While this provenance metadata should be considered essential by those generating or using the data, for those discovering data there is a sliding scale between what is considered discovery metadata (title, abstract, contacts, etc.) versus usage metadata (methodology, environmental setup, lineage, etc.), the split depending on the intended use of data. As part of the OA-ICC's activities, the metadata fields from the NOAA template relevant to the sample processing chain and QA criteria have been factored to develop profiles for, and extensions to, the OM-JSON encoding supported by the PROV ontology. While this work started focused on carbonate chemistry variable specific metadata, the factorization could be applied within the O&M model across other disciplines such as trace metals or contaminants. In a linked data world with a suitable high level model for sample processing and QA available, tools and support can be provided to link reproducible units of metadata (e.g. the standard protocol for a variable as adopted by a community) and simplify the provision of metadata and subsequent discovery.

  7. Simulation of Smart Home Activity Datasets

    PubMed Central

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-01-01

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371

  8. A robust dataset-agnostic heart disease classifier from Phonocardiogram.

    PubMed

    Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

    2017-07-01

    Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.

  9. Simulation of Smart Home Activity Datasets.

    PubMed

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  10. PAIR Comparison between Two Within-Group Conditions of Resting-State fMRI Improves Classification Accuracy

    PubMed Central

    Zhou, Zhen; Wang, Jian-Bao; Zang, Yu-Feng; Pan, Gang

    2018-01-01

    Classification approaches have been increasingly applied to differentiate patients and normal controls using resting-state functional magnetic resonance imaging data (RS-fMRI). Although most previous classification studies have reported promising accuracy within individual datasets, achieving high levels of accuracy with multiple datasets remains challenging for two main reasons: high dimensionality, and high variability across subjects. We used two independent RS-fMRI datasets (n = 31, 46, respectively) both with eyes closed (EC) and eyes open (EO) conditions. For each dataset, we first reduced the number of features to a small number of brain regions with paired t-tests, using the amplitude of low frequency fluctuation (ALFF) as a metric. Second, we employed a new method for feature extraction, named the PAIR method, examining EC and EO as paired conditions rather than independent conditions. Specifically, for each dataset, we obtained EC minus EO (EC—EO) maps of ALFF from half of subjects (n = 15 for dataset-1, n = 23 for dataset-2) and obtained EO—EC maps from the other half (n = 16 for dataset-1, n = 23 for dataset-2). A support vector machine (SVM) method was used for classification of EC RS-fMRI mapping and EO mapping. The mean classification accuracy of the PAIR method was 91.40% for dataset-1, and 92.75% for dataset-2 in the conventional frequency band of 0.01–0.08 Hz. For cross-dataset validation, we applied the classifier from dataset-1 directly to dataset-2, and vice versa. The mean accuracy of cross-dataset validation was 94.93% for dataset-1 to dataset-2 and 90.32% for dataset-2 to dataset-1 in the 0.01–0.08 Hz range. For the UNPAIR method, classification accuracy was substantially lower (mean 69.89% for dataset-1 and 82.97% for dataset-2), and was much lower for cross-dataset validation (64.69% for dataset-1 to dataset-2 and 64.98% for dataset-2 to dataset-1) in the 0.01–0.08 Hz range. In conclusion, for within-group design studies (e.g., paired conditions or follow-up studies), we recommend the PAIR method for feature extraction. In addition, dimensionality reduction with strong prior knowledge of specific brain regions should also be considered for feature selection in neuroimaging studies. PMID:29375288

  11. Automated retinal image quality assessment on the UK Biobank dataset for epidemiological studies.

    PubMed

    Welikala, R A; Fraz, M M; Foster, P J; Whincup, P H; Rudnicka, A R; Owen, C G; Strachan, D P; Barman, S A

    2016-04-01

    Morphological changes in the retinal vascular network are associated with future risk of many systemic and vascular diseases. However, uncertainty over the presence and nature of some of these associations exists. Analysis of data from large population based studies will help to resolve these uncertainties. The QUARTZ (QUantitative Analysis of Retinal vessel Topology and siZe) retinal image analysis system allows automated processing of large numbers of retinal images. However, an image quality assessment module is needed to achieve full automation. In this paper, we propose such an algorithm, which uses the segmented vessel map to determine the suitability of retinal images for use in the creation of vessel morphometric data suitable for epidemiological studies. This includes an effective 3-dimensional feature set and support vector machine classification. A random subset of 800 retinal images from UK Biobank (a large prospective study of 500,000 middle aged adults; where 68,151 underwent retinal imaging) was used to examine the performance of the image quality algorithm. The algorithm achieved a sensitivity of 95.33% and a specificity of 91.13% for the detection of inadequate images. The strong performance of this image quality algorithm will make rapid automated analysis of vascular morphometry feasible on the entire UK Biobank dataset (and other large retinal datasets), with minimal operator involvement, and at low cost. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. MICCA: a complete and accurate software for taxonomic profiling of metagenomic data.

    PubMed

    Albanese, Davide; Fontana, Paolo; De Filippo, Carlotta; Cavalieri, Duccio; Donati, Claudio

    2015-05-19

    The introduction of high throughput sequencing technologies has triggered an increase of the number of studies in which the microbiota of environmental and human samples is characterized through the sequencing of selected marker genes. While experimental protocols have undergone a process of standardization that makes them accessible to a large community of scientist, standard and robust data analysis pipelines are still lacking. Here we introduce MICCA, a software pipeline for the processing of amplicon metagenomic datasets that efficiently combines quality filtering, clustering of Operational Taxonomic Units (OTUs), taxonomy assignment and phylogenetic tree inference. MICCA provides accurate results reaching a good compromise among modularity and usability. Moreover, we introduce a de-novo clustering algorithm specifically designed for the inference of Operational Taxonomic Units (OTUs). Tests on real and synthetic datasets shows that thanks to the optimized reads filtering process and to the new clustering algorithm, MICCA provides estimates of the number of OTUs and of other common ecological indices that are more accurate and robust than currently available pipelines. Analysis of public metagenomic datasets shows that the higher consistency of results improves our understanding of the structure of environmental and human associated microbial communities. MICCA is an open source project.

  13. Seasonal-scale Observational Data Analysis and Atmospheric Phenomenology for the Cold Land Processes Experiment

    NASA Technical Reports Server (NTRS)

    Poulos, Gregory S.; Stamus, Peter A.; Snook, John S.

    2005-01-01

    The Cold Land Processes Experiment (CLPX) experiment emphasized the development of a strong synergism between process-oriented understanding, land surface models and microwave remote sensing. Our work sought to investigate which topographically- generated atmospheric phenomena are most relevant to the CLPX MSA's for the purpose of evaluating their climatic importance to net local moisture fluxes and snow transport through the use of high-resolution data assimilation/atmospheric numerical modeling techniques. Our task was to create three long-term, scientific quality atmospheric datasets for quantitative analysis (for all CLPX researchers) and provide a summary of the meteorologically-relevant phenomena of the three MSAs (see Figure) over northern Colorado. Our efforts required the ingest of a variety of CLPX datasets and the execution an atmospheric and land surface data assimilation system based on the Navier-Stokes equations (the Local Analysis and Prediction System, LAPS, and an atmospheric numerical weather prediction model, as required) at topographically- relevant grid spacing (approx. 500 m). The resulting dataset will be analyzed by the CLPX community as a part of their larger research goals to determine the relative influence of various atmospheric phenomena on processes relevant to CLPX scientific goals.

  14. MICCA: a complete and accurate software for taxonomic profiling of metagenomic data

    PubMed Central

    Albanese, Davide; Fontana, Paolo; De Filippo, Carlotta; Cavalieri, Duccio; Donati, Claudio

    2015-01-01

    The introduction of high throughput sequencing technologies has triggered an increase of the number of studies in which the microbiota of environmental and human samples is characterized through the sequencing of selected marker genes. While experimental protocols have undergone a process of standardization that makes them accessible to a large community of scientist, standard and robust data analysis pipelines are still lacking. Here we introduce MICCA, a software pipeline for the processing of amplicon metagenomic datasets that efficiently combines quality filtering, clustering of Operational Taxonomic Units (OTUs), taxonomy assignment and phylogenetic tree inference. MICCA provides accurate results reaching a good compromise among modularity and usability. Moreover, we introduce a de-novo clustering algorithm specifically designed for the inference of Operational Taxonomic Units (OTUs). Tests on real and synthetic datasets shows that thanks to the optimized reads filtering process and to the new clustering algorithm, MICCA provides estimates of the number of OTUs and of other common ecological indices that are more accurate and robust than currently available pipelines. Analysis of public metagenomic datasets shows that the higher consistency of results improves our understanding of the structure of environmental and human associated microbial communities. MICCA is an open source project. PMID:25988396

  15. Evolution of the Southern Oscillation as observed by the Nimbus-7 ERB experiment

    NASA Technical Reports Server (NTRS)

    Ardanuy, Philip E.; Kyle, H. Lee; Chang, Hyo-Duck

    1987-01-01

    The Nimbus-7 satellite has been in a 955-km, sun-synchronous orbit since October 1978. The Earth Radiation Budget (ERB) experiment has taken approximately 8 years of high-quality data during this time, of which seven complete years have been archived at the National Space Science Data Center. A final reprocessing of the wide-field-of-view channel dataset is underway. Error analyses indicate a long-term stability of 1 percent better over the length of the data record. As part of the validation of the ERB measurements, the archived 7-year Nimbus-7 ERB dataset is examined for the presence and accuracy of interannual variations including the Southern Oscillation signal. Zonal averages of broadband outgoing longwave radiation indicate a terrestrial response of more than 2 years to the oceanic and atmospheric manifestations of the 1982-83 El Nino/Southern Oscillation (ENSO) event, especially in the tropics. This signal is present in monthly and seasonal averages and is shown here to derive primarily from atmospheric responses to adjustments in the Pacific Ocean. The calibration stability of this dataset thus provides a powerful new tool to examine the physics of the ENSO phenomena.

  16. X-ray computed tomography using curvelet sparse regularization.

    PubMed

    Wieczorek, Matthias; Frikel, Jürgen; Vogel, Jakob; Eggl, Elena; Kopp, Felix; Noël, Peter B; Pfeiffer, Franz; Demaret, Laurent; Lasser, Tobias

    2015-04-01

    Reconstruction of x-ray computed tomography (CT) data remains a mathematically challenging problem in medical imaging. Complementing the standard analytical reconstruction methods, sparse regularization is growing in importance, as it allows inclusion of prior knowledge. The paper presents a method for sparse regularization based on the curvelet frame for the application to iterative reconstruction in x-ray computed tomography. In this work, the authors present an iterative reconstruction approach based on the alternating direction method of multipliers using curvelet sparse regularization. Evaluation of the method is performed on a specifically crafted numerical phantom dataset to highlight the method's strengths. Additional evaluation is performed on two real datasets from commercial scanners with different noise characteristics, a clinical bone sample acquired in a micro-CT and a human abdomen scanned in a diagnostic CT. The results clearly illustrate that curvelet sparse regularization has characteristic strengths. In particular, it improves the restoration and resolution of highly directional, high contrast features with smooth contrast variations. The authors also compare this approach to the popular technique of total variation and to traditional filtered backprojection. The authors conclude that curvelet sparse regularization is able to improve reconstruction quality by reducing noise while preserving highly directional features.

  17. Seltzer_et_al_2016

    EPA Pesticide Factsheets

    This dataset supports the modeling study of Seltzer et al. (2016) published in Atmospheric Environment. In this study, techniques typically used for future air quality projections are applied to a historical 11-year period to assess the performance of the modeling system when the driving meteorological conditions are obtained using dynamical downscaling of coarse-scale fields without correcting toward higher resolution observations. The Weather Research and Forecasting model and the Community Multiscale Air Quality model are used to simulate regional climate and air quality over the contiguous United States for 2000-2010. The air quality simulations for that historical period are then compared to observations from four national networks. Comparisons are drawn between defined performance metrics and other published modeling results for predicted ozone, fine particulate matter, and speciated fine particulate matter. The results indicate that the historical air quality simulations driven by dynamically downscaled meteorology are typically within defined modeling performance benchmarks and are consistent with results from other published modeling studies using finer-resolution meteorology. This indicates that the regional climate and air quality modeling framework utilized here does not introduce substantial bias, which provides confidence in the method??s use for future air quality projections.This dataset is associated with the following publication:Seltzer, K., C

  18. a Critical Review of Automated Photogrammetric Processing of Large Datasets

    NASA Astrophysics Data System (ADS)

    Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.

    2017-08-01

    The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.

  19. Antarctic and Sub-Antarctic Asteroidea database.

    PubMed

    Moreau, Camille; Mah, Christopher; Agüera, Antonio; Améziane, Nadia; David Barnes; Crokaert, Guillaume; Eléaume, Marc; Griffiths, Huw; Charlène Guillaumot; Hemery, Lenaïg G; Jażdżewska, Anna; Quentin Jossart; Vladimir Laptikhovsky; Linse, Katrin; Neill, Kate; Sands, Chester; Thomas Saucède; Schiaparelli, Stefano; Siciński, Jacek; Vasset, Noémie; Bruno Danis

    2018-01-01

    The present dataset is a compilation of georeferenced occurrences of asteroids (Echinodermata: Asteroidea) in the Southern Ocean. Occurrence data south of 45°S latitude were mined from various sources together with information regarding the taxonomy, the sampling source and sampling sites when available. Records from 1872 to 2016 were thoroughly checked to ensure the quality of a dataset that reaches a total of 13,840 occurrences from 4,580 unique sampling events. Information regarding the reproductive strategy (brooders vs. broadcasters) of 63 species is also made available. This dataset represents the most exhaustive occurrence database on Antarctic and Sub-Antarctic asteroids.

  20. Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer

    PubMed Central

    Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.

    2015-01-01

    Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938

  1. A High-Resolution Merged Wind Dataset for DYNAMO: Progress and Future Plans

    NASA Technical Reports Server (NTRS)

    Lang, Timothy J.; Mecikalski, John; Li, Xuanli; Chronis, Themis; Castillo, Tyler; Hoover, Kacie; Brewer, Alan; Churnside, James; McCarty, Brandi; Hein, Paul; hide

    2015-01-01

    In order to support research on optimal data assimilation methods for the Cyclone Global Navigation Satellite System (CYGNSS), launching in 2016, work has been ongoing to produce a high-resolution merged wind dataset for the Dynamics of the Madden Julian Oscillation (DYNAMO) field campaign, which took place during late 2011/early 2012. The winds are produced by assimilating DYNAMO observations into the Weather Research and Forecasting (WRF) three-dimensional variational (3DVAR) system. Data sources from the DYNAMO campaign include the upper-air sounding network, radial velocities from the radar network, vector winds from the Advanced Scatterometer (ASCAT) and Oceansat-2 Scatterometer (OSCAT) satellite instruments, the NOAA High Resolution Doppler Lidar (HRDL), and several others. In order the prep them for 3DVAR, significant additional quality control work is being done for the currently available TOGA and SMART-R radar datasets, including automatically dealiasing radial velocities and correcting for intermittent TOGA antenna azimuth angle errors. The assimilated winds are being made available as model output fields from WRF on two separate grids with different horizontal resolutions - a 3-km grid focusing on the main DYNAMO quadrilateral (i.e., Gan Island, the R/V Revelle, the R/V Mirai, and Diego Garcia), and a 1-km grid focusing on the Revelle. The wind dataset is focused on three separate approximately 2-week periods during the Madden Julian Oscillation (MJO) onsets that occurred in October, November, and December 2011. Work is ongoing to convert the 10-m surface winds from these model fields to simulated CYGNSS observations using the CYGNSS End-To-End Simulator (E2ES), and these simulated satellite observations are being compared to radar observations of DYNAMO precipitation systems to document the anticipated ability of CYGNSS to provide information on the relationships between surface winds and oceanic precipitation at the mesoscale level. This research will improve our understanding of the future utility of CYGNSS for documenting key MJO processes.

  2. Diviner lunar radiometer gridded brightness temperatures from geodesic binning of modeled fields of view

    NASA Astrophysics Data System (ADS)

    Sefton-Nash, E.; Williams, J.-P.; Greenhagen, B. T.; Aye, K.-M.; Paige, D. A.

    2017-12-01

    An approach is presented to efficiently produce high quality gridded data records from the large, global point-based dataset returned by the Diviner Lunar Radiometer Experiment aboard NASA's Lunar Reconnaissance Orbiter. The need to minimize data volume and processing time in production of science-ready map products is increasingly important with the growth in data volume of planetary datasets. Diviner makes on average >1400 observations per second of radiance that is reflected and emitted from the lunar surface, using 189 detectors divided into 9 spectral channels. Data management and processing bottlenecks are amplified by modeling every observation as a probability distribution function over the field of view, which can increase the required processing time by 2-3 orders of magnitude. Geometric corrections, such as projection of data points onto a digital elevation model, are numerically intensive and therefore it is desirable to perform them only once. Our approach reduces bottlenecks through parallel binning and efficient storage of a pre-processed database of observations. Database construction is via subdivision of a geodesic icosahedral grid, with a spatial resolution that can be tailored to suit the field of view of the observing instrument. Global geodesic grids with high spatial resolution are normally impractically memory intensive. We therefore demonstrate a minimum storage and highly parallel method to bin very large numbers of data points onto such a grid. A database of the pre-processed and binned points is then used for production of mapped data products that is significantly faster than if unprocessed points were used. We explore quality controls in the production of gridded data records by conditional interpolation, allowed only where data density is sufficient. The resultant effects on the spatial continuity and uncertainty in maps of lunar brightness temperatures is illustrated. We identify four binning regimes based on trades between the spatial resolution of the grid, the size of the FOV and the on-target spacing of observations. Our approach may be applicable and beneficial for many existing and future point-based planetary datasets.

  3. Scalar and Vector Spherical Harmonics for Assimilation of Global Datasets in the Ionosphere and Thermosphere

    NASA Astrophysics Data System (ADS)

    Miladinovich, D.; Datta-Barua, S.; Bust, G. S.; Ramirez, U.

    2017-12-01

    Understanding physical processes during storm time in the ionosphere-thermosphere (IT) system is limited, in part, due to the inability to obtain accurate estimates of IT states on a global scale. One reason for this inability is the sparsity of spatially distributed high quality data sets. Data assimilation is showing promise toward enabling global estimates by blending high quality observational data sets with established climate models. We are continuing development of an algorithm called Estimating Model Parameters for Ionospheric Reverse Engineering (EMPIRE) to enable assimilation of global datasets for storm time estimates of IT drivers. EMPIRE is a data assimilation algorithm that uses a Kalman filtering routine to ingest model and observational data. The EMPIRE algorithm is based on spherical harmonics which provide a spherically symmetric, smooth, continuous, and orthonormal set of basis functions suitable for a spherical domain such as Earth's IT region (200-600 km altitude). Once the basis function coefficients are determined, the newly fitted function represents the disagreement between observational measurements and models. We apply spherical harmonics to study the March 17, 2015 storm. Data sources include Fabry-Perot interferometer neutral wind measurements and global Ionospheric Data Assimilation 4 Dimensional (IDA4D) assimilated total electron content (TEC). Models include Weimer 2000 electric potential, International Geomagnetic Reference Field (IGRF) magnetic field, and Horizontal Wind Model 2014 (HWM14) neutral winds. We present the EMPIRE assimilation results of Earth's electric potential and thermospheric winds. We also compare EMPIRE storm time E cross B ion drift estimates to measured drifts produced from the Super Dual Auroral Radar Network (SuperDARN) and Active Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) measurement datasets. The analysis from these results will enable the generation of globally assimilated storm time IT state estimates for future studies. In particular, the ability to provide data assimilated estimation of the drivers of the IT system from high to low latitudes is a critical step toward forecasting the influence of geomagnetic storms on the near Earth space environment.

  4. Fast directional changes in the geomagnetic field recovered from archaeomagnetism of ancient Israel

    NASA Astrophysics Data System (ADS)

    Shaar, R.; Hassul, E.; Raphael, K.; Ebert, Y.; Marco, S.; Nowaczyk, N. R.; Ben-Yosef, E.; Agnon, A.

    2017-12-01

    Recent archaeomagnetic intensity data from the Levant revealed short-term sub-centennial changes in the geomagnetic field such as `archaeomagnetic jerks' and `geomagnetic spikes'. To fully understand the nature of these fast variations a complementary high-precision time-series of geomagnetic field direction is required. To this end we investigated 35 heat impacted archaeological objects from Israel, including cooking ovens, furnaces, and burnt walls. We combine the new dataset with previously unpublished data and construct the first archaeomagnetic compilation of Israel which, at the moment, consists of a total of 57 directions. Screening out poor quality data leaves 30 acceptable archaeomagnetic directions, 25 of which spanning the period between 1700 BCE to 400 BCE. The most striking result of this dataset is a large directional anomaly with deviation of 20°-25° from geocentric axial dipole direction during the 9th century BCE. This anomaly in field direction is contemporaneous with the Levantine Iron Age Anomaly (LIAA) - a local geomagnetic anomaly over the Levant that was characterized by a high averaged geomagnetic field (nearly twice of today's field) and short decadal-scale geomagnetic spikes.

  5. Collection, processing, and quality assurance of time-series electromagnetic-induction log datasets, 1995–2016, south Florida

    USGS Publications Warehouse

    Prinos, Scott T.; Valderrama, Robert

    2016-12-13

    Time-series electromagnetic-induction log (TSEMIL) datasets are collected from polyvinyl-chloride cased or uncased monitoring wells to evaluate changes in water conductivity over time. TSEMIL datasets consist of a series of individual electromagnetic-induction logs, generally collected at a frequency of once per month or once per year that have been compiled into a dataset by eliminating small uniform offsets in bulk conductivity between logs probably caused by minor variations in calibration. These offsets are removed by selecting a depth at which no changes are apparent from year to year, and by adjusting individual logs to the median of all logs at the selected depth. Generally, the selected depths are within the freshwater saturated part of the aquifer, well below the water table. TSEMIL datasets can be used to monitor changes in water conductivity throughout the full thickness of an aquifer, without the need for long open-interval wells which have, in some instances, allowed vertical water flow within the well bore that has biased water conductivity profiles. The TSEMIL dataset compilation process enhances the ability to identify small differences between logs that were otherwise obscured by the offsets. As a result of TSEMIL dataset compilation, the root mean squared error of the linear regression between bulk conductivity of the electromagnetic-induction log measurements and the chloride concentration of water samples decreased from 17.4 to 1.7 millisiemens per meter in well G–3611 and from 3.7 to 2.2 millisiemens per meter in well G–3609. The primary use of the TSEMIL datasets in south Florida is to detect temporal changes in bulk conductivity associated with saltwater intrusion in the aquifer; however, other commonly observed changes include (1) variations in bulk conductivity near the water table where water saturation of pore spaces might vary and water temperature might be more variable, and (2) dissipation of conductive water in high-porosity rock layers, which might have entered these layers during drilling. Although TSEMIL dataset processing of even a few logs improves evaluations of the differences between the logs that are related to changes in the salinity, about 16 logs are needed to estimate the bulk conductivity within ±2 millisiemens per meter. Unlike many other types of data published by the U.S. Geological Survey, the median of TSEMIL datasets should not be considered final until 16 logs are collected and the median of the dataset is stable.

  6. INTEGRATION OF SATELLITE, MODELED, AND GROUND BASED AEROSOL DATA FOR USE IN AIR QUALITY AND PUBLIC HEALTH APPLICATIONS

    EPA Science Inventory

    Case studies of severe pollution events due to forest fires/dust storms/industrial haze, from the integrated 2001 aerosol dataset, will be presented within the context of air quality and human health.

  7. Estimates of inorganic nitrogen wet deposition from precipitation for the conterminous United States, 1955-84

    USGS Publications Warehouse

    Gronberg, Jo Ann M.; Ludtke, Amy S.; Knifong, Donna L.

    2014-01-01

    The U.S. Geological Survey’s National Water-Quality Assessment program requires nutrient input information for analysis of national and regional assessment of water quality. Historical data are needed to lengthen the data record for assessment of trends in water quality. This report provides estimates of inorganic nitrogen deposition from precipitation for the conterminous United States for 1955–56, 1961–65, and 1981–84. The estimates were derived from ammonium, nitrate, and inorganic nitrogen concentrations in atmospheric wet deposition and precipitation-depth data. This report documents the sources of these data and the methods that were used to estimate the inorganic nitrogen deposition. Tabular datasets, including the analytical results, precipitation depth, and calculated site-specific precipitation-weighted concentrations, and raster datasets of nitrogen from wet deposition are provided as appendixes in this report.

  8. Importance of A Priori Vertical Ozone Profiles for TEMPO Air Quality Retrievals

    NASA Astrophysics Data System (ADS)

    Johnson, M. S.; Sullivan, J. T.; Liu, X.; Zoogman, P.; Newchurch, M.; Kuang, S.; McGee, T. J.; Leblanc, T.

    2017-12-01

    Ozone (O3) is a toxic pollutant which plays a major role in air quality. Typically, monitoring of surface air quality and O3 mixing ratios is conducted using in situ measurement networks. This is partially due to high-quality information related to air quality being limited from space-borne platforms due to coarse spatial resolution, limited temporal frequency, and minimal sensitivity to lower tropospheric and surface-level O3. The Tropospheric Emissions: Monitoring of Pollution (TEMPO) satellite is designed to address the limitations of current space-based platforms and to improve our ability to monitor North American air quality. TEMPO will provide hourly data of total column and vertical profiles of O3 with high spatial resolution to be used as a near-real-time air quality product. TEMPO O3 retrievals will apply the Smithsonian Astrophysical Observatory profile algorithm developed based on work from GOME, GOME-2, and OMI. This algorithm is suggested to use a priori O3 profile information from a climatological data-base developed from long-term ozone-sonde measurements (tropopause-based (TB-Clim) O3 climatology). This study evaluates the TB-Clim dataset and model simulated O3 profiles, which could potentially serve as a priori O3 profile information in TEMPO retrievals, from near-real-time data assimilation model products (NASA GMAO's operational GEOS-5 FP model and reanalysis data from MERRA2) and a full chemical transport model (CTM), GEOS-Chem. In this study, vertical profile products are evaluated with surface (0-2 km) and tropospheric (0-10 km) TOLNet observations and the theoretical impact of individual a priori profile sources on the accuracy of TEMPO O3 retrievals in the troposphere and at the surface are presented. Results indicate that while the TB-Clim climatological dataset can replicate seasonally-averaged tropospheric O3 profiles, model-simulated profiles from a full CTM resulted in more accurate tropospheric and surface-level O3 retrievals from TEMPO when compared to hourly and daily-averaged TOLNet observations. Furthermore, it is shown that when large surface O3 mixing ratios are observed, TEMPO retrieval values at the surface are most accurate when applying CTM a priori profile information compared to all other data products.

  9. Effect of xylitol versus sorbitol: a quantitative systematic review of clinical trials.

    PubMed

    Mickenautsch, Steffen; Yengopal, Veerasamy

    2012-08-01

    This study aimed to appraise, within the context of tooth caries, the current clinical evidence and its risk for bias regarding the effects of xylitol in comparison with sorbitol. Databases were searched for clinical trials to 19 March 2011. Inclusion criteria required studies to: test a caries-related primary outcome; compare the effects of xylitol with those of sorbitol; describe a clinical trial with two or more arms, and utilise a prospective study design. Articles were excluded if they did not report computable data or did not follow up test and control groups in the same way. Individual dichotomous and continuous datasets were extracted from accepted articles. Selection and performance/detection bias were assessed. Sensitivity analysis was used to investigate attrition bias. Egger's regression and funnel plotting were used to investigate risk for publication bias. Nine articles were identified. Of these, eight were accepted and one was excluded. Ten continuous and eight dichotomous datasets were extracted. Because of high clinical heterogeneity, no meta-analysis was performed. Most of the datasets favoured xylitol, but this was not consistent. The accepted trials may be limited by selection bias. Results of the sensitivity analysis indicate a high risk for attrition bias. The funnel plot and Egger's regression results suggest a low publication bias risk. External fluoride exposure and stimulated saliva flow may have confounded the measured anticariogenic effect of xylitol. The evidence identified in support of xylitol over sorbitol is contradictory, is at high risk for selection and attrition bias and may be limited by confounder effects. Future high-quality randomised controlled trials are needed to show whether xylitol has a greater anticariogenic effect than sorbitol. © 2012 FDI World Dental Federation.

  10. Comparison of Shallow Survey 2012 Multibeam Datasets

    NASA Astrophysics Data System (ADS)

    Ramirez, T. M.

    2012-12-01

    The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and IHO compliance. Results from an initial analysis indicate that more factors need to be considered to properly compare sonar quality from processed results than just utilizing the same vessel with the same vessel configuration. Survey techniques such as focusing the beams over a narrower beam width can greatly increase data quality. While each sonar manufacturer was required to meet Special Order IHO specifications, line spacing was not specified and allowed for a greater data density despite equipment specification.

  11. Spatial Data Quality Control Procedure applied to the Okavango Basin Information System

    NASA Astrophysics Data System (ADS)

    Butchart-Kuhlmann, Daniel

    2014-05-01

    Spatial data is a powerful form of information, capable of providing information of great interest and tremendous use to a variety of users. However, much like other data representing the 'real world', precision and accuracy must be high for the results of data analysis to be deemed reliable and thus applicable to real world projects and undertakings. The spatial data quality control (QC) procedure presented here was developed as the topic of a Master's thesis, in the sphere of and using data from the Okavango Basin Information System (OBIS), itself a part of The Future Okavango (TFO) project. The aim of the QC procedure was to form the basis of a method through which to determine the quality of spatial data relevant for application to hydrological, solute, and erosion transport modelling using the Jena Adaptable Modelling System (JAMS). As such, the quality of all data present in OBIS classified under the topics of elevation, geoscientific information, or inland waters, was evaluated. Since the initial data quality has been evaluated, efforts are underway to correct the errors found, thus improving the quality of the dataset.

  12. Communication and effectiveness in a US nursing home quality-improvement collaborative.

    PubMed

    Arling, Priscilla A; Abrahamson, Kathleen; Miech, Edward J; Inui, Thomas S; Arling, Greg

    2014-09-01

    In this study, we explored the relationship between changes in resident health outcomes, practitioner communication patterns, and practitioner perceptions of group effectiveness within a quality-improvement collaborative of nursing home clinicians. Survey and interview data were collected from nursing home clinicians participating in a quality-improvement collaborative. Quality-improvement outcomes were evaluated using US Federal and State minimum dataset measures. Models were specified evaluating the relationships between resident outcomes, staff perceptions of communication patterns, and staff perceptions of collaborative effectiveness. Interview data provided deeper understanding of the quantitative findings. Reductions in fall rates were highest in facilities where respondents experienced the highest levels of communication with collaborative members outside of scheduled meetings, and where respondents perceived that the collaborative kept them informed and provided new ideas. Clinicians observed that participation in a quality-improvement collaborative positively influenced the ability to share innovative ideas and expand the quality-improvement program within their nursing home. For practitioners, a high level of communication, both inside and outside of meetings, was key to making measurable gains in resident health outcomes. © 2013 Wiley Publishing Asia Pty Ltd.

  13. Integrating High-Resolution Datasets to Target Mitigation Efforts for Improving Air Quality and Public Health in Urban Neighborhoods

    PubMed Central

    Shandas, Vivek; Voelkel, Jackson; Rao, Meenakshi; George, Linda

    2016-01-01

    Reducing exposure to degraded air quality is essential for building healthy cities. Although air quality and population vary at fine spatial scales, current regulatory and public health frameworks assess human exposures using county- or city-scales. We build on a spatial analysis technique, dasymetric mapping, for allocating urban populations that, together with emerging fine-scale measurements of air pollution, addresses three objectives: (1) evaluate the role of spatial scale in estimating exposure; (2) identify urban communities that are disproportionately burdened by poor air quality; and (3) estimate reduction in mobile sources of pollutants due to local tree-planting efforts using nitrogen dioxide. Our results show a maximum value of 197% difference between cadastrally-informed dasymetric system (CIDS) and standard estimations of population exposure to degraded air quality for small spatial extent analyses, and a lack of substantial difference for large spatial extent analyses. These results provide the foundation for improving policies for managing air quality, and targeting mitigation efforts to address challenges of environmental justice. PMID:27527205

  14. TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958-2015.

    PubMed

    Abatzoglou, John T; Dobrowski, Solomon Z; Parks, Sean A; Hegewisch, Katherine C

    2018-01-09

    We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.

  15. TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958-2015

    NASA Astrophysics Data System (ADS)

    Abatzoglou, John T.; Dobrowski, Solomon Z.; Parks, Sean A.; Hegewisch, Katherine C.

    2018-01-01

    We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.

  16. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences.

    PubMed

    Gao, Song; Sung, Wing-Kin; Nagarajan, Niranjan

    2011-11-01

    Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of high-quality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no guarantees on the quality of the solution. In this work, we explored the feasibility of an exact solution for scaffolding and present a first tractable solution for this problem (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes (Availability: http://sourceforge.net/projects/operasf/ ).

  17. Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences

    PubMed Central

    Gao, Song; Sung, Wing-Kin

    2011-01-01

    Abstract Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of high-quality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no guarantees on the quality of the solution. In this work, we explored the feasibility of an exact solution for scaffolding and present a first tractable solution for this problem (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes (Availability: http://sourceforge.net/projects/operasf/). PMID:21929371

  18. Optimized multiple linear mappings for single image super-resolution

    NASA Astrophysics Data System (ADS)

    Zhang, Kaibing; Li, Jie; Xiong, Zenggang; Liu, Xiuping; Gao, Xinbo

    2017-12-01

    Learning piecewise linear regression has been recognized as an effective way for example learning-based single image super-resolution (SR) in literature. In this paper, we employ an expectation-maximization (EM) algorithm to further improve the SR performance of our previous multiple linear mappings (MLM) based SR method. In the training stage, the proposed method starts with a set of linear regressors obtained by the MLM-based method, and then jointly optimizes the clustering results and the low- and high-resolution subdictionary pairs for regression functions by using the metric of the reconstruction errors. In the test stage, we select the optimal regressor for SR reconstruction by accumulating the reconstruction errors of m-nearest neighbors in the training set. Thorough experimental results carried on six publicly available datasets demonstrate that the proposed SR method can yield high-quality images with finer details and sharper edges in terms of both quantitative and perceptual image quality assessments.

  19. Medicine and democracy: The importance of institutional quality in the relationship between health expenditure and health outcomes in the MENA region.

    PubMed

    Bousmah, Marwân-Al-Qays; Ventelou, Bruno; Abu-Zaineh, Mohammad

    2016-08-01

    Evidence suggests that the effect of health expenditure on health outcomes is highly context-specific and may be driven by other factors. We construct a panel dataset of 18 countries from the Middle East and North Africa region for the period 1995-2012. Panel data models are used to estimate the macro-level determinants of health outcomes. The core finding of the paper is that increasing health expenditure leads to health outcomes improvements only to the extent that the quality of institutions within a country is sufficiently high. The sensitivity of the results is assessed using various measures of health outcomes as well as institutional variables. Overall, it appears that increasing health care expenditure in the MENA region is a necessary but not sufficient condition for health outcomes improvements. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  20. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.

    PubMed

    Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S

    2014-01-01

    A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA. Copyright © 2014 Elsevier Inc. All rights reserved.

  1. Improving alignment in Tract-based spatial statistics: evaluation and optimization of image registration.

    PubMed

    de Groot, Marius; Vernooij, Meike W; Klein, Stefan; Ikram, M Arfan; Vos, Frans M; Smith, Stephen M; Niessen, Wiro J; Andersson, Jesper L R

    2013-08-01

    Anatomical alignment in neuroimaging studies is of such importance that considerable effort is put into improving the registration used to establish spatial correspondence. Tract-based spatial statistics (TBSS) is a popular method for comparing diffusion characteristics across subjects. TBSS establishes spatial correspondence using a combination of nonlinear registration and a "skeleton projection" that may break topological consistency of the transformed brain images. We therefore investigated feasibility of replacing the two-stage registration-projection procedure in TBSS with a single, regularized, high-dimensional registration. To optimize registration parameters and to evaluate registration performance in diffusion MRI, we designed an evaluation framework that uses native space probabilistic tractography for 23 white matter tracts, and quantifies tract similarity across subjects in standard space. We optimized parameters for two registration algorithms on two diffusion datasets of different quality. We investigated reproducibility of the evaluation framework, and of the optimized registration algorithms. Next, we compared registration performance of the regularized registration methods and TBSS. Finally, feasibility and effect of incorporating the improved registration in TBSS were evaluated in an example study. The evaluation framework was highly reproducible for both algorithms (R(2) 0.993; 0.931). The optimal registration parameters depended on the quality of the dataset in a graded and predictable manner. At optimal parameters, both algorithms outperformed the registration of TBSS, showing feasibility of adopting such approaches in TBSS. This was further confirmed in the example experiment. Copyright © 2013 Elsevier Inc. All rights reserved.

  2. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  3. Physico-chemical characterisation of material fractions in residual and source-segregated household waste in Denmark.

    PubMed

    Götze, R; Pivnenko, K; Boldrin, A; Scheutz, C; Astrup, T Fruergaard

    2016-08-01

    Physico-chemical waste composition data are paramount for the assessment and planning of waste management systems. However, the applicability of data is limited by the regional, temporal and technical scope of waste characterisation studies. As Danish and European legislation aims for higher recycling rates evaluation of source-segregation and recycling chains gain importance. This paper provides a consistent up-to-date dataset for 74 physico-chemical parameters in 49 material fractions from residual and 24 material fractions from source-segregated Danish household waste. Significant differences in the physico-chemical properties of residual and source-segregated waste fractions were found for many parameters related to organic matter, but also for elements of environmental concern. Considerable differences in potentially toxic metal concentrations between the individual recyclable fractions within one material type were observed. This indicates that careful planning and performance evaluation of recycling schemes are important to ensure a high quality of collected recyclables. Rare earth elements (REE) were quantified in all waste fractions analysed, with the highest concentrations of REE found in fractions with high content of mineral raw materials, soil materials and dust. The observed REE concentrations represent the background concentration level in non-hazardous waste materials that may serve as a reference point for future investigations related to hazardous waste management. The detailed dataset provided here can be used for assessments of waste management solutions in Denmark and for the evaluation of the quality of recyclable materials in waste. Copyright © 2016 Elsevier Ltd. All rights reserved.

  4. Comparing historical and modern methods of Sea Surface Temperature measurement - Part 1: Review of methods, field comparisons and dataset adjustments

    NASA Astrophysics Data System (ADS)

    Matthews, J. B. R.

    2012-09-01

    Sea Surface Temperature (SST) measurements have been obtained from a variety of different platforms, instruments and depths over the post-industrial period. Today most measurements come from ships, moored and drifting buoys and satellites. Shipboard methods include temperature measurement of seawater sampled by bucket and in engine cooling water intakes. Engine intake temperatures are generally thought to average a few tenths of a °C warmer than simultaneous bucket temperatures. Here I review SST measurement methods, studies comparing shipboard methods by field experiment and adjustments applied to SST datasets to account for variable methods. In opposition to contemporary thinking, I find average bucket-intake temperature differences reported from field studies inconclusive. Non-zero average differences often have associated standard deviations that are several times larger than the averages themselves. Further, average differences have been found to vary widely between ships and between cruises on the same ship. The cause of non-zero average differences is typically unclear given the general absence of additional temperature observations to those from buckets and engine intakes. Shipboard measurements appear of variable quality, highly dependent upon the accuracy and precision of the thermometer used and the care of the observer where manually read. Methods are generally poorly documented, with written instructions not necessarily reflecting actual practices of merchant mariners. Measurements cannot be expected to be of high quality where obtained by untrained sailors using thermometers of low accuracy and precision.

  5. Onshore and offshore wind resource evaluation in the northeastern area of the Iberian Peninsula: quality assurance of the surface wind observations

    NASA Astrophysics Data System (ADS)

    Hidalgo, A.; González-Rouco, J. F.; Jiménez, P. A.; Navarro, J.; García-Bustamante, E.; Lucio-Eceiza, E. E.; Montávez, J. P.; García, A. Y.; Prieto, L.

    2012-04-01

    Offshore wind energy is becoming increasingly important as a reliable source of electricity generation. The areas located in the vicinity of the Cantabrian and Mediterranean coasts are areas of interest in this regard. This study targets an assessment of the wind resource focused on the two coastal regions and the strip of land between them, thereby including most of the northeastern part of the Iberian Peninsula (IP) and containing the Ebro basin. The analysis of the wind resource in inland areas is crucial as the wind channeling through the existing mountains has a direct impact on the sea circulations near the coast. The thermal circulations generated by the topography near the coast also influence the offshore wind resource. This work summarizes the results of the first steps of a Quality Assurance (QA) procedure applied to the surface wind database available over the area of interest. The dataset consists of 752 stations compiled from different sources: 14 buoys distributed over the IP coast provided by Puertos del Estado (1990-2010); and 738 land sites over the area of interest provided by 8 different Spanish institutions (1933-2010) and the National Center of Atmospheric Research (NCAR; 1978-2010). It is worth noting that the variety of institutional observational protocols lead to different temporal resolutions and peculiarities that somewhat complicate the QA. The QA applied to the dataset is structured in three steps that involve the detection and suppression of: 1) manipulation errors (i.e. repetitions); 2) unrealistic values and ranges in wind module and direction; 3) abnormally low (e.g. long constant periods) and high variations (e.g. extreme values and inhomogeneities) to ensure the temporal consistency of the time series. A quality controlled observational network of wind variables with such spatial density and temporal length is not frequent and specifically for the IP is not documented in the literature. The final observed dataset will allow for a comprehensive understanding of the wind field climatology and variability and its association with the large scale atmospheric circulation as well as their dependence on local/regional features like topography, land-sea contrast, etc. In future steps, a high spatial resolution simulation will be accomplished with the WRF mesoescale model in order to improve the knowledge of the wind field in the area of interest. Such simulation will be validated by comparison with the observational dataset. In addition, studies to analyze the sensitivity of the model to different factors such as the parameterizations of the most significant physical processes that the model does not solve explicitly, the boundary conditions that feed the model, etc. will be carried out.

  6. The medical science DMZ: a network design pattern for data-intensive medical science

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Peisert, Sean; Dart, Eli; Barnett, William

    We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations.High-end networking, packet-filter firewalls, network intrusion-detection systems.We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs.The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and networkmore » resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows.By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high-speed research networks while preserving privacy and meeting regulatory requirements.« less

  7. The medical science DMZ: a network design pattern for data-intensive medical science.

    PubMed

    Peisert, Sean; Dart, Eli; Barnett, William; Balas, Edward; Cuff, James; Grossman, Robert L; Berman, Ari; Shankar, Anurag; Tierney, Brian

    2017-10-06

    We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations. High-end networking, packet-filter firewalls, network intrusion-detection systems. We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs. The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and network resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows. By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high-speed research networks while preserving privacy and meeting regulatory requirements. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  8. CARDS - comprehensive aerological reference data set. Station history, Version 2.1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    NONE

    1994-03-01

    The possibility of anthropogenic climate change has reached the attention of Government officials and researchers. However, one cannot study climate change without climate data. The CARDS project will produce high-quality upper-air data for the research community and for policy-makers. The authors intend to produce a dataset which is: easy to use, as complete as possible, as free of random errors as possible. They will also attempt to identify biases and remove them whenever possible. In this report, they relate progress toward their goal. They created a robust new format for archiving upper-air data, and designed a relational database structure tomore » hold them. The authors have converted 13 datasets to the new format and have archived over 10,000,000 individual soundings from 10 separate data sources. They produce and archive a metadata summary of each sounding they load. They have researched station histories, and have built a preliminary upper-air station history database. They have converted station-sorted data from their primary database into synoptic-sorted data in a parallel database. They have tested and will soon implement an advanced quality-control procedure, capable of detecting and often repairing errors in geopotential height, temperature, humidity, and wind. This unique quality-control method uses simultaneous vertical, horizontal, and temporal checks of several meteorological variables. It can detect errors other methods cannot. This report contains the station histories for the CARDS data set.« less

  9. The Most Common Geometric and Semantic Errors in CityGML Datasets

    NASA Astrophysics Data System (ADS)

    Biljecki, F.; Ledoux, H.; Du, X.; Stoter, J.; Soon, K. H.; Khoo, V. H. S.

    2016-10-01

    To be used as input in most simulation and modelling software, 3D city models should be geometrically and topologically valid, and semantically rich. We investigate in this paper what is the quality of currently available CityGML datasets, i.e. we validate the geometry/topology of the 3D primitives (Solid and MultiSurface), and we validate whether the semantics of the boundary surfaces of buildings is correct or not. We have analysed all the CityGML datasets we could find, both from portals of cities and on different websites, plus a few that were made available to us. We have thus validated 40M surfaces in 16M 3D primitives and 3.6M buildings found in 37 CityGML datasets originating from 9 countries, and produced by several companies with diverse software and acquisition techniques. The results indicate that CityGML datasets without errors are rare, and those that are nearly valid are mostly simple LOD1 models. We report on the most common errors we have found, and analyse them. One main observation is that many of these errors could be automatically fixed or prevented with simple modifications to the modelling software. Our principal aim is to highlight the most common errors so that these are not repeated in the future. We hope that our paper and the open-source software we have developed will help raise awareness for data quality among data providers and 3D GIS software producers.

  10. Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA

    PubMed Central

    Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.

    2018-01-01

    Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513

  11. Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA.

    PubMed

    Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G

    2018-02-20

    Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.

  12. Conducting high-value secondary dataset analysis: an introductory guide and resources.

    PubMed

    Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

    2011-08-01

    Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

  13. Digital data in support of studies and assessments of coal and petroleum resources in the Appalachian basin: Chapter I.1 in Coal and petroleum resources in the Appalachian basin: distribution, geologic framework, and geochemical character

    USGS Publications Warehouse

    Trippi, Michael H.; Kinney, Scott A.; Gunther, Gregory; Ryder, Robert T.; Ruppert, Leslie F.; Ruppert, Leslie F.; Ryder, Robert T.

    2014-01-01

    Metadata for these datasets are available in HTML and XML formats. Metadata files contain information about the sources of data used to create the dataset, the creation process steps, the data quality, the geographic coordinate system and horizontal datum used for the dataset, the values of attributes used in the dataset table, information about the publication and the publishing organization, and other information that may be useful to the reader. All links in the metadata were valid at the time of compilation. Some of these links may no longer be valid. No attempt has been made to determine the new online location (if one exists) for the data.

  14. A Merged Dataset for Solar Probe Plus FIELDS Magnetometers

    NASA Astrophysics Data System (ADS)

    Bowen, T. A.; Dudok de Wit, T.; Bale, S. D.; Revillet, C.; MacDowall, R. J.; Sheppard, D.

    2016-12-01

    The Solar Probe Plus FIELDS experiment will observe turbulent magnetic fluctuations deep in the inner heliosphere. The FIELDS magnetometer suite implements a set of three magnetometers: two vector DC fluxgate magnetometers (MAGs), sensitive from DC- 100Hz, as well as a vector search coil magnetometer (SCM), sensitive from 10Hz-50kHz. Single axis measurements are additionally made up to 1MHz. To study the full range of observations, we propose merging data from the individual magnetometers into a single dataset. A merged dataset will improve the quality of observations in the range of frequencies observed by both magnetometers ( 10-100 Hz). Here we present updates on the individual MAG and SCM calibrations as well as our results on generating a cross-calibrated and merged dataset.

  15. Dose reduction in abdominal computed tomography: intraindividual comparison of image quality of full-dose standard and half-dose iterative reconstructions with dual-source computed tomography.

    PubMed

    May, Matthias S; Wüst, Wolfgang; Brand, Michael; Stahl, Christian; Allmendinger, Thomas; Schmidt, Bernhard; Uder, Michael; Lell, Michael M

    2011-07-01

    We sought to evaluate the image quality of iterative reconstruction in image space (IRIS) in half-dose (HD) datasets compared with full-dose (FD) and HD filtered back projection (FBP) reconstruction in abdominal computed tomography (CT). To acquire data with FD and HD simultaneously, contrast-enhanced abdominal CT was performed with a dual-source CT system, both tubes operating at 120 kV, 100 ref.mAs, and pitch 0.8. Three different image datasets were reconstructed from the raw data: Standard FD images applying FBP which served as reference, HD images applying FBP and HD images applying IRIS. For the HD data sets, only data from 1 tube detector-system was used. Quantitative image quality analysis was performed by measuring image noise in tissue and air. Qualitative image quality was evaluated according to the European Guidelines on Quality criteria for CT. Additional assessment of artifacts, lesion conspicuity, and edge sharpness was performed. : Image noise in soft tissue was substantially decreased in HD-IRIS (-3.4 HU, -22%) and increased in HD-FBP (+6.2 HU, +39%) images when compared with the reference (mean noise, 15.9 HU). No significant differences between the FD-FBP and HD-IRIS images were found for the visually sharp anatomic reproduction, overall diagnostic acceptability (P = 0.923), lesion conspicuity (P = 0.592), and edge sharpness (P = 0.589), while HD-FBP was rated inferior. Streak artifacts and beam hardening was significantly more prominent in HD-FBP while HD-IRIS images exhibited a slightly different noise pattern. Direct intrapatient comparison of standard FD body protocols and HD-IRIS reconstruction suggest that the latest iterative reconstruction algorithms allow for approximately 50% dose reduction without deterioration of the high image quality necessary for confident diagnosis.

  16. Performance comparison of SNP detection tools with illumina exome sequencing data—an assessment using both family pedigree information and sample-matched SNP array data

    PubMed Central

    Yi, Ming; Zhao, Yongmei; Jia, Li; He, Mei; Kebebew, Electron; Stephens, Robert M.

    2014-01-01

    To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios—family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of these tools without directly assessing the quality of the derived SNPs. More importantly, the main purpose of our study was to establish a reusable procedure that applies high-throughput validation to compare the quality of SNP discovery tools with a focus on exome-seq, which can be used to compare any forthcoming tool(s) of interest. PMID:24831545

  17. Using ZIP Code Business Patterns Data to Measure Alcohol Outlet Density

    PubMed Central

    Matthews, Stephen A.; McCarthy, John D.; Rafail, Patrick S.

    2014-01-01

    Some states maintain high-quality alcohol outlet databases but quality varies by state, making comprehensive comparative analysis across US communities difficult. This study assesses the adequacy of using ZIP Code Business Patterns (ZIP-BP) data on establishments as estimates of the number of alcohol outlets by ZIP code. Specifically we compare ZIP-BP alcohol outlet counts with high-quality data from state and local records surrounding 44 college campus communities across 10 states plus the District of Columbia. Results show that a composite measure is strongly correlated (R=0.89) with counts of alcohol outlets generated from official state records. Analyses based on Generalized Estimation Equation models show that community and contextual factors have little impact on the concordance between the two data sources. There are also minimal inter-state differences in the level of agreement. To validate the use of a convenient secondary data set (ZIP-BP) it is important to have a high correlation with the more complex, high quality and more costly data product (i.e., datasets based on the acquisition and geocoding of state and local records) and then to clearly demonstrate that the discrepancy between the two to be unrelated to relevant explanatory variables. Thus our overall findings support the adequacy of using a conveniently available data set (ZIP-BP data) to estimate alcohol outlet densities in ZIP code areas in future research. PMID:21411233

  18. Using ZIP code business patterns data to measure alcohol outlet density.

    PubMed

    Matthews, Stephen A; McCarthy, John D; Rafail, Patrick S

    2011-07-01

    Some states maintain high-quality alcohol outlet databases but quality varies by state, making comprehensive comparative analysis across US communities difficult. This study assesses the adequacy of using ZIP Code Business Patterns (ZIP-BP) data on establishments as estimates of the number of alcohol outlets by ZIP code. Specifically we compare ZIP-BP alcohol outlet counts with high-quality data from state and local records surrounding 44 college campus communities across 10 states plus the District of Columbia. Results show that a composite measure is strongly correlated (R=0.89) with counts of alcohol outlets generated from official state records. Analyses based on Generalized Estimation Equation models show that community and contextual factors have little impact on the concordance between the two data sources. There are also minimal inter-state differences in the level of agreement. To validate the use of a convenient secondary data set (ZIP-BP) it is important to have a high correlation with the more complex, high quality and more costly data product (i.e., datasets based on the acquisition and geocoding of state and local records) and then to clearly demonstrate that the discrepancy between the two to be unrelated to relevant explanatory variables. Thus our overall findings support the adequacy of using a conveniently available data set (ZIP-BP data) to estimate alcohol outlet densities in ZIP code areas in future research. Copyright © 2011 Elsevier Ltd. All rights reserved.

  19. Validation of a novel technique for creating simulated radiographs using computed tomography datasets.

    PubMed

    Mendoza, Patricia; d'Anjou, Marc-André; Carmel, Eric N; Fournier, Eric; Mai, Wilfried; Alexander, Kate; Winter, Matthew D; Zwingenberger, Allison L; Thrall, Donald E; Theoret, Christine

    2014-01-01

    Understanding radiographic anatomy and the effects of varying patient and radiographic tube positioning on image quality can be a challenge for students. The purposes of this study were to develop and validate a novel technique for creating simulated radiographs using computed tomography (CT) datasets. A DICOM viewer (ORS Visual) plug-in was developed with the ability to move and deform cuboidal volumetric CT datasets, and to produce images simulating the effects of tube-patient-detector distance and angulation. Computed tomographic datasets were acquired from two dogs, one cat, and one horse. Simulated radiographs of different body parts (n = 9) were produced using different angles to mimic conventional projections, before actual digital radiographs were obtained using the same projections. These studies (n = 18) were then submitted to 10 board-certified radiologists who were asked to score visualization of anatomical landmarks, depiction of patient positioning, realism of distortion/magnification, and image quality. No significant differences between simulated and actual radiographs were found for anatomic structure visualization and patient positioning in the majority of body parts. For the assessment of radiographic realism, no significant differences were found between simulated and digital radiographs for canine pelvis, equine tarsus, and feline abdomen body parts. Overall, image quality and contrast resolution of simulated radiographs were considered satisfactory. Findings from the current study indicated that radiographs simulated using this new technique are comparable to actual digital radiographs. Further studies are needed to apply this technique in developing interactive tools for teaching radiographic anatomy and the effects of varying patient and tube positioning. © 2013 American College of Veterinary Radiology.

  20. Operational use of open satellite data for marine water quality monitoring

    NASA Astrophysics Data System (ADS)

    Symeonidis, Panagiotis; Vakkas, Theodoros

    2017-09-01

    The purpose of this study was to develop an operational platform for marine water quality monitoring using near real time satellite data. The developed platform utilizes free and open satellite data available from different data sources like COPERNICUS, the European Earth Observation Initiative, or NASA, from different satellites and instruments. The quality of the marine environment is operationally evaluated using parameters like chlorophyll-a concentration, water color and Sea Surface Temperature (SST). For each parameter, there are more than one dataset available, from different data sources or satellites, to allow users to select the most appropriate dataset for their area or time of interest. The above datasets are automatically downloaded from the data provider's services and ingested to the central, spatial engine. The spatial data platform uses the Postgresql database with the PostGIS extension for spatial data storage and Geoserver for the provision of the spatial data services. The system provides daily, 10 days and monthly maps and time series of the above parameters. The information is provided using a web client which is based on the GET SDI PORTAL, an easy to use and feature rich geospatial visualization and analysis platform. The users can examine the temporal variation of the parameters using a simple time animation tool. In addition, with just one click on the map, the system provides an interactive time series chart for any of the parameters of the available datasets. The platform can be offered as Software as a Service (SaaS) to any area in the Mediterranean region.

  1. Extreme learning machines: a new approach for modeling dissolved oxygen (DO) concentration with and without water quality variables as predictors.

    PubMed

    Heddam, Salim; Kisi, Ozgur

    2017-07-01

    In this paper, several extreme learning machine (ELM) models, including standard extreme learning machine with sigmoid activation function (S-ELM), extreme learning machine with radial basis activation function (R-ELM), online sequential extreme learning machine (OS-ELM), and optimally pruned extreme learning machine (OP-ELM), are newly applied for predicting dissolved oxygen concentration with and without water quality variables as predictors. Firstly, using data from eight United States Geological Survey (USGS) stations located in different rivers basins, USA, the S-ELM, R-ELM, OS-ELM, and OP-ELM were compared against the measured dissolved oxygen (DO) using four water quality variables, water temperature, specific conductance, turbidity, and pH, as predictors. For each station, we used data measured at an hourly time step for a period of 4 years. The dataset was divided into a training set (70%) and a validation set (30%). We selected several combinations of the water quality variables as inputs for each ELM model and six different scenarios were compared. Secondly, an attempt was made to predict DO concentration without water quality variables. To achieve this goal, we used the year numbers, 2008, 2009, etc., month numbers from (1) to (12), day numbers from (1) to (31) and hour numbers from (00:00) to (24:00) as predictors. Thirdly, the best ELM models were trained using validation dataset and tested with the training dataset. The performances of the four ELM models were evaluated using four statistical indices: the coefficient of correlation (R), the Nash-Sutcliffe efficiency (NSE), the root mean squared error (RMSE), and the mean absolute error (MAE). Results obtained from the eight stations indicated that: (i) the best results were obtained by the S-ELM, R-ELM, OS-ELM, and OP-ELM models having four water quality variables as predictors; (ii) out of eight stations, the OP-ELM performed better than the other three ELM models at seven stations while the R-ELM performed the best at one station. The OS-ELM models performed the worst and provided the lowest accuracy; (iii) for predicting DO without water quality variables, the R-ELM performed the best at seven stations followed by the S-ELM in the second place and the OP-ELM performed the worst with low accuracy; (iv) for the final application where training ELM models with validation dataset and testing with training dataset, the OP-ELM provided the best accuracy using water quality variables and the R-ELM performed the best at all eight stations without water quality variables. Fourthly, and finally, we compared the results obtained from different ELM models with those obtained using multiple linear regression (MLR) and multilayer perceptron neural network (MLPNN). Results obtained using MLPNN and MLR models reveal that: (i) using water quality variables as predictors, the MLR performed the worst and provided the lowest accuracy in all stations; (ii) MLPNN was ranked in the second place at two stations, in the third place at four stations, and finally, in the fourth place at two stations, (iii) for predicting DO without water quality variables, MLPNN is ranked in the second place at five stations, and ranked in the third, fourth, and fifth places in the remaining three stations, while MLR was ranked in the last place with very low accuracy at all stations. Overall, the results suggest that the ELM is more effective than the MLPNN and MLR for modelling DO concentration in river ecosystems.

  2. Finding clean water habitats in urban landscapes: professional researcher vs citizen science approaches.

    PubMed

    McGoff, Elaine; Dunn, Francesca; Cachazo, Luis Moliner; Williams, Penny; Biggs, Jeremy; Nicolet, Pascale; Ewald, Naomi C

    2017-03-01

    This study investigated patterns of nutrient pollution in waterbody types across Greater London. Nitrate and phosphate data were collected by both citizen scientists and professional ecologists and their results were compared. The professional survey comprised 495 randomly selected pond, lake, river, stream and ditch sites. Citizen science survey sites were self-selected and comprised 76 ponds, lakes, rivers and streams. At each site, nutrient concentrations were assessed using field chemistry kits to measure nitrate-N and phosphate-P. The professional and the citizen science datasets both showed that standing waterbodies had significantly lower average nutrient concentrations than running waters. In the professional datasets 46% of ponds and lakes had nutrient levels below the threshold at which biological impairment is likely, whereas only 3% of running waters were unimpaired by nutrients. The citizen science dataset showed the same broad pattern, but there was a trend towards selection of higher quality waterbodies with 77% standing waters and 14% of rivers and streams unimpaired. Waterbody nutrient levels in the professional dataset were broadly correlated with landuse intensity. Rivers and streams had a significantly higher proportion of urban and suburban land cover than other waterbody types. Ponds had higher percentage of semi-natural vegetation within their much smaller catchments. Relationships with land cover and water quality were less apparent in the citizen-collected dataset probably because the areas visited by citizens were less representative of the landscape as whole. The results suggest that standing waterbodies, especially ponds, may represent an important clean water resource within urban areas. Small waterbodies, including ponds, small lakes<50ha and ditches, are rarely part of the statutory water quality monitoring programmes and are frequently overlooked. Citizen scientist data have the potential to partly fill this gap if they are co-ordinated to reduce bias in the type and location of the waterbodies selected. Copyright © 2016 Elsevier B.V. All rights reserved.

  3. Modelling land use change in the Ganga basin

    NASA Astrophysics Data System (ADS)

    Moulds, Simon; Mijic, Ana; Buytaert, Wouter

    2014-05-01

    Over recent decades the green revolution in India has driven substantial environmental change. Modelling experiments have identified northern India as a "hot spot" of land-atmosphere coupling strength during the boreal summer. However, there is a wide range of sensitivity of atmospheric variables to soil moisture between individual climate models. The lack of a comprehensive land use change dataset to force climate models has been identified as a major contributor to model uncertainty. This work aims to construct a monthly time series dataset of land use change for the period 1966 to 2007 for northern India to improve the quantification of regional hydrometeorological feedbacks. The Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on board the Aqua and Terra satellites provides near-continuous remotely sensed datasets from 2000 to the present day. However, the quality and availability of satellite products before 2000 is poor. To complete the dataset MODIS images are extrapolated back in time using the Conversion of Land Use and its Effects at Small regional extent (CLUE-S) modelling framework, recoded in the R programming language to overcome limitations of the original interface. Non-spatial estimates of land use area published by the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT) for the study period, available on an annual, district-wise basis, are used as a direct model input. Land use change is allocated spatially as a function of biophysical and socioeconomic drivers identified using logistic regression. The dataset will provide an essential input to a high-resolution, physically-based land-surface model to generate the lower boundary condition to assess the impact of land use change on regional climate.

  4. Exploring Genetic Divergence in a Species-Rich Insect Genus Using 2790 DNA Barcodes

    PubMed Central

    Lin, Xiaolong; Stur, Elisabeth; Ekrem, Torbjørn

    2015-01-01

    DNA barcoding using a fragment of the mitochondrial cytochrome c oxidase subunit 1 gene (COI) has proven to be successful for species-level identification in many animal groups. However, most studies have been focused on relatively small datasets or on large datasets of taxonomically high-ranked groups. We explore the quality of DNA barcodes to delimit species in the diverse chironomid genus Tanytarsus (Diptera: Chironomidae) by using different analytical tools. The genus Tanytarsus is the most species-rich taxon of tribe Tanytarsini (Diptera: Chironomidae) with more than 400 species worldwide, some of which can be notoriously difficult to identify to species-level using morphology. Our dataset, based on sequences generated from own material and publicly available data in BOLD, consist of 2790 DNA barcodes with a fragment length of at least 500 base pairs. A neighbor joining tree of this dataset comprises 131 well separated clusters representing 121 morphological species of Tanytarsus: 77 named, 16 unnamed and 28 unidentified theoretical species. For our geographically widespread dataset, DNA barcodes unambiguously discriminate 94.6% of the Tanytarsus species recognized through prior morphological study. Deep intraspecific divergences exist in some species complexes, and need further taxonomic studies using appropriate nuclear markers as well as morphological and ecological data to be resolved. The DNA barcodes cluster into 120–242 molecular operational taxonomic units (OTUs) depending on whether Objective Clustering, Automatic Barcode Gap Discovery (ABGD), Generalized Mixed Yule Coalescent model (GMYC), Poisson Tree Process (PTP), subjective evaluation of the neighbor joining tree or Barcode Index Numbers (BINs) are used. We suggest that a 4–5% threshold is appropriate to delineate species of Tanytarsus non-biting midges. PMID:26406595

  5. Status update: is smoke on your mind? Using social media to assess smoke exposure

    NASA Astrophysics Data System (ADS)

    Ford, Bonne; Burke, Moira; Lassman, William; Pfister, Gabriele; Pierce, Jeffrey R.

    2017-06-01

    Exposure to wildland fire smoke is associated with negative effects on human health. However, these effects are poorly quantified. Accurately attributing health endpoints to wildland fire smoke requires determining the locations, concentrations, and durations of smoke events. Most current methods for assessing these smoke events (ground-based measurements, satellite observations, and chemical transport modeling) are limited temporally, spatially, and/or by their level of accuracy. In this work, we explore using daily social media posts from Facebook regarding smoke, haze, and air quality to assess population-level exposure for the summer of 2015 in the western US. We compare this de-identified, aggregated Facebook dataset to several other datasets that are commonly used for estimating exposure, such as satellite observations (MODIS aerosol optical depth and Hazard Mapping System smoke plumes), daily (24 h) average surface particulate matter measurements, and model-simulated (WRF-Chem) surface concentrations. After adding population-weighted spatial smoothing to the Facebook data, this dataset is well correlated (R2 generally above 0.5) with the other methods in smoke-impacted regions. The Facebook dataset is better correlated with surface measurements of PM2. 5 at a majority of monitoring sites (163 of 293 sites) than the satellite observations and our model simulation. We also present an example case for Washington state in 2015, for which we combine this Facebook dataset with MODIS observations and WRF-Chem-simulated PM2. 5 in a regression model. We show that the addition of the Facebook data improves the regression model's ability to predict surface concentrations. This high correlation of the Facebook data with surface monitors and our Washington state example suggests that this social-media-based proxy can be used to estimate smoke exposure in locations without direct ground-based particulate matter measurements.

  6. Concept for Future Data Services at the Long-Term Archive of WDCC combining DOIs with common PIDs

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Weigel, Tobias; Toussaint, Frank; Höck, Heinke; Thiemann, Hannes; Lautenschlager, Michael

    2013-04-01

    The World Data Center for Climate (WDCC) hosted at the German Climate Computing Center (DKRZ) maintains a long-term archive (LTA) of climate model data as well as observational data. WDCC distinguishes between two types of LTA data: Structured data: Data output of an instrument or of a climate model run consists of numerous, highly structured individual datasets in a uniform format. Part of these data is also published on an ESGF (Earth System Grid Federation) data node. Detailed metadata is available allowing for fine-grained user-defined data access. Unstructured data: LTA data of finished scientific projects are in general unstructured and consist of datasets of different formats, different sizes, and different contents. For these data compact metadata is available as content information. The structured data is suitable for WDCC's DataCite DOI process, the project data only in exceptional cases. The DOI process includes a thorough quality control process of technical as well as scientific aspects by the publication agent and the data creator. DOIs are assigned to data collections appropriate to be cited in scientific publications, like a simulation run. The data collection is defined in agreement with the data creator. At the moment there is no possibility to identify and cite individual datasets within this DOI data collection analogous to the citation of chapters in a book. Also missing is a compact citation regulation for a user-specified collection of data. WDCC therefore complements its existing LTA/DOI concept by Persistent Identifier (PID) assignment to datasets using Handles. In addition to data identification for internal and external use, the concept of PIDs allows to define relations among PIDs. Such structural information is stored as key-value pair directly in the handles. Thus, relations provide basic provenance or lineage information, even if part of the data like intermediate results are lost. WDCC intends to use additional PIDs on metadata entities with a relation to the data PID(s). These add background information on the data creation process (e.g. descriptions of experiment, model, model set-up, and platform for the model run etc.) to the data. These pieces of additional information increase the re-usability of the archived model data, significantly. Other valuable additional information for scientific collaboration could be added by the same mechanism, like quality information and annotations. Apart from relations among data and metadata entities, PIDs on collections are advantageous for model data: Collections allow for persistent references to single datasets or subsets of data assigned a DOI, Data objects and additional information objects can be consistently connected via relations (provenance, creation, quality information for data),

  7. A Question of Quality: Do Children from Disadvantaged Backgrounds Receive Lower Quality Early Childhood Education and Care?

    ERIC Educational Resources Information Center

    Gambaro, Ludovica; Stewart, Kitty; Waldfogel, Jane

    2015-01-01

    This paper examines how the quality of early childhood education and care accessed by 3- and 4-year-olds in England varies by children's background. Focusing on the free entitlement to early education, the analysis combines information from three administrative datasets for 2010-2011, the Early Years Census, the Schools Census and the Ofsted…

  8. Data preparation techniques for a perinatal psychiatric study based on linked data.

    PubMed

    Xu, Fenglian; Hilder, Lisa; Austin, Marie-Paule; Sullivan, Elizabeth A

    2012-06-08

    In recent years there has been an increase in the use of population-based linked data. However, there is little literature that describes the method of linked data preparation. This paper describes the method for merging data, calculating the statistical variable (SV), recoding psychiatric diagnoses and summarizing hospital admissions for a perinatal psychiatric study. The data preparation techniques described in this paper are based on linked birth data from the New South Wales (NSW) Midwives Data Collection (MDC), the Register of Congenital Conditions (RCC), the Admitted Patient Data Collection (APDC) and the Pharmaceutical Drugs of Addiction System (PHDAS). The master dataset is the meaningfully linked data which include all or major study data collections. The master dataset can be used to improve the data quality, calculate the SV and can be tailored for different analyses. To identify hospital admissions in the periods before pregnancy, during pregnancy and after birth, a statistical variable of time interval (SVTI) needs to be calculated. The methods and SPSS syntax for building a master dataset, calculating the SVTI, recoding the principal diagnoses of mental illness and summarizing hospital admissions are described. Linked data preparation, including building the master dataset and calculating the SV, can improve data quality and enhance data function.

  9. Accuracy assessment of the U.S. Geological Survey National Elevation Dataset, and comparison with other large-area elevation datasets: SRTM and ASTER

    USGS Publications Warehouse

    Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.

    2014-01-01

    The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).

  10. Two ultraviolet radiation datasets that cover China

    NASA Astrophysics Data System (ADS)

    Liu, Hui; Hu, Bo; Wang, Yuesi; Liu, Guangren; Tang, Liqin; Ji, Dongsheng; Bai, Yongfei; Bao, Weikai; Chen, Xin; Chen, Yunming; Ding, Weixin; Han, Xiaozeng; He, Fei; Huang, Hui; Huang, Zhenying; Li, Xinrong; Li, Yan; Liu, Wenzhao; Lin, Luxiang; Ouyang, Zhu; Qin, Boqiang; Shen, Weijun; Shen, Yanjun; Su, Hongxin; Song, Changchun; Sun, Bo; Sun, Song; Wang, Anzhi; Wang, Genxu; Wang, Huimin; Wang, Silong; Wang, Youshao; Wei, Wenxue; Xie, Ping; Xie, Zongqiang; Yan, Xiaoyuan; Zeng, Fanjiang; Zhang, Fawei; Zhang, Yangjian; Zhang, Yiping; Zhao, Chengyi; Zhao, Wenzhi; Zhao, Xueyong; Zhou, Guoyi; Zhu, Bo

    2017-07-01

    Ultraviolet (UV) radiation has significant effects on ecosystems, environments, and human health, as well as atmospheric processes and climate change. Two ultraviolet radiation datasets are described in this paper. One contains hourly observations of UV radiation measured at 40 Chinese Ecosystem Research Network stations from 2005 to 2015. CUV3 broadband radiometers were used to observe the UV radiation, with an accuracy of 5%, which meets the World Meteorology Organization's measurement standards. The extremum method was used to control the quality of the measured datasets. The other dataset contains daily cumulative UV radiation estimates that were calculated using an all-sky estimation model combined with a hybrid model. The reconstructed daily UV radiation data span from 1961 to 2014. The mean absolute bias error and root-mean-square error are smaller than 30% at most stations, and most of the mean bias error values are negative, which indicates underestimation of the UV radiation intensity. These datasets can improve our basic knowledge of the spatial and temporal variations in UV radiation. Additionally, these datasets can be used in studies of potential ozone formation and atmospheric oxidation, as well as simulations of ecological processes.

  11. The Physcomitrella patens gene atlas project: large-scale RNA-seq based expression data.

    PubMed

    Perroud, Pierre-François; Haas, Fabian B; Hiss, Manuel; Ullrich, Kristian K; Alboresi, Alessandro; Amirebrahimi, Mojgan; Barry, Kerrie; Bassi, Roberto; Bonhomme, Sandrine; Chen, Haodong; Coates, Juliet C; Fujita, Tomomichi; Guyon-Debast, Anouchka; Lang, Daniel; Lin, Junyan; Lipzen, Anna; Nogué, Fabien; Oliver, Melvin J; Ponce de León, Inés; Quatrano, Ralph S; Rameau, Catherine; Reiss, Bernd; Reski, Ralf; Ricca, Mariana; Saidi, Younousse; Sun, Ning; Szövényi, Péter; Sreedasyam, Avinash; Grimwood, Jane; Stacey, Gary; Schmutz, Jeremy; Rensing, Stefan A

    2018-07-01

    High-throughput RNA sequencing (RNA-seq) has recently become the method of choice to define and analyze transcriptomes. For the model moss Physcomitrella patens, although this method has been used to help analyze specific perturbations, no overall reference dataset has yet been established. In the framework of the Gene Atlas project, the Joint Genome Institute selected P. patens as a flagship genome, opening the way to generate the first comprehensive transcriptome dataset for this moss. The first round of sequencing described here is composed of 99 independent libraries spanning 34 different developmental stages and conditions. Upon dataset quality control and processing through read mapping, 28 509 of the 34 361 v3.3 gene models (83%) were detected to be expressed across the samples. Differentially expressed genes (DEGs) were calculated across the dataset to permit perturbation comparisons between conditions. The analysis of the three most distinct and abundant P. patens growth stages - protonema, gametophore and sporophyte - allowed us to define both general transcriptional patterns and stage-specific transcripts. As an example of variation of physico-chemical growth conditions, we detail here the impact of ammonium supplementation under standard growth conditions on the protonemal transcriptome. Finally, the cooperative nature of this project allowed us to analyze inter-laboratory variation, as 13 different laboratories around the world provided samples. We compare differences in the replication of experiments in a single laboratory and between different laboratories. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.

  12. Recommended GIS Analysis Methods for Global Gridded Population Data

    NASA Astrophysics Data System (ADS)

    Frye, C. E.; Sorichetta, A.; Rose, A.

    2017-12-01

    When using geographic information systems (GIS) to analyze gridded, i.e., raster, population data, analysts need a detailed understanding of several factors that affect raster data processing, and thus, the accuracy of the results. Global raster data is most often provided in an unprojected state, usually in the WGS 1984 geographic coordinate system. Most GIS functions and tools evaluate data based on overlay relationships (area) or proximity (distance). Area and distance for global raster data can be either calculated directly using the various earth ellipsoids or after transforming the data to equal-area/equidistant projected coordinate systems to analyze all locations equally. However, unlike when projecting vector data, not all projected coordinate systems can support such analyses equally, and the process of transforming raster data from one coordinate space to another often results unmanaged loss of data through a process called resampling. Resampling determines which values to use in the result dataset given an imperfect locational match in the input dataset(s). Cell size or resolution, registration, resampling method, statistical type, and whether the raster represents continuous or discreet information potentially influence the quality of the result. Gridded population data represent estimates of population in each raster cell, and this presentation will provide guidelines for accurately transforming population rasters for analysis in GIS. Resampling impacts the display of high resolution global gridded population data, and we will discuss how to properly handle pyramid creation using the Aggregate tool with the sum option to create overviews for mosaic datasets.

  13. Estimating Gravity Biases with Wavelets in Support of a 1-cm Accurate Geoid Model

    NASA Astrophysics Data System (ADS)

    Ahlgren, K.; Li, X.

    2017-12-01

    Systematic errors that reside in surface gravity datasets are one of the major hurdles in constructing a high-accuracy geoid model at high resolutions. The National Oceanic and Atmospheric Administration's (NOAA) National Geodetic Survey (NGS) has an extensive historical surface gravity dataset consisting of approximately 10 million gravity points that are known to have systematic biases at the mGal level (Saleh et al. 2013). As most relevant metadata is absent, estimating and removing these errors to be consistent with a global geopotential model and airborne data in the corresponding wavelength is quite a difficult endeavor. However, this is crucial to support a 1-cm accurate geoid model for the United States. With recently available independent gravity information from GRACE/GOCE and airborne gravity from the NGS Gravity for the Redefinition of the American Vertical Datum (GRAV-D) project, several different methods of bias estimation are investigated which utilize radial basis functions and wavelet decomposition. We estimate a surface gravity value by incorporating a satellite gravity model, airborne gravity data, and forward-modeled topography at wavelet levels according to each dataset's spatial wavelength. Considering the estimated gravity values over an entire gravity survey, an estimate of the bias and/or correction for the entire survey can be found and applied. In order to assess the accuracy of each bias estimation method, two techniques are used. First, each bias estimation method is used to predict the bias for two high-quality (unbiased and high accuracy) geoid slope validation surveys (GSVS) (Smith et al. 2013 & Wang et al. 2017). Since these surveys are unbiased, the various bias estimation methods should reflect that and provide an absolute accuracy metric for each of the bias estimation methods. Secondly, the corrected gravity datasets from each of the bias estimation methods are used to build a geoid model. The accuracy of each geoid model provides an additional metric to assess the performance of each bias estimation method. The geoid model accuracies are assessed using the two GSVS lines and GPS-leveling data across the United States.

  14. Aerosol Climate Time Series in ESA Aerosol_cci

    NASA Astrophysics Data System (ADS)

    Popp, Thomas; de Leeuw, Gerrit; Pinnock, Simon

    2016-04-01

    Within the ESA Climate Change Initiative (CCI) Aerosol_cci (2010 - 2017) conducts intensive work to improve algorithms for the retrieval of aerosol information from European sensors. Meanwhile, full mission time series of 2 GCOS-required aerosol parameters are completely validated and released: Aerosol Optical Depth (AOD) from dual view ATSR-2 / AATSR radiometers (3 algorithms, 1995 - 2012), and stratospheric extinction profiles from star occultation GOMOS spectrometer (2002 - 2012). Additionally, a 35-year multi-sensor time series of the qualitative Absorbing Aerosol Index (AAI) together with sensitivity information and an AAI model simulator is available. Complementary aerosol properties requested by GCOS are in a "round robin" phase, where various algorithms are inter-compared: fine mode AOD, mineral dust AOD (from the thermal IASI spectrometer, but also from ATSR instruments and the POLDER sensor), absorption information and aerosol layer height. As a quasi-reference for validation in few selected regions with sparse ground-based observations the multi-pixel GRASP algorithm for the POLDER instrument is used. Validation of first dataset versions (vs. AERONET, MAN) and inter-comparison to other satellite datasets (MODIS, MISR, SeaWIFS) proved the high quality of the available datasets comparable to other satellite retrievals and revealed needs for algorithm improvement (for example for higher AOD values) which were taken into account for a reprocessing. The datasets contain pixel level uncertainty estimates which were also validated and improved in the reprocessing. For the three ATSR algorithms the use of an ensemble method was tested. The paper will summarize and discuss the status of dataset reprocessing and validation. The focus will be on the ATSR, GOMOS and IASI datasets. Pixel level uncertainties validation will be summarized and discussed including unknown components and their potential usefulness and limitations. Opportunities for time series extension with successor instruments of the Sentinel family will be described and the complementarity of the different satellite aerosol products (e.g. dust vs. total AOD, ensembles from different algorithms for the same sensor) will be discussed.

  15. A spatio-temporal analysis for regional enhancements of greenhouse gas concentration with GOSAT and OCO-2

    NASA Astrophysics Data System (ADS)

    Kasai, K.; Shiomi, K.; Konno, A.; Tadono, T.; Hori, M.

    2017-12-01

    Global observation of greenhouse gases such as carbon dioxide (CO2) and methane (CH4) with high spatio-temporal resolution and accurate estimation of sources and sinks are important to understand greenhouse gases dynamics. Greenhouse Gases Observing Satellite (GOSAT) has observed column-averaged dry-air mole fractions of CO2 (XCO2) and CH4 (XCH4) over 8 years since January 2009 with 3-day repeat cycle. Orbiting Carbon Observatory-2 (OCO-2) has observed XCO2 on orbit since July 2014 with 16-day repeat cycle. The objective of this study investigates regional enhancements of greenhouse gases concentrations using GOSAT and OCO-2 data. We use two retrieved datasets as GOSAT observation data. One is ACOS GOSAT/TANSO-FTS Level 2 Standard Product B7.3 by NASA/JPL, and the other is NIES TANSO-FTS SWIR L2 Product V02. As OCO-2 observation data, OCO-2 Operational L2 Data Version 7 is used. ODIAC dataset is also used for classification of regional enhancements into anthropogenic and biogenic sources. Before analyzing these datasets, outliers are screened by using "quality flag", "outcome flag" and "warn level" in land or water parts, and the "M-gain" data observed by GOSAT are removed. Then, the monthly mean XCO2 and XCH4 of all greenhouse gases datasets is calculated from the daily mean XCO2 and XCH4 to correct the weight by the difference in the number of observation points. Biases among datasets are assessed by comparing the monthly mean XCO2 and XCH4. Also, anomalies of XCO2 and XCH4 are computed by subtracting the monthly mean from individual observations. The positive and negative anomalies are candidates for regional enhancements and uptake, respectively. To detect the regional enhancements from the satellite observation datasets, the results of spatio-temporal analysis of the anomalies are also reported.

  16. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  17. Single-Image Super Resolution for Multispectral Remote Sensing Data Using Convolutional Neural Networks

    NASA Astrophysics Data System (ADS)

    Liebel, L.; Körner, M.

    2016-06-01

    In optical remote sensing, spatial resolution of images is crucial for numerous applications. Space-borne systems are most likely to be affected by a lack of spatial resolution, due to their natural disadvantage of a large distance between the sensor and the sensed object. Thus, methods for single-image super resolution are desirable to exceed the limits of the sensor. Apart from assisting visual inspection of datasets, post-processing operations—e.g., segmentation or feature extraction—can benefit from detailed and distinguishable structures. In this paper, we show that recently introduced state-of-the-art approaches for single-image super resolution of conventional photographs, making use of deep learning techniques, such as convolutional neural networks (CNN), can successfully be applied to remote sensing data. With a huge amount of training data available, end-to-end learning is reasonably easy to apply and can achieve results unattainable using conventional handcrafted algorithms. We trained our CNN on a specifically designed, domain-specific dataset, in order to take into account the special characteristics of multispectral remote sensing data. This dataset consists of publicly available SENTINEL-2 images featuring 13 spectral bands, a ground resolution of up to 10m, and a high radiometric resolution and thus satisfying our requirements in terms of quality and quantity. In experiments, we obtained results superior compared to competing approaches trained on generic image sets, which failed to reasonably scale satellite images with a high radiometric resolution, as well as conventional interpolation methods.

  18. Onshore industrial wind turbine locations for the United States

    USGS Publications Warehouse

    Diffendorfer, Jay E.; Compton, Roger; Kramer, Louisa; Ancona, Zach; Norton, Donna

    2017-01-01

    This dataset provides industrial-scale onshore wind turbine locations in the United States, corresponding facility information, and turbine technical specifications. The database has wind turbine records that have been collected, digitized, locationally verified, and internally quality controlled. Turbines from the Federal Aviation Administration Digital Obstacles File, through product release date July 22, 2013, were used as the primary source of turbine data points. The dataset was subsequently revised and reposted as described in the revision histories for the report. Verification of the turbine positions was done by visual interpretation using high-resolution aerial imagery in Environmental Systems Research Institute (Esri) ArcGIS Desktop. Turbines without Federal Aviation Administration Obstacles Repository System numbers were visually identified and point locations were added to the collection. We estimated a locational error of plus or minus 10 meters for turbine locations. Wind farm facility names were identified from publicly available facility datasets. Facility names were then used in a Web search of additional industry publications and press releases to attribute additional turbine information (such as manufacturer, model, and technical specifications of wind turbines). Wind farm facility location data from various wind and energy industry sources were used to search for and digitize turbines not in existing databases. Technical specifications for turbines were assigned based on the wind turbine make and model as described in literature, specifications listed in the Federal Aviation Administration Digital Obstacles File, and information on the turbine manufacturer’s Web site. Some facility and turbine information on make and model did not exist or was difficult to obtain. Thus, uncertainty may exist for certain turbine specifications. That uncertainty was rated and a confidence was recorded for both location and attribution data quality.

  19. Clear-Sky Longwave Irradiance at the Earth's Surface--Evaluation of Climate Models.

    NASA Astrophysics Data System (ADS)

    Garratt, J. R.

    2001-04-01

    An evaluation of the clear-sky longwave irradiance at the earth's surface (LI) simulated in climate models and in satellite-based global datasets is presented. Algorithm-based estimates of LI, derived from global observations of column water vapor and surface (or screen air) temperature, serve as proxy `observations.' All datasets capture the broad zonal variation and seasonal behavior in LI, mainly because the behavior in column water vapor and temperature is reproduced well. Over oceans, the dependence of annual and monthly mean irradiance upon sea surface temperature (SST) closely resembles the observed behavior of column water with SST. In particular, the observed hemispheric difference in the summer minus winter column water dependence on SST is found in all models, though with varying seasonal amplitudes. The analogous behavior in the summer minus winter LI is seen in all datasets. Over land, all models have a more highly scattered dependence of LI upon surface temperature compared with the situation over the oceans. This is related to a much weaker dependence of model column water on the screen-air temperature at both monthly and annual timescales, as observed. The ability of climate models to simulate realistic LI fields depends as much on the quality of model water vapor and temperature fields as on the quality of the longwave radiation codes. In a comparison of models with observations, root-mean-square gridpoint differences in mean monthly column water and temperature are 4-6 mm (5-8 mm) and 0.5-2 K (3-4 K), respectively, over large regions of ocean (land), consistent with the intermodel differences in LI of 5-13 W m2 (15-28 W m2).

  20. Developing Cyberinfrastructure Tools and Services for Metadata Quality Evaluation

    NASA Astrophysics Data System (ADS)

    Mecum, B.; Gordon, S.; Habermann, T.; Jones, M. B.; Leinfelder, B.; Powers, L. A.; Slaughter, P.

    2016-12-01

    Metadata and data quality are at the core of reusable and reproducible science. While great progress has been made over the years, much of the metadata collected only addresses data discovery, covering concepts such as titles and keywords. Improving metadata beyond the discoverability plateau means documenting detailed concepts within the data such as sampling protocols, instrumentation used, and variables measured. Given that metadata commonly do not describe their data at this level, how might we improve the state of things? Giving scientists and data managers easy to use tools to evaluate metadata quality that utilize community-driven recommendations is the key to producing high-quality metadata. To achieve this goal, we created a set of cyberinfrastructure tools and services that integrate with existing metadata and data curation workflows which can be used to improve metadata and data quality across the sciences. These tools work across metadata dialects (e.g., ISO19115, FGDC, EML, etc.) and can be used to assess aspects of quality beyond what is internal to the metadata such as the congruence between the metadata and the data it describes. The system makes use of a user-friendly mechanism for expressing a suite of checks as code in popular data science programming languages such as Python and R. This reduces the burden on scientists and data managers to learn yet another language. We demonstrated these services and tools in three ways. First, we evaluated a large corpus of datasets in the DataONE federation of data repositories against a metadata recommendation modeled after existing recommendations such as the LTER best practices and the Attribute Convention for Dataset Discovery (ACDD). Second, we showed how this service can be used to display metadata and data quality information to data producers during the data submission and metadata creation process, and to data consumers through data catalog search and access tools. Third, we showed how the centrally deployed DataONE quality service can achieve major efficiency gains by allowing member repositories to customize and use recommendations that fit their specific needs without having to create de novo infrastructure at their site.

  1. A new normalizing algorithm for BAC CGH arrays with quality control metrics.

    PubMed

    Miecznikowski, Jeffrey C; Gaile, Daniel P; Liu, Song; Shepherd, Lori; Nowak, Norma

    2011-01-01

    The main focus in pin-tip (or print-tip) microarray analysis is determining which probes, genes, or oligonucleotides are differentially expressed. Specifically in array comparative genomic hybridization (aCGH) experiments, researchers search for chromosomal imbalances in the genome. To model this data, scientists apply statistical methods to the structure of the experiment and assume that the data consist of the signal plus random noise. In this paper we propose "SmoothArray", a new method to preprocess comparative genomic hybridization (CGH) bacterial artificial chromosome (BAC) arrays and we show the effects on a cancer dataset. As part of our R software package "aCGHplus," this freely available algorithm removes the variation due to the intensity effects, pin/print-tip, the spatial location on the microarray chip, and the relative location from the well plate. removal of this variation improves the downstream analysis and subsequent inferences made on the data. Further, we present measures to evaluate the quality of the dataset according to the arrayer pins, 384-well plates, plate rows, and plate columns. We compare our method against competing methods using several metrics to measure the biological signal. With this novel normalization algorithm and quality control measures, the user can improve their inferences on datasets and pinpoint problems that may arise in their BAC aCGH technology.

  2. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.

    PubMed

    Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio

    2017-10-06

    Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.

  3. Dataset of anomalies and malicious acts in a cyber-physical subsystem.

    PubMed

    Laso, Pedro Merino; Brosset, David; Puentes, John

    2017-10-01

    This article presents a dataset produced to investigate how data and information quality estimations enable to detect aNomalies and malicious acts in cyber-physical systems. Data were acquired making use of a cyber-physical subsystem consisting of liquid containers for fuel or water, along with its automated control and data acquisition infrastructure. Described data consist of temporal series representing five operational scenarios - Normal, aNomalies, breakdown, sabotages, and cyber-attacks - corresponding to 15 different real situations. The dataset is publicly available in the .zip file published with the article, to investigate and compare faulty operation detection and characterization methods for cyber-physical systems.

  4. An examination of data quality on QSAR Modeling in regards ...

    EPA Pesticide Factsheets

    The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy. presentation at UNC-CH.

  5. Multivariate analysis and visualization of soil quality data for no-till systems.

    PubMed

    Villamil, M B; Miguez, F E; Bollero, G A

    2008-01-01

    To evidence the multidimensionality of the soil quality concept, we propose the use of data visualization as a tool for exploratory data analyses, model building, and diagnostics. Our objective was to establish the best edaphic indicators for assessing soil quality in four no-till systems with regard to functioning as a medium for crop production and nutrient cycling across two Illinois locations. The compared situations were no-till corn-soybean rotations including either winter fallowing (C/S) or cover crops of rye (Secale cereale; C-R/S-R), hairy vetch (Vicia villosa; C-R/S-V), or their mixture (C-R/S-VR). The dataset included the variables bulk density (BD), penetration resistance (PR), water aggregate stability (WAS), soil reaction (pH), and the contents of soil organic matter (SOM), total nitrogen (TN), soil nitrates (NO(3)-N), and available phosphorus (P). Interactive data visualization along with canonical discriminant analysis (CDA) allowed us to show that WAS, BD, and the contents of P, TN, and SOM have the greatest potential as soil quality indicators in no-till systems in Illinois. It was more difficult to discriminate among WCC rotations than to separate these from C/S, considerably inflating the error rate associated with CDA. We predict that observations of no-till C/S will be classified correctly 51% of the time, while observations of no-till WCC rotations will be classified correctly 74% of the time. High error rates in CDA underscore the complexity of no-till systems and the need in this area for more long-term studies with larger datasets to increase accuracy to acceptable levels.

  6. A Satellite-Derived Climate-Quality Data Record of the Clear-Sky Surface Temperature of the Greenland Ice Sheet

    NASA Technical Reports Server (NTRS)

    Hall, Dorothy K.; Comiso, Josefino C.; DiGirolamo, Nicolo E.; Shuman, Christopher A.; Key, Jeffrey R.; Koenig, Lora S.

    2011-01-01

    We have developed a climate-quality data record of the clear-sky surface temperature of the Greenland Ice Sheet using the Moderate-Resolution Imaging Spectroradiometer (MODIS) Terra ice-surface temperature (1ST) algorithm. A climate-data record (CDR) is a time series of measurements of sufficient length, consistency, and continuity to determine climate variability and change. We present daily and monthly Terra MODIS ISTs of the Greenland Ice Sheet beginning on 1 March 2000 and continuing through 31 December 2010 at 6.25-km spatial resolution on a polar stereographic grid within +/-3 hours of 17:00Z or 2:00 PM Local Solar Time. Preliminary validation of the ISTs at Summit Camp, Greenland, during the 2008-09 winter, shows that there is a cold bias using the MODIS IST which underestimates the measured surface temperature by approximately 3 C when temperatures range from approximately -50 C to approximately -35 C. The ultimate goal is to develop a CDR that starts in 1981 with the Advanced Very High Resolution (AVHRR) Polar Pathfinder (APP) dataset and continues with MODIS data from 2000 to the present. Differences in the APP and MODIS cloud masks have so far precluded the current IST records from spanning both the APP and MODIS IST time series in a seamless manner though this will be revisited when the APP dataset has been reprocessed. The Greenland IST climate-quality data record is suitable for continuation using future Visible Infrared Imager Radiometer Suite (VIIRS) data and will be elevated in status to a CDR when at least 9 more years of climate-quality data become available either from MODIS Terra or Aqua, or from the VIIRS. The complete MODIS IST data record will be available online in the summer of 2011.

  7. Cadastral Database Positional Accuracy Improvement

    NASA Astrophysics Data System (ADS)

    Hashim, N. M.; Omar, A. H.; Ramli, S. N. M.; Omar, K. M.; Din, N.

    2017-10-01

    Positional Accuracy Improvement (PAI) is the refining process of the geometry feature in a geospatial dataset to improve its actual position. This actual position relates to the absolute position in specific coordinate system and the relation to the neighborhood features. With the growth of spatial based technology especially Geographical Information System (GIS) and Global Navigation Satellite System (GNSS), the PAI campaign is inevitable especially to the legacy cadastral database. Integration of legacy dataset and higher accuracy dataset like GNSS observation is a potential solution for improving the legacy dataset. However, by merely integrating both datasets will lead to a distortion of the relative geometry. The improved dataset should be further treated to minimize inherent errors and fitting to the new accurate dataset. The main focus of this study is to describe a method of angular based Least Square Adjustment (LSA) for PAI process of legacy dataset. The existing high accuracy dataset known as National Digital Cadastral Database (NDCDB) is then used as bench mark to validate the results. It was found that the propose technique is highly possible for positional accuracy improvement of legacy spatial datasets.

  8. Improving Vintage Seismic Data Quality through Implementation of Advance Processing Techniques

    NASA Astrophysics Data System (ADS)

    Latiff, A. H. Abdul; Boon Hong, P. G.; Jamaludin, S. N. F.

    2017-10-01

    It is essential in petroleum exploration to have high resolution subsurface images, both vertically and horizontally, in uncovering new geological and geophysical aspects of our subsurface. The lack of success may have been from the poor imaging quality which led to inaccurate analysis and interpretation. In this work, we re-processed the existing seismic dataset with an emphasis on two objectives. Firstly, to produce a better 3D seismic data quality with full retention of relative amplitudes and significantly reduce seismic and structural uncertainty. Secondly, to facilitate further prospect delineation through enhanced data resolution, fault definitions and events continuity, particularly in syn-rift section and basement cover contacts and in turn, better understand the geology of the subsurface especially in regard to the distribution of the fluvial and channel sands. By adding recent, state-of-the-art broadband processing techniques such as source and receiver de-ghosting, high density velocity analysis and shallow water de-multiple, the final results produced a better overall reflection detail and frequency in specific target zones, particularly in the deeper section.

  9. The 3D Elevation Program: summary for California

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the State of California, elevation data are critical for infrastructure and construction management; natural resources conservation; flood risk management; wildfire management, planning, and response; agriculture and precision farming; geologic resource assessment and hazard mitigation; and other business uses. Today, high-quality light detection and ranging (lidar) data are the sources for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative, managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  10. The 3D Elevation Program: summary for Virginia

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the Commonwealth of Virginia, elevation data are critical for urban and regional planning, natural resources conservation, flood risk management, agriculture and precision farming, resource mining, infrastructure and construction management, and other business uses. Today, high-quality light detection and ranging (lidar) data are the sources for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative, managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  11. The 3D Elevation Program: summary for Rhode Island

    USGS Publications Warehouse

    Carswell, William J.

    2013-01-01

    Elevation data are essential to a broad range of applications, including forest resources management, wildlife and habitat management, national security, recreation, and many others. For the State of Rhode Island, elevation data are critical for flood risk management, natural resources conservation, coastal zone management, sea level rise and subsidence, agriculture and precision farming, and other business uses. Today, high-quality light detection and ranging (lidar) data are the sources for creating elevation models and other elevation datasets. Federal, State, and local agencies work in partnership to (1) replace data, on a national basis, that are (on average) 30 years old and of lower quality and (2) provide coverage where publicly accessible data do not exist. A joint goal of State and Federal partners is to acquire consistent, statewide coverage to support existing and emerging applications enabled by lidar data. The new 3D Elevation Program (3DEP) initiative (Snyder, 2012a,b), managed by the U.S. Geological Survey (USGS), responds to the growing need for high-quality topographic data and a wide range of other three-dimensional representations of the Nation’s natural and constructed features.

  12. Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells.

    PubMed

    Xin, Yurong; Kim, Jinrang; Ni, Min; Wei, Yi; Okamoto, Haruka; Lee, Joseph; Adler, Christina; Cavino, Katie; Murphy, Andrew J; Yancopoulos, George D; Lin, Hsin Chieh; Gromada, Jesper

    2016-03-22

    This study provides an assessment of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells. The system combines microfluidic technology and nanoliter-scale reactions. We sequenced 622 cells, allowing identification of 341 islet cells with high-quality gene expression profiles. The cells clustered into populations of α-cells (5%), β-cells (92%), δ-cells (1%), and pancreatic polypeptide cells (2%). We identified cell-type-specific transcription factors and pathways primarily involved in nutrient sensing and oxidation and cell signaling. Unexpectedly, 281 cells had to be removed from the analysis due to low viability, low sequencing quality, or contamination resulting in the detection of more than one islet hormone. Collectively, we provide a resource for identification of high-quality gene expression datasets to help expand insights into genes and pathways characterizing islet cell types. We reveal limitations in the C1 Fluidigm cell capture process resulting in contaminated cells with altered gene expression patterns. This calls for caution when interpreting single-cell transcriptomics data using the C1 Fluidigm system.

  13. Assessment of average of normals (AON) procedure for outlier-free datasets including qualitative values below limit of detection (LoD): an application within tumor markers such as CA 15-3, CA 125, and CA 19-9.

    PubMed

    Usta, Murat; Aral, Hale; Mete Çilingirtürk, Ahmet; Kural, Alev; Topaç, Ibrahim; Semerci, Tuna; Hicri Köseoğlu, Mehmet

    2016-11-01

    Average of normals (AON) is a quality control procedure that is sensitive only to systematic errors that can occur in an analytical process in which patient test results are used. The aim of this study was to develop an alternative model in order to apply the AON quality control procedure to datasets that include qualitative values below limit of detection (LoD). The reported patient test results for tumor markers, such as CA 15-3, CA 125, and CA 19-9, analyzed by two instruments, were retrieved from the information system over a period of 5 months, using the calibrator and control materials with the same lot numbers. The median as a measure of central tendency and the median absolute deviation (MAD) as a measure of dispersion were used for the complementary model of AON quality control procedure. The u bias values, which were determined for the bias component of the measurement uncertainty, were partially linked to the percentages of the daily median values of the test results that fall within the control limits. The results for these tumor markers, in which lower limits of reference intervals are not medically important for clinical diagnosis and management, showed that the AON quality control procedure, using the MAD around the median, can be applied for datasets including qualitative values below LoD.

  14. TH-EF-BRA-03: Assessment of Data-Driven Respiratory Motion-Compensation Methods for 4D-CBCT Image Registration and Reconstruction Using Clinical Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Riblett, MJ; Weiss, E; Hugo, GD

    Purpose: To evaluate the performance of a 4D-CBCT registration and reconstruction method that corrects for respiratory motion and enhances image quality under clinically relevant conditions. Methods: Building on previous work, which tested feasibility of a motion-compensation workflow using image datasets superior to clinical acquisitions, this study assesses workflow performance under clinical conditions in terms of image quality improvement. Evaluated workflows utilized a combination of groupwise deformable image registration (DIR) and image reconstruction. Four-dimensional cone beam CT (4D-CBCT) FDK reconstructions were registered to either mean or respiratory phase reference frame images to model respiratory motion. The resulting 4D transformation was usedmore » to deform projection data during the FDK backprojection operation to create a motion-compensated reconstruction. To simulate clinically realistic conditions, superior quality projection datasets were sampled using a phase-binned striding method. Tissue interface sharpness (TIS) was defined as the slope of a sigmoid curve fit to the lung-diaphragm boundary or to the carina tissue-airway boundary when no diaphragm was discernable. Image quality improvement was assessed in 19 clinical cases by evaluating mitigation of view-aliasing artifacts, tissue interface sharpness recovery, and noise reduction. Results: For clinical datasets, evaluated average TIS recovery relative to base 4D-CBCT reconstructions was observed to be 87% using fixed-frame registration alone; 87% using fixed-frame with motion-compensated reconstruction; 92% using mean-frame registration alone; and 90% using mean-frame with motion-compensated reconstruction. Soft tissue noise was reduced on average by 43% and 44% for the fixed-frame registration and registration with motion-compensation methods, respectively, and by 40% and 42% for the corresponding mean-frame methods. Considerable reductions in view aliasing artifacts were observed for each method. Conclusion: Data-driven groupwise registration and motion-compensated reconstruction have the potential to improve the quality of 4D-CBCT images acquired under clinical conditions. For clinical image datasets, the addition of motion compensation after groupwise registration visibly reduced artifact impact. This work was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA166119. Hugo and Weiss hold a research agreement with Philips Healthcare and license agreement with Varian Medical Systems. Weiss receives royalties from UpToDate. Christensen receives funds from Roger Koch to support research.« less

  15. The Isprs Benchmark on Indoor Modelling

    NASA Astrophysics Data System (ADS)

    Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.

    2017-09-01

    Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.

  16. Big Data in Organ Transplantation: Registries and Administrative Claims

    PubMed Central

    Massie, Allan B.; Kucirka, Lauren; Segev, Dorry L.

    2015-01-01

    The field of organ transplantation benefits from large, comprehensive, transplant-specific national datasets available to researchers. In addition to the widely-used OPTN-based registries (the UNOS and SRTR datasets) and USRDS datasets, there are other publicly available national datasets, not specific to transplantation, which have historically been underutilized in the field of transplantation. Of particular interest are the Nationwide Inpatient Sample (NIS) and State Inpatient Databases (SID), produced by the Agency for Healthcare Research and Quality (AHRQ). The United States Renal Data System (USRDS) database provides extensive data relevant to studies of kidney transplantation. Linkage of publicly available datasets to external data sources such as private claims or pharmacy data provides further resources for registry-based research. Although these resources can transcend some limitations of OPTN-based registry data, they come with their own limitations, which must be understood to avoid biased inference. This review discusses different registry-based data sources available in the United States, as well as the proper design and conduct of registry-based research. PMID:25040084

  17. IMPACTS OF CLIMATE-INDUCED CHANGES IN EXTREME EVENTS ON OZONE AND PARTICULATE MATTER AIR QUALITY

    EPA Science Inventory

    Historical data records of air pollution meteorology from multiple datasets will be compiled and analyzed to identify possible trends in extreme events. Changes in climate and air quality between 2010 and 2050 will be simulated with a suite of models. The consequential effe...

  18. Interactive Visualization and Analysis of Geospatial Data Sets - TrikeND-iGlobe

    NASA Astrophysics Data System (ADS)

    Rosebrock, Uwe; Hogan, Patrick; Chandola, Varun

    2013-04-01

    The visualization of scientific datasets is becoming an ever-increasing challenge as advances in computing technologies have enabled scientists to build high resolution climate models that have produced petabytes of climate data. To interrogate and analyze these large datasets in real-time is a task that pushes the boundaries of computing hardware and software. But integration of climate datasets with geospatial data requires considerable amount of effort and close familiarity of various data formats and projection systems, which has prevented widespread utilization outside of climate community. TrikeND-iGlobe is a sophisticated software tool that bridges this gap, allows easy integration of climate datasets with geospatial datasets and provides sophisticated visualization and analysis capabilities. The objective for TrikeND-iGlobe is the continued building of an open source 4D virtual globe application using NASA World Wind technology that integrates analysis of climate model outputs with remote sensing observations as well as demographic and environmental data sets. This will facilitate a better understanding of global and regional phenomenon, and the impact analysis of climate extreme events. The critical aim is real-time interactive interrogation. At the data centric level the primary aim is to enable the user to interact with the data in real-time for the purpose of analysis - locally or remotely. TrikeND-iGlobe provides the basis for the incorporation of modular tools that provide extended interactions with the data, including sub-setting, aggregation, re-shaping, time series analysis methods and animation to produce publication-quality imagery. TrikeND-iGlobe may be run locally or can be accessed via a web interface supported by high-performance visualization compute nodes placed close to the data. It supports visualizing heterogeneous data formats: traditional geospatial datasets along with scientific data sets with geographic coordinates (NetCDF, HDF, etc.). It also supports multiple data access mechanisms, including HTTP, FTP, WMS, WCS, and Thredds Data Server (for NetCDF data and for scientific data, TrikeND-iGlobe supports various visualization capabilities, including animations, vector field visualization, etc. TrikeND-iGlobe is a collaborative open-source project, contributors include NASA (ARC-PX), ORNL (Oakridge National Laboratories), Unidata, Kansas University, CSIRO CMAR Australia and Geoscience Australia.

  19. URJC GB dataset: Community-based seed bank of Mediterranean high-mountain and semi-arid plant species at Universidad Rey Juan Carlos (Spain).

    PubMed

    Alonso, Patricia; Iriondo, José María

    2014-01-01

    The Germplasm Bank of Universidad Rey Juan Carlos was created in 2008 and currently holds 235 accessions and 96 species. This bank focuses on the conservation of wild-plant communities and aims to conserve ex situ a representative sample of the plant biodiversity present in a habitat, emphasizing priority ecosystems identified by the Habitats Directive. It is also used to store plant material for research and teaching purposes. The collection consists of three subcollections, two representative of typical habitats in the center of the Iberian Peninsula: high-mountain pastures (psicroxerophylous pastures) and semi-arid habitats (gypsophylic steppes), and a third representative of the genus Lupinus. The high-mountain subcollection currently holds 153 accessions (63 species), the semi-arid subcollection has 76 accessions (29 species,) and the Lupinus subcollection has 6 accessions (4 species). All accessions are stored in a freezer at -18 °C in Kilner jars with silica gel. The Germplasm Bank of Universidad Rey Juan Carlos follows a quality control protocol which describes the workflow performed with seeds from seed collection to storage. All collectors are members of research groups with great experience in species identification. Herbarium specimens associated with seed accessions are preserved and 63% of the records have been georreferenced with GPS and radio points. The dataset provides unique information concerning the location of populations of plant species that form part of the psicroxerophylous pastures and gypsophylic steppes of Central Spain as well as populations of genus Lupinus in the Iberian Peninsula. It also provides relevant information concerning mean seed weight and seed germination values under specific incubation conditions. This dataset has already been used by researchers of the Area of Biodiversity and Conservation of URJC as a source of information for the design and implementation of experimental designs in these plant communities. Since they are all active subcollections in continuous growth, data is updated regularly every six months and the latest version can be accessed through the GBIF data portal at http://www.gbif.es:8080/ipt/resource.do?r=germoplasma-urjc. This paper describes the URJC Germplasm Bank and its associated dataset with the aim of disseminating the dataset and explaining how it was derived.

  20. Selecting minimum dataset soil variables using PLSR as a regressive multivariate method

    NASA Astrophysics Data System (ADS)

    Stellacci, Anna Maria; Armenise, Elena; Castellini, Mirko; Rossi, Roberta; Vitti, Carolina; Leogrande, Rita; De Benedetto, Daniela; Ferrara, Rossana M.; Vivaldi, Gaetano A.

    2017-04-01

    Long-term field experiments and science-based tools that characterize soil status (namely the soil quality indices, SQIs) assume a strategic role in assessing the effect of agronomic techniques and thus in improving soil management especially in marginal environments. Selecting key soil variables able to best represent soil status is a critical step for the calculation of SQIs. Current studies show the effectiveness of statistical methods for variable selection to extract relevant information deriving from multivariate datasets. Principal component analysis (PCA) has been mainly used, however supervised multivariate methods and regressive techniques are progressively being evaluated (Armenise et al., 2013; de Paul Obade et al., 2016; Pulido Moncada et al., 2014). The present study explores the effectiveness of partial least square regression (PLSR) in selecting critical soil variables, using a dataset comparing conventional tillage and sod-seeding on durum wheat. The results were compared to those obtained using PCA and stepwise discriminant analysis (SDA). The soil data derived from a long-term field experiment in Southern Italy. On samples collected in April 2015, the following set of variables was quantified: (i) chemical: total organic carbon and nitrogen (TOC and TN), alkali-extractable C (TEC and humic substances - HA-FA), water extractable N and organic C (WEN and WEOC), Olsen extractable P, exchangeable cations, pH and EC; (ii) physical: texture, dry bulk density (BD), macroporosity (Pmac), air capacity (AC), and relative field capacity (RFC); (iii) biological: carbon of the microbial biomass quantified with the fumigation-extraction method. PCA and SDA were previously applied to the multivariate dataset (Stellacci et al., 2016). PLSR was carried out on mean centered and variance scaled data of predictors (soil variables) and response (wheat yield) variables using the PLS procedure of SAS/STAT. In addition, variable importance for projection (VIP) statistics was used to quantitatively assess the predictors most relevant for response variable estimation and then for variable selection (Andersen and Bro, 2010). PCA and SDA returned TOC and RFC as influential variables both on the set of chemical and physical data analyzed separately as well as on the whole dataset (Stellacci et al., 2016). Highly weighted variables in PCA were also TEC, followed by K, and AC, followed by Pmac and BD, in the first PC (41.2% of total variance); Olsen P and HA-FA in the second PC (12.6%), Ca in the third (10.6%) component. Variables enabling maximum discrimination among treatments for SDA were WEOC, on the whole dataset, humic substances, followed by Olsen P, EC and clay, in the separate data analyses. The highest PLS-VIP statistics were recorded for Olsen P and Pmac, followed by TOC, TEC, pH and Mg for chemical variables and clay, RFC and AC for the physical variables. Results show that different methods may provide different ranking of the selected variables and the presence of a response variable, in regressive techniques, may affect variable selection. Further investigation with different response variables and with multi-year datasets would allow to better define advantages and limits of single or combined approaches. Acknowledgment The work was supported by the projects "BIOTILLAGE, approcci innovative per il miglioramento delle performances ambientali e produttive dei sistemi cerealicoli no-tillage", financed by PSR-Basilicata 2007-2013, and "DESERT, Low-cost water desalination and sensor technology compact module" financed by ERANET-WATERWORKS 2014. References Andersen C.M. and Bro R., 2010. Variable selection in regression - a tutorial. Journal of Chemometrics, 24 728-737. Armenise et al., 2013. Developing a soil quality index to compare soil fitness for agricultural use under different managements in the mediterranean environment. Soil and Tillage Research, 130:91-98. de Paul Obade et al., 2016. A standardized soil quality index for diverse field conditions. Sci. Total Env. 541:424-434. Pulido Moncada et al., 2014. Data-driven analysis of soil quality indicators using limited data. Geoderma, 235:271-278. Stellacci et al., 2016. Comparison of different multivariate methods to select key soil variables for soil quality indices computation. XLV Congress of the Italian Society of Agronomy (SIA), Sassari, 20-22 September 2016.

  1. Quantity and/or Quality? The Importance of Publishing Many Papers

    PubMed Central

    van den Besselaar, Peter

    2016-01-01

    Do highly productive researchers have significantly higher probability to produce top cited papers? Or do high productive researchers mainly produce a sea of irrelevant papers—in other words do we find a diminishing marginal result from productivity? The answer on these questions is important, as it may help to answer the question of whether the increased competition and increased use of indicators for research evaluation and accountability focus has perverse effects or not. We use a Swedish author disambiguated dataset consisting of 48.000 researchers and their WoS-publications during the period of 2008–2011 with citations until 2014 to investigate the relation between productivity and production of highly cited papers. As the analysis shows, quantity does make a difference. PMID:27870854

  2. Operational use of spaceborne lidar datasets

    NASA Astrophysics Data System (ADS)

    Marenco, Franco; Halloran, Gemma; Forsythe, Mary

    2018-04-01

    The Met Office plans to use space lidar datasets from CALIPSO, CATS, Aeolus and EarthCARE operationally in near real time (NRT), for the detection of aerosols. The first step is the development of NRT imagery for nowcasting of volcanic events, air quality, and mineral dust episodes. Model verification and possibly assimilation will be explored. Assimilation trials of Aeolus winds are also planned. Here we will present our first in-house imagery and our operational requirements.

  3. Maggie Creek Water Quality Data

    EPA Pesticide Factsheets

    These data are standard water quality parameters collected for surface water condition analysis (for example pH, conductivity, DO, TSS).This dataset is associated with the following publication:Kozlowski, D., R. Hall , S. Swanson, and D. Heggem. Linking Management and Riparian Physical Functions to Water Quality and Aquatic Habitat. JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT. American Society of Civil Engineers (ASCE), Reston, VA, USA, 8(8): 797-815, (2016).

  4. Penalized maximum likelihood simultaneous longitudinal PET image reconstruction with difference-image priors.

    PubMed

    Ellis, Sam; Reader, Andrew J

    2018-04-26

    Many clinical contexts require the acquisition of multiple positron emission tomography (PET) scans of a single subject, for example, to observe and quantitate changes in functional behaviour in tumors after treatment in oncology. Typically, the datasets from each of these scans are reconstructed individually, without exploiting the similarities between them. We have recently shown that sharing information between longitudinal PET datasets by penalizing voxel-wise differences during image reconstruction can improve reconstructed images by reducing background noise and increasing the contrast-to-noise ratio of high-activity lesions. Here, we present two additional novel longitudinal difference-image priors and evaluate their performance using two-dimesional (2D) simulation studies and a three-dimensional (3D) real dataset case study. We have previously proposed a simultaneous difference-image-based penalized maximum likelihood (PML) longitudinal image reconstruction method that encourages sparse difference images (DS-PML), and in this work we propose two further novel prior terms. The priors are designed to encourage longitudinal images with corresponding differences which have (a) low entropy (DE-PML), and (b) high sparsity in their spatial gradients (DTV-PML). These two new priors and the originally proposed longitudinal prior were applied to 2D-simulated treatment response [ 18 F]fluorodeoxyglucose (FDG) brain tumor datasets and compared to standard maximum likelihood expectation-maximization (MLEM) reconstructions. These 2D simulation studies explored the effects of penalty strengths, tumor behaviour, and interscan coupling on reconstructed images. Finally, a real two-scan longitudinal data series acquired from a head and neck cancer patient was reconstructed with the proposed methods and the results compared to standard reconstruction methods. Using any of the three priors with an appropriate penalty strength produced images with noise levels equivalent to those seen when using standard reconstructions with increased counts levels. In tumor regions, each method produces subtly different results in terms of preservation of tumor quantitation and reconstruction root mean-squared error (RMSE). In particular, in the two-scan simulations, the DE-PML method produced tumor means in close agreement with MLEM reconstructions, while the DTV-PML method produced the lowest errors due to noise reduction within the tumor. Across a range of tumor responses and different numbers of scans, similar results were observed, with DTV-PML producing the lowest errors of the three priors and DE-PML producing the lowest bias. Similar improvements were observed in the reconstructions of the real longitudinal datasets, although imperfect alignment of the two PET images resulted in additional changes in the difference image that affected the performance of the proposed methods. Reconstruction of longitudinal datasets by penalizing difference images between pairs of scans from a data series allows for noise reduction in all reconstructed images. An appropriate choice of penalty term and penalty strength allows for this noise reduction to be achieved while maintaining reconstruction performance in regions of change, either in terms of quantitation of mean intensity via DE-PML, or in terms of tumor RMSE via DTV-PML. Overall, improving the image quality of longitudinal datasets via simultaneous reconstruction has the potential to improve upon currently used methods, allow dose reduction, or reduce scan time while maintaining image quality at current levels. © 2018 The Authors. Medical Physics published by Wiley Periodicals, Inc. on behalf of American Association of Physicists in Medicine.

  5. Evaluating the effect of a third-party implementation of resolution recovery on the quality of SPECT bone scan imaging using visual grading regression.

    PubMed

    Hay, Peter D; Smith, Julie; O'Connor, Richard A

    2016-02-01

    The aim of this study was to evaluate the benefits to SPECT bone scan image quality when applying resolution recovery (RR) during image reconstruction using software provided by a third-party supplier. Bone SPECT data from 90 clinical studies were reconstructed retrospectively using software supplied independent of the gamma camera manufacturer. The current clinical datasets contain 120×10 s projections and are reconstructed using an iterative method with a Butterworth postfilter. Five further reconstructions were created with the following characteristics: 10 s projections with a Butterworth postfilter (to assess intraobserver variation); 10 s projections with a Gaussian postfilter with and without RR; and 5 s projections with a Gaussian postfilter with and without RR. Two expert observers were asked to rate image quality on a five-point scale relative to our current clinical reconstruction. Datasets were anonymized and presented in random order. The benefits of RR on image scores were evaluated using ordinal logistic regression (visual grading regression). The application of RR during reconstruction increased the probability of both observers of scoring image quality as better than the current clinical reconstruction even where the dataset contained half the normal counts. Type of reconstruction and observer were both statistically significant variables in the ordinal logistic regression model. Visual grading regression was found to be a useful method for validating the local introduction of technological developments in nuclear medicine imaging. RR, as implemented by the independent software supplier, improved bone SPECT image quality when applied during image reconstruction. In the majority of clinical cases, acquisition times for bone SPECT intended for the purposes of localization can safely be halved (from 10 s projections to 5 s) when RR is applied.

  6. A machine learning approach to multi-level ECG signal quality classification.

    PubMed

    Li, Qiao; Rajagopalan, Cadathur; Clifford, Gari D

    2014-12-01

    Current electrocardiogram (ECG) signal quality assessment studies have aimed to provide a two-level classification: clean or noisy. However, clinical usage demands more specific noise level classification for varying applications. This work outlines a five-level ECG signal quality classification algorithm. A total of 13 signal quality metrics were derived from segments of ECG waveforms, which were labeled by experts. A support vector machine (SVM) was trained to perform the classification and tested on a simulated dataset and was validated using data from the MIT-BIH arrhythmia database (MITDB). The simulated training and test datasets were created by selecting clean segments of the ECG in the 2011 PhysioNet/Computing in Cardiology Challenge database, and adding three types of real ECG noise at different signal-to-noise ratio (SNR) levels from the MIT-BIH Noise Stress Test Database (NSTDB). The MITDB was re-annotated for five levels of signal quality. Different combinations of the 13 metrics were trained and tested on the simulated datasets and the best combination that produced the highest classification accuracy was selected and validated on the MITDB. Performance was assessed using classification accuracy (Ac), and a single class overlap accuracy (OAc), which assumes that an individual type classified into an adjacent class is acceptable. An Ac of 80.26% and an OAc of 98.60% on the test set were obtained by selecting 10 metrics while 57.26% (Ac) and 94.23% (OAc) were the numbers for the unseen MITDB validation data without retraining. By performing the fivefold cross validation, an Ac of 88.07±0.32% and OAc of 99.34±0.07% were gained on the validation fold of MITDB. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  7. Creating Situational Awareness in Spacecraft Operations with the Machine Learning Approach

    NASA Astrophysics Data System (ADS)

    Li, Z.

    2016-09-01

    This paper presents a machine learning approach for the situational awareness capability in spacecraft operations. There are two types of time dependent data patterns for spacecraft datasets: the absolute time pattern (ATP) and the relative time pattern (RTP). The machine learning captures the data patterns of the satellite datasets through the data training during the normal operations, which is represented by its time dependent trend. The data monitoring compares the values of the incoming data with the predictions of machine learning algorithm, which can detect any meaningful changes to a dataset above the noise level. If the difference between the value of incoming telemetry and the machine learning prediction are larger than the threshold defined by the standard deviation of datasets, it could indicate the potential anomaly that may need special attention. The application of the machine-learning approach to the Advanced Himawari Imager (AHI) on Japanese Himawari spacecraft series is presented, which has the same configuration as the Advanced Baseline Imager (ABI) on Geostationary Environment Operational Satellite (GOES) R series. The time dependent trends generated by the data-training algorithm are in excellent agreement with the datasets. The standard deviation in the time dependent trend provides a metric for measuring the data quality, which is particularly useful in evaluating the detector quality for both AHI and ABI with multiple detectors in each channel. The machine-learning approach creates the situational awareness capability, and enables engineers to handle the huge data volume that would have been impossible with the existing approach, and it leads to significant advances to more dynamic, proactive, and autonomous spacecraft operations.

  8. A critical enquiry into the psychometric properties of the professional quality of life scale (ProQol-5) instrument.

    PubMed

    Hemsworth, David; Baregheh, Anahita; Aoun, Samar; Kazanjian, Arminee

    2018-02-01

    This study had conducted a comprehensive analysis of the psychometric properties of Proqol 5, professional quality of work instrument among nurses and palliative care-workers on the basis of three independent datasets. The goal is to see the general applicability of this instrument across multiple populations. Although the Proqol scale has been widely adopted, there are few attempts that have thoroughly analyzed this instrument across multiple datasets using multiple populations. A questionnaire was developed and distributed to palliative care-workers in Canada and Nurses at two hospitals in Australia and Canada, this resulted in 273 datasets from the Australian and 303 datasets from the Canadian nurses and 503 datasets from the Canadian palliative care-workers. A comprehensive psychometric property analysis was conducted including inter-item correlations, tests of reliability, and both convergent and discriminant validity as well as construct validity analyses. In addition, to test for the reverse coding artifacts in the BO scale, exploratory factor analysis was adopted. The psychometric property analysis of Proqol 5 was satisfactory for the compassion satisfaction construct. However, there are concerns with respect to the burnout and secondary trauma stress scales and recommendations are made regarding the coding and specific items which should improve the reliability and validity of these scales. This research establishes the strengths and weaknesses of the Proqol instrument and demonstrates how it can be improved. Through specific recommendations, the academic community is invited to revise the burnout and secondary traumatic stress scales in an effort to improve Proqol 5 measures. Copyright © 2017. Published by Elsevier Inc.

  9. Improving the quantity, quality and transparency of data used to derive radionuclide transfer parameters for animal products. 2. Cow milk.

    PubMed

    Howard, B J; Wells, C; Barnett, C L; Howard, D C

    2017-02-01

    Under the International Atomic Energy Agency (IAEA) MODARIA (Modelling and Data for Radiological Impact Assessments) Programme, there has been an initiative to improve the derivation, provenance and transparency of transfer parameter values for radionuclides from feed to animal products that are for human consumption. A description of the revised MODARIA 2016 cow milk dataset is described in this paper. As previously reported for the MODARIA goat milk dataset, quality control has led to the discounting of some references used in IAEA's Technical Report Series (TRS) report 472 (IAEA, 2010). The number of Concentration Ratio (CR) values has been considerably increased by (i) the inclusion of more literature from agricultural studies which particularly enhanced the stable isotope data of both CR and F m and (ii) by estimating dry matter intake from assumed liveweight. In TRS 472, the data for cow milk were 714 transfer coefficient (F m ) values and 254 CR values describing 31 elements and 26 elements respectively. In the MODARIA 2016 cow milk dataset, F m and CR values are now reported for 43 elements based upon 825 data values for F m and 824 for CR. The MODARIA 2016 cow milk dataset F m values are within an order of magnitude of those reported in TRS 472. Slightly bigger changes are seen in the CR values, but the increase in size of the dataset creates greater confidence in them. Data gaps that still remain are identified for elements with isotopes relevant to radiation protection. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.

  10. Long-term coastal measurements for large-scale climate trends characterization

    NASA Astrophysics Data System (ADS)

    Pomaro, Angela; Cavaleri, Luigi; Lionello, Piero

    2017-04-01

    Multi-decadal time-series of observational wave data beginning in the late 1970's are relatively rare. The present study refers to the analysis of the 37-year long directional wave time-series recorded between 1979 and 2015 at the CNR-ISMAR (Institute of Marine Sciences of the Italian National Research Council) "Acqua Alta" oceanographic research tower, located in the Northern Adriatic Sea, 15 km offshore the Venice lagoon, on 16 m depth. The extent of the time series allows to exploit its content not only for modelling purposes or short-term statistical analyses, but also at the climatological scale thanks to the peculiar meteorological and oceanographic aspects of the coastal area where this relevant infrastructure has been installed. We explore the dataset both to characterize the local average climate and its variability, and to detect the possible long-term trends that might be suggestive of, or emphasize, large scale circulation patterns and trends. Measured data are essential for the assessment, and often for the calibration, of model data, generally, if long enough, also the reference also for climate studies. By applying this analysis to an area well characterized from the meteorological point of view, we first assess the changes in time based on measured data, and then we compare them to the ones derived from the ERA-Interim regional simulation over the same area, thus showing the strong improvement that is still needed to get reliable climate models projections on coastal areas and the Mediterranean Region as a whole. Moreover, long term hindcast aiming at climatic considerations are well known for 1) underestimating, if their resolution is not high enough, the actual wave heights as well as for 2) being strongly affected by different conditions over time that are likely to introduce spurious trends of variable magnitude. In particular the different amount, in time, of assimilated data by the hindcast models, directly and indirectly affects the results, making it difficult, if not impossible, to distinguish the imposed effects from the climate signal itself, as demonstrated by Aarnes et al. (2015). From this point of view the problem is that long-term measured datasets are relatively unique, due to the cost and technical difficulty of maintaining fixed instrumental equipment over time, as well as of assuring the homogeneity and availability of the entire dataset. For this reason we are furthermore working on the publication of the quality controlled dataset to make it widely available for open-access research purposes. The analysis and homogenization of the original dataset has actually required a substantial part of the time spent on the study, because of the strong impact that the quality of the data may have on the final result. We consider this particularly relevant, especially when referring to coastal areas, where the lack of reliable satellite data makes it difficult to improve the model capability to resolve the local peculiar oceanographic processes. We describe in detail any step and procedure used in producing the data, including full descriptions of the experimental design, data acquisition assays, and any computational processing needed to support the technical quality of the dataset.

  11. FLUXNET2015 Dataset: Batteries included

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.

    2016-12-01

    The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.

  12. Global Lakes Sentinel Services: Water Quality Parameters Retrieval in Lakes Using the MERIS and S3-OLCI Band Sets

    NASA Astrophysics Data System (ADS)

    Peters, Steef; Alikas, Krista; Hommersom, Annelies; Latt, Silver; Reinart, Anu; Giardino, Claudia; Bresciani, Mariano; Philipson, Petra; Ruescas, Ana; Stelzer, Kerstin; Schenk, Karin; Heege, Thomas; Gege, Peter; Koponen, Sampsa; Kallio, Karri; Zhang, Yunlin

    2015-12-01

    The European collaborative project GLaSS aims to prepare for the use of the data streams from Sentinel 2 and Sentinel 3. Its focus is on inland waters, since these are considered to be sentinels for land-use- and climate change and need to be monitored closely. One of the objectives of the project is to compare existing water quality algorithms and parameterizations on the basis of in-situ spectral observations and Hydrolight simulations. A first achievement of the project is the collection of over 400 Rrs spectra with accompanying data on CHL, TSM, aCDOM and Secchi depth. Especially the dataset on Lake CDOM measurements represents a rather unique reference dataset.

  13. MSWEP: 3-hourly 0.25° global gridded precipitation (1979-2015) by merging gauge, satellite, and reanalysis data

    NASA Astrophysics Data System (ADS)

    Beck, Hylke E.; van Dijk, Albert I. J. M.; Levizzani, Vincenzo; Schellekens, Jaap; Miralles, Diego G.; Martens, Brecht; de Roo, Ad

    2017-01-01

    Current global precipitation (P) datasets do not take full advantage of the complementary nature of satellite and reanalysis data. Here, we present Multi-Source Weighted-Ensemble Precipitation (MSWEP) version 1.1, a global P dataset for the period 1979-2015 with a 3-hourly temporal and 0.25° spatial resolution, specifically designed for hydrological modeling. The design philosophy of MSWEP was to optimally merge the highest quality P data sources available as a function of timescale and location. The long-term mean of MSWEP was based on the CHPclim dataset but replaced with more accurate regional datasets where available. A correction for gauge under-catch and orographic effects was introduced by inferring catchment-average P from streamflow (Q) observations at 13 762 stations across the globe. The temporal variability of MSWEP was determined by weighted averaging of P anomalies from seven datasets; two based solely on interpolation of gauge observations (CPC Unified and GPCC), three on satellite remote sensing (CMORPH, GSMaP-MVK, and TMPA 3B42RT), and two on atmospheric model reanalysis (ERA-Interim and JRA-55). For each grid cell, the weight assigned to the gauge-based estimates was calculated from the gauge network density, while the weights assigned to the satellite- and reanalysis-based estimates were calculated from their comparative performance at the surrounding gauges. The quality of MSWEP was compared against four state-of-the-art gauge-adjusted P datasets (WFDEI-CRU, GPCP-1DD, TMPA 3B42, and CPC Unified) using independent P data from 125 FLUXNET tower stations around the globe. MSWEP obtained the highest daily correlation coefficient (R) among the five P datasets for 60.0 % of the stations and a median R of 0.67 vs. 0.44-0.59 for the other datasets. We further evaluated the performance of MSWEP using hydrological modeling for 9011 catchments (< 50 000 km2) across the globe. Specifically, we calibrated the simple conceptual hydrological model HBV (Hydrologiska Byråns Vattenbalansavdelning) against daily Q observations with P from each of the different datasets. For the 1058 sparsely gauged catchments, representative of 83.9 % of the global land surface (excluding Antarctica), MSWEP obtained a median calibration NSE of 0.52 vs. 0.29-0.39 for the other P datasets. MSWEP is available via http://www.gloh2o.org.

  14. Querying Patterns in High-Dimensional Heterogenous Datasets

    ERIC Educational Resources Information Center

    Singh, Vishwakarma

    2012-01-01

    The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…

  15. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    NASA Astrophysics Data System (ADS)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  16. An assessment of the cultivated cropland class of NLCD 2006 using a multi-source and multi-criteria approach

    USGS Publications Warehouse

    Danielson, Patrick; Yang, Limin; Jin, Suming; Homer, Collin G.; Napton, Darrell

    2016-01-01

    We developed a method that analyzes the quality of the cultivated cropland class mapped in the USA National Land Cover Database (NLCD) 2006. The method integrates multiple geospatial datasets and a Multi Index Integrated Change Analysis (MIICA) change detection method that captures spectral changes to identify the spatial distribution and magnitude of potential commission and omission errors for the cultivated cropland class in NLCD 2006. The majority of the commission and omission errors in NLCD 2006 are in areas where cultivated cropland is not the most dominant land cover type. The errors are primarily attributed to the less accurate training dataset derived from the National Agricultural Statistics Service Cropland Data Layer dataset. In contrast, error rates are low in areas where cultivated cropland is the dominant land cover. Agreement between model-identified commission errors and independently interpreted reference data was high (79%). Agreement was low (40%) for omission error comparison. The majority of the commission errors in the NLCD 2006 cultivated crops were confused with low-intensity developed classes, while the majority of omission errors were from herbaceous and shrub classes. Some errors were caused by inaccurate land cover change from misclassification in NLCD 2001 and the subsequent land cover post-classification process.

  17. A Bayesian Multilevel Model for Microcystin Prediction in ...

    EPA Pesticide Factsheets

    The frequency of cyanobacteria blooms in North American lakes is increasing. A major concern with rising cyanobacteria blooms is microcystin, a common cyanobacterial hepatotoxin. To explore the conditions that promote high microcystin concentrations, we analyzed the US EPA National Lake Assessment (NLA) dataset collected in the summer of 2007. The NLA dataset is reported for nine eco-regions. We used the results of random forest modeling as a means ofvariable selection from which we developed a Bayesian multilevel model of microcystin concentrations. Model parameters under a multilevel modeling framework are eco-region specific, butthey are also assumed to be exchangeable across eco-regions for broad continental scaling. The exchangeability assumption ensures that both the common patterns and eco-region specific features will be reflected in the model. Furthermore, the method incorporates appropriate estimates of uncertainty. Our preliminary results show associations between microcystin and turbidity, total nutrients, and N:P ratios. Upon release of a comparable 2012 NLA dataset, we will apply Bayesian updating. The results will help develop management strategies to alleviate microcystin impacts and improve lake quality. This work provides a probabilistic framework for predicting microcystin presences in lakes. It would allow for insights to be made about how changes in nutrient concentrations could potentially change toxin levels.

  18. Point-based warping with optimized weighting factors of displacement vectors

    NASA Astrophysics Data System (ADS)

    Pielot, Ranier; Scholz, Michael; Obermayer, Klaus; Gundelfinger, Eckart D.; Hess, Andreas

    2000-06-01

    The accurate comparison of inter-individual 3D image brain datasets requires non-affine transformation techniques (warping) to reduce geometric variations. Constrained by the biological prerequisites we use in this study a landmark-based warping method with weighted sums of displacement vectors, which is enhanced by an optimization process. Furthermore, we investigate fast automatic procedures for determining landmarks to improve the practicability of 3D warping. This combined approach was tested on 3D autoradiographs of Gerbil brains. The autoradiographs were obtained after injecting a non-metabolized radioactive glucose derivative into the Gerbil thereby visualizing neuronal activity in the brain. Afterwards the brain was processed with standard autoradiographical methods. The landmark-generator computes corresponding reference points simultaneously within a given number of datasets by Monte-Carlo-techniques. The warping function is a distance weighted exponential function with a landmark- specific weighting factor. These weighting factors are optimized by a computational evolution strategy. The warping quality is quantified by several coefficients (correlation coefficient, overlap-index, and registration error). The described approach combines a highly suitable procedure to automatically detect landmarks in autoradiographical brain images and an enhanced point-based warping technique, optimizing the local weighting factors. This optimization process significantly improves the similarity between the warped and the target dataset.

  19. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets.

    PubMed

    Yu, Yao; Hu, Hao; Bohlender, Ryan J; Hu, Fulan; Chen, Jiun-Sheng; Holt, Carson; Fowler, Jerry; Guthery, Stephen L; Scheet, Paul; Hildebrandt, Michelle A T; Yandell, Mark; Huff, Chad D

    2018-04-06

    High-throughput sequencing data are increasingly being made available to the research community for secondary analyses, providing new opportunities for large-scale association studies. However, heterogeneity in target capture and sequencing technologies often introduce strong technological stratification biases that overwhelm subtle signals of association in studies of complex traits. Here, we introduce the Cross-Platform Association Toolkit, XPAT, which provides a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets. XPAT includes tools to support cross-platform aware variant calling, quality control filtering, gene-based association testing and rare variant effect size estimation. To evaluate the performance of XPAT, we conducted case-control association studies for three diseases, including 783 breast cancer cases, 272 ovarian cancer cases, 205 Crohn disease cases and 3507 shared controls (including 1722 females) using sequencing data from multiple sources. XPAT greatly reduced Type I error inflation in the case-control analyses, while replicating many previously identified disease-gene associations. We also show that association tests conducted with XPAT using cross-platform data have comparable performance to tests using matched platform data. XPAT enables new association studies that combine existing sequencing datasets to identify genetic loci associated with common diseases and other complex traits.

  20. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.

    PubMed

    Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A

    2018-02-01

    The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.

Top