Sample records for large volume datasets

  1. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    NASA Astrophysics Data System (ADS)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  2. Segmentation of Unstructured Datasets

    NASA Technical Reports Server (NTRS)

    Bhat, Smitha

    1996-01-01

    Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.

  3. A proposed framework for consensus-based lung tumour volume auto-segmentation in 4D computed tomography imaging

    NASA Astrophysics Data System (ADS)

    Martin, Spencer; Brophy, Mark; Palma, David; Louie, Alexander V.; Yu, Edward; Yaremko, Brian; Ahmad, Belal; Barron, John L.; Beauchemin, Steven S.; Rodrigues, George; Gaede, Stewart

    2015-02-01

    This work aims to propose and validate a framework for tumour volume auto-segmentation based on ground-truth estimates derived from multi-physician input contours to expedite 4D-CT based lung tumour volume delineation. 4D-CT datasets of ten non-small cell lung cancer (NSCLC) patients were manually segmented by 6 physicians. Multi-expert ground truth (GT) estimates were constructed using the STAPLE algorithm for the gross tumour volume (GTV) on all respiratory phases. Next, using a deformable model-based method, multi-expert GT on each individual phase of the 4D-CT dataset was propagated to all other phases providing auto-segmented GTVs and motion encompassing internal gross target volumes (IGTVs) based on GT estimates (STAPLE) from each respiratory phase of the 4D-CT dataset. Accuracy assessment of auto-segmentation employed graph cuts for 3D-shape reconstruction and point-set registration-based analysis yielding volumetric and distance-based measures. STAPLE-based auto-segmented GTV accuracy ranged from (81.51  ±  1.92) to (97.27  ±  0.28)% volumetric overlap of the estimated ground truth. IGTV auto-segmentation showed significantly improved accuracies with reduced variance for all patients ranging from 90.87 to 98.57% volumetric overlap of the ground truth volume. Additional metrics supported these observations with statistical significance. Accuracy of auto-segmentation was shown to be largely independent of selection of the initial propagation phase. IGTV construction based on auto-segmented GTVs within the 4D-CT dataset provided accurate and reliable target volumes compared to manual segmentation-based GT estimates. While inter-/intra-observer effects were largely mitigated, the proposed segmentation workflow is more complex than that of current clinical practice and requires further development.

  4. A proposed framework for consensus-based lung tumour volume auto-segmentation in 4D computed tomography imaging.

    PubMed

    Martin, Spencer; Brophy, Mark; Palma, David; Louie, Alexander V; Yu, Edward; Yaremko, Brian; Ahmad, Belal; Barron, John L; Beauchemin, Steven S; Rodrigues, George; Gaede, Stewart

    2015-02-21

    This work aims to propose and validate a framework for tumour volume auto-segmentation based on ground-truth estimates derived from multi-physician input contours to expedite 4D-CT based lung tumour volume delineation. 4D-CT datasets of ten non-small cell lung cancer (NSCLC) patients were manually segmented by 6 physicians. Multi-expert ground truth (GT) estimates were constructed using the STAPLE algorithm for the gross tumour volume (GTV) on all respiratory phases. Next, using a deformable model-based method, multi-expert GT on each individual phase of the 4D-CT dataset was propagated to all other phases providing auto-segmented GTVs and motion encompassing internal gross target volumes (IGTVs) based on GT estimates (STAPLE) from each respiratory phase of the 4D-CT dataset. Accuracy assessment of auto-segmentation employed graph cuts for 3D-shape reconstruction and point-set registration-based analysis yielding volumetric and distance-based measures. STAPLE-based auto-segmented GTV accuracy ranged from (81.51  ±  1.92) to (97.27  ±  0.28)% volumetric overlap of the estimated ground truth. IGTV auto-segmentation showed significantly improved accuracies with reduced variance for all patients ranging from 90.87 to 98.57% volumetric overlap of the ground truth volume. Additional metrics supported these observations with statistical significance. Accuracy of auto-segmentation was shown to be largely independent of selection of the initial propagation phase. IGTV construction based on auto-segmented GTVs within the 4D-CT dataset provided accurate and reliable target volumes compared to manual segmentation-based GT estimates. While inter-/intra-observer effects were largely mitigated, the proposed segmentation workflow is more complex than that of current clinical practice and requires further development.

  5. Parallel Rendering of Large Time-Varying Volume Data

    NASA Technical Reports Server (NTRS)

    Garbutt, Alexander E.

    2005-01-01

    Interactive visualization of large time-varying 3D volume datasets has been and still is a great challenge to the modem computational world. It stretches the limits of the memory capacity, the disk space, the network bandwidth and the CPU speed of a conventional computer. In this SURF project, we propose to develop a parallel volume rendering program on SGI's Prism, a cluster computer equipped with state-of-the-art graphic hardware. The proposed program combines both parallel computing and hardware rendering in order to achieve an interactive rendering rate. We use 3D texture mapping and a hardware shader to implement 3D volume rendering on each workstation. We use SGI's VisServer to enable remote rendering using Prism's graphic hardware. And last, we will integrate this new program with ParVox, a parallel distributed visualization system developed at JPL. At the end of the project, we Will demonstrate remote interactive visualization using this new hardware volume renderer on JPL's Prism System using a time-varying dataset from selected JPL applications.

  6. Extraction and LOD control of colored interval volumes

    NASA Astrophysics Data System (ADS)

    Miyamura, Hiroko N.; Takeshima, Yuriko; Fujishiro, Issei; Saito, Takafumi

    2005-03-01

    Interval volume serves as a generalized isosurface and represents a three-dimensional subvolume for which the associated scalar filed values lie within a user-specified closed interval. In general, it is not an easy task for novices to specify the scalar field interval corresponding to their ROIs. In order to extract interval volumes from which desirable geometric features can be mined effectively, we propose a suggestive technique which extracts interval volumes automatically based on the global examination of the field contrast structure. Also proposed here is a simplification scheme for decimating resultant triangle patches to realize efficient transmission and rendition of large-scale interval volumes. Color distributions as well as geometric features are taken into account to select best edges to be collapsed. In addition, when a user wants to selectively display and analyze the original dataset, the simplified dataset is restructured to the original quality. Several simulated and acquired datasets are used to demonstrate the effectiveness of the present methods.

  7. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

    PubMed Central

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948

  8. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies.

    PubMed

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.

  9. Exploring Relationships in Big Data

    NASA Astrophysics Data System (ADS)

    Mahabal, A.; Djorgovski, S. G.; Crichton, D. J.; Cinquini, L.; Kelly, S.; Colbert, M. A.; Kincaid, H.

    2015-12-01

    Big Data are characterized by several different 'V's. Volume, Veracity, Volatility, Value and so on. For many datasets inflated Volumes through redundant features often make the data more noisy and difficult to extract Value out of. This is especially true if one is comparing/combining different datasets, and the metadata are diverse. We have been exploring ways to exploit such datasets through a variety of statistical machinery, and visualization. We show how we have applied it to time-series from large astronomical sky-surveys. This was done in the Virtual Observatory framework. More recently we have been doing similar work for a completely different domain viz. biology/cancer. The methodology reuse involves application to diverse datasets gathered through the various centers associated with the Early Detection Research Network (EDRN) for cancer, an initiative of the National Cancer Institute (NCI). Application to Geo datasets is a natural extension.

  10. 3D geometric split-merge segmentation of brain MRI datasets.

    PubMed

    Marras, Ioannis; Nikolaidis, Nikolaos; Pitas, Ioannis

    2014-05-01

    In this paper, a novel method for MRI volume segmentation based on region adaptive splitting and merging is proposed. The method, called Adaptive Geometric Split Merge (AGSM) segmentation, aims at finding complex geometrical shapes that consist of homogeneous geometrical 3D regions. In each volume splitting step, several splitting strategies are examined and the most appropriate is activated. A way to find the maximal homogeneity axis of the volume is also introduced. Along this axis, the volume splitting technique divides the entire volume in a number of large homogeneous 3D regions, while at the same time, it defines more clearly small homogeneous regions within the volume in such a way that they have greater probabilities of survival at the subsequent merging step. Region merging criteria are proposed to this end. The presented segmentation method has been applied to brain MRI medical datasets to provide segmentation results when each voxel is composed of one tissue type (hard segmentation). The volume splitting procedure does not require training data, while it demonstrates improved segmentation performance in noisy brain MRI datasets, when compared to the state of the art methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. Organisational and extraorganisational determinants of volume of service delivery by English community pharmacies: a cross-sectional survey and secondary data analysis.

    PubMed

    Hann, Mark; Schafheutle, Ellen I; Bradley, Fay; Elvey, Rebecca; Wagner, Andrew; Halsall, Devina; Hassell, Karen; Jacobs, Sally

    2017-10-10

    This study aimed to identify the organisational and extraorganisational factors associated with existing variation in the volume of services delivered by community pharmacies. Linear and ordered logistic regression of linked national data from secondary sources-community pharmacy activity, socioeconomic and health need datasets-and primary data from a questionnaire survey of community pharmacies in nine diverse geographical areas in England. Annual dispensing volume; annual volume of medicines use reviews (MURs). National dataset (n=10 454 pharmacies): greater dispensing volume was significantly associated with pharmacy ownership type (large chains>independents>supermarkets), greater deprivation, higher local prevalence of cardiovascular disease and depression, older people (aged >75 years) and infants (aged 0-4 years) but lower prevalence of mental health conditions. Greater volume of MURs was significantly associated with pharmacy ownership type (large chains/supermarkets>independents), greater dispensing volume, and lower disease prevalence.Survey dataset (n=285 pharmacies; response=34.6%): greater dispensing volume was significantly associated with staffing, skill-mix, organisational culture, years open and greater deprivation. Greater MUR volume was significantly associated with pharmacy ownership type (large chains/supermarkets>independents), greater dispensing volume, weekly opening hours and lower asthma prevalence. Organisational and extraorganisational factors were found to impact differently on dispensing volume and MUR activity, the latter being driven more by corporate ownership than population need. While levels of staffing and skill-mix were associated with dispensing volume, they did not influence MUR activity. Despite recent changes to the contractual framework, the existing fee-for-service reimbursement may therefore not be the most appropriate for the delivery of cognitive (rather than supply) services, still appearing to incentivise quantity over the quality (in terms of appropriate targeting) of services delivered. Future research should focus on the development of quality measures that could be incorporated into community pharmacy reimbursement mechanisms. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  12. Diurnal fluctuations in brain volume: Statistical analyses of MRI from large populations.

    PubMed

    Nakamura, Kunio; Brown, Robert A; Narayanan, Sridar; Collins, D Louis; Arnold, Douglas L

    2015-09-01

    We investigated fluctuations in brain volume throughout the day using statistical modeling of magnetic resonance imaging (MRI) from large populations. We applied fully automated image analysis software to measure the brain parenchymal fraction (BPF), defined as the ratio of the brain parenchymal volume and intracranial volume, thus accounting for variations in head size. The MRI data came from serial scans of multiple sclerosis (MS) patients in clinical trials (n=755, 3269 scans) and from subjects participating in the Alzheimer's Disease Neuroimaging Initiative (ADNI, n=834, 6114 scans). The percent change in BPF was modeled with a linear mixed effect (LME) model, and the model was applied separately to the MS and ADNI datasets. The LME model for the MS datasets included random subject effects (intercept and slope over time) and fixed effects for the time-of-day, time from the baseline scan, and trial, which accounted for trial-related effects (for example, different inclusion criteria and imaging protocol). The model for ADNI additionally included the demographics (baseline age, sex, subject type [normal, mild cognitive impairment, or Alzheimer's disease], and interaction between subject type and time from baseline). There was a statistically significant effect of time-of-day on the BPF change in MS clinical trial datasets (-0.180 per day, that is, 0.180% of intracranial volume, p=0.019) as well as the ADNI dataset (-0.438 per day, that is, 0.438% of intracranial volume, p<0.0001), showing that the brain volume is greater in the morning. Linearly correcting the BPF values with the time-of-day reduced the required sample size to detect a 25% treatment effect (80% power and 0.05 significance level) on change in brain volume from 2 time-points over a period of 1year by 2.6%. Our results have significant implications for future brain volumetric studies, suggesting that there is a potential acquisition time bias that should be randomized or statistically controlled to account for the day-to-day brain volume fluctuations. Copyright © 2015 Elsevier Inc. All rights reserved.

  13. Animated analysis of geoscientific datasets: An interactive graphical application

    NASA Astrophysics Data System (ADS)

    Morse, Peter; Reading, Anya; Lueg, Christopher

    2017-12-01

    Geoscientists are required to analyze and draw conclusions from increasingly large volumes of data. There is a need to recognise and characterise features and changing patterns of Earth observables within such large datasets. It is also necessary to identify significant subsets of the data for more detailed analysis. We present an innovative, interactive software tool and workflow to visualise, characterise, sample and tag large geoscientific datasets from both local and cloud-based repositories. It uses an animated interface and human-computer interaction to utilise the capacity of human expert observers to identify features via enhanced visual analytics. 'Tagger' enables users to analyze datasets that are too large in volume to be drawn legibly on a reasonable number of single static plots. Users interact with the moving graphical display, tagging data ranges of interest for subsequent attention. The tool provides a rapid pre-pass process using fast GPU-based OpenGL graphics and data-handling and is coded in the Quartz Composer visual programing language (VPL) on Mac OSX. It makes use of interoperable data formats, and cloud-based (or local) data storage and compute. In a case study, Tagger was used to characterise a decade (2000-2009) of data recorded by the Cape Sorell Waverider Buoy, located approximately 10 km off the west coast of Tasmania, Australia. These data serve as a proxy for the understanding of Southern Ocean storminess, which has both local and global implications. This example shows use of the tool to identify and characterise 4 different types of storm and non-storm events during this time. Events characterised in this way are compared with conventional analysis, noting advantages and limitations of data analysis using animation and human interaction. Tagger provides a new ability to make use of humans as feature detectors in computer-based analysis of large-volume geosciences and other data.

  14. A Lightweight Remote Parallel Visualization Platform for Interactive Massive Time-varying Climate Data Analysis

    NASA Astrophysics Data System (ADS)

    Li, J.; Zhang, T.; Huang, Q.; Liu, Q.

    2014-12-01

    Today's climate datasets are featured with large volume, high degree of spatiotemporal complexity and evolving fast overtime. As visualizing large volume distributed climate datasets is computationally intensive, traditional desktop based visualization applications fail to handle the computational intensity. Recently, scientists have developed remote visualization techniques to address the computational issue. Remote visualization techniques usually leverage server-side parallel computing capabilities to perform visualization tasks and deliver visualization results to clients through network. In this research, we aim to build a remote parallel visualization platform for visualizing and analyzing massive climate data. Our visualization platform was built based on Paraview, which is one of the most popular open source remote visualization and analysis applications. To further enhance the scalability and stability of the platform, we have employed cloud computing techniques to support the deployment of the platform. In this platform, all climate datasets are regular grid data which are stored in NetCDF format. Three types of data access methods are supported in the platform: accessing remote datasets provided by OpenDAP servers, accessing datasets hosted on the web visualization server and accessing local datasets. Despite different data access methods, all visualization tasks are completed at the server side to reduce the workload of clients. As a proof of concept, we have implemented a set of scientific visualization methods to show the feasibility of the platform. Preliminary results indicate that the framework can address the computation limitation of desktop based visualization applications.

  15. Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E

    NASA Technical Reports Server (NTRS)

    Ma, Kwan-Liu; Crockett, Thomas W.

    1999-01-01

    This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.

  16. Organisational and extraorganisational determinants of volume of service delivery by English community pharmacies: a cross-sectional survey and secondary data analysis

    PubMed Central

    Hann, Mark; Schafheutle, Ellen I; Bradley, Fay; Elvey, Rebecca; Wagner, Andrew; Halsall, Devina; Hassell, Karen

    2017-01-01

    Objectives This study aimed to identify the organisational and extraorganisational factors associated with existing variation in the volume of services delivered by community pharmacies. Design and setting Linear and ordered logistic regression of linked national data from secondary sources—community pharmacy activity, socioeconomic and health need datasets—and primary data from a questionnaire survey of community pharmacies in nine diverse geographical areas in England. Outcome measures Annual dispensing volume; annual volume of medicines use reviews (MURs). Results National dataset (n=10 454 pharmacies): greater dispensing volume was significantly associated with pharmacy ownership type (large chains>independents>supermarkets), greater deprivation, higher local prevalence of cardiovascular disease and depression, older people (aged >75 years) and infants (aged 0–4 years) but lower prevalence of mental health conditions. Greater volume of MURs was significantly associated with pharmacy ownership type (large chains/supermarkets>>independents), greater dispensing volume, and lower disease prevalence. Survey dataset (n=285 pharmacies; response=34.6%): greater dispensing volume was significantly associated with staffing, skill-mix, organisational culture, years open and greater deprivation. Greater MUR volume was significantly associated with pharmacy ownership type (large chains/supermarkets>>independents), greater dispensing volume, weekly opening hours and lower asthma prevalence. Conclusions Organisational and extraorganisational factors were found to impact differently on dispensing volume and MUR activity, the latter being driven more by corporate ownership than population need. While levels of staffing and skill-mix were associated with dispensing volume, they did not influence MUR activity. Despite recent changes to the contractual framework, the existing fee-for-service reimbursement may therefore not be the most appropriate for the delivery of cognitive (rather than supply) services, still appearing to incentivise quantity over the quality (in terms of appropriate targeting) of services delivered. Future research should focus on the development of quality measures that could be incorporated into community pharmacy reimbursement mechanisms. PMID:29018074

  17. Extraction of drainage networks from large terrain datasets using high throughput computing

    NASA Astrophysics Data System (ADS)

    Gong, Jianya; Xie, Jibo

    2009-02-01

    Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.

  18. Content-level deduplication on mobile internet datasets

    NASA Astrophysics Data System (ADS)

    Hou, Ziyu; Chen, Xunxun; Wang, Yang

    2017-06-01

    Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.

  19. [Quantification of pulmonary emphysema in multislice-CT using different software tools].

    PubMed

    Heussel, C P; Achenbach, T; Buschsieweke, C; Kuhnigk, J; Weinheimer, O; Hammer, G; Düber, C; Kauczor, H-U

    2006-10-01

    The data records of thin-section MSCT of the lung with approx. 300 images are difficult to use in manual evaluation. A computer-assisted pre-diagnosis can help with reporting. Furthermore, post-processing techniques, for instance, for quantification of emphysema on the basis of three-dimensional anatomical information might be improved and the workflow might be further automated. The results of 4 programs (Pulmo, Volume, YACTA and PulmoFUNC) for the quantitative analysis of emphysema (lung and emphysema volume, mean lung density and emphysema index) of 30 consecutive thin-section MSCT datasets with different emphysema severity levels were compared. The classification result of the YACTA program for different types of emphysema was also analyzed. Pulmo and Volume have a median operating time of 105 and 59 minutes respectively due to the necessity for extensive manual correction of the lung segmentation. The programs PulmoFUNC and YACTA, which are automated to a large extent, have a median runtime of 26 and 16 minutes, respectively. The evaluation with Pulmo and Volume using 2 different datasets resulted in implausible values. PulmoFUNC crashed with 2 other datasets in a reproducible manner. Only with YACTA could all graphic datasets be evaluated. The lung volume, emphysema volume, emphysema index and mean lung density determined by YACTA and PulmoFUNC are significantly larger than the corresponding values of Volume and Pulmo (differences: Volume: 119 cm(3)/65 cm(3)/1 %/17 HU, Pulmo: 60 cm(3)/96 cm(3)/1 %/37 HU). Classification of the emphysema type was in agreement with that of the radiologist in 26 panlobular cases, in 22 paraseptalen cases and in 15 centrilobular emphysema cases. The substantial expenditure of time obstructs the employment of quantitative emphysema analysis in the clinical routine. The results of YACTA and PulmoFUNC are affected by the dedicated exclusion of the tracheobronchial system. These fully automatic tools enable not only fast quantification without manual interaction, but also a reproducible measurement without user dependence.

  20. Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.

    PubMed

    Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher

    2008-01-01

    With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.

  1. TLEM 2.0 - a comprehensive musculoskeletal geometry dataset for subject-specific modeling of lower extremity.

    PubMed

    Carbone, V; Fluit, R; Pellikaan, P; van der Krogt, M M; Janssen, D; Damsgaard, M; Vigneron, L; Feilkas, T; Koopman, H F J M; Verdonschot, N

    2015-03-18

    When analyzing complex biomechanical problems such as predicting the effects of orthopedic surgery, subject-specific musculoskeletal models are essential to achieve reliable predictions. The aim of this paper is to present the Twente Lower Extremity Model 2.0, a new comprehensive dataset of the musculoskeletal geometry of the lower extremity, which is based on medical imaging data and dissection performed on the right lower extremity of a fresh male cadaver. Bone, muscle and subcutaneous fat (including skin) volumes were segmented from computed tomography and magnetic resonance images scans. Inertial parameters were estimated from the image-based segmented volumes. A complete cadaver dissection was performed, in which bony landmarks, attachments sites and lines-of-action of 55 muscle actuators and 12 ligaments, bony wrapping surfaces, and joint geometry were measured. The obtained musculoskeletal geometry dataset was finally implemented in the AnyBody Modeling System (AnyBody Technology A/S, Aalborg, Denmark), resulting in a model consisting of 12 segments, 11 joints and 21 degrees of freedom, and including 166 muscle-tendon elements for each leg. The new TLEM 2.0 dataset was purposely built to be easily combined with novel image-based scaling techniques, such as bone surface morphing, muscle volume registration and muscle-tendon path identification, in order to obtain subject-specific musculoskeletal models in a quick and accurate way. The complete dataset, including CT and MRI scans and segmented volume and surfaces, is made available at http://www.utwente.nl/ctw/bw/research/projects/TLEMsafe for the biomechanical community, in order to accelerate the development and adoption of subject-specific models on large scale. TLEM 2.0 is freely shared for non-commercial use only, under acceptance of the TLEMsafe Research License Agreement. Copyright © 2014 Elsevier Ltd. All rights reserved.

  2. Automated Fault Interpretation and Extraction using Improved Supplementary Seismic Datasets

    NASA Astrophysics Data System (ADS)

    Bollmann, T. A.; Shank, R.

    2017-12-01

    During the interpretation of seismic volumes, it is necessary to interpret faults along with horizons of interest. With the improvement of technology, the interpretation of faults can be expedited with the aid of different algorithms that create supplementary seismic attributes, such as semblance and coherency. These products highlight discontinuities, but still need a large amount of human interaction to interpret faults and are plagued by noise and stratigraphic discontinuities. Hale (2013) presents a method to improve on these datasets by creating what is referred to as a Fault Likelihood volume. In general, these volumes contain less noise and do not emphasize stratigraphic features. Instead, planar features within a specified strike and dip range are highlighted. Once a satisfactory Fault Likelihood Volume is created, extraction of fault surfaces is much easier. The extracted fault surfaces are then exported to interpretation software for QC. Numerous software packages have implemented this methodology with varying results. After investigating these platforms, we developed a preferred Automated Fault Interpretation workflow.

  3. Development of global sea ice 6.0 CICE configuration for the Met Office global coupled model

    DOE PAGES

    Rae, J. . G. L; Hewitt, H. T.; Keen, A. B.; ...

    2015-03-05

    The new sea ice configuration GSI6.0, used in the Met Office global coupled configuration GC2.0, is described and the sea ice extent, thickness and volume are compared with the previous configuration and with observationally-based datasets. In the Arctic, the sea ice is thicker in all seasons than in the previous configuration, and there is now better agreement of the modelled concentration and extent with the HadISST dataset. In the Antarctic, a warm bias in the ocean model has been exacerbated at the higher resolution of GC2.0, leading to a large reduction in ice extent and volume; further work is requiredmore » to rectify this in future configurations.« less

  4. Comparison of epicardial adipose tissue radiodensity threshold between contrast and non-contrast enhanced computed tomography scans: A cohort study of derivation and validation.

    PubMed

    Xu, Lingyu; Xu, Yuancheng; Coulden, Richard; Sonnex, Emer; Hrybouski, Stanislau; Paterson, Ian; Butler, Craig

    2018-05-11

    Epicardial adipose tissue (EAT) volume derived from contrast enhanced (CE) computed tomography (CT) scans is not well validated. We aim to establish a reliable threshold to accurately quantify EAT volume from CE datasets. We analyzed EAT volume on paired non-contrast (NC) and CE datasets from 25 patients to derive appropriate Hounsfield (HU) cutpoints to equalize two EAT volume estimates. The gold standard threshold (-190HU, -30HU) was used to assess EAT volume on NC datasets. For CE datasets, EAT volumes were estimated using three previously reported thresholds: (-190HU, -30HU), (-190HU, -15HU), (-175HU, -15HU) and were analyzed by a semi-automated 3D Fat analysis software. Subsequently, we applied a threshold correction to (-190HU, -30HU) based on mean differences in radiodensity between NC and CE images (ΔEATrd = CE radiodensity - NC radiodensity). We then validated our findings on EAT threshold in 21 additional patients with paired CT datasets. EAT volume from CE datasets using previously published thresholds consistently underestimated EAT volume from NC dataset standard by a magnitude of 8.2%-19.1%. Using our corrected threshold (-190HU, -3HU) in CE datasets yielded statistically identical EAT volume to NC EAT volume in the validation cohort (186.1 ± 80.3 vs. 185.5 ± 80.1 cm 3 , Δ = 0.6 cm 3 , 0.3%, p = 0.374). Estimating EAT volume from contrast enhanced CT scans using a corrected threshold of -190HU, -3HU provided excellent agreement with EAT volume from non-contrast CT scans using a standard threshold of -190HU, -30HU. Copyright © 2018. Published by Elsevier B.V.

  5. Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets

    PubMed Central

    Jeong, Won-Ki; Beyer, Johanna; Hadwiger, Markus; Vazquez, Amelio; Pfister, Hanspeter; Whitaker, Ross T.

    2011-01-01

    Recent advances in scanning technology provide high resolution EM (Electron Microscopy) datasets that allow neuroscientists to reconstruct complex neural connections in a nervous system. However, due to the enormous size and complexity of the resulting data, segmentation and visualization of neural processes in EM data is usually a difficult and very time-consuming task. In this paper, we present NeuroTrace, a novel EM volume segmentation and visualization system that consists of two parts: a semi-automatic multiphase level set segmentation with 3D tracking for reconstruction of neural processes, and a specialized volume rendering approach for visualization of EM volumes. It employs view-dependent on-demand filtering and evaluation of a local histogram edge metric, as well as on-the-fly interpolation and ray-casting of implicit surfaces for segmented neural structures. Both methods are implemented on the GPU for interactive performance. NeuroTrace is designed to be scalable to large datasets and data-parallel hardware architectures. A comparison of NeuroTrace with a commonly used manual EM segmentation tool shows that our interactive workflow is faster and easier to use for the reconstruction of complex neural processes. PMID:19834227

  6. Preliminary Evaluation of a Diagnostic Tool for Prosthetics

    DTIC Science & Technology

    2017-10-01

    volume change. Processing algorithms for data from the activity monitors were modified to run more efficiently so that large datasets could be...left) and blade style prostheses (right). Figure 4: Ankle ActiGraph correct position demonstrated for a left leg below-knee amputee cylindrical

  7. Finding Intervals of Abrupt Change in Earth Science Data

    NASA Astrophysics Data System (ADS)

    Zhou, X.; Shekhar, S.; Liess, S.

    2011-12-01

    In earth science data (e.g., climate data), it is often observed that a persistently abrupt change in value occurs in a certain time-period or spatial interval. For example, abrupt climate change is defined as an unusually large shift of precipitation, temperature, etc, that occurs during a relatively short time period. A similar pattern can also be found in geographical space, representing a sharp transition of the environment (e.g., vegetation between different ecological zones). Identifying such intervals of change from earth science datasets is a crucial step for understanding and attributing the underlying phenomenon. However, inconsistencies in these noisy datasets can obstruct the major change trend, and more importantly can complicate the search of the beginning and end points of the interval of change. Also, the large volume of data makes it challenging to process the dataset reasonably fast. In earth science data (e.g., climate data), it is often observed that a persistently abrupt change in value occurs in a certain time-period or spatial interval. For example, abrupt climate change is defined as an unusually large shift of precipitation, temperature, etc, that occurs during a relatively short time period. A similar change pattern can also be found in geographical space, representing a sharp transition of the environment (e.g., vegetation between different ecological zones). Identifying such intervals of change from earth science datasets is a crucial step for understanding and attributing the underlying phenomenon. However, inconsistencies in these noisy datasets can obstruct the major change trend, and more importantly can complicate the search of the beginning and end points of the interval of change. Also, the large volume of data makes it challenging to process the dataset fast. In this work, we analyze earth science data using a novel, automated data mining approach to identify spatial/temporal intervals of persistent, abrupt change. We first propose a statistical model to quantitatively evaluate the change abruptness and persistence in an interval. Then we design an algorithm to exhaustively examine all the intervals using this model. Intervals passing a threshold test will be kept as final results. We evaluate the proposed method with the Climate Research Unit (CRU) precipitation data, whereby we focus on the Sahel rainfall index. Results show that this method can find periods of persistent and abrupt value changes with different temporal scales. We also further optimize the algorithm using a smart strategy, which always examines longer intervals before its subsets. By doing this, we reduce the computational cost to only one third of that of the original algorithm for the above test case. More significantly, the optimized algorithm is also proven to scale up well with data volume and number of changes. Particularly, it achieves better performance when dealing with longer change intervals.

  8. Measurement and genetics of human subcortical and hippocampal asymmetries in large datasets.

    PubMed

    Guadalupe, Tulio; Zwiers, Marcel P; Teumer, Alexander; Wittfeld, Katharina; Vasquez, Alejandro Arias; Hoogman, Martine; Hagoort, Peter; Fernandez, Guillen; Buitelaar, Jan; Hegenscheid, Katrin; Völzke, Henry; Franke, Barbara; Fisher, Simon E; Grabe, Hans J; Francks, Clyde

    2014-07-01

    Functional and anatomical asymmetries are prevalent features of the human brain, linked to gender, handedness, and cognition. However, little is known about the neurodevelopmental processes involved. In zebrafish, asymmetries arise in the diencephalon before extending within the central nervous system. We aimed to identify genes involved in the development of subtle, left-right volumetric asymmetries of human subcortical structures using large datasets. We first tested the feasibility of measuring left-right volume differences in such large-scale samples, as assessed by two automated methods of subcortical segmentation (FSL|FIRST and FreeSurfer), using data from 235 subjects who had undergone MRI twice. We tested the agreement between the first and second scan, and the agreement between the segmentation methods, for measures of bilateral volumes of six subcortical structures and the hippocampus, and their volumetric asymmetries. We also tested whether there were biases introduced by left-right differences in the regional atlases used by the methods, by analyzing left-right flipped images. While many bilateral volumes were measured well (scan-rescan r = 0.6-0.8), most asymmetries, with the exception of the caudate nucleus, showed lower repeatabilites. We meta-analyzed genome-wide association scan results for caudate nucleus asymmetry in a combined sample of 3,028 adult subjects but did not detect associations at genome-wide significance (P < 5 × 10(-8) ). There was no enrichment of genetic association in genes involved in left-right patterning of the viscera. Our results provide important information for researchers who are currently aiming to carry out large-scale genome-wide studies of subcortical and hippocampal volumes, and their asymmetries. Copyright © 2013 Wiley Periodicals, Inc.

  9. Modelling the standing timber volume of Baden-Württemberg-A large-scale approach using a fusion of Landsat, airborne LiDAR and National Forest Inventory data

    NASA Astrophysics Data System (ADS)

    Maack, Joachim; Lingenfelder, Marcus; Weinacker, Holger; Koch, Barbara

    2016-07-01

    Remote sensing-based timber volume estimation is key for modelling the regional potential, accessibility and price of lignocellulosic raw material for an emerging bioeconomy. We used a unique wall-to-wall airborne LiDAR dataset and Landsat 7 satellite images in combination with terrestrial inventory data derived from the National Forest Inventory (NFI), and applied generalized additive models (GAM) to estimate spatially explicit timber distribution and volume in forested areas. Since the NFI data showed an underlying structure regarding size and ownership, we additionally constructed a socio-economic predictor to enhance the accuracy of the analysis. Furthermore, we balanced the training dataset with a bootstrap method to achieve unbiased regression weights for interpolating timber volume. Finally, we compared and discussed the model performance of the original approach (r2 = 0.56, NRMSE = 9.65%), the approach with balanced training data (r2 = 0.69, NRMSE = 12.43%) and the final approach with balanced training data and the additional socio-economic predictor (r2 = 0.72, NRMSE = 12.17%). The results demonstrate the usefulness of remote sensing techniques for mapping timber volume for a future lignocellulose-based bioeconomy.

  10. Digital tissue and what it may reveal about the brain.

    PubMed

    Morgan, Josh L; Lichtman, Jeff W

    2017-10-30

    Imaging as a means of scientific data storage has evolved rapidly over the past century from hand drawings, to photography, to digital images. Only recently can sufficiently large datasets be acquired, stored, and processed such that tissue digitization can actually reveal more than direct observation of tissue. One field where this transformation is occurring is connectomics: the mapping of neural connections in large volumes of digitized brain tissue.

  11. Fuzzy hidden Markov chains segmentation for volume determination and quantitation in PET.

    PubMed

    Hatt, M; Lamare, F; Boussion, N; Turzo, A; Collet, C; Salzenstein, F; Roux, C; Jarritt, P; Carson, K; Cheze-Le Rest, C; Visvikis, D

    2007-06-21

    Accurate volume of interest (VOI) estimation in PET is crucial in different oncology applications such as response to therapy evaluation and radiotherapy treatment planning. The objective of our study was to evaluate the performance of the proposed algorithm for automatic lesion volume delineation; namely the fuzzy hidden Markov chains (FHMC), with that of current state of the art in clinical practice threshold based techniques. As the classical hidden Markov chain (HMC) algorithm, FHMC takes into account noise, voxel intensity and spatial correlation, in order to classify a voxel as background or functional VOI. However the novelty of the fuzzy model consists of the inclusion of an estimation of imprecision, which should subsequently lead to a better modelling of the 'fuzzy' nature of the object of interest boundaries in emission tomography data. The performance of the algorithms has been assessed on both simulated and acquired datasets of the IEC phantom, covering a large range of spherical lesion sizes (from 10 to 37 mm), contrast ratios (4:1 and 8:1) and image noise levels. Both lesion activity recovery and VOI determination tasks were assessed in reconstructed images using two different voxel sizes (8 mm3 and 64 mm3). In order to account for both the functional volume location and its size, the concept of % classification errors was introduced in the evaluation of volume segmentation using the simulated datasets. Results reveal that FHMC performs substantially better than the threshold based methodology for functional volume determination or activity concentration recovery considering a contrast ratio of 4:1 and lesion sizes of <28 mm. Furthermore differences between classification and volume estimation errors evaluated were smaller for the segmented volumes provided by the FHMC algorithm. Finally, the performance of the automatic algorithms was less susceptible to image noise levels in comparison to the threshold based techniques. The analysis of both simulated and acquired datasets led to similar results and conclusions as far as the performance of segmentation algorithms under evaluation is concerned.

  12. Interacting with Petabytes of Earth Science Data using Jupyter Notebooks, IPython Widgets and Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Erickson, T. A.; Granger, B.; Grout, J.; Corlay, S.

    2017-12-01

    The volume of Earth science data gathered from satellites, aircraft, drones, and field instruments continues to increase. For many scientific questions in the Earth sciences, managing this large volume of data is a barrier to progress, as it is difficult to explore and analyze large volumes of data using the traditional paradigm of downloading datasets to a local computer for analysis. Furthermore, methods for communicating Earth science algorithms that operate on large datasets in an easily understandable and reproducible way are needed. Here we describe a system for developing, interacting, and sharing well-documented Earth Science algorithms that combines existing software components: Jupyter Notebook: An open-source, web-based environment that supports documents that combine code and computational results with text narrative, mathematics, images, and other media. These notebooks provide an environment for interactive exploration of data and development of well documented algorithms. Jupyter Widgets / ipyleaflet: An architecture for creating interactive user interface controls (such as sliders, text boxes, etc.) in Jupyter Notebooks that communicate with Python code. This architecture includes a default set of UI controls (sliders, dropboxes, etc.) as well as APIs for building custom UI controls. The ipyleaflet project is one example that offers a custom interactive map control that allows a user to display and manipulate geographic data within the Jupyter Notebook. Google Earth Engine: A cloud-based geospatial analysis platform that provides access to petabytes of Earth science data via a Python API. The combination of Jupyter Notebooks, Jupyter Widgets, ipyleaflet, and Google Earth Engine makes it possible to explore and analyze massive Earth science datasets via a web browser, in an environment suitable for interactive exploration, teaching, and sharing. Using these environments can make Earth science analyses easier to understand and reproducible, which may increase the rate of scientific discoveries and the transition of discoveries into real-world impacts.

  13. Reference data on muscle volumes of healthy human pelvis and lower extremity muscles: an in vivo magnetic resonance imaging feasibility study.

    PubMed

    Lube, Juliane; Cotofana, Sebastian; Bechmann, Ingo; Milani, Thomas L; Özkurtul, Orkun; Sakai, Tatsuo; Steinke, Hanno; Hammer, Niels

    2016-01-01

    Muscle volumes are of crucial interest when attempting to analyze individual physical performance and disease- or age-related alterations in muscle morphology. However, very little reference data are available in the literature on pelvis and lower extremity muscle volumes originating from healthy and young individuals. Furthermore, it is of interest if representative muscle volumes, covering large anatomical regions, can be obtained using magnetic resonance imaging (MRI) in a setting similar to the clinical routine. Our objective was therefore to provide encompassing, bilateral, 3-T MRI-based datasets on muscle volumes of the pelvis and the lower limb muscles. T1-weighted 3-T MRI records were obtained bilaterally from six young and healthy participants. Three-dimensional volumes were compiled from 28 muscles and muscle groups of each participant before the muscle volumes were computed. Muscle volumes were obtained from 28 muscles and muscle groups of the pelvis and lower extremity. Volumes were larger in male than in female participants. Volumes of the dominant and non-dominant sides were similar in both genders. The obtained results were in line with volumetric data obtained from smaller anatomical areas, thus extending the available datasets. This study provides an encompassing and feasible approach to obtain data on the muscle volumes of pelvic and limb muscles of healthy, young, and physically active individuals. The respective data form a basis to determine effects of therapeutic approaches, progression of diseases, or technical applications like automated segmentation algorithms applied to different populations.

  14. Remote volume rendering pipeline for mHealth applications

    NASA Astrophysics Data System (ADS)

    Gutenko, Ievgeniia; Petkov, Kaloian; Papadopoulos, Charilaos; Zhao, Xin; Park, Ji Hwan; Kaufman, Arie; Cha, Ronald

    2014-03-01

    We introduce a novel remote volume rendering pipeline for medical visualization targeted for mHealth (mobile health) applications. The necessity of such a pipeline stems from the large size of the medical imaging data produced by current CT and MRI scanners with respect to the complexity of the volumetric rendering algorithms. For example, the resolution of typical CT Angiography (CTA) data easily reaches 512^3 voxels and can exceed 6 gigabytes in size by spanning over the time domain while capturing a beating heart. This explosion in data size makes data transfers to mobile devices challenging, and even when the transfer problem is resolved the rendering performance of the device still remains a bottleneck. To deal with this issue, we propose a thin-client architecture, where the entirety of the data resides on a remote server where the image is rendered and then streamed to the client mobile device. We utilize the display and interaction capabilities of the mobile device, while performing interactive volume rendering on a server capable of handling large datasets. Specifically, upon user interaction the volume is rendered on the server and encoded into an H.264 video stream. H.264 is ubiquitously hardware accelerated, resulting in faster compression and lower power requirements. The choice of low-latency CPU- and GPU-based encoders is particularly important in enabling the interactive nature of our system. We demonstrate a prototype of our framework using various medical datasets on commodity tablet devices.

  15. Description of the U.S. Geological Survey Geo Data Portal data integration framework

    USGS Publications Warehouse

    Blodgett, David L.; Booth, Nathaniel L.; Kunicki, Thomas C.; Walker, Jordan I.; Lucido, Jessica M.

    2012-01-01

    The U.S. Geological Survey has developed an open-standard data integration framework for working efficiently and effectively with large collections of climate and other geoscience data. A web interface accesses catalog datasets to find data services. Data resources can then be rendered for mapping and dataset metadata are derived directly from these web services. Algorithm configuration and information needed to retrieve data for processing are passed to a server where all large-volume data access and manipulation takes place. The data integration strategy described here was implemented by leveraging existing free and open source software. Details of the software used are omitted; rather, emphasis is placed on how open-standard web services and data encodings can be used in an architecture that integrates common geographic and atmospheric data.

  16. COMPARISON OF VOLUMETRIC REGISTRATION ALGORITHMS FOR TENSOR-BASED MORPHOMETRY

    PubMed Central

    Villalon, Julio; Joshi, Anand A.; Toga, Arthur W.; Thompson, Paul M.

    2015-01-01

    Nonlinear registration of brain MRI scans is often used to quantify morphological differences associated with disease or genetic factors. Recently, surface-guided fully 3D volumetric registrations have been developed that combine intensity-guided volume registrations with cortical surface constraints. In this paper, we compare one such algorithm to two popular high-dimensional volumetric registration methods: large-deformation viscous fluid registration, formulated in a Riemannian framework, and the diffeomorphic “Demons” algorithm. We performed an objective morphometric comparison, by using a large MRI dataset from 340 young adult twin subjects to examine 3D patterns of correlations in anatomical volumes. Surface-constrained volume registration gave greater effect sizes for detecting morphometric associations near the cortex, while the other two approaches gave greater effects sizes subcortically. These findings suggest novel ways to combine the advantages of multiple methods in the future. PMID:26925198

  17. GPU-based multi-volume ray casting within VTK for medical applications.

    PubMed

    Bozorgi, Mohammadmehdi; Lindseth, Frank

    2015-03-01

    Multi-volume visualization is important for displaying relevant information in multimodal or multitemporal medical imaging studies. The main objective with the current study was to develop an efficient GPU-based multi-volume ray caster (MVRC) and validate the proposed visualization system in the context of image-guided surgical navigation. Ray casting can produce high-quality 2D images from 3D volume data but the method is computationally demanding, especially when multiple volumes are involved, so a parallel GPU version has been implemented. In the proposed MVRC, imaginary rays are sent through the volumes (one ray for each pixel in the view), and at equal and short intervals along the rays, samples are collected from each volume. Samples from all the volumes are composited using front to back α-blending. Since all the rays can be processed simultaneously, the MVRC was implemented in parallel on the GPU to achieve acceptable interactive frame rates. The method is fully integrated within the visualization toolkit (VTK) pipeline with the ability to apply different operations (e.g., transformations, clipping, and cropping) on each volume separately. The implemented method is cross-platform (Windows, Linux and Mac OSX) and runs on different graphics card (NVidia and AMD). The speed of the MVRC was tested with one to five volumes of varying sizes: 128(3), 256(3), and 512(3). A Tesla C2070 GPU was used, and the output image size was 600 × 600 pixels. The original VTK single-volume ray caster and the MVRC were compared when rendering only one volume. The multi-volume rendering system achieved an interactive frame rate (> 15 fps) when rendering five small volumes (128 (3) voxels), four medium-sized volumes (256(3) voxels), and two large volumes (512(3) voxels). When rendering single volumes, the frame rate of the MVRC was comparable to the original VTK ray caster for small and medium-sized datasets but was approximately 3 frames per second slower for large datasets. The MVRC was successfully integrated in an existing surgical navigation system and was shown to be clinically useful during an ultrasound-guided neurosurgical tumor resection. A GPU-based MVRC for VTK is a useful tool in medical visualization. The proposed multi-volume GPU-based ray caster for VTK provided high-quality images at reasonable frame rates. The MVRC was effective when used in a neurosurgical navigation application.

  18. Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

    PubMed

    Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2015-01-01

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.

  19. Effective 2D-3D medical image registration using Support Vector Machine.

    PubMed

    Qi, Wenyuan; Gu, Lixu; Zhao, Qiang

    2008-01-01

    Registration of pre-operative 3D volume dataset and intra-operative 2D images gradually becomes an important technique to assist radiologists in diagnosing complicated diseases easily and quickly. In this paper, we proposed a novel 2D/3D registration framework based on Support Vector Machine (SVM) to compensate the disadvantages of generating large number of DRR images in the stage of intra-operation. Estimated similarity metric distribution could be built up from the relationship between parameters of transform and prior sparse target metric values by means of SVR method. Based on which, global optimal parameters of transform are finally searched out by an optimizer in order to guide 3D volume dataset to match intra-operative 2D image. Experiments reveal that our proposed registration method improved performance compared to conventional registration method and also provided a precise registration result efficiently.

  20. Curious or spurious correlations within a national-scale forest inventory?

    Treesearch

    Christopher W. Woodall; James A. Westfall

    2012-01-01

    Foresters are increasingly required to assess trends not only in traditional forest attributes (e.g., growing-stock volumes), but also across suites of forest health indicators and site/climate variables. Given the tenuous relationship between correlation and causality within extremely large datasets, the goal of this study was to use a nationwide annual forest...

  1. Big Data Challenges Indexing Large-Volume, Heterogeneous EO Datasets for Effective Data Discovery

    NASA Astrophysics Data System (ADS)

    Waterfall, Alison; Bennett, Victoria; Donegan, Steve; Juckes, Martin; Kershaw, Phil; Petrie, Ruth; Stephens, Ag; Wilson, Antony

    2016-08-01

    This paper describes the importance and challenges faced in making Earth Observation datasets discoverable and accessible by the widest possible user base. Concentrating on data discovery, it details work that is being undertaken by the Centre for Environmental Data Analysis (CEDA), to ensure that the datasets held within its archive are discoverable and searchable. One aspect of this is in indexing the data using controlled vocabularies, based on a Simple Knowledge Organization System (SKOS) ontology, and hosted in a vocabulary server, to ensure that a consistent understanding and approach to a faceted search of the data can be achieved via a variety of different routes. This approach will be illustrated using the example of the development of the ESA CCI Open Data Portal.

  2. Machine learning techniques for diabetic macular edema (DME) classification on SD-OCT images.

    PubMed

    Alsaih, Khaled; Lemaitre, Guillaume; Rastgoo, Mojdeh; Massich, Joan; Sidibé, Désiré; Meriaudeau, Fabrice

    2017-06-07

    Spectral domain optical coherence tomography (OCT) (SD-OCT) is most widely imaging equipment used in ophthalmology to detect diabetic macular edema (DME). Indeed, it offers an accurate visualization of the morphology of the retina as well as the retina layers. The dataset used in this study has been acquired by the Singapore Eye Research Institute (SERI), using CIRRUS TM (Carl Zeiss Meditec, Inc., Dublin, CA, USA) SD-OCT device. The dataset consists of 32 OCT volumes (16 DME and 16 normal cases). Each volume contains 128 B-scans with resolution of 1024 px × 512 px, resulting in more than 3800 images being processed. All SD-OCT volumes are read and assessed by trained graders and identified as normal or DME cases based on evaluation of retinal thickening, hard exudates, intraretinal cystoid space formation, and subretinal fluid. Within the DME sub-set, a large number of lesions has been selected to create a rather complete and diverse DME dataset. This paper presents an automatic classification framework for SD-OCT volumes in order to identify DME versus normal volumes. In this regard, a generic pipeline including pre-processing, feature detection, feature representation, and classification was investigated. More precisely, extraction of histogram of oriented gradients and local binary pattern (LBP) features within a multiresolution approach is used as well as principal component analysis (PCA) and bag of words (BoW) representations. Besides comparing individual and combined features, different representation approaches and different classifiers are evaluated. The best results are obtained for LBP[Formula: see text] vectors while represented and classified using PCA and a linear-support vector machine (SVM), leading to a sensitivity(SE) and specificity (SP) of 87.5 and 87.5%, respectively.

  3. EMERALD: Coping with the Explosion of Seismic Data

    NASA Astrophysics Data System (ADS)

    West, J. D.; Fouch, M. J.; Arrowsmith, R.

    2009-12-01

    The geosciences are currently generating an unparalleled quantity of new public broadband seismic data with the establishment of large-scale seismic arrays such as the EarthScope USArray, which are enabling new and transformative scientific discoveries of the structure and dynamics of the Earth’s interior. Much of this explosion of data is a direct result of the formation of the IRIS consortium, which has enabled an unparalleled level of open exchange of seismic instrumentation, data, and methods. The production of these massive volumes of data has generated new and serious data management challenges for the seismological community. A significant challenge is the maintenance and updating of seismic metadata, which includes information such as station location, sensor orientation, instrument response, and clock timing data. This key information changes at unknown intervals, and the changes are not generally communicated to data users who have already downloaded and processed data. Another basic challenge is the ability to handle massive seismic datasets when waveform file volumes exceed the fundamental limitations of a computer’s operating system. A third, long-standing challenge is the difficulty of exchanging seismic processing codes between researchers; each scientist typically develops his or her own unique directory structure and file naming convention, requiring that codes developed by another researcher be rewritten before they can be used. To address these challenges, we are developing EMERALD (Explore, Manage, Edit, Reduce, & Analyze Large Datasets). The overarching goal of the EMERALD project is to enable more efficient and effective use of seismic datasets ranging from just a few hundred to millions of waveforms with a complete database-driven system, leading to higher quality seismic datasets for scientific analysis and enabling faster, more efficient scientific research. We will present a preliminary (beta) version of EMERALD, an integrated, extensible, standalone database server system based on the open-source PostgreSQL database engine. The system is designed for fast and easy processing of seismic datasets, and provides the necessary tools to manage very large datasets and all associated metadata. EMERALD provides methods for efficient preprocessing of seismic records; large record sets can be easily and quickly searched, reviewed, revised, reprocessed, and exported. EMERALD can retrieve and store station metadata and alert the user to metadata changes. The system provides many methods for visualizing data, analyzing dataset statistics, and tracking the processing history of individual datasets. EMERALD allows development and sharing of visualization and processing methods using any of 12 programming languages. EMERALD is designed to integrate existing software tools; the system provides wrapper functionality for existing widely-used programs such as GMT, SOD, and TauP. Users can interact with EMERALD via a web browser interface, or they can directly access their data from a variety of database-enabled external tools. Data can be imported and exported from the system in a variety of file formats, or can be directly requested and downloaded from the IRIS DMC from within EMERALD.

  4. SU-E-J-32: Dosimetric Evaluation Based On Pre-Treatment Cone Beam CT for Spine Stereotactic Body Radiotherapy: Does Region of Interest Focus Matter?

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magnelli, A; Xia, P

    2015-06-15

    Purpose: Spine stereotactic body radiotherapy requires very conformal dose distributions and precise delivery. Prior to treatment, a KV cone-beam CT (KV-CBCT) is registered to the planning CT to provide image-guided positional corrections, which depend on selection of the region of interest (ROI) because of imperfect patient positioning and anatomical deformation. Our objective is to determine the dosimetric impact of ROI selections. Methods: Twelve patients were selected for this study with the treatment regions varied from C-spine to T-spine. For each patient, the KV-CBCT was registered to the planning CT three times using distinct ROIs: one encompassing the entire patient, amore » large ROI containing large bony anatomy, and a small target-focused ROI. Each registered CBCT volume, saved as an aligned dataset, was then sent to the planning system. The treated plan was applied to each dataset and dose was recalculated. The tumor dose coverage (percentage of target volume receiving prescription dose), maximum point dose to 0.03 cc of the spinal cord, and dose to 10% of the spinal cord volume (V10) for each alignment were compared to the original plan. Results: The average magnitude of tumor coverage deviation was 3.9%±5.8% with external contour, 1.5%±1.1% with large ROI, 1.3%±1.1% with small ROI. Spinal cord V10 deviation from plan was 6.6%±6.6% with external contour, 3.5%±3.1% with large ROI, and 1.2%±1.0% with small ROI. Spinal cord max point dose deviation from plan was: 12.2%±13.3% with external contour, 8.5%±8.4% with large ROI, and 3.7%±2.8% with small ROI. Conclusion: A small ROI focused on the target results in the smallest deviation from planned dose to target and cord although rotations at large distances from the targets were observed. It is recommended that image fusion during CBCT focus narrowly on the target volume to minimize dosimetric error. Improvement in patient setups may further reduce residual errors.« less

  5. Evaluation of the Soil Conservation Service curve number methodology using data from agricultural plots

    NASA Astrophysics Data System (ADS)

    Lal, Mohan; Mishra, S. K.; Pandey, Ashish; Pandey, R. P.; Meena, P. K.; Chaudhary, Anubhav; Jha, Ranjit Kumar; Shreevastava, Ajit Kumar; Kumar, Yogendra

    2017-01-01

    The Soil Conservation Service curve number (SCS-CN) method, also known as the Natural Resources Conservation Service curve number (NRCS-CN) method, is popular for computing the volume of direct surface runoff for a given rainfall event. The performance of the SCS-CN method, based on large rainfall (P) and runoff (Q) datasets of United States watersheds, is evaluated using a large dataset of natural storm events from 27 agricultural plots in India. On the whole, the CN estimates from the National Engineering Handbook (chapter 4) tables do not match those derived from the observed P and Q datasets. As a result, the runoff prediction using former CNs was poor for the data of 22 (out of 24) plots. However, the match was little better for higher CN values, consistent with the general notion that the existing SCS-CN method performs better for high rainfall-runoff (high CN) events. Infiltration capacity (fc) was the main explanatory variable for runoff (or CN) production in study plots as it exhibited the expected inverse relationship between CN and fc. The plot-data optimization yielded initial abstraction coefficient (λ) values from 0 to 0.659 for the ordered dataset and 0 to 0.208 for the natural dataset (with 0 as the most frequent value). Mean and median λ values were, respectively, 0.030 and 0 for the natural rainfall-runoff dataset and 0.108 and 0 for the ordered rainfall-runoff dataset. Runoff estimation was very sensitive to λ and it improved consistently as λ changed from 0.2 to 0.03.

  6. Estimating the volume of Alpine glacial lakes

    NASA Astrophysics Data System (ADS)

    Cook, S. J.; Quincey, D. J.

    2015-09-01

    Supraglacial, moraine-dammed and ice-dammed lakes represent a potential glacial lake outburst flood (GLOF) threat to downstream communities in many mountain regions. This has motivated the development of empirical relationships to predict lake volume given a measurement of lake surface area obtained from satellite imagery. Such relationships are based on the notion that lake depth, area and volume scale predictably. We critically evaluate the performance of these existing empirical relationships by examining a global database of measured glacial lake depths, areas and volumes. Results show that lake area and depth are not always well correlated (r2 = 0.38), and that although lake volume and area are well correlated (r2 = 0.91), there are distinct outliers in the dataset. These outliers represent situations where it may not be appropriate to apply existing empirical relationships to predict lake volume, and include growing supraglacial lakes, glaciers that recede into basins with complex overdeepened morphologies or that have been deepened by intense erosion, and lakes formed where glaciers advance across and block a main trunk valley. We use the compiled dataset to develop a conceptual model of how the volumes of supraglacial ponds and lakes, moraine-dammed lakes and ice-dammed lakes should be expected to evolve with increasing area. Although a large amount of bathymetric data exist for moraine-dammed and ice-dammed lakes, we suggest that further measurements of growing supraglacial ponds and lakes are needed to better understand their development.

  7. The Derivation of Fault Volumetric Properties from 3D Trace Maps Using Outcrop Constrained Discrete Fracture Network Models

    NASA Astrophysics Data System (ADS)

    Hodgetts, David; Seers, Thomas

    2015-04-01

    Fault systems are important structural elements within many petroleum reservoirs, acting as potential conduits, baffles or barriers to hydrocarbon migration. Large, seismic-scale faults often serve as reservoir bounding seals, forming structural traps which have proved to be prolific plays in many petroleum provinces. Though inconspicuous within most seismic datasets, smaller subsidiary faults, commonly within the damage zones of parent structures, may also play an important role. These smaller faults typically form narrow, tabular low permeability zones which serve to compartmentalize the reservoir, negatively impacting upon hydrocarbon recovery. Though considerable improvements have been made in the visualization field to reservoir-scale fault systems with the advent of 3D seismic surveys, the occlusion of smaller scale faults in such datasets is a source of significant uncertainty during prospect evaluation. The limited capacity of conventional subsurface datasets to probe the spatial distribution of these smaller scale faults has given rise to a large number of outcrop based studies, allowing their intensity, connectivity and size distributions to be explored in detail. Whilst these studies have yielded an improved theoretical understanding of the style and distribution of sub-seismic scale faults, the ability to transform observations from outcrop to quantities that are relatable to reservoir volumes remains elusive. These issues arise from the fact that outcrops essentially offer a pseudo-3D window into the rock volume, making the extrapolation of surficial fault properties such as areal density (fracture length per unit area: P21), to equivalent volumetric measures (i.e. fracture area per unit volume: P32) applicable to fracture modelling extremely challenging. Here, we demonstrate an approach which harnesses advances in the extraction of 3D trace maps from surface reconstructions using calibrated image sequences, in combination with a novel semi-deterministic, outcrop constrained discrete fracture network modeling code to derive volumetric fault intensity measures (fault area per unit volume / fault volume per unit volume). Producing per-vertex measures of volumetric intensity; our method captures the spatial variability in 3D fault density across a surveyed outcrop, enabling first order controls to be probed. We demonstrate our approach on pervasively faulted exposures of a Permian aged reservoir analogue from the Vale of Eden Basin, UK.

  8. A Multi-Cohort Study of ApoE ɛ4 and Amyloid-β Effects on the Hippocampus in Alzheimer’s Disease

    PubMed Central

    Khan, Wasim; Giampietro, Vincent; Banaschewski, Tobias; Barker, Gareth J.; Bokde, Arun L.W.; Büchel, Christian; Conrod, Patricia; Flor, Herta; Frouin, Vincent; Garavan, Hugh; Gowland, Penny; Heinz, Anreas; Ittermann, Bernd; Lemaître, Hervé; Nees, Frauke; Paus, Tomas; Pausova, Zdenka; Rietschel, Marcella; Smolka, Michael N.; Ströhle, Andreas; Gallinat, Jeurgen; Vellas, Bruno; Soininen, Hilkka; Kloszewska, Iwona; Tsolaki, Magda; Mecocci, Patrizia; Spenger, Christian; Villemagne, Victor L.; Masters, Colin L.; Muehlboeck, J-Sebastian; Bäckman, Lars; Fratiglioni, Laura; Kalpouzos, Grégoria; Wahlund, Lars-Olof; Schumann, Gunther; Lovestone, Simon; Williams, Steven C.R.; Westman, Eric; Simmons, Andrew

    2017-01-01

    The apolipoprotein E (APOE) gene has been consistently shown to modulate the risk of Alzheimer’s disease (AD). Here, using an AD and normal aging dataset primarily consisting of three AD multi-center studies (n = 1,781), we compared the effect of APOE and amyloid-β (Aβ) on baseline hippocampal volumes in AD patients, mild cognitive impairment (MCI) subjects, and healthy controls. A large sample of healthy adolescents (n = 1,387) was also used to compare hippocampal volumes between APOE groups. Subjects had undergone a magnetic resonance imaging (MRI) scan and APOE genotyping. Hippocampal volumes were processed using FreeSurfer. In the AD and normal aging dataset, hippocampal comparisons were performed in each APOE group and in ɛ4 carriers with positron emission tomography (PET) Aβ who were dichotomized (Aβ+/Aβ–) using previous cut-offs. We found a linear reduction in hippocampal volumes with ɛ4 carriers possessing the smallest volumes, ɛ3 carriers possessing intermediate volumes, and ɛ2 carriers possessing the largest volumes. Moreover, AD and MCI ɛ4 carriers possessed the smallest hippocampal volumes and control ɛ2 carriers possessed the largest hippocampal volumes. Subjects with both APOE ɛ4 and Aβ positivity had the lowest hippocampal volumes when compared to Aβ- ɛ4 carriers, suggesting a synergistic relationship between APOE ɛ4 and Aβ. However, we found no hippocampal volume differences between APOE groups in healthy 14-year-old adolescents. Our findings suggest that the strongest neuroanatomic effect of APOE ɛ4 on the hippocampus is observed in AD and groups most at risk of developing the disease, whereas hippocampi of old and young healthy individuals remain unaffected. PMID:28157104

  9. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL

    NASA Astrophysics Data System (ADS)

    Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer

    2018-04-01

    Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.

  10. Filtering Raw Terrestrial Laser Scanning Data for Efficient and Accurate Use in Geomorphologic Modeling

    NASA Astrophysics Data System (ADS)

    Gleason, M. J.; Pitlick, J.; Buttenfield, B. P.

    2011-12-01

    Terrestrial laser scanning (TLS) represents a new and particularly effective remote sensing technique for investigating geomorphologic processes. Unfortunately, TLS data are commonly characterized by extremely large volume, heterogeneous point distribution, and erroneous measurements, raising challenges for applied researchers. To facilitate efficient and accurate use of TLS in geomorphology, and to improve accessibility for TLS processing in commercial software environments, we are developing a filtering method for raw TLS data to: eliminate data redundancy; produce a more uniformly spaced dataset; remove erroneous measurements; and maintain the ability of the TLS dataset to accurately model terrain. Our method conducts local aggregation of raw TLS data using a 3-D search algorithm based on the geometrical expression of expected random errors in the data. This approach accounts for the estimated accuracy and precision limitations of the instruments and procedures used in data collection, thereby allowing for identification and removal of potential erroneous measurements prior to data aggregation. Initial tests of the proposed technique on a sample TLS point cloud required a modest processing time of approximately 100 minutes to reduce dataset volume over 90 percent (from 12,380,074 to 1,145,705 points). Preliminary analysis of the filtered point cloud revealed substantial improvement in homogeneity of point distribution and minimal degradation of derived terrain models. We will test the method on two independent TLS datasets collected in consecutive years along a non-vegetated reach of the North Fork Toutle River in Washington. We will evaluate the tool using various quantitative, qualitative, and statistical methods. The crux of this evaluation will include a bootstrapping analysis to test the ability of the filtered datasets to model the terrain at roughly the same accuracy as the raw datasets.

  11. Eleven fetal echocardiographic planes using 4-dimensional ultrasound with spatio-temporal image correlation (STIC): a logical approach to fetal heart volume analysis.

    PubMed

    Jantarasaengaram, Surasak; Vairojanavong, Kittipong

    2010-09-15

    Theoretically, a cross-sectional image of any cardiac planes can be obtained from a STIC fetal heart volume dataset. We described a method to display 11 fetal echocardiographic planes from STIC volumes. Fetal heart volume datasets were acquired by transverse acquisition from 200 normal fetuses at 15 to 40 weeks of gestation. Analysis of the volume datasets using the described technique to display 11 echocardiographic planes in the multiplanar display mode were performed offline. Volume datasets from 18 fetuses were excluded due to poor image resolution. The mean visualization rates for all echocardiographic planes at 15-17, 18-22, 23-27, 28-32 and 33-40 weeks of gestation fetuses were 85.6% (range 45.2-96.8%, N = 31), 92.9% (range 64.0-100%, N = 64), 93.4% (range 51.4-100%, N = 37), 88.7%(range 54.5-100%, N = 33) and 81.8% (range 23.5-100%, N = 17) respectively. Overall, the applied technique can favorably display the pertinent echocardiographic planes. Description of the presented method provides a logical approach to explore the fetal heart volumes.

  12. A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

    NASA Astrophysics Data System (ADS)

    Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

    2017-10-01

    In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.

  13. Global surface displacement data for assessing variability of displacement at a point on a fault

    USGS Publications Warehouse

    Hecker, Suzanne; Sickler, Robert; Feigelson, Leah; Abrahamson, Norman; Hassett, Will; Rosa, Carla; Sanquini, Ann

    2014-01-01

    This report presents a global dataset of site-specific surface-displacement data on faults. We have compiled estimates of successive displacements attributed to individual earthquakes, mainly paleoearthquakes, at sites where two or more events have been documented, as a basis for analyzing inter-event variability in surface displacement on continental faults. An earlier version of this composite dataset was used in a recent study relating the variability of surface displacement at a point to the magnitude-frequency distribution of earthquakes on faults, and to hazard from fault rupture (Hecker and others, 2013). The purpose of this follow-on report is to provide potential data users with an updated comprehensive dataset, largely complete through 2010 for studies in English-language publications, as well as in some unpublished reports and abstract volumes.

  14. Big Data challenges and solutions in building the Global Earth Observation System of Systems (GEOSS)

    NASA Astrophysics Data System (ADS)

    Mazzetti, Paolo; Nativi, Stefano; Santoro, Mattia; Boldrini, Enrico

    2014-05-01

    The Group on Earth Observation (GEO) is a voluntary partnership of governments and international organizations launched in response to calls for action by the 2002 World Summit on Sustainable Development and by the G8 (Group of Eight) leading industrialized countries. These high-level meetings recognized that international collaboration is essential for exploiting the growing potential of Earth observations to support decision making in an increasingly complex and environmentally stressed world. To this aim is constructing the Global Earth Observation System of Systems (GEOSS) on the basis of a 10-Year Implementation Plan for the period 2005 to 2015 when it will become operational. As a large-scale integrated system handling large datasets as those provided by Earth Observation, GEOSS needs to face several challenges related to big data handling and big data infrastructures management. Referring to the traditional multiple Vs characteristics of Big Data (volume, variety, velocity, veracity and visualization) it is evident how most of them can be found in data handled by GEOSS. In particular, concerning Volume, Earth Observation already generates a large amount of data which can be estimated in the range of Petabytes (1015 bytes), with Exabytes (1018) already targeted. Moreover, the challenge is related not only to the data size, but also to the large amount of datasets (not necessarily having a big size) that systems need to manage. Variety is the other main challenge since datasets coming from different sensors, processed for different use-cases are published with highly heterogeneous metadata and data models, through different service interfaces. Innovative multidisciplinary applications need to access and use those datasets in a harmonized way. Moreover Earth Observation data are growing in size and variety at an exceptionally fast rate and new technologies and applications, including crowdsourcing, will even increase data volume and variety in the next future. The current implementation of GEOSS already addresses several big data challenges. In particular, the brokered architecture adopted in the GEOSS Common Infrastructure with the deployment of the GEO DAB (Discovery and Access Broker) allows to connect more than 20 big EO infrastructures while keeping them autonomous as required by their own mandate and governance. They make more than 60 million of unique resources discoverable and accessible through the GEO Portal. Through the GEO DAB, users are able to seamlessly discover resources provided by different infrastructures, and access them in a harmonized way, collecting datasets from different sources on a Common Environment (same coordinate reference system, spatial subset, format, etc.). Through the GEONETCast system, GEOSS is also providing a solution related to the Velocity challenge, for delivering EO resources to developing countries with low bandwidth connections. Several researches addressing other Big data Vs challenges in GEOSS are on-going, including quality representation for Veracity (as in the FP7 GeoViQua project), brokering big data analytics platforms for Velocity, and support of other EO resources for Variety (such as modelling resources in the Model Web).

  15. Near Real-time Scientific Data Analysis and Visualization with the ArcGIS Platform

    NASA Astrophysics Data System (ADS)

    Shrestha, S. R.; Viswambharan, V.; Doshi, A.

    2017-12-01

    Scientific multidimensional data are generated from a variety of sources and platforms. These datasets are mostly produced by earth observation and/or modeling systems. Agencies like NASA, NOAA, USGS, and ESA produce large volumes of near real-time observation, forecast, and historical data that drives fundamental research and its applications in larger aspects of humanity from basic decision making to disaster response. A common big data challenge for organizations working with multidimensional scientific data and imagery collections is the time and resources required to manage and process such large volumes and varieties of data. The challenge of adopting data driven real-time visualization and analysis, as well as the need to share these large datasets, workflows, and information products to wider and more diverse communities, brings an opportunity to use the ArcGIS platform to handle such demand. In recent years, a significant effort has put in expanding the capabilities of ArcGIS to support multidimensional scientific data across the platform. New capabilities in ArcGIS to support scientific data management, processing, and analysis as well as creating information products from large volumes of data using the image server technology are becoming widely used in earth science and across other domains. We will discuss and share the challenges associated with big data by the geospatial science community and how we have addressed these challenges in the ArcGIS platform. We will share few use cases, such as NOAA High Resolution Refresh Radar (HRRR) data, that demonstrate how we access large collections of near real-time data (that are stored on-premise or on the cloud), disseminate them dynamically, process and analyze them on-the-fly, and serve them to a variety of geospatial applications. We will also share how on-the-fly processing using raster functions capabilities, can be extended to create persisted data and information products using raster analytics capabilities that exploit distributed computing in an enterprise environment.

  16. ESSG-based global spatial reference frame for datasets interrelation

    NASA Astrophysics Data System (ADS)

    Yu, J. Q.; Wu, L. X.; Jia, Y. J.

    2013-10-01

    To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.

  17. Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2012-01-01

    With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less

  18. Three-dimensional histology: tools and application to quantitative assessment of cell-type distribution in rabbit heart

    PubMed Central

    Burton, Rebecca A.B.; Lee, Peter; Casero, Ramón; Garny, Alan; Siedlecka, Urszula; Schneider, Jürgen E.; Kohl, Peter; Grau, Vicente

    2014-01-01

    Aims Cardiac histo-anatomical organization is a major determinant of function. Changes in tissue structure are a relevant factor in normal and disease development, and form targets of therapeutic interventions. The purpose of this study was to test tools aimed to allow quantitative assessment of cell-type distribution from large histology and magnetic resonance imaging- (MRI) based datasets. Methods and results Rabbit heart fixation during cardioplegic arrest and MRI were followed by serial sectioning of the whole heart and light-microscopic imaging of trichrome-stained tissue. Segmentation techniques developed specifically for this project were applied to segment myocardial tissue in the MRI and histology datasets. In addition, histology slices were segmented into myocytes, connective tissue, and undefined. A bounding surface, containing the whole heart, was established for both MRI and histology. Volumes contained in the bounding surface (called ‘anatomical volume’), as well as that identified as containing any of the above tissue categories (called ‘morphological volume’), were calculated. The anatomical volume was 7.8 cm3 in MRI, and this reduced to 4.9 cm3 after histological processing, representing an ‘anatomical’ shrinkage by 37.2%. The morphological volume decreased by 48% between MRI and histology, highlighting the presence of additional tissue-level shrinkage (e.g. an increase in interstitial cleft space). The ratio of pixels classified as containing myocytes to pixels identified as non-myocytes was roughly 6:1 (61.6 vs. 9.8%; the remaining fraction of 28.6% was ‘undefined’). Conclusion Qualitative and quantitative differentiation between myocytes and connective tissue, using state-of-the-art high-resolution serial histology techniques, allows identification of cell-type distribution in whole-heart datasets. Comparison with MRI illustrates a pronounced reduction in anatomical and morphological volumes during histology processing. PMID:25362175

  19. A Comparative Study of Point Cloud Data Collection and Processing

    NASA Astrophysics Data System (ADS)

    Pippin, J. E.; Matheney, M.; Gentle, J. N., Jr.; Pierce, S. A.; Fuentes-Pineda, G.

    2016-12-01

    Over the past decade, there has been dramatic growth in the acquisition of publicly funded high-resolution topographic data for scientific, environmental, engineering and planning purposes. These data sets are valuable for applications of interest across a large and varied user community. However, because of the large volumes of data produced by high-resolution mapping technologies and expense of aerial data collection, it is often difficult to collect and distribute these datasets. Furthermore, the data can be technically challenging to process, requiring software and computing resources not readily available to many users. This study presents a comparison of advanced computing hardware and software that is used to collect and process point cloud datasets, such as LIDAR scans. Activities included implementation and testing of open source libraries and applications for point cloud data processing such as, Meshlab, Blender, PDAL, and PCL. Additionally, a suite of commercial scale applications, Skanect and Cloudcompare, were applied to raw datasets. Handheld hardware solutions, a Structure Scanner and Xbox 360 Kinect V1, were tested for their ability to scan at three field locations. The resultant data projects successfully scanned and processed subsurface karst features ranging from small stalactites to large rooms, as well as a surface waterfall feature. Outcomes support the feasibility of rapid sensing in 3D at field scales.

  20. Potential application of machine learning in health outcomes research and some statistical cautions.

    PubMed

    Crown, William H

    2015-03-01

    Traditional analytic methods are often ill-suited to the evolving world of health care big data characterized by massive volume, complexity, and velocity. In particular, methods are needed that can estimate models efficiently using very large datasets containing healthcare utilization data, clinical data, data from personal devices, and many other sources. Although very large, such datasets can also be quite sparse (e.g., device data may only be available for a small subset of individuals), which creates problems for traditional regression models. Many machine learning methods address such limitations effectively but are still subject to the usual sources of bias that commonly arise in observational studies. Researchers using machine learning methods such as lasso or ridge regression should assess these models using conventional specification tests. Copyright © 2015 International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. All rights reserved.

  1. Cone Penetration Testing, a new approach to quantify coastal-deltaic land subsidence by peat consolidation

    NASA Astrophysics Data System (ADS)

    Koster, Kay; Erkens, Gilles; Zwanenburg, Cor

    2016-04-01

    It is undisputed that land subsidence threatens coastal-deltaic lowlands all over the world. Any loss of elevation (on top of sea level rise) increases flood risk in these lowlands, and differential subsidence may cause damage to infrastructure and constructions. Many of these settings embed substantial amounts of peat, which is, due to its mechanically weak organic composition, one of the main drivers of subsidence. Peat is very susceptible to volume reduction by loading and drainage induced consolidation, which dissipates pore water, resulting in a tighter packing of the organic components. Often, the current state of consolidation of peat embedded within coastal-deltaic subsidence hotspots (e.g. Venice lagoon, Mississippi delta, San Joaquin delta, Kalimantan peatlands), is somewhere between its initial (natural) and maximum compressed stage. Quantifying the current state regarding peat volume loss, is of utmost importance to predict potential (near) future subsidence when draining or loading an area. The processes of subsidence often afflict large areas (>103 km2), thus demanding large datasets to assess the current state of the subsurface. In contrast to data describing the vertical motions of the actual surface (geodesy, satellite imagery), subsurface information applicable for subsidence analysis are often lacking in subsiding deltas. This calls for new initiatives to bridge that gap. Here we introduce Cone Penetration Testing (CPT) to quantify the amount of volume loss peat layers embedded within the Holland coastal plain (the Netherlands) experienced. CPT measures soil mechanical strength, and hundreds of thousands of CPTs are conducted each year on all continents. We analyzed 28 coupled CPT-borehole observations, and found strong empirical relations between volume loss and increased peat mechanical strength. The peat lost between ~20 - 95% of its initial thickness by dissipation of excess pore water. An increase in 0.1 - 0.4 MPa of peat strength is accountable for 20 - 75 % of the volume loss, and 0.4 - 0.7 MPa for 75 - 95 % volume loss. This indicates that large amounts of volume by water dissipation has to be lost, before peat experiences a serious increase in strength, which subsequently continuous to increase with only small amount of volume loss. To demonstrate the robustness of our approach to the international field of land subsidence, we applied the obtained empirical relations to previously published CPT logs deriving from the peat-rich San Joaquin-Sacramento delta and the Kalimantan peatlands, and found volume losses that correspond with previously published results. Furthermore, we used the obtained results to predict maximum surface lowering for these areas by consolidation. In conclusion, these promising results and its worldwide popularity yielding large datasets, open the door for CPT as a generic method to contribute to quantifying the imminent threat of coastal-deltaic land subsidence.

  2. SU-C-207B-04: Automated Segmentation of Pectoral Muscle in MR Images of Dense Breasts

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verburg, E; Waard, SN de; Veldhuis, WB

    Purpose: To develop and evaluate a fully automated method for segmentation of the pectoral muscle boundary in Magnetic Resonance Imaging (MRI) of dense breasts. Methods: Segmentation of the pectoral muscle is an important part of automatic breast image analysis methods. Current methods for segmenting the pectoral muscle in breast MRI have difficulties delineating the muscle border correctly in breasts with a large proportion of fibroglandular tissue (i.e., dense breasts). Hence, an automated method based on dynamic programming was developed, incorporating heuristics aimed at shape, location and gradient features.To assess the method, the pectoral muscle was segmented in 91 randomly selectedmore » participants (mean age 56.6 years, range 49.5–75.2 years) from a large MRI screening trial in women with dense breasts (ACR BI-RADS category 4). Each MR dataset consisted of 178 or 179 T1-weighted images with voxel size 0.64 × 0.64 × 1.00 mm3. All images (n=16,287) were reviewed and scored by a radiologist. In contrast to volume overlap coefficients, such as DICE, the radiologist detected deviations in the segmented muscle border and determined whether the result would impact the ability to accurately determine the volume of fibroglandular tissue and detection of breast lesions. Results: According to the radiologist’s scores, 95.5% of the slices did not mask breast tissue in such way that it could affect detection of breast lesions or volume measurements. In 13.1% of the slices a deviation in the segmented muscle border was present which would not impact breast lesion detection. In 70 datasets (78%) at least 95% of the slices were segmented in such a way it would not affect detection of breast lesions, and in 60 (66%) datasets this was 100%. Conclusion: Dynamic programming with dedicated heuristics shows promising potential to segment the pectoral muscle in women with dense breasts.« less

  3. TU-AB-BRA-11: Evaluation of Fully Automatic Volumetric GBM Segmentation in the TCGA-GBM Dataset: Prognosis and Correlation with VASARI Features

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rios Velazquez, E; Meier, R; Dunn, W

    Purpose: Reproducible definition and quantification of imaging biomarkers is essential. We evaluated a fully automatic MR-based segmentation method by comparing it to manually defined sub-volumes by experienced radiologists in the TCGA-GBM dataset, in terms of sub-volume prognosis and association with VASARI features. Methods: MRI sets of 67 GBM patients were downloaded from the Cancer Imaging archive. GBM sub-compartments were defined manually and automatically using the Brain Tumor Image Analysis (BraTumIA), including necrosis, edema, contrast enhancing and non-enhancing tumor. Spearman’s correlation was used to evaluate the agreement with VASARI features. Prognostic significance was assessed using the C-index. Results: Auto-segmented sub-volumes showedmore » high agreement with manually delineated volumes (range (r): 0.65 – 0.91). Also showed higher correlation with VASARI features (auto r = 0.35, 0.60 and 0.59; manual r = 0.29, 0.50, 0.43, for contrast-enhancing, necrosis and edema, respectively). The contrast-enhancing volume and post-contrast abnormal volume showed the highest C-index (0.73 and 0.72), comparable to manually defined volumes (p = 0.22 and p = 0.07, respectively). The non-enhancing region defined by BraTumIA showed a significantly higher prognostic value (CI = 0.71) than the edema (CI = 0.60), both of which could not be distinguished by manual delineation. Conclusion: BraTumIA tumor sub-compartments showed higher correlation with VASARI data, and equivalent performance in terms of prognosis compared to manual sub-volumes. This method can enable more reproducible definition and quantification of imaging based biomarkers and has a large potential in high-throughput medical imaging research.« less

  4. 'The surface management system' (SuMS) database: a surface-based database to aid cortical surface reconstruction, visualization and analysis

    NASA Technical Reports Server (NTRS)

    Dickson, J.; Drury, H.; Van Essen, D. C.

    2001-01-01

    Surface reconstructions of the cerebral cortex are increasingly widely used in the analysis and visualization of cortical structure, function and connectivity. From a neuroinformatics perspective, dealing with surface-related data poses a number of challenges. These include the multiplicity of configurations in which surfaces are routinely viewed (e.g. inflated maps, spheres and flat maps), plus the diversity of experimental data that can be represented on any given surface. To address these challenges, we have developed a surface management system (SuMS) that allows automated storage and retrieval of complex surface-related datasets. SuMS provides a systematic framework for the classification, storage and retrieval of many types of surface-related data and associated volume data. Within this classification framework, it serves as a version-control system capable of handling large numbers of surface and volume datasets. With built-in database management system support, SuMS provides rapid search and retrieval capabilities across all the datasets, while also incorporating multiple security levels to regulate access. SuMS is implemented in Java and can be accessed via a Web interface (WebSuMS) or using downloaded client software. Thus, SuMS is well positioned to act as a multiplatform, multi-user 'surface request broker' for the neuroscience community.

  5. "Tools For Analysis and Visualization of Large Time- Varying CFD Data Sets"

    NASA Technical Reports Server (NTRS)

    Wilhelms, Jane; vanGelder, Allen

    1999-01-01

    During the four years of this grant (including the one year extension), we have explored many aspects of the visualization of large CFD (Computational Fluid Dynamics) datasets. These have included new direct volume rendering approaches, hierarchical methods, volume decimation, error metrics, parallelization, hardware texture mapping, and methods for analyzing and comparing images. First, we implemented an extremely general direct volume rendering approach that can be used to render rectilinear, curvilinear, or tetrahedral grids, including overlapping multiple zone grids, and time-varying grids. Next, we developed techniques for associating the sample data with a k-d tree, a simple hierarchial data model to approximate samples in the regions covered by each node of the tree, and an error metric for the accuracy of the model. We also explored a new method for determining the accuracy of approximate models based on the light field method described at ACM SIGGRAPH (Association for Computing Machinery Special Interest Group on Computer Graphics) '96. In our initial implementation, we automatically image the volume from 32 approximately evenly distributed positions on the surface of an enclosing tessellated sphere. We then calculate differences between these images under different conditions of volume approximation or decimation.

  6. A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study.

    PubMed

    Kalpathy-Cramer, Jayashree; Zhao, Binsheng; Goldgof, Dmitry; Gu, Yuhua; Wang, Xingwei; Yang, Hao; Tan, Yongqiang; Gillies, Robert; Napel, Sandy

    2016-08-01

    Tumor volume estimation, as well as accurate and reproducible borders segmentation in medical images, are important in the diagnosis, staging, and assessment of response to cancer therapy. The goal of this study was to demonstrate the feasibility of a multi-institutional effort to assess the repeatability and reproducibility of nodule borders and volume estimate bias of computerized segmentation algorithms in CT images of lung cancer, and to provide results from such a study. The dataset used for this evaluation consisted of 52 tumors in 41 CT volumes (40 patient datasets and 1 dataset containing scans of 12 phantom nodules of known volume) from five collections available in The Cancer Imaging Archive. Three academic institutions developing lung nodule segmentation algorithms submitted results for three repeat runs for each of the nodules. We compared the performance of lung nodule segmentation algorithms by assessing several measurements of spatial overlap and volume measurement. Nodule sizes varied from 29 μl to 66 ml and demonstrated a diversity of shapes. Agreement in spatial overlap of segmentations was significantly higher for multiple runs of the same algorithm than between segmentations generated by different algorithms (p < 0.05) and was significantly higher on the phantom dataset compared to the other datasets (p < 0.05). Algorithms differed significantly in the bias of the measured volumes of the phantom nodules (p < 0.05) underscoring the need for assessing performance on clinical data in addition to phantoms. Algorithms that most accurately estimated nodule volumes were not the most repeatable, emphasizing the need to evaluate both their accuracy and precision. There were considerable differences between algorithms, especially in a subset of heterogeneous nodules, underscoring the recommendation that the same software be used at all time points in longitudinal studies.

  7. Primate Brain Anatomy: New Volumetric MRI Measurements for Neuroanatomical Studies.

    PubMed

    Navarrete, Ana F; Blezer, Erwin L A; Pagnotta, Murillo; de Viet, Elizabeth S M; Todorov, Orlin S; Lindenfors, Patrik; Laland, Kevin N; Reader, Simon M

    2018-06-12

    Since the publication of the primate brain volumetric dataset of Stephan and colleagues in the early 1980s, no major new comparative datasets covering multiple brain regions and a large number of primate species have become available. However, technological and other advances in the last two decades, particularly magnetic resonance imaging (MRI) and the creation of institutions devoted to the collection and preservation of rare brain specimens, provide opportunities to rectify this situation. Here, we present a new dataset including brain region volumetric measurements of 39 species, including 20 species not previously available in the literature, with measurements of 16 brain areas. These volumes were extracted from MRI of 46 brains of 38 species from the Netherlands Institute of Neuroscience Primate Brain Bank, scanned at high resolution with a 9.4-T scanner, plus a further 7 donated MRI of 4 primate species. Partial measurements were made on an additional 8 brains of 5 species. We make the dataset and MRI scans available online in the hope that they will be of value to researchers conducting comparative studies of primate evolution. © 2018 S. Karger AG, Basel.

  8. Improved Statistical Method For Hydrographic Climatic Records Quality Control

    NASA Astrophysics Data System (ADS)

    Gourrion, J.; Szekely, T.

    2016-02-01

    Climate research benefits from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of a quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to early 2014, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has been implemented in the latest version of the CORA dataset and will benefit to the next version of the Copernicus CMEMS dataset.

  9. Conjunction Assessment Screening Volume Sizing and Event Filtering in Light of Natural Conjunction Event Development Behaviors

    NASA Technical Reports Server (NTRS)

    Hejduk, M. D.; Pachura, D. A.

    2017-01-01

    Conjunction Assessment screening volumes used in the protection of NASA satellites are constructed as geometric volumes about these satellites, of a size expected to capture a certain percentage of the serious conjunction events by a certain time before closest approach. However, the analyses that established these sizes were grounded on covariance-based projections rather than empirical screening results, did not tailor the volume sizes to ensure operational actionability of those results, and did not consider the adjunct ability to produce data that could provide prevenient assistance for maneuver planning. The present study effort seeks to reconsider these questions based on a six-month dataset of empirical screening results using an extremely large screening volume. The results, pursued here for a highly-populated orbit regime near 700 km altitude, identify theoretical limits of screening volume performance, explore volume configuration to facilitate both maneuver remediation planning as well as basic asset protection, and recommend sizing principles that maximize volume performance while minimizing the capture of "chaff" conjunctions that are unlikely ever to become serious events.

  10. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

    NASA Astrophysics Data System (ADS)

    Tobon-Gomez, Catalina; Sukno, Federico M.; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F.

    2012-07-01

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18% LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  11. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation.

    PubMed

    Tobon-Gomez, Catalina; Sukno, Federico M; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F

    2012-07-07

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18%; LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  12. TH-E-BRF-05: Comparison of Survival-Time Prediction Models After Radiotherapy for High-Grade Glioma Patients Based On Clinical and DVH Features

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magome, T; Haga, A; Igaki, H

    Purpose: Although many outcome prediction models based on dose-volume information have been proposed, it is well known that the prognosis may be affected also by multiple clinical factors. The purpose of this study is to predict the survival time after radiotherapy for high-grade glioma patients based on features including clinical and dose-volume histogram (DVH) information. Methods: A total of 35 patients with high-grade glioma (oligodendroglioma: 2, anaplastic astrocytoma: 3, glioblastoma: 30) were selected in this study. All patients were treated with prescribed dose of 30–80 Gy after surgical resection or biopsy from 2006 to 2013 at The University of Tokyomore » Hospital. All cases were randomly separated into training dataset (30 cases) and test dataset (5 cases). The survival time after radiotherapy was predicted based on a multiple linear regression analysis and artificial neural network (ANN) by using 204 candidate features. The candidate features included the 12 clinical features (tumor location, extent of surgical resection, treatment duration of radiotherapy, etc.), and the 192 DVH features (maximum dose, minimum dose, D95, V60, etc.). The effective features for the prediction were selected according to a step-wise method by using 30 training cases. The prediction accuracy was evaluated by a coefficient of determination (R{sup 2}) between the predicted and actual survival time for the training and test dataset. Results: In the multiple regression analysis, the value of R{sup 2} between the predicted and actual survival time was 0.460 for the training dataset and 0.375 for the test dataset. On the other hand, in the ANN analysis, the value of R{sup 2} was 0.806 for the training dataset and 0.811 for the test dataset. Conclusion: Although a large number of patients would be needed for more accurate and robust prediction, our preliminary Result showed the potential to predict the outcome in the patients with high-grade glioma. This work was partly supported by the JSPS Core-to-Core Program(No. 23003) and Grant-in-aid from the JSPS Fellows.« less

  13. A Virtual Reality Visualization Tool for Neuron Tracing

    PubMed Central

    Usher, Will; Klacansky, Pavol; Federer, Frederick; Bremer, Peer-Timo; Knoll, Aaron; Angelucci, Alessandra; Pascucci, Valerio

    2017-01-01

    Tracing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists. PMID:28866520

  14. A Node Linkage Approach for Sequential Pattern Mining

    PubMed Central

    Navarro, Osvaldo; Cumplido, René; Villaseñor-Pineda, Luis; Feregrino-Uribe, Claudia; Carrasco-Ochoa, Jesús Ariel

    2014-01-01

    Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms. PMID:24933123

  15. A Virtual Reality Visualization Tool for Neuron Tracing.

    PubMed

    Usher, Will; Klacansky, Pavol; Federer, Frederick; Bremer, Peer-Timo; Knoll, Aaron; Yarch, Jeff; Angelucci, Alessandra; Pascucci, Valerio

    2018-01-01

    Tracing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists.

  16. An open, multi-vendor, multi-field-strength brain MR dataset and analysis of publicly available skull stripping methods agreement.

    PubMed

    Souza, Roberto; Lucena, Oeslle; Garrafa, Julia; Gobbi, David; Saluzzi, Marina; Appenzeller, Simone; Rittner, Letícia; Frayne, Richard; Lotufo, Roberto

    2018-04-15

    This paper presents an open, multi-vendor, multi-field strength magnetic resonance (MR) T1-weighted volumetric brain imaging dataset, named Calgary-Campinas-359 (CC-359). The dataset is composed of images of older healthy adults (29-80 years) acquired on scanners from three vendors (Siemens, Philips and General Electric) at both 1.5 T and 3 T. CC-359 is comprised of 359 datasets, approximately 60 subjects per vendor and magnetic field strength. The dataset is approximately age and gender balanced, subject to the constraints of the available images. It provides consensus brain extraction masks for all volumes generated using supervised classification. Manual segmentation results for twelve randomly selected subjects performed by an expert are also provided. The CC-359 dataset allows investigation of 1) the influences of both vendor and magnetic field strength on quantitative analysis of brain MR; 2) parameter optimization for automatic segmentation methods; and potentially 3) machine learning classifiers with big data, specifically those based on deep learning methods, as these approaches require a large amount of data. To illustrate the utility of this dataset, we compared to the results of a supervised classifier, the results of eight publicly available skull stripping methods and one publicly available consensus algorithm. A linear mixed effects model analysis indicated that vendor (p-value<0.001) and magnetic field strength (p-value<0.001) have statistically significant impacts on skull stripping results. Copyright © 2017 Elsevier Inc. All rights reserved.

  17. Improved statistical method for temperature and salinity quality control

    NASA Astrophysics Data System (ADS)

    Gourrion, Jérôme; Szekely, Tanguy

    2017-04-01

    Climate research and Ocean monitoring benefit from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of an automatic quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to late 2015, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has already been implemented in the latest version of the delayed-time CMEMS in-situ dataset and will be deployed soon in the equivalent near-real time products.

  18. How spatial and temporal rainfall variability affect runoff across basin scales: insights from field observations in the (semi-)urbanised Charlotte watershed

    NASA Astrophysics Data System (ADS)

    Ten Veldhuis, M. C.; Smith, J. A.; Zhou, Z.

    2017-12-01

    Impacts of rainfall variability on runoff response are highly scale-dependent. Sensitivity analyses based on hydrological model simulations have shown that impacts are likely to depend on combinations of storm type, basin versus storm scale, temporal versus spatial rainfall variability. So far, few of these conclusions have been confirmed on observational grounds, since high quality datasets of spatially variable rainfall and runoff over prolonged periods are rare. Here we investigate relationships between rainfall variability and runoff response based on 30 years of radar-rainfall datasets and flow measurements for 16 hydrological basins ranging from 7 to 111 km2. Basins vary not only in scale, but also in their degree of urbanisation. We investigated temporal and spatial variability characteristics of rainfall fields across a range of spatial and temporal scales to identify main drivers for variability in runoff response. We identified 3 ranges of basin size with different temporal versus spatial rainfall variability characteristics. Total rainfall volume proved to be the dominant agent determining runoff response at all basin scales, independent of their degree of urbanisation. Peak rainfall intensity and storm core volume are of secondary importance. This applies to all runoff parameters, including runoff volume, runoff peak, volume-to-peak and lag time. Position and movement of the storm with respect to the basin have a negligible influence on runoff response, with the exception of lag times in some of the larger basins. This highlights the importance of accuracy in rainfall estimation: getting the position right but the volume wrong will inevitably lead to large errors in runoff prediction. Our study helps to identify conditions where rainfall variability matters for correct estimation of the rainfall volume as well as the associated runoff response.

  19. Compositional variations of ignimbrite magmas in the Central Andes over the past 26 Ma - A multivariate statistical perspective

    NASA Astrophysics Data System (ADS)

    Brandmeier, M.; Wörner, G.

    2016-10-01

    Multivariate statistical and geospatial analyses based on a compilation of 890 geochemical and 1200 geochronological data for 194 mapped ignimbrites from the Central Andes document the compositional and temporal patterns of large-volume ignimbrites (so-called "ignimbrite flare-ups") during Neogene times. Rapid advances in computational science during the past decade led to a growing pool of algorithms for multivariate statistics for large datasets with many predictor variables. This study applies cluster analysis (CA) and linear discriminant analysis (LDA) on log-ratio transformed data with the aim of (1) testing a tool for ignimbrite correlation and (2) distinguishing compositional groups that reflect different processes and sources of ignimbrite magmatism during the geodynamic evolution of the Central Andes. CA on major and trace elements allows grouping of ignimbrites according to their geochemical characteristics into rhyolitic and dacitic "end-members" and to differentiate characteristic trace element signatures with respect to Eu anomaly, depletions in middle and heavy rare earth elements (REE) and variable enrichments in light REE. To highlight these distinct compositional signatures, we applied LDA to selected ignimbrites for which comprehensive datasets were available. In comparison to traditional geochemical parameters we found that the advantage of multivariate statistics is their capability of dealing with large datasets and many variables (elements) and to take advantage of this n-dimensional space to detect subtle compositional differences contained in the data. The most important predictors for discriminating ignimbrites are La, Yb, Eu, Al2O3, K2O, P2O5, MgO, FeOt, and TiO2. However, other REE such as Gd, Pr, Tm, Sm, Dy and Er also contribute to the discrimination functions. Significant compositional differences were found between (1) the older (> 13 Ma) large-volume plateau-forming ignimbrites in northernmost Chile and southern Peru and (2) the younger (< 10 Ma) Altiplano-Puna-Volcanic-Complex (APVC) ignimbrites that are of similar volumes. Older ignimbrites are less depleted in HREE and less radiogenic in Sr isotopes, indicating smaller crustal contributions during evolution in a thinner and thermally less evolved crust. These compositional variations indicate a relation to crustal thickening with a "transition" from plagioclase to amphibole and garnet residual mineralogy between 13 and 9 Ma. Compositional and volumetric variations correlate to the N-S passage of the Juan-Fernandéz-Ridge, crustal shortening and thickening, and increased average crustal temperatures during the past 26 Ma. Table DR2 Mapped ignimbrite sheets.

  20. Online Visualization and Value Added Services of MERRA-2 Data at GES DISC

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Ostrenga, Dana M.; Vollmer, Bruce E.; Hegde, Mahabaleshwa S.; Wei, Jennifer C.; Bosilovich, Michael G.

    2017-01-01

    NASA climate reanalysis datasets from MERRA-2, distributed at the Goddard Earth Sciences Data and Information Services Center (GES DISC), have been used in broad research areas, such as climate variations, extreme weather, agriculture, renewable energy, and air quality, etc. The datasets contain numerous variables for atmosphere, land, and ocean, grouped into 95 products. The total archived volume is approximately 337 TB ( approximately 562K files) at the end of October 2017. Due to the large number of products and files, and large data volumes, it may be a challenge for a user to find and download the data of interest. The support team at GES DISC, working closely with the MERRA-2 science team, has created and is continuing to work on value added data services to best meet the needs of a broad user community. This presentation, using aerosol over Asia Monsoon as an example, provides an overview of the MERRA-2 data services at GES DISC, including: How to find the data? How many data access methods are provided? What are the best data access methods for me? How do download the subsetted (parameter, spatial, temporal) data and save in preferred spatial resolution and data format? How to visualize and explore the data online? In addition, we introduce a future online analytic tool designed for supporting application research, focusing on long-term hourly time-series data access and analysis.

  1. Improving Discoverability of Geophysical Data using Location Based Services

    NASA Astrophysics Data System (ADS)

    Morrison, D.; Barnes, R. J.; Potter, M.; Nylund, S. R.; Patrone, D.; Weiss, M.; Talaat, E. R.; Sarris, T. E.; Smith, D.

    2014-12-01

    The great promise of Virtual Observatories is the ability to perform complex search operations across the metadata of a large variety of different data sets. This allows the researcher to isolate and select the relevant measurements for their topic of study. The Virtual ITM Observatory (VITMO) has many diverse geophysical datasets that cover a large temporal and spatial range that present a unique search problem. VITMO provides many methods by which the user can search for and select data of interest including restricting selections based on geophysical conditions (solar wind speed, Kp, etc) as well as finding those datasets that overlap in time. One of the key challenges in improving discoverability is the ability to identify portions of datasets that overlap in time and in location. The difficulty is that location data is not contained in the metadata for datasets produced by satellites and would be extremely large in volume if it were available, making searching for overlapping data very time consuming. To solve this problem we have developed a series of light-weight web services that can provide a new data search capability for VITMO and others. The services consist of a database of spacecraft ephemerides and instrument fields of view; an overlap calculator to find times when the fields of view of different instruments intersect; and a magnetic field line tracing service that maps in situ and ground based measurements to the equatorial plane in magnetic coordinates for a number of field models and geophysical conditions. These services run in real-time when the user queries for data. They will allow the non-specialist user to select data that they were previously unable to locate, opening up analysis opportunities beyond the instrument teams and specialists, making it easier for future students who come into the field.

  2. Multiline 3D beamforming using micro-beamformed datasets for pediatric transesophageal echocardiography

    NASA Astrophysics Data System (ADS)

    Bera, D.; Raghunathan, S. B.; Chen, C.; Chen, Z.; Pertijs, M. A. P.; Verweij, M. D.; Daeichin, V.; Vos, H. J.; van der Steen, A. F. W.; de Jong, N.; Bosch, J. G.

    2018-04-01

    Until now, no matrix transducer has been realized for 3D transesophageal echocardiography (TEE) in pediatric patients. In 3D TEE with a matrix transducer, the biggest challenges are to connect a large number of elements to a standard ultrasound system, and to achieve a high volume rate (>200 Hz). To address these issues, we have recently developed a prototype miniaturized matrix transducer for pediatric patients with micro-beamforming and a small central transmitter. In this paper we propose two multiline parallel 3D beamforming techniques (µBF25 and µBF169) using the micro-beamformed datasets from 25 and 169 transmit events to achieve volume rates of 300 Hz and 44 Hz, respectively. Both the realizations use angle-weighted combination of the neighboring overlapping sub-volumes to avoid artifacts due to sharp intensity changes introduced by parallel beamforming. In simulation, the image quality in terms of the width of the point spread function (PSF), lateral shift invariance and mean clutter level for volumes produced by µBF25 and µBF169 are similar to the idealized beamforming using a conventional single-line acquisition with a fully-sampled matrix transducer (FS4k, 4225 transmit events). For completeness, we also investigated a 9 transmit-scheme (3  ×  3) that allows even higher frame rates but found worse B-mode image quality with our probe. The simulations were experimentally verified by acquiring the µBF datasets from the prototype using a Verasonics V1 research ultrasound system. For both µBF169 and µBF25, the experimental PSFs were similar to the simulated PSFs, but in the experimental PSFs, the clutter level was ~10 dB higher. Results indicate that the proposed multiline 3D beamforming techniques with the prototype matrix transducer are promising candidates for real-time pediatric 3D TEE.

  3. Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

    NASA Astrophysics Data System (ADS)

    Maskey, M.; Ramachandran, R.; Miller, J.

    2017-12-01

    Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.

  4. Lensfree diffractive tomography for the imaging of 3D cell cultures

    NASA Astrophysics Data System (ADS)

    Berdeu, Anthony; Momey, Fabien; Dinten, Jean-Marc; Gidrol, Xavier; Picollet-D'hahan, Nathalie; Allier, Cédric

    2017-02-01

    New microscopes are needed to help reaching the full potential of 3D organoid culture studies by gathering large quantitative and systematic data over extended periods of time while preserving the integrity of the living sample. In order to reconstruct large volumes while preserving the ability to catch every single cell, we propose new imaging platforms based on lens-free microscopy, a technic which is addressing these needs in the context of 2D cell culture, providing label-free and non-phototoxic acquisition of large datasets. We built lens-free diffractive tomography setups performing multi-angle acquisitions of 3D organoid cultures embedded in Matrigel and developed dedicated 3D holographic reconstruction algorithms based on the Fourier diffraction theorem. Nonetheless, holographic setups do not record the phase of the incident wave front and the biological samples in Petri dish strongly limit the angular coverage. These limitations introduce numerous artefacts in the sample reconstruction. We developed several methods to overcome them, such as multi-wavelength imaging or iterative phase retrieval. The most promising technic currently developed is based on a regularised inverse problem approach directly applied on the 3D volume to reconstruct. 3D reconstructions were performed on several complex samples such as 3D networks or spheroids embedded in capsules with large reconstructed volumes up to 25 mm3 while still being able to identify single cells. To our knowledge, this is the first time that such an inverse problem approach is implemented in the context of lens-free diffractive tomography enabling to reconstruct large fully 3D volumes of unstained biological samples.

  5. Genetically targeted 3D visualisation of Drosophila neurons under Electron Microscopy and X-Ray Microscopy using miniSOG

    PubMed Central

    Ng, Julian; Browning, Alyssa; Lechner, Lorenz; Terada, Masako; Howard, Gillian; Jefferis, Gregory S. X. E.

    2016-01-01

    Large dimension, high-resolution imaging is important for neural circuit visualisation as neurons have both long- and short-range patterns: from axons and dendrites to the numerous synapses at terminal endings. Electron Microscopy (EM) is the favoured approach for synaptic resolution imaging but how such structures can be segmented from high-density images within large volume datasets remains challenging. Fluorescent probes are widely used to localise synapses, identify cell-types and in tracing studies. The equivalent EM approach would benefit visualising such labelled structures from within sub-cellular, cellular, tissue and neuroanatomical contexts. Here we developed genetically-encoded, electron-dense markers using miniSOG. We demonstrate their ability in 1) labelling cellular sub-compartments of genetically-targeted neurons, 2) generating contrast under different EM modalities, and 3) segmenting labelled structures from EM volumes using computer-assisted strategies. We also tested non-destructive X-ray imaging on whole Drosophila brains to evaluate contrast staining. This enabled us to target specific regions for EM volume acquisition. PMID:27958322

  6. Quantification of the thorax-to-abdomen breathing ratio for breathing motion modeling.

    PubMed

    White, Benjamin M; Zhao, Tianyu; Lamb, James; Bradley, Jeffrey D; Low, Daniel A

    2013-06-01

    The purpose of this study was to develop a methodology to quantitatively measure the thorax-to-abdomen breathing ratio from a 4DCT dataset for breathing motion modeling and breathing motion studies. The thorax-to-abdomen breathing ratio was quantified by measuring the rate of cross-sectional volume increase throughout the thorax and abdomen as a function of tidal volume. Twenty-six 16-slice 4DCT patient datasets were acquired during quiet respiration using a protocol that acquired 25 ciné scans at each couch position. Fifteen datasets included data from the neck through the pelvis. Tidal volume, measured using a spirometer and abdominal pneumatic bellows, was used as breathing-cycle surrogates. The cross-sectional volume encompassed by the skin contour when compared for each CT slice against the tidal volume exhibited a nearly linear relationship. A robust iteratively reweighted least squares regression analysis was used to determine η(i), defined as the amount of cross-sectional volume expansion at each slice i per unit tidal volume. The sum Ση(i) throughout all slices was predicted to be the ratio of the geometric expansion of the lung and the tidal volume; 1.11. The Xiphoid process was selected as the boundary between the thorax and abdomen. The Xiphoid process slice was identified in a scan acquired at mid-inhalation. The imaging protocol had not originally been designed for purposes of measuring the thorax-to-abdomen breathing ratio so the scans did not extend to the anatomy with η(i) = 0. Extrapolation of η(i)-η(i) = 0 was used to include the entire breathing volume. The thorax and abdomen regions were individually analyzed to determine the thorax-to-abdomen breathing ratios. There were 11 image datasets that had been scanned only through the thorax. For these cases, the abdomen breathing component was equal to 1.11 - Ση(i) where the sum was taken throughout the thorax. The average Ση(i) for thorax and abdomen image datasets was found to be 1.20 ± 0.17, close to the expected value of 1.11. The thorax-to-abdomen breathing ratio was 0.32 ± 0.24. The average Ση(i) was 0.26 ± 0.14 in the thorax and 0.93 ± 0.22 in the abdomen. In the scan datasets that encompassed only the thorax, the average Ση(i) was 0.21 ± 0.11. A method to quantify the relationship between abdomen and thoracic breathing was developed and characterized.

  7. Stand-volume estimation from multi-source data for coppiced and high forest Eucalyptus spp. silvicultural systems in KwaZulu-Natal, South Africa

    NASA Astrophysics Data System (ADS)

    Dube, Timothy; Sibanda, Mbulisi; Shoko, Cletah; Mutanga, Onisimo

    2017-10-01

    Forest stand volume is one of the crucial stand parameters, which influences the ability of these forests to provide ecosystem goods and services. This study thus aimed at examining the potential of integrating multispectral SPOT 5 image, with ancillary data (forest age and rainfall metrics) in estimating stand volume between coppiced and planted Eucalyptus spp. in KwaZulu-Natal, South Africa. To achieve this objective, Partial Least Squares Regression (PLSR) algorithm was used. The PLSR algorithm was implemented by applying three tier analysis stages: stage I: using ancillary data as an independent dataset, stage II: SPOT 5 spectral bands as an independent dataset and stage III: combined SPOT 5 spectral bands and ancillary data. The results of the study showed that the use of an independent ancillary dataset better explained the volume of Eucalyptus spp. growing from coppices (adjusted R2 (R2Adj) = 0.54, RMSEP = 44.08 m3/ha), when compared with those that were planted (R2Adj = 0.43, RMSEP = 53.29 m3/ha). Similar results were also observed when SPOT 5 spectral bands were applied as an independent dataset, whereas improved volume estimates were produced when using combined dataset. For instance, planted Eucalyptus spp. were better predicted adjusted R2 (R2Adj) = 0.77, adjusted R2Adj = 0.59, RMSEP = 36.02 m3/ha) when compared with those that grow from coppices (R2 = 0.76, R2Adj = 0.46, RMSEP = 40.63 m3/ha). Overall, the findings of this study demonstrated the relevance of multi-source data in ecosystems modelling.

  8. A tool for the estimation of the distribution of landslide area in R

    NASA Astrophysics Data System (ADS)

    Rossi, M.; Cardinali, M.; Fiorucci, F.; Marchesini, I.; Mondini, A. C.; Santangelo, M.; Ghosh, S.; Riguer, D. E. L.; Lahousse, T.; Chang, K. T.; Guzzetti, F.

    2012-04-01

    We have developed a tool in R (the free software environment for statistical computing, http://www.r-project.org/) to estimate the probability density and the frequency density of landslide area. The tool implements parametric and non-parametric approaches to the estimation of the probability density and the frequency density of landslide area, including: (i) Histogram Density Estimation (HDE), (ii) Kernel Density Estimation (KDE), and (iii) Maximum Likelihood Estimation (MLE). The tool is available as a standard Open Geospatial Consortium (OGC) Web Processing Service (WPS), and is accessible through the web using different GIS software clients. We tested the tool to compare Double Pareto and Inverse Gamma models for the probability density of landslide area in different geological, morphological and climatological settings, and to compare landslides shown in inventory maps prepared using different mapping techniques, including (i) field mapping, (ii) visual interpretation of monoscopic and stereoscopic aerial photographs, (iii) visual interpretation of monoscopic and stereoscopic VHR satellite images and (iv) semi-automatic detection and mapping from VHR satellite images. Results show that both models are applicable in different geomorphological settings. In most cases the two models provided very similar results. Non-parametric estimation methods (i.e., HDE and KDE) provided reasonable results for all the tested landslide datasets. For some of the datasets, MLE failed to provide a result, for convergence problems. The two tested models (Double Pareto and Inverse Gamma) resulted in very similar results for large and very large datasets (> 150 samples). Differences in the modeling results were observed for small datasets affected by systematic biases. A distinct rollover was observed in all analyzed landslide datasets, except for a few datasets obtained from landslide inventories prepared through field mapping or by semi-automatic mapping from VHR satellite imagery. The tool can also be used to evaluate the probability density and the frequency density of landslide volume.

  9. Generation of open biomedical datasets through ontology-driven transformation and integration processes.

    PubMed

    Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2016-06-03

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.

  10. An Approach to Data Center-Based KDD of Remote Sensing Datasets

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Mack, Robert; Wharton, Stephen W. (Technical Monitor)

    2001-01-01

    The data explosion in remote sensing is straining the ability of data centers to deliver the data to the user community, yet many large-volume users actually seek a relatively small information component within the data, which they extract at their sites using Knowledge Discovery in Databases (KDD) techniques. To improve the efficiency of this process, the Goddard Earth Sciences Distributed Active Archive Center (GES DAAC) has implemented a KDD subsystem that supports execution of the user's KDD algorithm at the data center, dramatically reducing the volume that is sent to the user. The data are extracted from the archive in a planned, organized "campaign"; the algorithms are executed, and the output products sent to the users over the network. The first campaign, now complete, has resulted in overall reductions in shipped volume from 3.3 TB to 0.4 TB.

  11. Precise segmentation of multiple organs in CT volumes using learning-based approach and information theory.

    PubMed

    Lu, Chao; Zheng, Yefeng; Birkbeck, Neil; Zhang, Jingdan; Kohlberger, Timo; Tietjen, Christian; Boettger, Thomas; Duncan, James S; Zhou, S Kevin

    2012-01-01

    In this paper, we present a novel method by incorporating information theory into the learning-based approach for automatic and accurate pelvic organ segmentation (including the prostate, bladder and rectum). We target 3D CT volumes that are generated using different scanning protocols (e.g., contrast and non-contrast, with and without implant in the prostate, various resolution and position), and the volumes come from largely diverse sources (e.g., diseased in different organs). Three key ingredients are combined to solve this challenging segmentation problem. First, marginal space learning (MSL) is applied to efficiently and effectively localize the multiple organs in the largely diverse CT volumes. Second, learning techniques, steerable features, are applied for robust boundary detection. This enables handling of highly heterogeneous texture pattern. Third, a novel information theoretic scheme is incorporated into the boundary inference process. The incorporation of the Jensen-Shannon divergence further drives the mesh to the best fit of the image, thus improves the segmentation performance. The proposed approach is tested on a challenging dataset containing 188 volumes from diverse sources. Our approach not only produces excellent segmentation accuracy, but also runs about eighty times faster than previous state-of-the-art solutions. The proposed method can be applied to CT images to provide visual guidance to physicians during the computer-aided diagnosis, treatment planning and image-guided radiotherapy to treat cancers in pelvic region.

  12. A new estimate of the volume and distribution of gas hydrate in the northern Gulf of Mexico

    NASA Astrophysics Data System (ADS)

    Majumdar, U.; Cook, A.

    2016-12-01

    In spite of the wealth of information gained over the last several decades about gas hydrate in the northern Gulf of Mexico, there is still considerable uncertainty about the distribution and volume of gas hydrate. In our assessment we build a dataset of basin-wide gas hydrate distribution and thickness, as appraised from publicly available petroleum industry well logs within the gas hydrate stability zone (HSZ), and subsequently develop a Monte Carlo to determine the volumetric estimate of gas hydrate using the dataset. We evaluate the presence of gas hydrate from electrical resistivity well logs, and categorized possible reservoir type (either sand or clay) based on the gamma ray response and resistivity curve characteristics. Out of the 798 wells with resistivity well log data within the HSZ we analyzed, we found evidence of gas hydrate in 124 wells. In this research we present a new stochastic estimate of the gas hydrate volume in the northern Gulf of Mexico guided by our well log dataset. For our Monte Carlo simulation, we divided our assessment area of 200,000 km2 into 1 km2 grid cells. Our volume assessment model incorporates variables unique to our well log dataset such as the likelihood of gas hydrate occurrence, fraction of the HSZ occupied by gas hydrate, reservoir type, and gas hydrate saturation depending on the reservoir, in each grid cell, in addition to other basic variables such as HSZ thickness and porosity. Preliminary results from our model suggests that the total volume of gas at standard temperature and pressure in gas hydrate in the northern Gulf of Mexico is in the range of 430 trillion cubic feet (TCF) to 730 TCF, with a mean volume of 585 TCF. While the reservoir distribution from our well log dataset found gas hydrate in sand reservoirs in 30 wells out of the 124 wells with evidence of gas hydrate ( 24%), we find sand reservoirs contain over half of the total volume of gas hydrate in the Gulf of Mexico, as a result of the relatively high gas hydrate saturation in sand.

  13. Epithelium-Stroma Classification via Convolutional Neural Networks and Unsupervised Domain Adaptation in Histopathological Images.

    PubMed

    Huang, Yue; Zheng, Han; Liu, Chi; Ding, Xinghao; Rohde, Gustavo K

    2017-11-01

    Epithelium-stroma classification is a necessary preprocessing step in histopathological image analysis. Current deep learning based recognition methods for histology data require collection of large volumes of labeled data in order to train a new neural network when there are changes to the image acquisition procedure. However, it is extremely expensive for pathologists to manually label sufficient volumes of data for each pathology study in a professional manner, which results in limitations in real-world applications. A very simple but effective deep learning method, that introduces the concept of unsupervised domain adaptation to a simple convolutional neural network (CNN), has been proposed in this paper. Inspired by transfer learning, our paper assumes that the training data and testing data follow different distributions, and there is an adaptation operation to more accurately estimate the kernels in CNN in feature extraction, in order to enhance performance by transferring knowledge from labeled data in source domain to unlabeled data in target domain. The model has been evaluated using three independent public epithelium-stroma datasets by cross-dataset validations. The experimental results demonstrate that for epithelium-stroma classification, the proposed framework outperforms the state-of-the-art deep neural network model, and it also achieves better performance than other existing deep domain adaptation methods. The proposed model can be considered to be a better option for real-world applications in histopathological image analysis, since there is no longer a requirement for large-scale labeled data in each specified domain.

  14. Data Prospecting Framework - a new approach to explore "big data" in Earth Science

    NASA Astrophysics Data System (ADS)

    Ramachandran, R.; Rushing, J.; Lin, A.; Kuo, K.

    2012-12-01

    Due to advances in sensors, computation and storage, cost and effort required to produce large datasets have been significantly reduced. As a result, we are seeing a proliferation of large-scale data sets being assembled in almost every science field, especially in geosciences. Opportunities to exploit the "big data" are enormous as new hypotheses can be generated by combining and analyzing large amounts of data. However, such a data-driven approach to science discovery assumes that scientists can find and isolate relevant subsets from vast amounts of available data. Current Earth Science data systems only provide data discovery through simple metadata and keyword-based searches and are not designed to support data exploration capabilities based on the actual content. Consequently, scientists often find themselves downloading large volumes of data, struggling with large amounts of storage and learning new analysis technologies that will help them separate the wheat from the chaff. New mechanisms of data exploration are needed to help scientists discover the relevant subsets We present data prospecting, a new content-based data analysis paradigm to support data-intensive science. Data prospecting allows the researchers to explore big data in determining and isolating data subsets for further analysis. This is akin to geo-prospecting in which mineral sites of interest are determined over the landscape through screening methods. The resulting "data prospects" only provide an interaction with and feel for the data through first-look analytics; the researchers would still have to download the relevant datasets and analyze them deeply using their favorite analytical tools to determine if the datasets will yield new hypotheses. Data prospecting combines two traditional categories of data analysis, data exploration and data mining within the discovery step. Data exploration utilizes manual/interactive methods for data analysis such as standard statistical analysis and visualization, usually on small datasets. On the other hand, data mining utilizes automated algorithms to extract useful information. Humans guide these automated algorithms and specify algorithm parameters (training samples, clustering size, etc.). Data Prospecting combines these two approaches using high performance computing and the new techniques for efficient distributed file access.

  15. GTN-G, WGI, RGI, DCW, GLIMS, WGMS, GCOS - What's all this about? (Invited)

    NASA Astrophysics Data System (ADS)

    Paul, F.; Raup, B. H.; Zemp, M.

    2013-12-01

    In a large collaborative effort, the glaciological community has compiled a new and spa-tially complete global dataset of glacier outlines, the so-called Randolph Glacier Inventory or RGI. Despite its regional shortcomings in quality (e.g. in regard to geolocation, gener-alization, and interpretation), this dataset was heavily used for global-scale modelling ap-plications (e.g. determination of total glacier volume and glacier contribution to sea-level rise) in support of the forthcoming 5th Assessment Report (AR5) of Working Group I of the IPCC. The RGI is a merged dataset that is largely based on the GLIMS database and several new datasets provided by the community (both are mostly derived from satellite data), as well as the Digital Chart of the World (DCW) and glacier attribute information (location, size) from the World Glacier Inventory (WGI). There are now two key tasks to be performed, (1) improving the quality of the RGI in all regions where the outlines do not met the quality required for local scale applications, and (2) integrating the RGI in the GLIMS glacier database to improve its spatial completeness. While (1) requires again a huge effort but is already ongoing, (2) is mainly a technical issue that is nearly solved. Apart from this technical dimension, there is also a more political or structural one. While GLIMS is responsible for the remote sensing and glacier inventory part (Tier 5) of the Global Terrestrial Network for Glaciers (GTN-G) within the Global Climate Observing System (GCOS), the World Glacier Monitoring Service (WGMS) is collecting and dis-seminating the field observations. Along with new global products derived from satellite data (e.g. elevation changes and velocity fields) and the community wish to keep a snap-shot dataset such as the RGI available, how to make all these datasets available to the community without duplicating efforts and making best use of the very limited financial resources available must now be discussed. This overview presentation describes the cur-rently available datasets, clarifying the terminology and the international framework, and suggesting a way forward to serve the community at best.

  16. Dynamic analysis, transformation, dissemination and applications of scientific multidimensional data in ArcGIS Platform

    NASA Astrophysics Data System (ADS)

    Shrestha, S. R.; Collow, T. W.; Rose, B.

    2016-12-01

    Scientific datasets are generated from various sources and platforms but they are typically produced either by earth observation systems or by modelling systems. These are widely used for monitoring, simulating, or analyzing measurements that are associated with physical, chemical, and biological phenomena over the ocean, atmosphere, or land. A significant subset of scientific datasets stores values directly as rasters or in a form that can be rasterized. This is where a value exists at every cell in a regular grid spanning the spatial extent of the dataset. Government agencies like NOAA, NASA, EPA, USGS produces large volumes of near real-time, forecast, and historical data that drives climatological and meteorological studies, and underpins operations ranging from weather prediction to sea ice loss. Modern science is computationally intensive because of the availability of an enormous amount of scientific data, the adoption of data-driven analysis, and the need to share these dataset and research results with the public. ArcGIS as a platform is sophisticated and capable of handling such complex domain. We'll discuss constructs and capabilities applicable to multidimensional gridded data that can be conceptualized as a multivariate space-time cube. Building on the concept of a two-dimensional raster, a typical multidimensional raster dataset could contain several "slices" within the same spatial extent. We will share a case from the NOAA Climate Forecast Systems Reanalysis (CFSR) multidimensional data as an example of how large collections of rasters can be efficiently organized and managed through a data model within a geodatabase called "Mosaic dataset" and dynamically transformed and analyzed using raster functions. A raster function is a lightweight, raster-valued transformation defined over a mixed set of raster and scalar input. That means, just like any tool, you can provide a raster function with input parameters. It enables dynamic processing of only the data that's being displayed on the screen or requested by an application. We will present the dynamic processing and analysis of CFSR data using the chains of raster function and share it as dynamic multidimensional image service. This workflow and capabilities can be easily applied to any scientific data formats that are supported in mosaic dataset.

  17. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106

  18. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

  19. Scalable isosurface visualization of massive datasets on commodity off-the-shelf clusters

    PubMed Central

    Bajaj, Chandrajit

    2009-01-01

    Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces. PMID:19756231

  20. Efficient Record Linkage Algorithms Using Complete Linkage Clustering.

    PubMed

    Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar

    2016-01-01

    Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.

  1. Efficient Record Linkage Algorithms Using Complete Linkage Clustering

    PubMed Central

    Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar

    2016-01-01

    Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times. PMID:27124604

  2. Climate Model Diagnostic Analyzer

    NASA Technical Reports Server (NTRS)

    Lee, Seungwon; Pan, Lei; Zhai, Chengxing; Tang, Benyang; Kubar, Terry; Zhang, Zia; Wang, Wei

    2015-01-01

    The comprehensive and innovative evaluation of climate models with newly available global observations is critically needed for the improvement of climate model current-state representation and future-state predictability. A climate model diagnostic evaluation process requires physics-based multi-variable analyses that typically involve large-volume and heterogeneous datasets, making them both computation- and data-intensive. With an exploratory nature of climate data analyses and an explosive growth of datasets and service tools, scientists are struggling to keep track of their datasets, tools, and execution/study history, let alone sharing them with others. In response, we have developed a cloud-enabled, provenance-supported, web-service system called Climate Model Diagnostic Analyzer (CMDA). CMDA enables the physics-based, multivariable model performance evaluations and diagnoses through the comprehensive and synergistic use of multiple observational data, reanalysis data, and model outputs. At the same time, CMDA provides a crowd-sourcing space where scientists can organize their work efficiently and share their work with others. CMDA is empowered by many current state-of-the-art software packages in web service, provenance, and semantic search.

  3. Re-Organizing Earth Observation Data Storage to Support Temporal Analysis of Big Data

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher

    2017-01-01

    The Earth Observing System Data and Information System archives many datasets that are critical to understanding long-term variations in Earth science properties. Thus, some of these are large, multi-decadal datasets. Yet the challenge in long time series analysis comes less from the sheer volume than the data organization, which is typically one (or a small number of) time steps per file. The overhead of opening and inventorying complex, API-driven data formats such as Hierarchical Data Format introduces a small latency at each time step, which nonetheless adds up for datasets with O(10^6) single-timestep files. Several approaches to reorganizing the data can mitigate this overhead by an order of magnitude: pre-aggregating data along the time axis (time-chunking); storing the data in a highly distributed file system; or storing data in distributed columnar databases. Storing a second copy of the data incurs extra costs, so some selection criteria must be employed, which would be driven by expected or actual usage by the end user community, balanced against the extra cost.

  4. Re-organizing Earth Observation Data Storage to Support Temporal Analysis of Big Data

    NASA Astrophysics Data System (ADS)

    Lynnes, C.

    2017-12-01

    The Earth Observing System Data and Information System archives many datasets that are critical to understanding long-term variations in Earth science properties. Thus, some of these are large, multi-decadal datasets. Yet the challenge in long time series analysis comes less from the sheer volume than the data organization, which is typically one (or a small number of) time steps per file. The overhead of opening and inventorying complex, API-driven data formats such as Hierarchical Data Format introduces a small latency at each time step, which nonetheless adds up for datasets with O(10^6) single-timestep files. Several approaches to reorganizing the data can mitigate this overhead by an order of magnitude: pre-aggregating data along the time axis (time-chunking); storing the data in a highly distributed file system; or storing data in distributed columnar databases. Storing a second copy of the data incurs extra costs, so some selection criteria must be employed, which would be driven by expected or actual usage by the end user community, balanced against the extra cost.

  5. Accuracy and robustness evaluation in stereo matching

    NASA Astrophysics Data System (ADS)

    Nguyen, Duc M.; Hanca, Jan; Lu, Shao-Ping; Schelkens, Peter; Munteanu, Adrian

    2016-09-01

    Stereo matching has received a lot of attention from the computer vision community, thanks to its wide range of applications. Despite of the large variety of algorithms that have been proposed so far, it is not trivial to select suitable algorithms for the construction of practical systems. One of the main problems is that many algorithms lack sufficient robustness when employed in various operational conditions. This problem is due to the fact that most of the proposed methods in the literature are usually tested and tuned to perform well on one specific dataset. To alleviate this problem, an extensive evaluation in terms of accuracy and robustness of state-of-the-art stereo matching algorithms is presented. Three datasets (Middlebury, KITTI, and MPEG FTV) representing different operational conditions are employed. Based on the analysis, improvements over existing algorithms have been proposed. The experimental results show that our improved versions of cross-based and cost volume filtering algorithms outperform the original versions with large margins on Middlebury and KITTI datasets. In addition, the latter of the two proposed algorithms ranks itself among the best local stereo matching approaches on the KITTI benchmark. Under evaluations using specific settings for depth-image-based-rendering applications, our improved belief propagation algorithm is less complex than MPEG's FTV depth estimation reference software (DERS), while yielding similar depth estimation performance. Finally, several conclusions on stereo matching algorithms are also presented.

  6. Zircon Age Distributions Provide Magma Fluxes in the Earth's Crust

    NASA Astrophysics Data System (ADS)

    Caricchi, L.; Simpson, G.; Schaltegger, U.

    2014-12-01

    Magma fluxes control the growth of continents, the frequency and magnitude of volcanic eruptions and are important for the genesis of magmatic ore deposits. A significant part of the magma produced in the Earth's mantle solidifies at depth and this limits our capability of determining magma fluxes, which, in turn, compromises our ability to establish a link between global heat transfer and large-scale geological processes. Using thermal modelling in combination with high precision zircon dating we show that populations of zircon ages provide an accurate mean to retrieve magma fluxes. The characteristics of zircon age populations vary significantly and systematically as function of the flux and total volume of magma accumulated at depth. This new approach provides results that are identical to independent determinations of magma fluxes and volumes of magmatic systems. The analysis of existing age population datasets by our method highlights that porphyry-type deposits, plutons and large eruptions each require magma input over different timescales at characteristic average fluxes.

  7. Development of a global historic monthly mean precipitation dataset

    NASA Astrophysics Data System (ADS)

    Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang

    2016-04-01

    Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to other datasets. Using the Penalized Maximal t-test method, significant inhomogeneity has been detected in historic precipitation datasets at 340 stations. The ratio method is then employed to effectively remove these remarkable change points. Global precipitation analysis based on CGP v1.0 shows that rainfall has been increasing during 1901-2013 with an increasing rate of 3.52 ± 0.5 mm (10 yr)-1, slightly higher than that in the NCDC data. Analysis also reveals distinguished long-term changing trends at different latitude zones.

  8. Estimation of parameters of dose volume models and their confidence limits

    NASA Astrophysics Data System (ADS)

    van Luijk, P.; Delvigne, T. C.; Schilstra, C.; Schippers, J. M.

    2003-07-01

    Predictions of the normal-tissue complication probability (NTCP) for the ranking of treatment plans are based on fits of dose-volume models to clinical and/or experimental data. In the literature several different fit methods are used. In this work frequently used methods and techniques to fit NTCP models to dose response data for establishing dose-volume effects, are discussed. The techniques are tested for their usability with dose-volume data and NTCP models. Different methods to estimate the confidence intervals of the model parameters are part of this study. From a critical-volume (CV) model with biologically realistic parameters a primary dataset was generated, serving as the reference for this study and describable by the NTCP model. The CV model was fitted to this dataset. From the resulting parameters and the CV model, 1000 secondary datasets were generated by Monte Carlo simulation. All secondary datasets were fitted to obtain 1000 parameter sets of the CV model. Thus the 'real' spread in fit results due to statistical spreading in the data is obtained and has been compared with estimates of the confidence intervals obtained by different methods applied to the primary dataset. The confidence limits of the parameters of one dataset were estimated using the methods, employing the covariance matrix, the jackknife method and directly from the likelihood landscape. These results were compared with the spread of the parameters, obtained from the secondary parameter sets. For the estimation of confidence intervals on NTCP predictions, three methods were tested. Firstly, propagation of errors using the covariance matrix was used. Secondly, the meaning of the width of a bundle of curves that resulted from parameters that were within the one standard deviation region in the likelihood space was investigated. Thirdly, many parameter sets and their likelihood were used to create a likelihood-weighted probability distribution of the NTCP. It is concluded that for the type of dose response data used here, only a full likelihood analysis will produce reliable results. The often-used approximations, such as the usage of the covariance matrix, produce inconsistent confidence limits on both the parameter sets and the resulting NTCP values.

  9. How do you assign persistent identifiers to extracts from large, complex, dynamic data sets that underpin scholarly publications?

    NASA Astrophysics Data System (ADS)

    Wyborn, Lesley; Car, Nicholas; Evans, Benjamin; Klump, Jens

    2016-04-01

    Persistent identifiers in the form of a Digital Object Identifier (DOI) are becoming more mainstream, assigned at both the collection and dataset level. For static datasets, this is a relatively straight-forward matter. However, many new data collections are dynamic, with new data being appended, models and derivative products being revised with new data, or the data itself revised as processing methods are improved. Further, because data collections are becoming accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: they also can easily mix and match data from multiple collections, each of which can have a complex history. Inevitably extracts from such dynamic data sets underpin scholarly publications, and this presents new challenges. The National Computational Infrastructure (NCI) has been experiencing and making progress towards addressing these issues. The NCI is large node of the Research Data Services initiative (RDS) of the Australian Government's research infrastructure, which currently makes available over 10 PBytes of priority research collections, ranging from geosciences, geophysics, environment, and climate, through to astronomy, bioinformatics, and social sciences. Data are replicated to, or are produced at, NCI and then processed there to higher-level data products or directly analysed. Individual datasets range from multi-petabyte computational models and large volume raster arrays, down to gigabyte size, ultra-high resolution datasets. To facilitate access, maximise reuse and enable integration across the disciplines, datasets have been organized on a platform called the National Environmental Research Data Interoperability Platform (NERDIP). Combined, the NERDIP data collections form a rich and diverse asset for researchers: their co-location and standardization optimises the value of existing data, and forms a new resource to underpin data-intensive Science. New publication procedures require that a persistent identifier (DOI) be provided for the dataset that underpins the publication. Being able to produce these for data extracts from the NCI data node using only DOIs is proving difficult: preserving a copy of each data extract is not possible due to data scale. A proposal is for researchers to use workflows that capture the provenance of each data extraction, including metadata (e.g., version of the dataset used, the query and time of extraction). In parallel, NCI is now working with the NERDIP dataset providers to ensure that the provenance of data publication is also captured in provenance systems including references to previous versions and a history of data appended or modified. This proposed solution would require an enhancement to new scholarly publication procedures whereby the reference to underlying dataset to a scholarly publication would be the persistent identifier of the provenance workflow that created the data extract. In turn, the provenance workflow would itself link to a series of persistent identifiers that, at a minimum, provide complete dataset production transparency and, if required, would facilitate reconstruction of the dataset. Such a solution will require strict adherence to design patterns for provenance representation to ensure that the provenance representation of the workflow does indeed contain information required to deliver dataset generation transparency and a pathway to reconstruction.

  10. Analyzing How We Do Analysis and Consume Data, Results from the SciDAC-Data Project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ding, P.; Aliaga, L.; Mubarak, M.

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data deliverymore » is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption« less

  11. Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project

    NASA Astrophysics Data System (ADS)

    Ding, P.; Aliaga, L.; Mubarak, M.; Tsaris, A.; Norman, A.; Lyon, A.; Ross, R.

    2017-10-01

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.

  12. Towards Universal Screening for Toxoplasmosis: Rapid, Cost-effective and Simultaneous Detection of Toxoplasma Anti-IgG, IgM and IgA Antibodies Using Very Small Serum Volumes

    EPA Pesticide Factsheets

    No dataset associated with this publication.This dataset is associated with the following publication:Augustine, S. Towards Universal Screening for Toxoplasmosis: Rapid, Cost-effective and Simultaneous Detection of Toxoplasma Anti-IgG, IgM and IgA Antibodies Using Very Small Serum Volumes. JOURNAL OF CLINICAL MICROBIOLOGY. American Society for Microbiology, Washington, DC, USA, 56(7): 1-2, (2016).

  13. Sensitivity of quantitative groundwater recharge estimates to volumetric and distribution uncertainty in rainfall forcing products

    NASA Astrophysics Data System (ADS)

    Werner, Micha; Westerhoff, Rogier; Moore, Catherine

    2017-04-01

    Quantitative estimates of recharge due to precipitation excess are an important input to determining sustainable abstraction of groundwater resources, as well providing one of the boundary conditions required for numerical groundwater modelling. Simple water balance models are widely applied for calculating recharge. In these models, precipitation is partitioned between different processes and stores; including surface runoff and infiltration, storage in the unsaturated zone, evaporation, capillary processes, and recharge to groundwater. Clearly the estimation of recharge amounts will depend on the estimation of precipitation volumes, which may vary, depending on the source of precipitation data used. However, the partitioning between the different processes is in many cases governed by (variable) intensity thresholds. This means that the estimates of recharge will not only be sensitive to input parameters such as soil type, texture, land use, potential evaporation; but mainly to the precipitation volume and intensity distribution. In this paper we explore the sensitivity of recharge estimates due to difference in precipitation volumes and intensity distribution in the rainfall forcing over the Canterbury region in New Zealand. We compare recharge rates and volumes using a simple water balance model that is forced using rainfall and evaporation data from; the NIWA Virtual Climate Station Network (VCSN) data (which is considered as the reference dataset); the ERA-Interim/WATCH dataset at 0.25 degrees and 0.5 degrees resolution; the TRMM-3B42 dataset; the CHIRPS dataset; and the recently releases MSWEP dataset. Recharge rates are calculated at a daily time step over the 14 year period from the 2000 to 2013 for the full Canterbury region, as well as at eight selected points distributed over the region. Lysimeter data with observed estimates of recharge are available at four of these points, as well as recharge estimates from the NGRM model, an independent model constructed using the same base data and forced with the VCSN precipitation dataset. Results of the comparison of the rainfall products show that there are significant differences in precipitation volume between the forcing products; in the order of 20% at most points. Even more significant differences can be seen, however, in the distribution of precipitation. For the VCSN data wet days (defined as >0.1mm precipitation) occur on some 20-30% of days (depending on location). This is reasonably reflected in the TRMM and CHIRPS data, while for the re-analysis based products some 60%to 80% of days are wet, albeit at lower intensities. These differences are amplified in the recharge estimates. At most points, volumetric differences are in the order of 40-60%, though difference may range into several orders of magnitude. The frequency distributions of recharge also differ significantly, with recharge over 0.1 mm occurring on 4-6% of days for the VCNS, CHIRPS, and TRMM datasets, but up to the order of 12% of days for the re-analysis data. Comparison against the lysimeter data show estimates to be reasonable, in particular for the reference datasets. Surprisingly some estimates of the lower resolution re-analysis datasets are reasonable, though this does seem to be due to lower recharge being compensated by recharge occurring more frequently. These results underline the importance of correct representation of rainfall volumes, as well as of distribution, particularly when evaluating possible changes to for example changes in precipitation intensity and volume. This holds for precipitation data derived from satellite based and re-analysis products, but also for interpolated data from gauges, where the distribution of intensities is strongly influenced by the interpolation process.

  14. TDat: An Efficient Platform for Processing Petabyte-Scale Whole-Brain Volumetric Images.

    PubMed

    Li, Yuxin; Gong, Hui; Yang, Xiaoquan; Yuan, Jing; Jiang, Tao; Li, Xiangning; Sun, Qingtao; Zhu, Dan; Wang, Zhenyu; Luo, Qingming; Li, Anan

    2017-01-01

    Three-dimensional imaging of whole mammalian brains at single-neuron resolution has generated terabyte (TB)- and even petabyte (PB)-sized datasets. Due to their size, processing these massive image datasets can be hindered by the computer hardware and software typically found in biological laboratories. To fill this gap, we have developed an efficient platform named TDat, which adopts a novel data reformatting strategy by reading cuboid data and employing parallel computing. In data reformatting, TDat is more efficient than any other software. In data accessing, we adopted parallelization to fully explore the capability for data transmission in computers. We applied TDat in large-volume data rigid registration and neuron tracing in whole-brain data with single-neuron resolution, which has never been demonstrated in other studies. We also showed its compatibility with various computing platforms, image processing software and imaging systems.

  15. The application of cloud computing to scientific workflows: a study of cost and performance.

    PubMed

    Berriman, G Bruce; Deelman, Ewa; Juve, Gideon; Rynge, Mats; Vöckler, Jens-S

    2013-01-28

    The current model of transferring data from data centres to desktops for analysis will soon be rendered impractical by the accelerating growth in the volume of science datasets. Processing will instead often take place on high-performance servers co-located with data. Evaluations of how new technologies such as cloud computing would support such a new distributed computing model are urgently needed. Cloud computing is a new way of purchasing computing and storage resources on demand through virtualization technologies. We report here the results of investigations of the applicability of commercial cloud computing to scientific computing, with an emphasis on astronomy, including investigations of what types of applications can be run cheaply and efficiently on the cloud, and an example of an application well suited to the cloud: processing a large dataset to create a new science product.

  16. A Practical Approach to Spatiotemporal Data Compression

    NASA Astrophysics Data System (ADS)

    Prudden, Rachel; Robinson, Niall; Arribas, Alberto

    2017-04-01

    Datasets representing the world around us are becoming ever more unwieldy as data volumes grow. This is largely due to increased measurement and modelling resolution, but the problem is often exacerbated when data are stored at spuriously high precisions. In an effort to facilitate analysis of these datasets, computationally intensive calculations are increasingly being performed on specialised remote servers before the reduced data are transferred to the consumer. Due to bandwidth limitations, this often means data are displayed as simple 2D data visualisations, such as scatter plots or images. We present here a novel way to efficiently encode and transmit 4D data fields on-demand so that they can be locally visualised and interrogated. This nascent "4D video" format allows us to more flexibly move the boundary between data server and consumer client. However, it has applications beyond purely scientific visualisation, in the transmission of data to virtual and augmented reality.

  17. Trend Extraction in Functional Data of Amplitudes of R and T Waves in Exercise Electrocardiogram

    NASA Astrophysics Data System (ADS)

    Cammarota, Camillo; Curione, Mario

    The amplitudes of R and T waves of the electrocardiogram (ECG) recorded during the exercise test show both large inter- and intra-individual variability in response to stress. We analyze a dataset of 65 normal subjects undergoing ambulatory test. We model the dataset of R and T series in the framework of functional data, assuming that the individual series are realizations of a non-stationary process, centered at the population trend. We test the time variability of this trend computing a simultaneous confidence band and the zero crossing of its derivative. The analysis shows that the amplitudes of the R and T waves have opposite responses to stress, consisting respectively in a bump and a dip at the early recovery stage. Our findings support the existence of a relationship between R and T wave amplitudes and respectively diastolic and systolic ventricular volumes.

  18. High-Level Location Based Search Services That Improve Discoverability of Geophysical Data in the Virtual ITM Observatory

    NASA Astrophysics Data System (ADS)

    Schaefer, R. K.; Morrison, D.; Potter, M.; Barnes, R. J.; Nylund, S. R.; Patrone, D.; Aiello, J.; Talaat, E. R.; Sarris, T.

    2015-12-01

    The great promise of Virtual Observatories is the ability to perform complex search operations across the metadata of a large variety of different data sets. This allows the researcher to isolate and select the relevant measurements for their topic of study. The Virtual ITM Observatory (VITMO) has many diverse geophysical datasets that cover a large temporal and spatial range that present a unique search problem. VITMO provides many methods by which the user can search for and select data of interest including restricting selections based on geophysical conditions (solar wind speed, Kp, etc) as well as finding those datasets that overlap in time. One of the key challenges in improving discoverability is the ability to identify portions of datasets that overlap in time and in location. The difficulty is that location data is not contained in the metadata for datasets produced by satellites and would be extremely large in volume if it were available, making searching for overlapping data very time consuming. To solve this problem we have developed a series of light-weight web services that can provide a new data search capability for VITMO and others. The services consist of a database of spacecraft ephemerides and instrument fields of view; an overlap calculator to find times when the fields of view of different instruments intersect; and a magnetic field line tracing service that maps in situ and ground based measurements to the equatorial plane in magnetic coordinates for a number of field models and geophysical conditions. These services run in real-time when the user queries for data. These services will allow the non-specialist user to select data that they were previously unable to locate, opening up analysis opportunities beyond the instrument teams and specialists, making it easier for future students who come into the field.

  19. A method for examining the geospatial distribution of CO2 storage resources applied to the Pre-Punta Gorda Composite and Dollar Bay reservoirs of the South Florida Basin, U.S.A

    USGS Publications Warehouse

    Roberts-Ashby, Tina; Brandon N. Ashby,

    2016-01-01

    This paper demonstrates geospatial modification of the USGS methodology for assessing geologic CO2 storage resources, and was applied to the Pre-Punta Gorda Composite and Dollar Bay reservoirs of the South Florida Basin. The study provides detailed evaluation of porous intervals within these reservoirs and utilizes GIS to evaluate the potential spatial distribution of reservoir parameters and volume of CO2 that can be stored. This study also shows that incorporating spatial variation of parameters using detailed and robust datasets may improve estimates of storage resources when compared to applying uniform values across the study area derived from small datasets, like many assessment methodologies. Geospatially derived estimates of storage resources presented here (Pre-Punta Gorda Composite = 105,570 MtCO2; Dollar Bay = 24,760 MtCO2) were greater than previous assessments, which was largely attributed to the fact that detailed evaluation of these reservoirs resulted in higher estimates of porosity and net-porous thickness, and areas of high porosity and thick net-porous intervals were incorporated into the model, likely increasing the calculated volume of storage space available for CO2 sequestration. The geospatial method for evaluating CO2 storage resources also provides the ability to identify areas that potentially contain higher volumes of storage resources, as well as areas that might be less favorable.

  20. Unsupervised data mining in nanoscale x-ray spectro-microscopic study of NdFeB magnet

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Duan, Xiaoyue; Yang, Feifei; Antono, Erin

    Novel developments in X-ray based spectro-microscopic characterization techniques have increased the rate of acquisition of spatially resolved spectroscopic data by several orders of magnitude over what was possible a few years ago. This accelerated data acquisition, with high spatial resolution at nanoscale and sensitivity to subtle differences in chemistry and atomic structure, provides a unique opportunity to investigate hierarchically complex and structurally heterogeneous systems found in functional devices and materials systems. However, handling and analyzing the large volume data generated poses significant challenges. Here we apply an unsupervised data-mining algorithm known as DBSCAN to study a rare-earth element based permanentmore » magnet material, Nd 2Fe 14B. We are able to reduce a large spectro-microscopic dataset of over 300,000 spectra to 3, preserving much of the underlying information. Scientists can easily and quickly analyze in detail three characteristic spectra. Our approach can rapidly provide a concise representation of a large and complex dataset to materials scientists and chemists. For instance, it shows that the surface of common Nd 2Fe 14B magnet is chemically and structurally very different from the bulk, suggesting a possible surface alteration effect possibly due to the corrosion, which could affect the material’s overall properties.« less

  1. Unsupervised data mining in nanoscale x-ray spectro-microscopic study of NdFeB magnet

    DOE PAGES

    Duan, Xiaoyue; Yang, Feifei; Antono, Erin; ...

    2016-09-29

    Novel developments in X-ray based spectro-microscopic characterization techniques have increased the rate of acquisition of spatially resolved spectroscopic data by several orders of magnitude over what was possible a few years ago. This accelerated data acquisition, with high spatial resolution at nanoscale and sensitivity to subtle differences in chemistry and atomic structure, provides a unique opportunity to investigate hierarchically complex and structurally heterogeneous systems found in functional devices and materials systems. However, handling and analyzing the large volume data generated poses significant challenges. Here we apply an unsupervised data-mining algorithm known as DBSCAN to study a rare-earth element based permanentmore » magnet material, Nd 2Fe 14B. We are able to reduce a large spectro-microscopic dataset of over 300,000 spectra to 3, preserving much of the underlying information. Scientists can easily and quickly analyze in detail three characteristic spectra. Our approach can rapidly provide a concise representation of a large and complex dataset to materials scientists and chemists. For instance, it shows that the surface of common Nd 2Fe 14B magnet is chemically and structurally very different from the bulk, suggesting a possible surface alteration effect possibly due to the corrosion, which could affect the material’s overall properties.« less

  2. A machine learning pipeline for automated registration and classification of 3D lidar data

    NASA Astrophysics Data System (ADS)

    Rajagopal, Abhejit; Chellappan, Karthik; Chandrasekaran, Shivkumar; Brown, Andrew P.

    2017-05-01

    Despite the large availability of geospatial data, registration and exploitation of these datasets remains a persis- tent challenge in geoinformatics. Popular signal processing and machine learning algorithms, such as non-linear SVMs and neural networks, rely on well-formatted input models as well as reliable output labels, which are not always immediately available. In this paper we outline a pipeline for gathering, registering, and classifying initially unlabeled wide-area geospatial data. As an illustrative example, we demonstrate the training and test- ing of a convolutional neural network to recognize 3D models in the OGRIP 2007 LiDAR dataset using fuzzy labels derived from OpenStreetMap as well as other datasets available on OpenTopography.org. When auxiliary label information is required, various text and natural language processing filters are used to extract and cluster keywords useful for identifying potential target classes. A subset of these keywords are subsequently used to form multi-class labels, with no assumption of independence. Finally, we employ class-dependent geometry extraction routines to identify candidates from both training and testing datasets. Our regression networks are able to identify the presence of 6 structural classes, including roads, walls, and buildings, in volumes as big as 8000 m3 in as little as 1.2 seconds on a commodity 4-core Intel CPU. The presented framework is neither dataset nor sensor-modality limited due to the registration process, and is capable of multi-sensor data-fusion.

  3. Ultrastructurally-smooth thick partitioning and volume stitching for larger-scale connectomics

    PubMed Central

    Hayworth, Kenneth J.; Xu, C. Shan; Lu, Zhiyuan; Knott, Graham W.; Fetter, Richard D.; Tapia, Juan Carlos; Lichtman, Jeff W.; Hess, Harald F.

    2015-01-01

    FIB-SEM has become an essential tool for studying neural tissue at resolutions below 10×10×10 nm, producing datasets superior for automatic connectome tracing. We present a technical advance, ultrathick sectioning, which reliably subdivides embedded tissue samples into chunks (20 µm thick) optimally sized and mounted for efficient, parallel FIB-SEM imaging. These chunks are imaged separately and then ‘volume stitched’ back together, producing a final 3D dataset suitable for connectome tracing. PMID:25686390

  4. OpenCL based machine learning labeling of biomedical datasets

    NASA Astrophysics Data System (ADS)

    Amoros, Oscar; Escalera, Sergio; Puig, Anna

    2011-03-01

    In this paper, we propose a two-stage labeling method of large biomedical datasets through a parallel approach in a single GPU. Diagnostic methods, structures volume measurements, and visualization systems are of major importance for surgery planning, intra-operative imaging and image-guided surgery. In all cases, to provide an automatic and interactive method to label or to tag different structures contained into input data becomes imperative. Several approaches to label or segment biomedical datasets has been proposed to discriminate different anatomical structures in an output tagged dataset. Among existing methods, supervised learning methods for segmentation have been devised to easily analyze biomedical datasets by a non-expert user. However, they still have some problems concerning practical application, such as slow learning and testing speeds. In addition, recent technological developments have led to widespread availability of multi-core CPUs and GPUs, as well as new software languages, such as NVIDIA's CUDA and OpenCL, allowing to apply parallel programming paradigms in conventional personal computers. Adaboost classifier is one of the most widely applied methods for labeling in the Machine Learning community. In a first stage, Adaboost trains a binary classifier from a set of pre-labeled samples described by a set of features. This binary classifier is defined as a weighted combination of weak classifiers. Each weak classifier is a simple decision function estimated on a single feature value. Then, at the testing stage, each weak classifier is independently applied on the features of a set of unlabeled samples. In this work, we propose an alternative representation of the Adaboost binary classifier. We use this proposed representation to define a new GPU-based parallelized Adaboost testing stage using OpenCL. We provide numerical experiments based on large available data sets and we compare our results to CPU-based strategies in terms of time and labeling speeds.

  5. The first MICCAI challenge on PET tumor segmentation.

    PubMed

    Hatt, Mathieu; Laurent, Baptiste; Ouahabi, Anouar; Fayad, Hadi; Tan, Shan; Li, Laquan; Lu, Wei; Jaouen, Vincent; Tauber, Clovis; Czakon, Jakub; Drapejkowski, Filip; Dyrka, Witold; Camarasu-Pop, Sorina; Cervenansky, Frédéric; Girard, Pascal; Glatard, Tristan; Kain, Michael; Yao, Yao; Barillot, Christian; Kirov, Assen; Visvikis, Dimitris

    2018-02-01

    Automatic functional volume segmentation in PET images is a challenge that has been addressed using a large array of methods. A major limitation for the field has been the lack of a benchmark dataset that would allow direct comparison of the results in the various publications. In the present work, we describe a comparison of recent methods on a large dataset following recommendations by the American Association of Physicists in Medicine (AAPM) task group (TG) 211, which was carried out within a MICCAI (Medical Image Computing and Computer Assisted Intervention) challenge. Organization and funding was provided by France Life Imaging (FLI). A dataset of 176 images combining simulated, phantom and clinical images was assembled. A website allowed the participants to register and download training data (n = 19). Challengers then submitted encapsulated pipelines on an online platform that autonomously ran the algorithms on the testing data (n = 157) and evaluated the results. The methods were ranked according to the arithmetic mean of sensitivity and positive predictive value. Sixteen teams registered but only four provided manuscripts and pipeline(s) for a total of 10 methods. In addition, results using two thresholds and the Fuzzy Locally Adaptive Bayesian (FLAB) were generated. All competing methods except one performed with median accuracy above 0.8. The method with the highest score was the convolutional neural network-based segmentation, which significantly outperformed 9 out of 12 of the other methods, but not the improved K-Means, Gaussian Model Mixture and Fuzzy C-Means methods. The most rigorous comparative study of PET segmentation algorithms to date was carried out using a dataset that is the largest used in such studies so far. The hierarchy amongst the methods in terms of accuracy did not depend strongly on the subset of datasets or the metrics (or combination of metrics). All the methods submitted by the challengers except one demonstrated good performance with median accuracy scores above 0.8. Copyright © 2017 Elsevier B.V. All rights reserved.

  6. Hydrogeophysical Cyberinfrastructure For Real-Time Interactive Browser Controlled Monitoring Of Near Surface Hydrology: Results Of A 13 Month Monitoring Effort At The Hanford 300 Area

    NASA Astrophysics Data System (ADS)

    Versteeg, R. J.; Johnson, T.; Henrie, A.; Johnson, D.

    2013-12-01

    The Hanford 300 Area, located adjacent to the Columbia River in south-central Washington, USA, is the site of former research and uranium fuel rod fabrication facilities. Waste disposal practices at the site included discharging between 33 and 59 metric tons of uranium over a 40 year period into shallow infiltration galleries, resulting in persistent uranium contamination within the vadose and saturated zones. Uranium transport from the vadose zone to the saturated zone is intimately linked with water table fluctuations and river water driven by upstream dam operations. Different remedial efforts have occurred at the site to address uranium contamination. Numerous investigations are occurring at the site, both to investigate remedial performance and to increase the understanding of uranium dynamics. Several of these studies include acquisition of large hydrological and time lapse electrical geophysical data sets. Such datasets contain large amounts of information on hydrological processes. There are substantial challenges in how to effectively deal with the data volumes of such datasets, how to process such datasets and how to provide users with the ability to effectively access and synergize the hydrological information contained in raw and processed data. These challenges motivated the development of a cloud based cyberinfrastructure for dealing with large electrical hydrogeophysical datasets. This cyberinfrastructure is modular and extensible and includes datamanagement, data processing, visualization and result mining capabilities. Specifically, it provides for data transmission to a central server, data parsing in a relational database and processing of the data using a PNNL developed parallel inversion code on either dedicated or commodity compute clusters. Access to results is done through a browser with interactive tools allowing for generation of on demand visualization of the inversion results as well as interactive data mining and statistical calculation. This infrastructure was used for the acquisition and processing of an electrical geophysical timelapse survey which was collected over a highly instrumented field site in the Hanford 300 Area. Over a 13 month period between November 2011 and December 2012 1823 timelapse datasets were collected (roughly 5 datasets a day for a total of 23 million individual measurements) on three parallel resistivity lines of 30 m each with 0.5 meter electrode spacing. In addition, hydrological and environmental data were collected from dedicated and general purpose sensors. This dataset contains rich information on near surface processes on a range of different spatial and temporal scales (ranging from hourly to seasonal). We will show how this cyberinfrastructure was used to manage and process this dataset and how the cyberinfrastructure can be used to access, mine and visualize the resulting data and information.

  7. The Ophidia framework: toward cloud-based data analytics for climate change

    NASA Astrophysics Data System (ADS)

    Fiore, Sandro; D'Anca, Alessandro; Elia, Donatello; Mancini, Marco; Mariello, Andrea; Mirto, Maria; Palazzo, Cosimo; Aloisio, Giovanni

    2015-04-01

    The Ophidia project is a research effort on big data analytics facing scientific data analysis challenges in the climate change domain. It provides parallel (server-side) data analysis, an internal storage model and a hierarchical data organization to manage large amount of multidimensional scientific data. The Ophidia analytics platform provides several MPI-based parallel operators to manipulate large datasets (data cubes) and array-based primitives to perform data analysis on large arrays of scientific data. The most relevant data analytics use cases implemented in national and international projects target fire danger prevention (OFIDIA), interactions between climate change and biodiversity (EUBrazilCC), climate indicators and remote data analysis (CLIP-C), sea situational awareness (TESSA), large scale data analytics on CMIP5 data in NetCDF format, Climate and Forecast (CF) convention compliant (ExArch). Two use cases regarding the EU FP7 EUBrazil Cloud Connect and the INTERREG OFIDIA projects will be presented during the talk. In the former case (EUBrazilCC) the Ophidia framework is being extended to integrate scalable VM-based solutions for the management of large volumes of scientific data (both climate and satellite data) in a cloud-based environment to study how climate change affects biodiversity. In the latter one (OFIDIA) the data analytics framework is being exploited to provide operational support regarding processing chains devoted to fire danger prevention. To tackle the project challenges, data analytics workflows consisting of about 130 operators perform, among the others, parallel data analysis, metadata management, virtual file system tasks, maps generation, rolling of datasets, import/export of datasets in NetCDF format. Finally, the entire Ophidia software stack has been deployed at CMCC on 24-nodes (16-cores/node) of the Athena HPC cluster. Moreover, a cloud-based release tested with OpenNebula is also available and running in the private cloud infrastructure of the CMCC Supercomputing Centre.

  8. Multi-views Fusion CNN for Left Ventricular Volumes Estimation on Cardiac MR Images.

    PubMed

    Luo, Gongning; Dong, Suyu; Wang, Kuanquan; Zuo, Wangmeng; Cao, Shaodong; Zhang, Henggui

    2017-10-13

    Left ventricular (LV) volumes estimation is a critical procedure for cardiac disease diagnosis. The objective of this paper is to address direct LV volumes prediction task. In this paper, we propose a direct volumes prediction method based on the end-to-end deep convolutional neural networks (CNN). We study the end-to-end LV volumes prediction method in items of the data preprocessing, networks structure, and multi-views fusion strategy. The main contributions of this paper are the following aspects. First, we propose a new data preprocessing method on cardiac magnetic resonance (CMR). Second, we propose a new networks structure for end-to-end LV volumes estimation. Third, we explore the representational capacity of different slices, and propose a fusion strategy to improve the prediction accuracy. The evaluation results show that the proposed method outperforms other state-of-the-art LV volumes estimation methods on the open accessible benchmark datasets. The clinical indexes derived from the predicted volumes agree well with the ground truth (EDV: R=0.974, RMSE=9.6ml; ESV: R=0.976, RMSE=7.1ml; EF: R=0.828, RMSE =4.71%). Experimental results prove that the proposed method has high accuracy and efficiency on LV volumes prediction task. The proposed method not only has application potential for cardiac diseases screening for large-scale CMR data, but also can be extended to other medical image research fields.

  9. Regional flux analysis for discovering and quantifying anatomical changes: An application to the brain morphometry in Alzheimer's disease.

    PubMed

    Lorenzi, M; Ayache, N; Pennec, X

    2015-07-15

    In this study we introduce the regional flux analysis, a novel approach to deformation based morphometry based on the Helmholtz decomposition of deformations parameterized by stationary velocity fields. We use the scalar pressure map associated to the irrotational component of the deformation to discover the critical regions of volume change. These regions are used to consistently quantify the associated measure of volume change by the probabilistic integration of the flux of the longitudinal deformations across the boundaries. The presented framework unifies voxel-based and regional approaches, and robustly describes the volume changes at both group-wise and subject-specific level as a spatial process governed by consistently defined regions. Our experiments on the large cohorts of the ADNI dataset show that the regional flux analysis is a powerful and flexible instrument for the study of Alzheimer's disease in a wide range of scenarios: cross-sectional deformation based morphometry, longitudinal discovery and quantification of group-wise volume changes, and statistically powered and robust quantification of hippocampal and ventricular atrophy. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Reduction in expression of the benign AR transcriptome is a hallmark of localised prostate cancer progression.

    PubMed

    Stuchbery, Ryan; Macintyre, Geoff; Cmero, Marek; Harewood, Laurence M; Peters, Justin S; Costello, Anthony J; Hovens, Christopher M; Corcoran, Niall M

    2016-05-24

    Despite the importance of androgen receptor (AR) signalling to prostate cancer development, little is known about how this signalling pathway changes with increasing grade and stage of the disease. To explore changes in the normal AR transcriptome in localised prostate cancer, and its relation to adverse pathological features and disease recurrence. Publically accessible human prostate cancer expression arrays as well as RNA sequencing data from the prostate TCGA. Tumour associated PSA and PSAD were calculated for a large cohort of men (n=1108) undergoing prostatectomy. We performed a meta-analysis of the expression of an androgen-regulated gene set across datasets using Oncomine. Differential expression of selected genes in the prostate TCGA database was probed using the edgeR Bioconductor package. Changes in tumour PSA density with stage and grade were assessed by Student's t-test, and its association with biochemical recurrence explored by Kaplan-Meier curves and Cox regression. Meta-analysis revealed a systematic decline in the expression of a previously identified benign prostate androgen-regulated gene set with increasing tumour grade, reaching significance in nine of 25 genes tested despite increasing AR expression. These results were confirmed in a large independent dataset from the TCGA. At the protein level, when serum PSA was corrected for tumour volume, significantly lower levels were observed with increasing tumour grade and stage, and predicted disease recurrence. Lower PSA secretion-per-tumour-volume is associated with increasing grade and stage of prostate cancer, has prognostic relevance, and reflects a systematic perturbation of androgen signalling.

  11. A quantitative evaluation of pleural effusion on computed tomography scans using B-spline and local clustering level set.

    PubMed

    Song, Lei; Gao, Jungang; Wang, Sheng; Hu, Huasi; Guo, Youmin

    2017-01-01

    Estimation of the pleural effusion's volume is an important clinical issue. The existing methods cannot assess it accurately when there is large volume of liquid in the pleural cavity and/or the patient has some other disease (e.g. pneumonia). In order to help solve this issue, the objective of this study is to develop and test a novel algorithm using B-spline and local clustering level set method jointly, namely BLL. The BLL algorithm was applied to a dataset involving 27 pleural effusions detected on chest CT examination of 18 adult patients with the presence of free pleural effusion. Study results showed that average volumes of pleural effusion computed using the BLL algorithm and assessed manually by the physicians were 586 ml±339 ml and 604±352 ml, respectively. For the same patient, the volume of the pleural effusion, segmented semi-automatically, was 101.8% ±4.6% of that was segmented manually. Dice similarity was found to be 0.917±0.031. The study demonstrated feasibility of applying the new BLL algorithm to accurately measure the volume of pleural effusion.

  12. Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets

    NASA Astrophysics Data System (ADS)

    Day-Lewis, F. D.; Slater, L. D.; Johnson, T.

    2012-12-01

    Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.

  13. Adaptive Neuron Apoptosis for Accelerating Deep Learning on Large Scale Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Siegel, Charles M.; Daily, Jeffrey A.; Vishnu, Abhinav

    Machine Learning and Data Mining (MLDM) algorithms are becoming ubiquitous in {\\em model learning} from the large volume of data generated using simulations, experiments and handheld devices. Deep Learning algorithms -- a class of MLDM algorithms -- are applied for automatic feature extraction, and learning non-linear models for unsupervised and supervised algorithms. Naturally, several libraries which support large scale Deep Learning -- such as TensorFlow and Caffe -- have become popular. In this paper, we present novel techniques to accelerate the convergence of Deep Learning algorithms by conducting low overhead removal of redundant neurons -- {\\em apoptosis} of neurons --more » which do not contribute to model learning, during the training phase itself. We provide in-depth theoretical underpinnings of our heuristics (bounding accuracy loss and handling apoptosis of several neuron types), and present the methods to conduct adaptive neuron apoptosis. We implement our proposed heuristics with the recently introduced TensorFlow and using its recently proposed extension with MPI. Our performance evaluation on two difference clusters -- one connected with Intel Haswell multi-core systems, and other with nVIDIA GPUs -- using InfiniBand, indicates the efficacy of the proposed heuristics and implementations. Specifically, we are able to improve the training time for several datasets by 2-3x, while reducing the number of parameters by 30x (4-5x on average) on datasets such as ImageNet classification. For the Higgs Boson dataset, our implementation improves the accuracy (measured by Area Under Curve (AUC)) for classification from 0.88/1 to 0.94/1, while reducing the number of parameters by 3x in comparison to existing literature, while achieving a 2.44x speedup in comparison to the default (no apoptosis) algorithm.« less

  14. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data.

    PubMed

    Vaccarino, Anthony L; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M; Stuss, Donald T; Theriault, Elizabeth; Evans, Kenneth R

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute's "Brain-CODE" is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care.

  15. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data

    PubMed Central

    Vaccarino, Anthony L.; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R.; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G.; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F. Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M.; Stuss, Donald T.; Theriault, Elizabeth; Evans, Kenneth R.

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute’s “Brain-CODE” is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care. PMID:29875648

  16. Implementing DOIs for Oceanographic Satellite Data at PO.DAAC

    NASA Astrophysics Data System (ADS)

    Hausman, J.; Tauer, E.; Chung, N.; Chen, C.; Moroni, D. F.

    2013-12-01

    The Physical Oceanographic Distributed Active Archive Center (PO.DAAC) is NASA's archive for physical oceanographic satellite data. It distributes over 500 datasets from gravity, ocean wind, sea surface topography, sea ice, ocean currents, salinity, and sea surface temperature satellite missions. A dataset is a collection of granules/files that share the same mission/project, versioning, processing level, spatial, and temporal characteristics. The large number of datasets is partially due to the number of satellite missions, but mostly because a single satellite mission typically has multiple versions or even temporal and spatial resolutions of data. As a result, a user might mistake one dataset for a different dataset from the same satellite mission. Due to the PO.DAAC'S vast variety and volume of data and growing requirements to report dataset usage, it has begun implementing DOIs for the datasets it archives and distributes. However, this was not as simple as registering a name for a DOI and providing a URL. Before implementing DOIs multiple questions needed to be answered. What are the sponsor and end-user expectations regarding DOIs? At what level does a DOI get assigned (dataset, file/granule)? Do all data get a DOI, or only selected data? How do we create a DOI? How do we create landing pages and manage them? What changes need to be made to the data archive, life cycle policy and web portal to accommodate DOIs? What if the data also exists at another archive and a DOI already exists? How is a DOI included if the data were obtained via a subsetting tool? How does a researcher or author provide a unique, definitive reference (standard citation) for a given dataset? This presentation will discuss how these questions were answered through changes in policy, process, and system design. Implementing DOIs is not a trivial undertaking, but as DOIs are rapidly becoming the de facto approach, it is worth the effort. Researchers have historically referenced the source satellite and data center (or archive), but scientific writings do not typically provide enough detail to point to a singular, uniquely identifiable dataset. DOIs provide the means to help researchers be precise in their data citations and provide needed clarity, standardization and permanence.

  17. The Path from Large Earth Science Datasets to Information

    NASA Astrophysics Data System (ADS)

    Vicente, G. A.

    2013-12-01

    The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.

  18. ANTONIA perfusion and stroke. A software tool for the multi-purpose analysis of MR perfusion-weighted datasets and quantitative ischemic stroke assessment.

    PubMed

    Forkert, N D; Cheng, B; Kemmling, A; Thomalla, G; Fiehler, J

    2014-01-01

    The objective of this work is to present the software tool ANTONIA, which has been developed to facilitate a quantitative analysis of perfusion-weighted MRI (PWI) datasets in general as well as the subsequent multi-parametric analysis of additional datasets for the specific purpose of acute ischemic stroke patient dataset evaluation. Three different methods for the analysis of DSC or DCE PWI datasets are currently implemented in ANTONIA, which can be case-specifically selected based on the study protocol. These methods comprise a curve fitting method as well as a deconvolution-based and deconvolution-free method integrating a previously defined arterial input function. The perfusion analysis is extended for the purpose of acute ischemic stroke analysis by additional methods that enable an automatic atlas-based selection of the arterial input function, an analysis of the perfusion-diffusion and DWI-FLAIR mismatch as well as segmentation-based volumetric analyses. For reliability evaluation, the described software tool was used by two observers for quantitative analysis of 15 datasets from acute ischemic stroke patients to extract the acute lesion core volume, FLAIR ratio, perfusion-diffusion mismatch volume with manually as well as automatically selected arterial input functions, and follow-up lesion volume. The results of this evaluation revealed that the described software tool leads to highly reproducible results for all parameters if the automatic arterial input function selection method is used. Due to the broad selection of processing methods that are available in the software tool, ANTONIA is especially helpful to support image-based perfusion and acute ischemic stroke research projects.

  19. Adding the missing piece: Spitzer imaging of the HSC-Deep/PFS fields

    NASA Astrophysics Data System (ADS)

    Sajina, Anna; Bezanson, Rachel; Capak, Peter; Egami, Eiichi; Fan, Xiaohui; Farrah, Duncan; Greene, Jenny; Goulding, Andy; Lacy, Mark; Lin, Yen-Ting; Liu, Xin; Marchesini, Danilo; Moutard, Thibaud; Ono, Yoshiaki; Ouchi, Masami; Sawicki, Marcin; Strauss, Michael; Surace, Jason; Whitaker, Katherine

    2018-05-01

    We propose to observe a total of 7sq.deg. to complete the Spitzer-IRAC coverage of the HSC-Deep survey fields. These fields are the sites of the PrimeFocusSpectrograph (PFS) galaxy evolution survey which will provide spectra of wide wavelength range and resolution for almost all M* galaxies at z 0.7-1.7, and extend out to z 7 for targeted samples. Our fields already have deep broadband and narrowband photometry in 12 bands spanning from u through K and a wealth of other ancillary data. We propose completing the matching depth IRAC observations in the extended COSMOS, ELAIS-N1 and Deep2-3 fields. By complementing existing Spitzer coverage, this program will lead to an unprecedended in spectro-photometric coverage dataset across a total of 15 sq.deg. This dataset will have significant legacy value as it samples a large enough cosmic volume to be representative of the full range of environments, but also doing so with sufficient information content per galaxy to confidently derive stellar population characteristics. This enables detailed studies of the growth and quenching of galaxies and their supermassive black holes in the context of a galaxy's local and large scale environment.

  20. Detection of neuron membranes in electron microscopy images using a serial neural network architecture.

    PubMed

    Jurrus, Elizabeth; Paiva, Antonio R C; Watanabe, Shigeki; Anderson, James R; Jones, Bryan W; Whitaker, Ross T; Jorgensen, Erik M; Marc, Robert E; Tasdizen, Tolga

    2010-12-01

    Study of nervous systems via the connectome, the map of connectivities of all neurons in that system, is a challenging problem in neuroscience. Towards this goal, neurobiologists are acquiring large electron microscopy datasets. However, the shear volume of these datasets renders manual analysis infeasible. Hence, automated image analysis methods are required for reconstructing the connectome from these very large image collections. Segmentation of neurons in these images, an essential step of the reconstruction pipeline, is challenging because of noise, anisotropic shapes and brightness, and the presence of confounding structures. The method described in this paper uses a series of artificial neural networks (ANNs) in a framework combined with a feature vector that is composed of image intensities sampled over a stencil neighborhood. Several ANNs are applied in series allowing each ANN to use the classification context provided by the previous network to improve detection accuracy. We develop the method of serial ANNs and show that the learned context does improve detection over traditional ANNs. We also demonstrate advantages over previous membrane detection methods. The results are a significant step towards an automated system for the reconstruction of the connectome. Copyright 2010 Elsevier B.V. All rights reserved.

  1. Big Data Analytics for Demand Response: Clustering Over Space and Time

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chelmis, Charalampos; Kolte, Jahanvi; Prasanna, Viktor K.

    The pervasive deployment of advanced sensing infrastructure in Cyber-Physical systems, such as the Smart Grid, has resulted in an unprecedented data explosion. Such data exhibit both large volumes and high velocity characteristics, two of the three pillars of Big Data, and have a time-series notion as datasets in this context typically consist of successive measurements made over a time interval. Time-series data can be valuable for data mining and analytics tasks such as identifying the “right” customers among a diverse population, to target for Demand Response programs. However, time series are challenging to mine due to their high dimensionality. Inmore » this paper, we motivate this problem using a real application from the smart grid domain. We explore novel representations of time-series data for BigData analytics, and propose a clustering technique for determining natural segmentation of customers and identification of temporal consumption patterns. Our method is generizable to large-scale, real-world scenarios, without making any assumptions about the data. We evaluate our technique using real datasets from smart meters, totaling ~ 18,200,000 data points, and show the efficacy of our technique in efficiency detecting the number of optimal number of clusters.« less

  2. Cardiovascular imaging environment: will the future be cloud-based?

    PubMed

    Kawel-Boehm, Nadine; Bluemke, David A

    2017-07-01

    In cardiovascular CT and MR imaging large datasets have to be stored, post-processed, analyzed and distributed. Beside basic assessment of volume and function in cardiac magnetic resonance imaging e.g., more sophisticated quantitative analysis is requested requiring specific software. Several institutions cannot afford various types of software and provide expertise to perform sophisticated analysis. Areas covered: Various cloud services exist related to data storage and analysis specifically for cardiovascular CT and MR imaging. Instead of on-site data storage, cloud providers offer flexible storage services on a pay-per-use basis. To avoid purchase and maintenance of specialized software for cardiovascular image analysis, e.g. to assess myocardial iron overload, MR 4D flow and fractional flow reserve, evaluation can be performed with cloud based software by the consumer or complete analysis is performed by the cloud provider. However, challenges to widespread implementation of cloud services include regulatory issues regarding patient privacy and data security. Expert commentary: If patient privacy and data security is guaranteed cloud imaging is a valuable option to cope with storage of large image datasets and offer sophisticated cardiovascular image analysis for institutions of all sizes.

  3. Effects of VR system fidelity on analyzing isosurface visualization of volume datasets.

    PubMed

    Laha, Bireswar; Bowman, Doug A; Socha, John J

    2014-04-01

    Volume visualization is an important technique for analyzing datasets from a variety of different scientific domains. Volume data analysis is inherently difficult because volumes are three-dimensional, dense, and unfamiliar, requiring scientists to precisely control the viewpoint and to make precise spatial judgments. Researchers have proposed that more immersive (higher fidelity) VR systems might improve task performance with volume datasets, and significant results tied to different components of display fidelity have been reported. However, more information is needed to generalize these results to different task types, domains, and rendering styles. We visualized isosurfaces extracted from synchrotron microscopic computed tomography (SR-μCT) scans of beetles, in a CAVE-like display. We ran a controlled experiment evaluating the effects of three components of system fidelity (field of regard, stereoscopy, and head tracking) on a variety of abstract task categories that are applicable to various scientific domains, and also compared our results with those from our prior experiment using 3D texture-based rendering. We report many significant findings. For example, for search and spatial judgment tasks with isosurface visualization, a stereoscopic display provides better performance, but for tasks with 3D texture-based rendering, displays with higher field of regard were more effective, independent of the levels of the other display components. We also found that systems with high field of regard and head tracking improve performance in spatial judgment tasks. Our results extend existing knowledge and produce new guidelines for designing VR systems to improve the effectiveness of volume data analysis.

  4. Validation of geometric measurements of the left atrium and pulmonary veins for analysis of reverse structural remodeling following ablation therapy

    NASA Astrophysics Data System (ADS)

    Rettmann, M. E.; Holmes, D. R., III; Gunawan, M. S.; Ge, X.; Karwoski, R. A.; Breen, J. F.; Packer, D. L.; Robb, R. A.

    2012-03-01

    Geometric analysis of the left atrium and pulmonary veins is important for studying reverse structural remodeling following cardiac ablation therapy. It has been shown that the left atrium decreases in volume and the pulmonary vein ostia decrease in diameter following ablation therapy. Most analysis techniques, however, require laborious manual tracing of image cross-sections. Pulmonary vein diameters are typically measured at the junction between the left atrium and pulmonary veins, called the pulmonary vein ostia, with manually drawn lines on volume renderings or on image cross-sections. In this work, we describe a technique for making semi-automatic measurements of the left atrium and pulmonary vein ostial diameters from high resolution CT scans and multi-phase datasets. The left atrium and pulmonary veins are segmented from a CT volume using a 3D volume approach and cut planes are interactively positioned to separate the pulmonary veins from the body of the left atrium. The cut plane is also used to compute the pulmonary vein ostial diameter. Validation experiments are presented which demonstrate the ability to repeatedly measure left atrial volume and pulmonary vein diameters from high resolution CT scans, as well as the feasibility of this approach for analyzing dynamic, multi-phase datasets. In the high resolution CT scans the left atrial volume measurements show high repeatability with approximately 4% intra-rater repeatability and 8% inter-rater repeatability. Intra- and inter-rater repeatability for pulmonary vein diameter measurements range from approximately 2 to 4 mm. For the multi-phase CT datasets, differences in left atrial volumes between a standard slice-by-slice approach and the proposed 3D volume approach are small, with percent differences on the order of 3% to 6%.

  5. Workshop on New Views of the Moon 2: Understanding the Moon Through the Integration of Diverse Datasets

    NASA Technical Reports Server (NTRS)

    1999-01-01

    This volume contains abstracts that have been accepted for presentation at the Workshop on New Views of the Moon II: Understanding the Moon Through the Integration of Diverse Datasets, September 22-24, 1999, in Flagstaff, Arizona. The workshop conveners are Lisa Gaddis (U.S. Geological Survey, Flagstaff and Charles K. Shearer (University of New Mexico). Color versions of some of the images contained in this volume are available on the meeting Web site (http://cass.jsc.nasa.gov/meetings/moon99/pdf/program.pdf).

  6. Server-based Approach to Web Visualization of Integrated Three-dimensional Brain Imaging Data

    PubMed Central

    Poliakov, Andrew V.; Albright, Evan; Hinshaw, Kevin P.; Corina, David P.; Ojemann, George; Martin, Richard F.; Brinkley, James F.

    2005-01-01

    The authors describe a client-server approach to three-dimensional (3-D) visualization of neuroimaging data, which enables researchers to visualize, manipulate, and analyze large brain imaging datasets over the Internet. All computationally intensive tasks are done by a graphics server that loads and processes image volumes and 3-D models, renders 3-D scenes, and sends the renderings back to the client. The authors discuss the system architecture and implementation and give several examples of client applications that allow visualization and analysis of integrated language map data from single and multiple patients. PMID:15561787

  7. Semi-automated Neuron Boundary Detection and Nonbranching Process Segmentation in Electron Microscopy Images

    PubMed Central

    Jurrus, Elizabeth; Watanabe, Shigeki; Giuly, Richard J.; Paiva, Antonio R. C.; Ellisman, Mark H.; Jorgensen, Erik M.; Tasdizen, Tolga

    2013-01-01

    Neuroscientists are developing new imaging techniques and generating large volumes of data in an effort to understand the complex structure of the nervous system. The complexity and size of this data makes human interpretation a labor-intensive task. To aid in the analysis, new segmentation techniques for identifying neurons in these feature rich datasets are required. This paper presents a method for neuron boundary detection and nonbranching process segmentation in electron microscopy images and visualizing them in three dimensions. It combines both automated segmentation techniques with a graphical user interface for correction of mistakes in the automated process. The automated process first uses machine learning and image processing techniques to identify neuron membranes that deliniate the cells in each two-dimensional section. To segment nonbranching processes, the cell regions in each two-dimensional section are connected in 3D using correlation of regions between sections. The combination of this method with a graphical user interface specially designed for this purpose, enables users to quickly segment cellular processes in large volumes. PMID:22644867

  8. Semi-Automated Neuron Boundary Detection and Nonbranching Process Segmentation in Electron Microscopy Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jurrus, Elizabeth R.; Watanabe, Shigeki; Giuly, Richard J.

    2013-01-01

    Neuroscientists are developing new imaging techniques and generating large volumes of data in an effort to understand the complex structure of the nervous system. The complexity and size of this data makes human interpretation a labor-intensive task. To aid in the analysis, new segmentation techniques for identifying neurons in these feature rich datasets are required. This paper presents a method for neuron boundary detection and nonbranching process segmentation in electron microscopy images and visualizing them in three dimensions. It combines both automated segmentation techniques with a graphical user interface for correction of mistakes in the automated process. The automated processmore » first uses machine learning and image processing techniques to identify neuron membranes that deliniate the cells in each two-dimensional section. To segment nonbranching processes, the cell regions in each two-dimensional section are connected in 3D using correlation of regions between sections. The combination of this method with a graphical user interface specially designed for this purpose, enables users to quickly segment cellular processes in large volumes.« less

  9. Multivariate Formation Pressure Prediction with Seismic-derived Petrophysical Properties from Prestack AVO inversion and Poststack Seismic Motion Inversion

    NASA Astrophysics Data System (ADS)

    Yu, H.; Gu, H.

    2017-12-01

    A novel multivariate seismic formation pressure prediction methodology is presented, which incorporates high-resolution seismic velocity data from prestack AVO inversion, and petrophysical data (porosity and shale volume) derived from poststack seismic motion inversion. In contrast to traditional seismic formation prediction methods, the proposed methodology is based on a multivariate pressure prediction model and utilizes a trace-by-trace multivariate regression analysis on seismic-derived petrophysical properties to calibrate model parameters in order to make accurate predictions with higher resolution in both vertical and lateral directions. With prestack time migration velocity as initial velocity model, an AVO inversion was first applied to prestack dataset to obtain high-resolution seismic velocity with higher frequency that is to be used as the velocity input for seismic pressure prediction, and the density dataset to calculate accurate Overburden Pressure (OBP). Seismic Motion Inversion (SMI) is an inversion technique based on Markov Chain Monte Carlo simulation. Both structural variability and similarity of seismic waveform are used to incorporate well log data to characterize the variability of the property to be obtained. In this research, porosity and shale volume are first interpreted on well logs, and then combined with poststack seismic data using SMI to build porosity and shale volume datasets for seismic pressure prediction. A multivariate effective stress model is used to convert velocity, porosity and shale volume datasets to effective stress. After a thorough study of the regional stratigraphic and sedimentary characteristics, a regional normally compacted interval model is built, and then the coefficients in the multivariate prediction model are determined in a trace-by-trace multivariate regression analysis on the petrophysical data. The coefficients are used to convert velocity, porosity and shale volume datasets to effective stress and then to calculate formation pressure with OBP. Application of the proposed methodology to a research area in East China Sea has proved that the method can bridge the gap between seismic and well log pressure prediction and give predicted pressure values close to pressure meassurements from well testing.

  10. Field Research Facility Data Integration Framework Data Management Plan: Survey Lines Dataset

    DTIC Science & Technology

    2016-08-01

    CHL and its District partners. The beach morphology surveys on which this report focuses provide quantitative measures of the dynamic nature of...topography • volume change 1.4 Data description The morphology surveys are conducted over a series of 26 shore- perpendicular profile lines spaced 50...dataset input data and products. Table 1. FRF survey lines dataset input data and products. Input Data FDIF Product Description ASCII LARC survey text

  11. An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams

    ERIC Educational Resources Information Center

    Teymourlouei, Haydar

    2013-01-01

    The emerging large datasets have made efficient data processing a much more difficult task for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with time. The purpose of this research is to give an overview of some of the tools and techniques that can be utilized to manage and analyze large datasets. We…

  12. Mechanistic simulation of normal-tissue damage in radiotherapy—implications for dose-volume analyses

    NASA Astrophysics Data System (ADS)

    Rutkowska, Eva; Baker, Colin; Nahum, Alan

    2010-04-01

    A radiobiologically based 3D model of normal tissue has been developed in which complications are generated when 'irradiated'. The aim is to provide insight into the connection between dose-distribution characteristics, different organ architectures and complication rates beyond that obtainable with simple DVH-based analytical NTCP models. In this model the organ consists of a large number of functional subunits (FSUs), populated by stem cells which are killed according to the LQ model. A complication is triggered if the density of FSUs in any 'critical functioning volume' (CFV) falls below some threshold. The (fractional) CFV determines the organ architecture and can be varied continuously from small (series-like behaviour) to large (parallel-like). A key feature of the model is its ability to account for the spatial dependence of dose distributions. Simulations were carried out to investigate correlations between dose-volume parameters and the incidence of 'complications' using different pseudo-clinical dose distributions. Correlations between dose-volume parameters and outcome depended on characteristics of the dose distributions and on organ architecture. As anticipated, the mean dose and V20 correlated most strongly with outcome for a parallel organ, and the maximum dose for a serial organ. Interestingly better correlation was obtained between the 3D computer model and the LKB model with dose distributions typical for serial organs than with those typical for parallel organs. This work links the results of dose-volume analyses to dataset characteristics typical for serial and parallel organs and it may help investigators interpret the results from clinical studies.

  13. Automatic segmentation of tumor-laden lung volumes from the LIDC database

    NASA Astrophysics Data System (ADS)

    O'Dell, Walter G.

    2012-03-01

    The segmentation of the lung parenchyma is often a critical pre-processing step prior to application of computer-aided detection of lung nodules. Segmentation of the lung volume can dramatically decrease computation time and reduce the number of false positive detections by excluding from consideration extra-pulmonary tissue. However, while many algorithms are capable of adequately segmenting the healthy lung, none have been demonstrated to work reliably well on tumor-laden lungs. Of particular challenge is to preserve tumorous masses attached to the chest wall, mediastinum or major vessels. In this role, lung volume segmentation comprises an important computational step that can adversely affect the performance of the overall CAD algorithm. An automated lung volume segmentation algorithm has been developed with the goals to maximally exclude extra-pulmonary tissue while retaining all true nodules. The algorithm comprises a series of tasks including intensity thresholding, 2-D and 3-D morphological operations, 2-D and 3-D floodfilling, and snake-based clipping of nodules attached to the chest wall. It features the ability to (1) exclude trachea and bowels, (2) snip large attached nodules using snakes, (3) snip small attached nodules using dilation, (4) preserve large masses fully internal to lung volume, (5) account for basal aspects of the lung where in a 2-D slice the lower sections appear to be disconnected from main lung, and (6) achieve separation of the right and left hemi-lungs. The algorithm was developed and trained to on the first 100 datasets of the LIDC image database.

  14. Diviner lunar radiometer gridded brightness temperatures from geodesic binning of modeled fields of view

    NASA Astrophysics Data System (ADS)

    Sefton-Nash, E.; Williams, J.-P.; Greenhagen, B. T.; Aye, K.-M.; Paige, D. A.

    2017-12-01

    An approach is presented to efficiently produce high quality gridded data records from the large, global point-based dataset returned by the Diviner Lunar Radiometer Experiment aboard NASA's Lunar Reconnaissance Orbiter. The need to minimize data volume and processing time in production of science-ready map products is increasingly important with the growth in data volume of planetary datasets. Diviner makes on average >1400 observations per second of radiance that is reflected and emitted from the lunar surface, using 189 detectors divided into 9 spectral channels. Data management and processing bottlenecks are amplified by modeling every observation as a probability distribution function over the field of view, which can increase the required processing time by 2-3 orders of magnitude. Geometric corrections, such as projection of data points onto a digital elevation model, are numerically intensive and therefore it is desirable to perform them only once. Our approach reduces bottlenecks through parallel binning and efficient storage of a pre-processed database of observations. Database construction is via subdivision of a geodesic icosahedral grid, with a spatial resolution that can be tailored to suit the field of view of the observing instrument. Global geodesic grids with high spatial resolution are normally impractically memory intensive. We therefore demonstrate a minimum storage and highly parallel method to bin very large numbers of data points onto such a grid. A database of the pre-processed and binned points is then used for production of mapped data products that is significantly faster than if unprocessed points were used. We explore quality controls in the production of gridded data records by conditional interpolation, allowed only where data density is sufficient. The resultant effects on the spatial continuity and uncertainty in maps of lunar brightness temperatures is illustrated. We identify four binning regimes based on trades between the spatial resolution of the grid, the size of the FOV and the on-target spacing of observations. Our approach may be applicable and beneficial for many existing and future point-based planetary datasets.

  15. Hydrograph Predictions of Glacial Lake Outburst Floods From an Ice-Dammed Lake

    NASA Astrophysics Data System (ADS)

    McCoy, S. W.; Jacquet, J.; McGrath, D.; Koschitzki, R.; Okuinghttons, J.

    2017-12-01

    Understanding the time evolution of glacial lake outburst floods (GLOFs), and ultimately predicting peak discharge, is crucial to mitigating the impacts of GLOFs on downstream communities and understanding concomitant surface change. The dearth of in situ measurements taken during GLOFs has left many GLOF models currently in use untested. Here we present a dataset of 13 GLOFs from Lago Cachet Dos, Aysen Region, Chile in which we detail measurements of key environmental variables (total volume drained, lake temperature, and lake inflow rate) and high temporal resolution discharge measurements at the source lake, in addition to well-constrained ice thickness and bedrock topography. Using this dataset we test two common empirical equations as well as the physically-based model of Spring-Hutter-Clarke. We find that the commonly used empirical relationships based solely on a dataset of lake volume drained fail to predict the large variability in observed peak discharges from Lago Cachet Dos. This disagreement is likely because these equations do not consider additional environmental variables that we show also control peak discharge, primarily, lake water temperature and the rate of meltwater inflow to the source lake. We find that the Spring-Hutter-Clarke model can accurately simulate the exponentially rising hydrographs that are characteristic of ice-dammed GLOFs, as well as the order of magnitude variation in peak discharge between events if the hydraulic roughness parameter is allowed to be a free fitting parameter. However, the Spring-Hutter-Clarke model over predicts peak discharge in all cases by 10 to 35%. The systematic over prediction of peak discharge by the model is related to its abrupt flood termination that misses the observed steep falling limb of the flood hydrograph. Although satisfactory model fits are produced, the range in hydraulic roughness required to obtain these fits across all events was large, which suggests that current models do not completely capture the physics of these systems, thus limiting their ability to truly predict peak discharges using only independently constrained parameters. We suggest what some of these missing physics might be.

  16. Multi-observation PET image analysis for patient follow-up quantitation and therapy assessment

    NASA Astrophysics Data System (ADS)

    David, S.; Visvikis, D.; Roux, C.; Hatt, M.

    2011-09-01

    In positron emission tomography (PET) imaging, an early therapeutic response is usually characterized by variations of semi-quantitative parameters restricted to maximum SUV measured in PET scans during the treatment. Such measurements do not reflect overall tumor volume and radiotracer uptake variations. The proposed approach is based on multi-observation image analysis for merging several PET acquisitions to assess tumor metabolic volume and uptake variations. The fusion algorithm is based on iterative estimation using a stochastic expectation maximization (SEM) algorithm. The proposed method was applied to simulated and clinical follow-up PET images. We compared the multi-observation fusion performance to threshold-based methods, proposed for the assessment of the therapeutic response based on functional volumes. On simulated datasets the adaptive threshold applied independently on both images led to higher errors than the ASEM fusion and on clinical datasets it failed to provide coherent measurements for four patients out of seven due to aberrant delineations. The ASEM method demonstrated improved and more robust estimation of the evaluation leading to more pertinent measurements. Future work will consist in extending the methodology and applying it to clinical multi-tracer datasets in order to evaluate its potential impact on the biological tumor volume definition for radiotherapy applications.

  17. Approximating scatterplots of large datasets using distribution splats

    NASA Astrophysics Data System (ADS)

    Camuto, Matthew; Crawfis, Roger; Becker, Barry G.

    2000-02-01

    Many situations exist where the plotting of large data sets with categorical attributes is desired in a 3D coordinate system. For example, a marketing company may conduct a survey involving one million subjects and then plot peoples favorite car type against their weight, height and annual income. Scatter point plotting, in which each point is individually plotted at its correspond cartesian location using a defined primitive, is usually used to render a plot of this type. If the dependent variable is continuous, we can discretize the 3D space into bins or voxels and retain the average value of all records falling within each voxel. Previous work employed volume rendering techniques, in particular, splatting, to represent this aggregated data, by mapping each average value to a representative color.

  18. SEGMA: An Automatic SEGMentation Approach for Human Brain MRI Using Sliding Window and Random Forests

    PubMed Central

    Serag, Ahmed; Wilkinson, Alastair G.; Telford, Emma J.; Pataky, Rozalia; Sparrow, Sarah A.; Anblagan, Devasuda; Macnaught, Gillian; Semple, Scott I.; Boardman, James P.

    2017-01-01

    Quantitative volumes from brain magnetic resonance imaging (MRI) acquired across the life course may be useful for investigating long term effects of risk and resilience factors for brain development and healthy aging, and for understanding early life determinants of adult brain structure. Therefore, there is an increasing need for automated segmentation tools that can be applied to images acquired at different life stages. We developed an automatic segmentation method for human brain MRI, where a sliding window approach and a multi-class random forest classifier were applied to high-dimensional feature vectors for accurate segmentation. The method performed well on brain MRI data acquired from 179 individuals, analyzed in three age groups: newborns (38–42 weeks gestational age), children and adolescents (4–17 years) and adults (35–71 years). As the method can learn from partially labeled datasets, it can be used to segment large-scale datasets efficiently. It could also be applied to different populations and imaging modalities across the life course. PMID:28163680

  19. Gender classification of running subjects using full-body kinematics

    NASA Astrophysics Data System (ADS)

    Williams, Christina M.; Flora, Jeffrey B.; Iftekharuddin, Khan M.

    2016-05-01

    This paper proposes novel automated gender classification of subjects while engaged in running activity. The machine learning techniques include preprocessing steps using principal component analysis followed by classification with linear discriminant analysis, and nonlinear support vector machines, and decision-stump with AdaBoost. The dataset consists of 49 subjects (25 males, 24 females, 2 trials each) all equipped with approximately 80 retroreflective markers. The trials are reflective of the subject's entire body moving unrestrained through a capture volume at a self-selected running speed, thus producing highly realistic data. The classification accuracy using leave-one-out cross validation for the 49 subjects is improved from 66.33% using linear discriminant analysis to 86.74% using the nonlinear support vector machine. Results are further improved to 87.76% by means of implementing a nonlinear decision stump with AdaBoost classifier. The experimental findings suggest that the linear classification approaches are inadequate in classifying gender for a large dataset with subjects running in a moderately uninhibited environment.

  20. Progress and Challenges in Assessing NOAA Data Management

    NASA Astrophysics Data System (ADS)

    de la Beaujardiere, J.

    2016-12-01

    The US National Oceanic and Atmospheric Administration (NOAA) produces large volumes of environmental data from a great variety of observing systems including satellites, radars, aircraft, ships, buoys, and other platforms. These data are irreplaceable assets that must be properly managed to ensure they are discoverable, accessible, usable, and preserved. A policy framework has been established which informs data producers of their responsibilities and which supports White House-level mandates such as the Executive Order on Open Data and the OSTP Memorandum on Increasing Access to the Results of Federally Funded Scientific Research. However, assessing the current state and progress toward completion for the many NOAA datasets is a challenge. This presentation will discuss work toward establishing assessment methodologies and dashboard-style displays. Ideally, metrics would be gathered though software and be automatically updated whenever an individual improvement was made. In practice, however, some level of manual information collection is required. Differing approaches to dataset granularity in different branches of NOAA yield additional complexity.

  1. NP-PAH Interaction Dataset

    EPA Pesticide Factsheets

    Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration were allowed to equilibrate with known mass of nanoparticles. The mixture was then ultracentrifuged and sampled for analysis. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).

  2. Really big data: Processing and analysis of large datasets

    USDA-ARS?s Scientific Manuscript database

    Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Miller, Guthrie; Klumpp, John A.; Melo, Dunstana

    Here, the pharmacokinetic equations of Pierson et al. describing the behavior of bromide in rat provide a general approach to the modeling of extracellular fluid (ECF). The movement of material into ECF spaces is rapid and is completely characterized by tissue volumes and vascular flow rates to and from a tissue, the volumes of the tissue, and the ECF associated with the tissue. Early-time measurements are needed to characterize ECF. Measurements of DTPA disappearance from plasma by Wedeking et al. are discussed as an example of such measurements. In any biokinetic model, the fastest transfer rates are not determinable withmore » the usual datasets, and if determined empirically, these rates will have very large and highly correlated uncertainties, so particular values of these rates, even though the model fits the available data, are not significant. A pharmacokinetic front-end provides values for these fast rates. An example of such a front-end for a 200–g rat is given.« less

  4. Rising dough and baking bread at the Australian synchrotron

    NASA Astrophysics Data System (ADS)

    Mayo, S. C.; McCann, T.; Day, L.; Favaro, J.; Tuhumury, H.; Thompson, D.; Maksimenko, A.

    2016-01-01

    Wheat protein quality and the amount of common salt added in dough formulation can have a significant effect on the microstructure and loaf volume of bread. High-speed synchrotron micro-CT provides an ideal tool for observing the three dimensional structure of bread dough in situ during proving (rising) and baking. In this work, the synchrotron micro-CT technique was used to observe the structure and time evolution of doughs made from high and low protein flour and three different salt additives. These experiments showed that, as expected, high protein flour produces a higher volume loaf compared to low protein flour regardless of salt additives. Furthermore the results show that KCl in particular has a very negative effect on dough properties resulting in much reduced porosity. The hundreds of datasets produced and analysed during this experiment also provided a valuable test case for handling large quantities of data using tools on the Australian Synchrotron's MASSIVE cluster.

  5. An automatic approach for 3D registration of CT scans

    NASA Astrophysics Data System (ADS)

    Hu, Yang; Saber, Eli; Dianat, Sohail; Vantaram, Sreenath Rao; Abhyankar, Vishwas

    2012-03-01

    CT (Computed tomography) is a widely employed imaging modality in the medical field. Normally, a volume of CT scans is prescribed by a doctor when a specific region of the body (typically neck to groin) is suspected of being abnormal. The doctors are required to make professional diagnoses based upon the obtained datasets. In this paper, we propose an automatic registration algorithm that helps healthcare personnel to automatically align corresponding scans from 'Study' to 'Atlas'. The proposed algorithm is capable of aligning both 'Atlas' and 'Study' into the same resolution through 3D interpolation. After retrieving the scanned slice volume in the 'Study' and the corresponding volume in the original 'Atlas' dataset, a 3D cross correlation method is used to identify and register various body parts.

  6. A simple rapid process for semi-automated brain extraction from magnetic resonance images of the whole mouse head.

    PubMed

    Delora, Adam; Gonzales, Aaron; Medina, Christopher S; Mitchell, Adam; Mohed, Abdul Faheem; Jacobs, Russell E; Bearer, Elaine L

    2016-01-15

    Magnetic resonance imaging (MRI) is a well-developed technique in neuroscience. Limitations in applying MRI to rodent models of neuropsychiatric disorders include the large number of animals required to achieve statistical significance, and the paucity of automation tools for the critical early step in processing, brain extraction, which prepares brain images for alignment and voxel-wise statistics. This novel timesaving automation of template-based brain extraction ("skull-stripping") is capable of quickly and reliably extracting the brain from large numbers of whole head images in a single step. The method is simple to install and requires minimal user interaction. This method is equally applicable to different types of MR images. Results were evaluated with Dice and Jacquard similarity indices and compared in 3D surface projections with other stripping approaches. Statistical comparisons demonstrate that individual variation of brain volumes are preserved. A downloadable software package not otherwise available for extraction of brains from whole head images is included here. This software tool increases speed, can be used with an atlas or a template from within the dataset, and produces masks that need little further refinement. Our new automation can be applied to any MR dataset, since the starting point is a template mask generated specifically for that dataset. The method reliably and rapidly extracts brain images from whole head images, rendering them useable for subsequent analytical processing. This software tool will accelerate the exploitation of mouse models for the investigation of human brain disorders by MRI. Copyright © 2015 Elsevier B.V. All rights reserved.

  7. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Herschtal, Alan, E-mail: Alan.Herschtal@petermac.org; Faculty of Health, Arts and Design, Swinburne University of Technology, Melbourne; Te Marvelde, Luc

    Objective: To develop a mathematical tool that can update a patient's planning target volume (PTV) partway through a course of radiation therapy to more precisely target the tumor for the remainder of treatment and reduce dose to surrounding healthy tissue. Methods and Materials: Daily on-board imaging was used to collect large datasets of displacements for patients undergoing external beam radiation therapy for solid tumors. Bayesian statistical modeling of these geometric uncertainties was used to optimally trade off between displacement data collected from previously treated patients and the progressively accumulating data from a patient currently partway through treatment, to optimally predictmore » future displacements for that patient. These predictions were used to update the PTV position and margin width for the remainder of treatment, such that the clinical target volume (CTV) was more precisely targeted. Results: Software simulation of dose to CTV and normal tissue for 2 real prostate displacement datasets consisting of 146 and 290 patients treated with a minimum of 30 fractions each showed that re-evaluating the PTV position and margin width after 8 treatment fractions reduced healthy tissue dose by 19% and 17%, respectively, while maintaining CTV dose. Conclusion: Incorporating patient-specific displacement patterns from early in a course of treatment allows PTV adaptation for the remainder of treatment. This substantially reduces the dose to healthy tissues and thus can reduce radiation therapy–induced toxicities, improving patient outcomes.« less

  8. Quality assurance in the EORTC 22033-26033/CE5 phase III randomized trial for low grade glioma: the digital individual case review.

    PubMed

    Fairchild, Alysa; Weber, Damien C; Bar-Deroma, Raquel; Gulyban, Akos; Fenton, Paul A; Stupp, Roger; Baumert, Brigitta G

    2012-06-01

    The phase III EORTC 22033-26033/NCIC CE5 intergroup trial compares 50.4 Gy radiotherapy with up-front temozolomide in previously untreated low-grade glioma. We describe the digital EORTC individual case review (ICR) performed to evaluate protocol radiotherapy (RT) compliance. Fifty-eight institutions were asked to submit 1-2 randomly selected cases. Digital ICR datasets were uploaded to the EORTC server and accessed by three central reviewers. Twenty-seven parameters were analysed including volume delineation, treatment planning, organ at risk (OAR) dosimetry and verification. Consensus reviews were collated and summary statistics calculated. Fifty-seven of seventy-two requested datasets from forty-eight institutions were technically usable. 31/57 received a major deviation for at least one section. Relocation accuracy was according to protocol in 45. Just over 30% had acceptable target volumes. OAR contours were missing in an average of 25% of cases. Up to one-third of those present were incorrectly drawn while dosimetry was largely protocol compliant. Beam energy was acceptable in 97% and 48 patients had per protocol beam arrangements. Digital RT plan submission and review within the EORTC 22033-26033 ICR provide a solid foundation for future quality assurance procedures. Strict evaluation resulted in overall grades of minor and major deviation for 37% and 32%, respectively. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  9. Finding Spatio-Temporal Patterns in Large Sensor Datasets

    ERIC Educational Resources Information Center

    McGuire, Michael Patrick

    2010-01-01

    Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…

  10. Parallel Index and Query for Large Scale Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chou, Jerry; Wu, Kesheng; Ruebel, Oliver

    2011-07-18

    Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less

  11. Exploiting PubChem for Virtual Screening

    PubMed Central

    Xie, Xiang-Qun

    2011-01-01

    Importance of the field PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances. Areas covered in this review This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling. What the reader will gain These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed. Take home message Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design. PMID:21691435

  12. The digital traces of bubbles: feedback cycles between socio-economic signals in the Bitcoin economy.

    PubMed

    Garcia, David; Tessone, Claudio J; Mavrodiev, Pavlin; Perony, Nicolas

    2014-10-06

    What is the role of social interactions in the creation of price bubbles? Answering this question requires obtaining collective behavioural traces generated by the activity of a large number of actors. Digital currencies offer a unique possibility to measure socio-economic signals from such digital traces. Here, we focus on Bitcoin, the most popular cryptocurrency. Bitcoin has experienced periods of rapid increase in exchange rates (price) followed by sharp decline; we hypothesize that these fluctuations are largely driven by the interplay between different social phenomena. We thus quantify four socio-economic signals about Bitcoin from large datasets: price on online exchanges, volume of word-of-mouth communication in online social media, volume of information search and user base growth. By using vector autoregression, we identify two positive feedback loops that lead to price bubbles in the absence of exogenous stimuli: one driven by word of mouth, and the other by new Bitcoin adopters. We also observe that spikes in information search, presumably linked to external events, precede drastic price declines. Understanding the interplay between the socio-economic signals we measured can lead to applications beyond cryptocurrencies to other phenomena that leave digital footprints, such as online social network usage. © 2014 The Author(s) Published by the Royal Society. All rights reserved.

  13. The digital traces of bubbles: feedback cycles between socio-economic signals in the Bitcoin economy

    PubMed Central

    Garcia, David; Tessone, Claudio J.; Mavrodiev, Pavlin; Perony, Nicolas

    2014-01-01

    What is the role of social interactions in the creation of price bubbles? Answering this question requires obtaining collective behavioural traces generated by the activity of a large number of actors. Digital currencies offer a unique possibility to measure socio-economic signals from such digital traces. Here, we focus on Bitcoin, the most popular cryptocurrency. Bitcoin has experienced periods of rapid increase in exchange rates (price) followed by sharp decline; we hypothesize that these fluctuations are largely driven by the interplay between different social phenomena. We thus quantify four socio-economic signals about Bitcoin from large datasets: price on online exchanges, volume of word-of-mouth communication in online social media, volume of information search and user base growth. By using vector autoregression, we identify two positive feedback loops that lead to price bubbles in the absence of exogenous stimuli: one driven by word of mouth, and the other by new Bitcoin adopters. We also observe that spikes in information search, presumably linked to external events, precede drastic price declines. Understanding the interplay between the socio-economic signals we measured can lead to applications beyond cryptocurrencies to other phenomena that leave digital footprints, such as online social network usage. PMID:25100315

  14. ALS-based hummock size-distance relationship assessment of Mt Shasta debris avalanche deposit, Northern California, USA

    NASA Astrophysics Data System (ADS)

    Tortini, Riccardo; Carn, Simon; van Wyk de Vries, Benjamin

    2015-04-01

    The failure of destabilized volcano flanks is a likely occurrence during the lifetime of a stratovolcano, generating large debris avalanches and drastically changing landforms around volcanoes. The significant hazards associated with these events in the Cascade range were demonstrated, for example, by the collapse of Mt St Helens (WA), which triggered its devastating explosive eruption in 1980. The rapid modification of the landforms due to these events makes it difficult to estimate the magnitude of prehistoric avalanches. However, the widespread preservation of hummocks along the course of rockslide-debris avalanches is highly significant for understanding the physical characteristics of these landslides. Mt Shasta is a 4,317 m high, snow-capped, steep-sloped stratovolcano located in Northern California. The current edifice began forming on the remnants of an ancestral Mt Shasta that collapsed ~300-380k years ago producing one of the largest debris avalanches known on Earth. The debris avalanche deposit (DAD) covers a surface of ~450 km2 across the Shasta valley, with an estimated volume of ~26 km3. We analyze ALS data on hummocks from the prehistoric Shasta valley DAD in northern California (USA) to derive the relationship between hummock size and distance from landslide source, and interpret the geomorphic significance of the intercept and slope coefficients of the observed functional relationships. Given the limited extent of the ALS survey (i.e. 40 km2), the high-resolution dataset is used for validation of the morphological parameters extracted from freely available, broader coverage DTMs such as the National Elevation Dataset (NED). The ALS dataset also permits the identification of subtle topographic features not apparent in the field or in coarser resolution datasets, including a previously unmapped fault, of crucial importance for both seismic and volcanic hazard assessment in volcanic areas. We present evidence from the Shasta DAD of neotectonic deformation along a north south tending fault and a comparison with the NED-derived DTM. This work aims to improve our understanding of the Shasta DAD morphology and dynamics, and provide insight into the cause and timing of events as well as the mode of emplacement of the DAD. The Cascade range includes numerous large extinct, dormant or active stratovolcanoes. Size-distance relationships will enable us to estimate the volume of the collapsed mass and the travel distance of the avalanche, and the knowledge of the link between basement structures and the Shasta DAD will elucidate the causes of edifice instability and may be used to target priority areas for volcanic hazard mapping.

  15. The Ophidia Stack: Toward Large Scale, Big Data Analytics Experiments for Climate Change

    NASA Astrophysics Data System (ADS)

    Fiore, S.; Williams, D. N.; D'Anca, A.; Nassisi, P.; Aloisio, G.

    2015-12-01

    The Ophidia project is a research effort on big data analytics facing scientific data analysis challenges in multiple domains (e.g. climate change). It provides a "datacube-oriented" framework responsible for atomically processing and manipulating scientific datasets, by providing a common way to run distributive tasks on large set of data fragments (chunks). Ophidia provides declarative, server-side, and parallel data analysis, jointly with an internal storage model able to efficiently deal with multidimensional data and a hierarchical data organization to manage large data volumes. The project relies on a strong background on high performance database management and On-Line Analytical Processing (OLAP) systems to manage large scientific datasets. The Ophidia analytics platform provides several data operators to manipulate datacubes (about 50), and array-based primitives (more than 100) to perform data analysis on large scientific data arrays. To address interoperability, Ophidia provides multiple server interfaces (e.g. OGC-WPS). From a client standpoint, a Python interface enables the exploitation of the framework into Python-based eco-systems/applications (e.g. IPython) and the straightforward adoption of a strong set of related libraries (e.g. SciPy, NumPy). The talk will highlight a key feature of the Ophidia framework stack: the "Analytics Workflow Management System" (AWfMS). The Ophidia AWfMS coordinates, orchestrates, optimises and monitors the execution of multiple scientific data analytics and visualization tasks, thus supporting "complex analytics experiments". Some real use cases related to the CMIP5 experiment will be discussed. In particular, with regard to the "Climate models intercomparison data analysis" case study proposed in the EU H2020 INDIGO-DataCloud project, workflows related to (i) anomalies, (ii) trend, and (iii) climate change signal analysis will be presented. Such workflows will be distributed across multiple sites - according to the datasets distribution - and will include intercomparison, ensemble, and outlier analysis. The two-level workflow solution envisioned in INDIGO (coarse grain for distributed tasks orchestration, and fine grain, at the level of a single data analytics cluster instance) will be presented and discussed.

  16. Space-time patterns in ignimbrite compositions revealed by GIS and R based statistical analysis

    NASA Astrophysics Data System (ADS)

    Brandmeier, Melanie; Wörner, Gerhard

    2017-04-01

    GIS-based multivariate statistical and geospatial analysis of a compilation of 890 geochemical and ca. 1,200 geochronological data for 194 mapped ignimbrites from Central Andes documents the compositional and temporal pattern of large volume ignimbrites (so-called "ignimbrite flare-ups") during Neogene times. Rapid advances in computational sciences during the past decade lead to a growing pool of algorithms for multivariate statistics on big datasets with many predictor variables. This study uses the potential of R and ArcGIS and applies cluster (CA) and linear discriminant analysis (LDA) on log-ratio transformed spatial data. CA on major and trace element data allows to group ignimbrites according to their geochemical characteristics into rhyolitic and a dacitic "end-members" and differentiates characteristic trace element signatures with respect to Eu anomaly, depletion of MREEs and variable enrichment in LREE. To highlight these distinct compositional signatures, we applied LDA to selected ignimbrites for which comprehensive data sets were available. The most important predictors for discriminating ignimbrites are La (LREE), Yb (HREE), Eu, Al2O3, K2O, P2O5, MgO, FeOt and TiO2. However, other REEs such as Gd, Pr, Tm, Sm and Er also contribute to the discrimination functions. Significant compositional differences were found between the older (>14 Ma) large-volume plateau-forming ignimbrites in northernmost Chile and southern Peru and the younger (< 10 Ma) Altiplano-Puna-Volcanic-Complex ignimbrites that are of similar volumes. Older ignimbrites are less depleted in HREEs and less radiogenic in Sr isotopes, indicating smaller crustal contributions during evolution in thinner and thermally less evolved crust. These compositional variations indicate a relation to crustal thickening with a "transition" from plagioclase to amphibole and garnet residual mineralogy between 13 to 9 Ma. We correlate compositional and volumetric variations to the N-S passage of the Juan-Fernandéz ridge and crustal shortening and thickening during the past 26 Ma. The value of GIS and multivariate statistics in comparison to traditional geochemical parameters are highlighted working with large datasets with many predictors in a spatial and temporal context. Algorithms implemented in R allow taking advantage of an n-dimensional space and, thus, of subtle compositional differences contained in the data, while space-time patterns can be analyzed easily in GIS.

  17. Large-scale image region documentation for fully automated image biomarker algorithm development and evaluation.

    PubMed

    Reeves, Anthony P; Xie, Yiting; Liu, Shuang

    2017-04-01

    With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset.

  18. Trace Gas/Aerosol Interactions and GMI Modeling Support

    NASA Technical Reports Server (NTRS)

    Penner, Joyce E.; Liu, Xiaohong; Das, Bigyani; Bergmann, Dan; Rodriquez, Jose M.; Strahan, Susan; Wang, Minghuai; Feng, Yan

    2005-01-01

    Current global aerosol models use different physical and chemical schemes and parameters, different meteorological fields, and often different emission sources. Since the physical and chemical parameterization schemes are often tuned to obtain results that are consistent with observations, it is difficult to assess the true uncertainty due to meteorology alone. Under the framework of the NASA global modeling initiative (GMI), the differences and uncertainties in aerosol simulations (for sulfate, organic carbon, black carbon, dust and sea salt) solely due to different meteorological fields are analyzed and quantified. Three meteorological datasets available from the NASA DAO GCM, the GISS-II' GCM, and the NASA finite volume GCM (FVGCM) are used to drive the same aerosol model. The global sulfate and mineral dust burdens with FVGCM fields are 40% and 20% less than those with DAO and GISS fields, respectively due to its heavier rainfall. Meanwhile, the sea salt burden predicted with FVGCM fields is 56% and 43% higher than those with DAO and GISS, respectively, due to its stronger convection especially over the Southern Hemispheric Ocean. Sulfate concentrations at the surface in the Northern Hemisphere extratropics and in the middle to upper troposphere differ by more than a factor of 3 between the three meteorological datasets. The agreement between model calculated and observed aerosol concentrations in the industrial regions (e.g., North America and Europe) is quite similar for all three meteorological datasets. Away from the source regions, however, the comparisons with observations differ greatly for DAO, FVGCM and GISS, and the performance of the model using different datasets varies largely depending on sites and species. Global annual average aerosol optical depth at 550 nm is 0.120-0.131 for the three meteorological datasets.

  19. Optimizing tertiary storage organization and access for spatio-temporal datasets

    NASA Technical Reports Server (NTRS)

    Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith

    1994-01-01

    We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.

  20. BIANCA (Brain Intensity AbNormality Classification Algorithm): A new tool for automated segmentation of white matter hyperintensities.

    PubMed

    Griffanti, Ludovica; Zamboni, Giovanna; Khan, Aamira; Li, Linxin; Bonifacio, Guendalina; Sundaresan, Vaanathi; Schulz, Ursula G; Kuker, Wilhelm; Battaglini, Marco; Rothwell, Peter M; Jenkinson, Mark

    2016-11-01

    Reliable quantification of white matter hyperintensities of presumed vascular origin (WMHs) is increasingly needed, given the presence of these MRI findings in patients with several neurological and vascular disorders, as well as in elderly healthy subjects. We present BIANCA (Brain Intensity AbNormality Classification Algorithm), a fully automated, supervised method for WMH detection, based on the k-nearest neighbour (k-NN) algorithm. Relative to previous k-NN based segmentation methods, BIANCA offers different options for weighting the spatial information, local spatial intensity averaging, and different options for the choice of the number and location of the training points. BIANCA is multimodal and highly flexible so that the user can adapt the tool to their protocol and specific needs. We optimised and validated BIANCA on two datasets with different MRI protocols and patient populations (a "predominantly neurodegenerative" and a "predominantly vascular" cohort). BIANCA was first optimised on a subset of images for each dataset in terms of overlap and volumetric agreement with a manually segmented WMH mask. The correlation between the volumes extracted with BIANCA (using the optimised set of options), the volumes extracted from the manual masks and visual ratings showed that BIANCA is a valid alternative to manual segmentation. The optimised set of options was then applied to the whole cohorts and the resulting WMH volume estimates showed good correlations with visual ratings and with age. Finally, we performed a reproducibility test, to evaluate the robustness of BIANCA, and compared BIANCA performance against existing methods. Our findings suggest that BIANCA, which will be freely available as part of the FSL package, is a reliable method for automated WMH segmentation in large cross-sectional cohort studies. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  1. A fully automated system for quantification of background parenchymal enhancement in breast DCE-MRI

    NASA Astrophysics Data System (ADS)

    Ufuk Dalmiş, Mehmet; Gubern-Mérida, Albert; Borelli, Cristina; Vreemann, Suzan; Mann, Ritse M.; Karssemeijer, Nico

    2016-03-01

    Background parenchymal enhancement (BPE) observed in breast dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) has been identified as an important biomarker associated with risk for developing breast cancer. In this study, we present a fully automated framework for quantification of BPE. We initially segmented fibroglandular tissue (FGT) of the breasts using an improved version of an existing method. Subsequently, we computed BPEabs (volume of the enhancing tissue), BPErf (BPEabs divided by FGT volume) and BPErb (BPEabs divided by breast volume), using different relative enhancement threshold values between 1% and 100%. To evaluate and compare the previous and improved FGT segmentation methods, we used 20 breast DCE-MRI scans and we computed Dice similarity coefficient (DSC) values with respect to manual segmentations. For evaluation of the BPE quantification, we used a dataset of 95 breast DCE-MRI scans. Two radiologists, in individual reading sessions, visually analyzed the dataset and categorized each breast into minimal, mild, moderate and marked BPE. To measure the correlation between automated BPE values to the radiologists' assessments, we converted these values into ordinal categories and we used Spearman's rho as a measure of correlation. According to our results, the new segmentation method obtained an average DSC of 0.81 0.09, which was significantly higher (p<0.001) compared to the previous method (0.76 0.10). The highest correlation values between automated BPE categories and radiologists' assessments were obtained with the BPErf measurement (r=0.55, r=0.49, p<0.001 for both), while the correlation between the scores given by the two radiologists was 0.82 (p<0.001). The presented framework can be used to systematically investigate the correlation between BPE and risk in large screening cohorts.

  2. Towards large-scale mapping of urban three-dimensional structure using Landsat imagery and global elevation datasets

    NASA Astrophysics Data System (ADS)

    Wang, P.; Huang, C.

    2017-12-01

    The three-dimensional (3D) structure of buildings and infrastructures is fundamental to understanding and modelling of the impacts and challenges of urbanization in terms of energy use, carbon emissions, and earthquake vulnerabilities. However, spatially detailed maps of urban 3D structure have been scarce, particularly in fast-changing developing countries. We present here a novel methodology to map the volume of buildings and infrastructures at 30 meter resolution using a synergy of Landsat imagery and openly available global digital surface models (DSMs), including the Shuttle Radar Topography Mission (SRTM), ASTER Global Digital Elevation Map (GDEM), ALOS World 3D - 30m (AW3D30), and the recently released global DSM from the TanDEM-X mission. Our method builds on the concept of object-based height profile to extract height metrics from the DSMs and use a machine learning algorithm to predict height and volume from the height metrics. We have tested this algorithm in the entire England and assessed our result using Lidar measurements in 25 England cities. Our initial assessments achieved a RMSE of 1.4 m (R2 = 0.72) for building height and a RMSE of 1208.7 m3 (R2 = 0.69) for building volume, demonstrating the potential of large-scale applications and fully automated mapping of urban structure.

  3. Data quality can make or break a research infrastructure

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Gunter, D.; Chu, H.; Christianson, D. S.; Trotta, C.; Canfora, E.; Faybishenko, B.; Cheah, Y. W.; Beekwilder, N.; Chan, S.; Dengel, S.; Keenan, T. F.; O'Brien, F.; Elbashandy, A.; Poindexter, C.; Humphrey, M.; Papale, D.; Agarwal, D.

    2017-12-01

    Research infrastructures (RIs) commonly support observational data provided by multiple, independent sources. Uniformity in the data distributed by such RIs is important in most applications, e.g., in comparative studies using data from two or more sources. Achieving uniformity in terms of data quality is challenging, especially considering that many data issues are unpredictable and cannot be detected until a first occurrence of the issue. With that, many data quality control activities within RIs require a manual, human-in-the-loop element, making it an expensive activity. Our motivating example is the FLUXNET2015 dataset - a collection of ecosystem-level carbon, water, and energy fluxes between land and atmosphere from over 200 sites around the world, some sites with over 20 years of data. About 90% of the human effort to create the dataset was spent in data quality related activities. Based on this experience, we have been working on solutions to increase the automation of data quality control procedures. Since it is nearly impossible to fully automate all quality related checks, we have been drawing from the experience with techniques used in software development, which shares a few common constraints. In both managing scientific data and writing software, human time is a precious resource; code bases, as Science datasets, can be large, complex, and full of errors; both scientific and software endeavors can be pursued by individuals, but collaborative teams can accomplish a lot more. The lucrative and fast-paced nature of the software industry fueled the creation of methods and tools to increase automation and productivity within these constraints. Issue tracking systems, methods for translating problems into automated tests, powerful version control tools are a few examples. Terrestrial and aquatic ecosystems research relies heavily on many types of observational data. As volumes of data collection increases, ensuring data quality is becoming an unwieldy challenge for RIs. Business as usual approaches to data quality do not work with larger data volumes. We believe RIs can benefit greatly from adapting and imitating this body of theory and practice from software quality into data quality, enabling systematic and reproducible safeguards against errors and mistakes in datasets as much as in software.

  4. Hierarchical storage of large volume of multidector CT data using distributed servers

    NASA Astrophysics Data System (ADS)

    Ratib, Osman; Rosset, Antoine; Heuberger, Joris; Bandon, David

    2006-03-01

    Multidector scanners and hybrid multimodality scanners have the ability to generate large number of high-resolution images resulting in very large data sets. In most cases, these datasets are generated for the sole purpose of generating secondary processed images and 3D rendered images as well as oblique and curved multiplanar reformatted images. It is therefore not essential to archive the original images after they have been processed. We have developed an architecture of distributed archive servers for temporary storage of large image datasets for 3D rendering and image processing without the need for long term storage in PACS archive. With the relatively low cost of storage devices it is possible to configure these servers to hold several months or even years of data, long enough for allowing subsequent re-processing if required by specific clinical situations. We tested the latest generation of RAID servers provided by Apple computers with a capacity of 5 TBytes. We implemented a peer-to-peer data access software based on our Open-Source image management software called OsiriX, allowing remote workstations to directly access DICOM image files located on the server through a new technology called "bonjour". This architecture offers a seamless integration of multiple servers and workstations without the need for central database or complex workflow management tools. It allows efficient access to image data from multiple workstation for image analysis and visualization without the need for image data transfer. It provides a convenient alternative to centralized PACS architecture while avoiding complex and time-consuming data transfer and storage.

  5. I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.

    Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less

  6. Taking Open Innovation to the Molecular Level - Strengths and Limitations.

    PubMed

    Zdrazil, Barbara; Blomberg, Niklas; Ecker, Gerhard F

    2012-08-01

    The ever-growing availability of large-scale open data and its maturation is having a significant impact on industrial drug-discovery, as well as on academic and non-profit research. As industry is changing to an 'open innovation' business concept, precompetitive initiatives and strong public-private partnerships including academic research cooperation partners are gaining more and more importance. Now, the bioinformatics and cheminformatics communities are seeking for web tools which allow the integration of this large volume of life science datasets available in the public domain. Such a data exploitation tool would ideally be able to answer complex biological questions by formulating only one search query. In this short review/perspective, we outline the use of semantic web approaches for data and knowledge integration. Further, we discuss strengths and current limitations of public available data retrieval tools and integrated platforms.

  7. Analysis of the IJCNN 2011 UTL Challenge

    DTIC Science & Technology

    2012-01-13

    large datasets from various application domains: handwriting recognition, image recognition, video processing, text processing, and ecology. The goal...http //clopinet.com/ul). We made available large datasets from various application domains handwriting recognition, image recognition, video...evaluation sets consist of 4096 examples each. Dataset Domain Features Sparsity Devel. Transf. AVICENNA Handwriting 120 0% 150205 50000 HARRY Video 5000 98.1

  8. Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data.

    PubMed

    Gray, Vanessa E; Hause, Ronald J; Luebeck, Jens; Shendure, Jay; Fowler, Douglas M

    2018-01-24

    Large datasets describing the quantitative effects of mutations on protein function are becoming increasingly available. Here, we leverage these datasets to develop Envision, which predicts the magnitude of a missense variant's molecular effect. Envision combines 21,026 variant effect measurements from nine large-scale experimental mutagenesis datasets, a hitherto untapped training resource, with a supervised, stochastic gradient boosting learning algorithm. Envision outperforms other missense variant effect predictors both on large-scale mutagenesis data and on an independent test dataset comprising 2,312 TP53 variants whose effects were measured using a low-throughput approach. This dataset was never used for hyperparameter tuning or model training and thus serves as an independent validation set. Envision prediction accuracy is also more consistent across amino acids than other predictors. Finally, we demonstrate that Envision's performance improves as more large-scale mutagenesis data are incorporated. We precompute Envision predictions for every possible single amino acid variant in human, mouse, frog, zebrafish, fruit fly, worm, and yeast proteomes (https://envision.gs.washington.edu/). Copyright © 2017 Elsevier Inc. All rights reserved.

  9. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  10. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  11. Locally Downscaled and Spatially Customizable Climate Data for Historical and Future Periods for North America

    PubMed Central

    Wang, Tongli; Hamann, Andreas; Spittlehouse, Dave; Carroll, Carlos

    2016-01-01

    Large volumes of gridded climate data have become available in recent years including interpolated historical data from weather stations and future predictions from general circulation models. These datasets, however, are at various spatial resolutions that need to be converted to scales meaningful for applications such as climate change risk and impact assessments or sample-based ecological research. Extracting climate data for specific locations from large datasets is not a trivial task and typically requires advanced GIS and data management skills. In this study, we developed a software package, ClimateNA, that facilitates this task and provides a user-friendly interface suitable for resource managers and decision makers as well as scientists. The software locally downscales historical and future monthly climate data layers into scale-free point estimates of climate values for the entire North American continent. The software also calculates a large number of biologically relevant climate variables that are usually derived from daily weather data. ClimateNA covers 1) 104 years of historical data (1901–2014) in monthly, annual, decadal and 30-year time steps; 2) three paleoclimatic periods (Last Glacial Maximum, Mid Holocene and Last Millennium); 3) three future periods (2020s, 2050s and 2080s); and 4) annual time-series of model projections for 2011–2100. Multiple general circulation models (GCMs) were included for both paleo and future periods, and two representative concentration pathways (RCP4.5 and 8.5) were chosen for future climate data. PMID:27275583

  12. VarWalker: Personalized Mutation Network Analysis of Putative Cancer Genes from Next-Generation Sequencing Data

    PubMed Central

    Jia, Peilin; Zhao, Zhongming

    2014-01-01

    A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data. PMID:24516372

  13. VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

    PubMed

    Jia, Peilin; Zhao, Zhongming

    2014-02-01

    A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data.

  14. Locally Downscaled and Spatially Customizable Climate Data for Historical and Future Periods for North America.

    PubMed

    Wang, Tongli; Hamann, Andreas; Spittlehouse, Dave; Carroll, Carlos

    2016-01-01

    Large volumes of gridded climate data have become available in recent years including interpolated historical data from weather stations and future predictions from general circulation models. These datasets, however, are at various spatial resolutions that need to be converted to scales meaningful for applications such as climate change risk and impact assessments or sample-based ecological research. Extracting climate data for specific locations from large datasets is not a trivial task and typically requires advanced GIS and data management skills. In this study, we developed a software package, ClimateNA, that facilitates this task and provides a user-friendly interface suitable for resource managers and decision makers as well as scientists. The software locally downscales historical and future monthly climate data layers into scale-free point estimates of climate values for the entire North American continent. The software also calculates a large number of biologically relevant climate variables that are usually derived from daily weather data. ClimateNA covers 1) 104 years of historical data (1901-2014) in monthly, annual, decadal and 30-year time steps; 2) three paleoclimatic periods (Last Glacial Maximum, Mid Holocene and Last Millennium); 3) three future periods (2020s, 2050s and 2080s); and 4) annual time-series of model projections for 2011-2100. Multiple general circulation models (GCMs) were included for both paleo and future periods, and two representative concentration pathways (RCP4.5 and 8.5) were chosen for future climate data.

  15. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE PAGES

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew; ...

    2015-08-13

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  16. BactoGeNIE: a large-scale comparative genome visualization for big displays

    PubMed Central

    2015-01-01

    Background The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. Results In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE through a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. Conclusions BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics. PMID:26329021

  17. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  18. Individualized Prediction of Reading Comprehension Ability Using Gray Matter Volume.

    PubMed

    Cui, Zaixu; Su, Mengmeng; Li, Liangjie; Shu, Hua; Gong, Gaolang

    2018-05-01

    Reading comprehension is a crucial reading skill for learning and putatively contains 2 key components: reading decoding and linguistic comprehension. Current understanding of the neural mechanism underlying these reading comprehension components is lacking, and whether and how neuroanatomical features can be used to predict these 2 skills remain largely unexplored. In the present study, we analyzed a large sample from the Human Connectome Project (HCP) dataset and successfully built multivariate predictive models for these 2 skills using whole-brain gray matter volume features. The results showed that these models effectively captured individual differences in these 2 skills and were able to significantly predict these components of reading comprehension for unseen individuals. The strict cross-validation using the HCP cohort and another independent cohort of children demonstrated the model generalizability. The identified gray matter regions contributing to the skill prediction consisted of a wide range of regions covering the putative reading, cerebellum, and subcortical systems. Interestingly, there were gender differences in the predictive models, with the female-specific model overestimating the males' abilities. Moreover, the identified contributing gray matter regions for the female-specific and male-specific models exhibited considerable differences, supporting a gender-dependent neuroanatomical substrate for reading comprehension.

  19. Launch of the I13-2 data beamline at the Diamond Light Source synchrotron

    NASA Astrophysics Data System (ADS)

    Bodey, A. J.; Rau, C.

    2017-06-01

    Users of the Diamond-Manchester Imaging Branchline I13-2 commonly spend many months analysing the large volumes of tomographic data generated in a single beamtime. This is due to the difficulties inherent in performing complicated, computationally-expensive analyses on large datasets with workstations of limited computing power. To improve productivity, a ‘data beamline’ was launched in January 2016. Users are scheduled for visits to the data beamline in the same way as for regular beamlines, with bookings made via the User Administration System and provision of financial support for travel and subsistence. Two high-performance graphics workstations were acquired, with sufficient RAM to enable simultaneous analysis of several tomographic volumes. Users are given high priority on Diamond’s central computing cluster for the duration of their visit, and if necessary, archived data are restored to a high-performance disk array. Within the first six months of operation, thirteen user visits were made, lasting an average of 4.5 days each. The I13-2 data beamline was the first to be launched at Diamond Light Source and, to the authors’ knowledge, the first to be formalised in this way at any synchrotron.

  20. Opportunistic citizen science data transform understanding of species distributions, phenology, and diversity gradients for global change research.

    PubMed

    Soroye, Peter; Ahmed, Najeeba; Kerr, Jeremy T

    2018-06-19

    Opportunistic citizen science (CS) programs allow volunteers to report species observations from anywhere, at any time, and can assemble large volumes of historic and current data at faster rates than more coordinated programs with standardized data collection. This can quickly provide large amounts of species distributional data, but whether this focus on participation comes at a cost in data quality is not clear. While automated and expert vetting can increase data reliability, there is no guarantee that opportunistic data will do anything more than confirm information from professional surveys. Here, we use eButterfly, an opportunistic CS program, and a comparable dataset of professionally collected observations, to measure the amount of new distributional species information that opportunistic CS generates. We also test how well opportunistic CS can estimate regional species richness for a large group of taxa (>300 butterfly species) across a broad area. We find that eButterfly contributes new distributional information for >80% of species, and that opportunistically submitting observations allowed volunteers to spot species ~35 days earlier than professionals. While eButterfly did a relatively poor job at predicting regional species richness by itself (detecting only about 35-57% of species per region), it significantly contributed to regional species richness when used with the professional dataset (adding ~3 species that had gone undetected in professional surveys per region). Overall, we find that the opportunistic CS model can provide substantial complementary species information when used alongside professional survey data. Our results suggest that data from opportunistic CS programs in conjunction with professional datasets can strongly increase the capacity of researchers to estimate species richness, and provide unique information on species distributions and phenologies that are relevant to the detection of the biological consequences of global change. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  1. Large-scale image region documentation for fully automated image biomarker algorithm development and evaluation

    PubMed Central

    Reeves, Anthony P.; Xie, Yiting; Liu, Shuang

    2017-01-01

    Abstract. With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset. PMID:28612037

  2. On the radiobiological impact of metal artifacts in head-and-neck IMRT in terms of tumor control probability (TCP) and normal tissue complication probability (NTCP).

    PubMed

    Kim, Yusung; Tomé, Wolfgang A

    2007-11-01

    To investigate the effects of distorted head-and-neck (H&N) intensity-modulated radiation therapy (IMRT) dose distributions (hot and cold spots) on normal tissue complication probability (NTCP) and tumor control probability (TCP) due to dental-metal artifacts. Five patients' IMRT treatment plans have been analyzed, employing five different planning image data-sets: (a) uncorrected (UC); (b) homogeneous uncorrected (HUC); (c) sinogram completion corrected (SCC); (d) minimum-value-corrected (MVC); and (e) streak-artifact-reduction including minimum-value-correction (SAR-MVC), which has been taken as the reference data-set. The effects on NTCP and TCP were evaluated using the Lyman-NTCP model and the Logistic-TCP model, respectively. When compared to the predicted NTCP obtained using the reference data-set, the treatment plan based on the original CT data-set (UC) yielded an increase in NTCP of 3.2 and 2.0% for the spared parotid gland and the spinal cord, respectively. While for the treatment plans based on the MVC CT data-set the NTCP increased by a 1.1% and a 0.1% for the spared parotid glands and the spinal cord, respectively. In addition, the MVC correction method showed a reduction in TCP for target volumes (MVC: delta TCP = -0.6% vs. UC: delta TCP = -1.9%) with respect to that of the reference CT data-set. Our results indicate that the presence of dental-metal-artifacts in H&N planning CT data-sets has an impact on the estimates of TCP and NTCP. In particular dental-metal-artifacts lead to an increase in NTCP for the spared parotid glands and a slight decrease in TCP for target volumes.

  3. A peek into the future of radiology using big data applications.

    PubMed

    Kharat, Amit T; Singhal, Shubham

    2017-01-01

    Big data is extremely large amount of data which is available in the radiology department. Big data is identified by four Vs - Volume, Velocity, Variety, and Veracity. By applying different algorithmic tools and converting raw data to transformed data in such large datasets, there is a possibility of understanding and using radiology data for gaining new knowledge and insights. Big data analytics consists of 6Cs - Connection, Cloud, Cyber, Content, Community, and Customization. The global technological prowess and per-capita capacity to save digital information has roughly doubled every 40 months since the 1980's. By using big data, the planning and implementation of radiological procedures in radiology departments can be given a great boost. Potential applications of big data in the future are scheduling of scans, creating patient-specific personalized scanning protocols, radiologist decision support, emergency reporting, virtual quality assurance for the radiologist, etc. Targeted use of big data applications can be done for images by supporting the analytic process. Screening software tools designed on big data can be used to highlight a region of interest, such as subtle changes in parenchymal density, solitary pulmonary nodule, or focal hepatic lesions, by plotting its multidimensional anatomy. Following this, we can run more complex applications such as three-dimensional multi planar reconstructions (MPR), volumetric rendering (VR), and curved planar reconstruction, which consume higher system resources on targeted data subsets rather than querying the complete cross-sectional imaging dataset. This pre-emptive selection of dataset can substantially reduce the system requirements such as system memory, server load and provide prompt results. However, a word of caution, "big data should not become "dump data" due to inadequate and poor analysis and non-structured improperly stored data. In the near future, big data can ring in the era of personalized and individualized healthcare.

  4. Running a distributed virtual observatory: U.S. Virtual Astronomical Observatory operations

    NASA Astrophysics Data System (ADS)

    McGlynn, Thomas A.; Hanisch, Robert J.; Berriman, G. Bruce; Thakar, Aniruddha R.

    2012-09-01

    Operation of the US Virtual Astronomical Observatory shares some issues with modern physical observatories, e.g., intimidating data volumes and rapid technological change, and must also address unique concerns like the lack of direct control of the underlying and scattered data resources, and the distributed nature of the observatory itself. In this paper we discuss how the VAO has addressed these challenges to provide the astronomical community with a coherent set of science-enabling tools and services. The distributed nature of our virtual observatory-with data and personnel spanning geographic, institutional and regime boundaries-is simultaneously a major operational headache and the primary science motivation for the VAO. Most astronomy today uses data from many resources. Facilitation of matching heterogeneous datasets is a fundamental reason for the virtual observatory. Key aspects of our approach include continuous monitoring and validation of VAO and VO services and the datasets provided by the community, monitoring of user requests to optimize access, caching for large datasets, and providing distributed storage services that allow user to collect results near large data repositories. Some elements are now fully implemented, while others are planned for subsequent years. The distributed nature of the VAO requires careful attention to what can be a straightforward operation at a conventional observatory, e.g., the organization of the web site or the collection and combined analysis of logs. Many of these strategies use and extend protocols developed by the international virtual observatory community. Our long-term challenge is working with the underlying data providers to ensure high quality implementation of VO data access protocols (new and better 'telescopes'), assisting astronomical developers to build robust integrating tools (new 'instruments'), and coordinating with the research community to maximize the science enabled.

  5. A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics.

    PubMed

    Deng, Wan-Yu; Bai, Zuo; Huang, Guang-Bin; Zheng, Qing-Hua

    2016-05-01

    Big dimensional data is a growing trend that is emerging in many real world contexts, extending from web mining, gene expression analysis, protein-protein interaction to high-frequency financial data. Nowadays, there is a growing consensus that the increasing dimensionality poses impeding effects on the performances of classifiers, which is termed as the "peaking phenomenon" in the field of machine intelligence. To address the issue, dimensionality reduction is commonly employed as a preprocessing step on the Big dimensional data before building the classifiers. In this paper, we propose an Extreme Learning Machine (ELM) approach for large-scale data analytic. In contrast to existing approaches, we embed hidden nodes that are designed using singular value decomposition (SVD) into the classical ELM. These SVD nodes in the hidden layer are shown to capture the underlying characteristics of the Big dimensional data well, exhibiting excellent generalization performances. The drawback of using SVD on the entire dataset, however, is the high computational complexity involved. To address this, a fast divide and conquer approximation scheme is introduced to maintain computational tractability on high volume data. The resultant algorithm proposed is labeled here as Fast Singular Value Decomposition-Hidden-nodes based Extreme Learning Machine or FSVD-H-ELM in short. In FSVD-H-ELM, instead of identifying the SVD hidden nodes directly from the entire dataset, SVD hidden nodes are derived from multiple random subsets of data sampled from the original dataset. Comprehensive experiments and comparisons are conducted to assess the FSVD-H-ELM against other state-of-the-art algorithms. The results obtained demonstrated the superior generalization performance and efficiency of the FSVD-H-ELM. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. MetaMap: An atlas of metatranscriptomic reads in human disease-related RNA-seq data.

    PubMed

    Simon, L M; Karg, S; Westermann, A J; Engel, M; Elbehery, A H A; Hense, B; Heinig, M; Deng, L; Theis, F J

    2018-06-12

    With the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts. We developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource. Our results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation towards the role of the microbiome in human disease. Additionally, codes to process new datasets and perform statistical analyses are made available at https://github.com/theislab/MetaMap.

  7. Querying Large Biological Network Datasets

    ERIC Educational Resources Information Center

    Gulsoy, Gunhan

    2013-01-01

    New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…

  8. Measured and Calculated Volumes of Wetland Depressions

    EPA Pesticide Factsheets

    Measured and calculated volumes of wetland depressionsThis dataset is associated with the following publication:Wu, Q., and C. Lane. Delineation and quantification of wetland depressions in the Prairie Pothole Region of North Dakota. WETLANDS. The Society of Wetland Scientists, McLean, VA, USA, 36(2): 215-227, (2016).

  9. Exposure Render: An Interactive Photo-Realistic Volume Rendering Framework

    PubMed Central

    Kroes, Thomas; Post, Frits H.; Botha, Charl P.

    2012-01-01

    The field of volume visualization has undergone rapid development during the past years, both due to advances in suitable computing hardware and due to the increasing availability of large volume datasets. Recent work has focused on increasing the visual realism in Direct Volume Rendering (DVR) by integrating a number of visually plausible but often effect-specific rendering techniques, for instance modeling of light occlusion and depth of field. Besides yielding more attractive renderings, especially the more realistic lighting has a positive effect on perceptual tasks. Although these new rendering techniques yield impressive results, they exhibit limitations in terms of their exibility and their performance. Monte Carlo ray tracing (MCRT), coupled with physically based light transport, is the de-facto standard for synthesizing highly realistic images in the graphics domain, although usually not from volumetric data. Due to the stochastic sampling of MCRT algorithms, numerous effects can be achieved in a relatively straight-forward fashion. For this reason, we have developed a practical framework that applies MCRT techniques also to direct volume rendering (DVR). With this work, we demonstrate that a host of realistic effects, including physically based lighting, can be simulated in a generic and flexible fashion, leading to interactive DVR with improved realism. In the hope that this improved approach to DVR will see more use in practice, we have made available our framework under a permissive open source license. PMID:22768292

  10. The observed clustering of damaging extra-tropical cyclones in Europe

    NASA Astrophysics Data System (ADS)

    Cusack, S.

    2015-12-01

    The clustering of severe European windstorms on annual timescales has substantial impacts on the re/insurance industry. Management of the risk is impaired by large uncertainties in estimates of clustering from historical storm datasets typically covering the past few decades. The uncertainties are unusually large because clustering depends on the variance of storm counts. Eight storm datasets are gathered for analysis in this study in order to reduce these uncertainties. Six of the datasets contain more than 100~years of severe storm information to reduce sampling errors, and the diversity of information sources and analysis methods between datasets sample observational errors. All storm severity measures used in this study reflect damage, to suit re/insurance applications. It is found that the shortest storm dataset of 42 years in length provides estimates of clustering with very large sampling and observational errors. The dataset does provide some useful information: indications of stronger clustering for more severe storms, particularly for southern countries off the main storm track. However, substantially different results are produced by removal of one stormy season, 1989/1990, which illustrates the large uncertainties from a 42-year dataset. The extended storm records place 1989/1990 into a much longer historical context to produce more robust estimates of clustering. All the extended storm datasets show a greater degree of clustering with increasing storm severity and suggest clustering of severe storms is much more material than weaker storms. Further, they contain signs of stronger clustering in areas off the main storm track, and weaker clustering for smaller-sized areas, though these signals are smaller than uncertainties in actual values. Both the improvement of existing storm records and development of new historical storm datasets would help to improve management of this risk.

  11. Spatial contexts for temporal variability in alpine vegetation under ongoing climate change

    USGS Publications Warehouse

    Fagre, Daniel B.; ,; George P. Malanson,

    2013-01-01

    A framework to monitor mountain summit vegetation (The Global Observation Research Initiative in Alpine Environments, GLORIA) was initiated in 1997. GLORIA results should be taken within a regional context of the spatial variability of alpine tundra. Changes observed at GLORIA sites in Glacier National Park, Montana, USA are quantified within the context of the range of variability observed in alpine tundra across much of western North America. Dissimilarity is calculated and used in nonmetric multidimensional scaling for repeated measures of vascular species cover at 14 GLORIA sites with 525 nearby sites and with 436 sites in western North America. The lengths of the trajectories of the GLORIA sites in ordination space are compared to the dimensions of the space created by the larger datasets. The absolute amount of change on the GLORIA summits over 5 years is high, but the degree of change is small relative to the geographical context. The GLORIA sites are on the margin of the ordination volumes with the large datasets. The GLORIA summit vegetation appears to be specialized, arguing for the intrinsic value of early observed change in limited niche space.

  12. A Computational Framework for High-Throughput Isotopic Natural Abundance Correction of Omics-Level Ultra-High Resolution FT-MS Datasets

    PubMed Central

    Carreer, William J.; Flight, Robert M.; Moseley, Hunter N. B.

    2013-01-01

    New metabolomics applications of ultra-high resolution and accuracy mass spectrometry can provide thousands of detectable isotopologues, with the number of potentially detectable isotopologues increasing exponentially with the number of stable isotopes used in newer isotope tracing methods like stable isotope-resolved metabolomics (SIRM) experiments. This huge increase in usable data requires software capable of correcting the large number of isotopologue peaks resulting from SIRM experiments in a timely manner. We describe the design of a new algorithm and software system capable of handling these high volumes of data, while including quality control methods for maintaining data quality. We validate this new algorithm against a previous single isotope correction algorithm in a two-step cross-validation. Next, we demonstrate the algorithm and correct for the effects of natural abundance for both 13C and 15N isotopes on a set of raw isotopologue intensities of UDP-N-acetyl-D-glucosamine derived from a 13C/15N-tracing experiment. Finally, we demonstrate the algorithm on a full omics-level dataset. PMID:24404440

  13. Integrating Intelligent Systems Domain Knowledge Into the Earth Science Curricula

    NASA Astrophysics Data System (ADS)

    Güereque, M.; Pennington, D. D.; Pierce, S. A.

    2017-12-01

    High-volume heterogeneous datasets are becoming ubiquitous, migrating to center stage over the last ten years and transcending the boundaries of computationally intensive disciplines into the mainstream, becoming a fundamental part of every science discipline. Despite the fact that large datasets are now pervasive across industries and academic disciplines, the array of skills is generally absent from earth science programs. This has left the bulk of the student population without access to curricula that systematically teach appropriate intelligent-systems skills, creating a void for skill sets that should be universal given their need and marketability. While some guidance regarding appropriate computational thinking and pedagogy is appearing, there exist few examples where these have been specifically designed and tested within the earth science domain. Furthermore, best practices from learning science have not yet been widely tested for developing intelligent systems-thinking skills. This research developed and tested evidence based computational skill modules that target this deficit with the intention of informing the earth science community as it continues to incorporate intelligent systems techniques and reasoning into its research and classrooms.

  14. Bat trait, genetic and pathogen data from large-scale investigations of African fruit bats, Eidolon helvum.

    PubMed

    Peel, Alison J; Baker, Kate S; Hayman, David T S; Suu-Ire, Richard; Breed, Andrew C; Gembu, Guy-Crispin; Lembo, Tiziana; Fernández-Loras, Andrés; Sargan, David R; Fooks, Anthony R; Cunningham, Andrew A; Wood, James L N

    2016-08-01

    Bats, including African straw-coloured fruit bats (Eidolon helvum), have been highlighted as reservoirs of many recently emerged zoonotic viruses. This common, widespread and ecologically important species was the focus of longitudinal and continent-wide studies of the epidemiological and ecology of Lagos bat virus, henipaviruses and Achimota viruses. Here we present a spatial, morphological, demographic, genetic and serological dataset encompassing 2827 bats from nine countries over an 8-year period. Genetic data comprises cytochrome b mitochondrial sequences (n=608) and microsatellite genotypes from 18 loci (n=544). Tooth-cementum analyses (n=316) allowed derivation of rare age-specific serologic data for a lyssavirus, a henipavirus and two rubulaviruses. This dataset contributes a substantial volume of data on the ecology of E. helvum and its viruses and will be valuable for a wide range of studies, including viral transmission dynamic modelling in age-structured populations, investigation of seasonal reproductive asynchrony in wide-ranging species, ecological niche modelling, inference of island colonisation history, exploration of relationships between island and body size, and various spatial analyses of demographic, morphometric or serological data.

  15. Tracking Provenance of Earth Science Data

    NASA Technical Reports Server (NTRS)

    Tilmes, Curt; Yesha, Yelena; Halem, Milton

    2010-01-01

    Tremendous volumes of data have been captured, archived and analyzed. Sensors, algorithms and processing systems for transforming and analyzing the data are evolving over time. Web Portals and Services can create transient data sets on-demand. Data are transferred from organization to organization with additional transformations at every stage. Provenance in this context refers to the source of data and a record of the process that led to its current state. It encompasses the documentation of a variety of artifacts related to particular data. Provenance is important for understanding and using scientific datasets, and critical for independent confirmation of scientific results. Managing provenance throughout scientific data processing has gained interest lately and there are a variety of approaches. Large scale scientific datasets consisting of thousands to millions of individual data files and processes offer particular challenges. This paper uses the analogy of art history provenance to explore some of the concerns of applying provenance tracking to earth science data. It also illustrates some of the provenance issues with examples drawn from the Ozone Monitoring Instrument (OMI) Data Processing System (OMIDAPS) run at NASA's Goddard Space Flight Center by the first author.

  16. Australia's continental-scale acoustic tracking database and its automated quality control process

    NASA Astrophysics Data System (ADS)

    Hoenner, Xavier; Huveneers, Charlie; Steckenreuter, Andre; Simpfendorfer, Colin; Tattersall, Katherine; Jaine, Fabrice; Atkins, Natalia; Babcock, Russ; Brodie, Stephanie; Burgess, Jonathan; Campbell, Hamish; Heupel, Michelle; Pasquer, Benedicte; Proctor, Roger; Taylor, Matthew D.; Udyawer, Vinay; Harcourt, Robert

    2018-01-01

    Our ability to predict species responses to environmental changes relies on accurate records of animal movement patterns. Continental-scale acoustic telemetry networks are increasingly being established worldwide, producing large volumes of information-rich geospatial data. During the last decade, the Integrated Marine Observing System's Animal Tracking Facility (IMOS ATF) established a permanent array of acoustic receivers around Australia. Simultaneously, IMOS developed a centralised national database to foster collaborative research across the user community and quantify individual behaviour across a broad range of taxa. Here we present the database and quality control procedures developed to collate 49.6 million valid detections from 1891 receiving stations. This dataset consists of detections for 3,777 tags deployed on 117 marine species, with distances travelled ranging from a few to thousands of kilometres. Connectivity between regions was only made possible by the joint contribution of IMOS infrastructure and researcher-funded receivers. This dataset constitutes a valuable resource facilitating meta-analysis of animal movement, distributions, and habitat use, and is important for relating species distribution shifts with environmental covariates.

  17. Optimal retinal cyst segmentation from OCT images

    NASA Astrophysics Data System (ADS)

    Oguz, Ipek; Zhang, Li; Abramoff, Michael D.; Sonka, Milan

    2016-03-01

    Accurate and reproducible segmentation of cysts and fluid-filled regions from retinal OCT images is an important step allowing quantification of the disease status, longitudinal disease progression, and response to therapy in wet-pathology retinal diseases. However, segmentation of fluid-filled regions from OCT images is a challenging task due to their inhomogeneous appearance, the unpredictability of their number, size and location, as well as the intensity profile similarity between such regions and certain healthy tissue types. While machine learning techniques can be beneficial for this task, they require large training datasets and are often over-fitted to the appearance models of specific scanner vendors. We propose a knowledge-based approach that leverages a carefully designed cost function and graph-based segmentation techniques to provide a vendor-independent solution to this problem. We illustrate the results of this approach on two publicly available datasets with a variety of scanner vendors and retinal disease status. Compared to a previous machine-learning based approach, the volume similarity error was dramatically reduced from 81:3+/-56:4% to 22:2+/-21:3% (paired t-test, p << 0:001).

  18. Sentiment analysis of feature ranking methods for classification accuracy

    NASA Astrophysics Data System (ADS)

    Joseph, Shashank; Mugauri, Calvin; Sumathy, S.

    2017-11-01

    Text pre-processing and feature selection are important and critical steps in text mining. Text pre-processing of large volumes of datasets is a difficult task as unstructured raw data is converted into structured format. Traditional methods of processing and weighing took much time and were less accurate. To overcome this challenge, feature ranking techniques have been devised. A feature set from text preprocessing is fed as input for feature selection. Feature selection helps improve text classification accuracy. Of the three feature selection categories available, the filter category will be the focus. Five feature ranking methods namely: document frequency, standard deviation information gain, CHI-SQUARE, and weighted-log likelihood -ratio is analyzed.

  19. A web-based solution for 3D medical image visualization

    NASA Astrophysics Data System (ADS)

    Hou, Xiaoshuai; Sun, Jianyong; Zhang, Jianguo

    2015-03-01

    In this presentation, we present a web-based 3D medical image visualization solution which enables interactive large medical image data processing and visualization over the web platform. To improve the efficiency of our solution, we adopt GPU accelerated techniques to process images on the server side while rapidly transferring images to the HTML5 supported web browser on the client side. Compared to traditional local visualization solution, our solution doesn't require the users to install extra software or download the whole volume dataset from PACS server. By designing this web-based solution, it is feasible for users to access the 3D medical image visualization service wherever the internet is available.

  20. X-ray EM simulation tool for ptychography dataset construction

    NASA Astrophysics Data System (ADS)

    Stoevelaar, L. Pjotr; Gerini, Giampiero

    2018-03-01

    In this paper, we present an electromagnetic full-wave modeling framework, as a support EM tool providing data sets for X-ray ptychographic imaging. Modeling the entire scattering problem with Finite Element Method (FEM) tools is, in fact, a prohibitive task, because of the large area illuminated by the beam (due to the poor focusing power at these wavelengths) and the very small features to be imaged. To overcome this problem, the spectrum of the illumination beam is decomposed into a discrete set of plane waves. This allows reducing the electromagnetic modeling volume to the one enclosing the area to be imaged. The total scattered field is reconstructed by superimposing the solutions for each plane wave illumination.

  1. Access NASA Satellite Global Precipitation Data Visualization on YouTube

    NASA Astrophysics Data System (ADS)

    Liu, Z.; Su, J.; Acker, J. G.; Huffman, G. J.; Vollmer, B.; Wei, J.; Meyer, D. J.

    2017-12-01

    Since the satellite era began, NASA has collected a large volume of Earth science observations for research and applications around the world. Satellite data at 12 NASA data centers can also be used for STEM activities such as disaster events, climate change, etc. However, accessing satellite data can be a daunting task for non-professional users such as teachers and students because of unfamiliarity of terminology, disciplines, data formats, data structures, computing resources, processing software, programing languages, etc. Over the years, many efforts have been developed to improve satellite data access, but barriers still exist for non-professionals. In this presentation, we will present our latest activity that uses the popular online video sharing web site, YouTube, to access visualization of global precipitation datasets at the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC). With YouTube, users can access and visualize a large volume of satellite data without necessity to learn new software or download data. The dataset in this activity is the 3-hourly TRMM (Tropical Rainfall Measuring Mission) Multi-satellite Precipitation Analysis (TMPA). The video consists of over 50,000 data files collected since 1998 onwards, covering a zone between 50°N-S. The YouTube video will last 36 minutes for the entire dataset record (over 19 years). Since the time stamp is on each frame of the video, users can begin at any time by dragging the time progress bar. This precipitation animation will allow viewing precipitation events and processes (e.g., hurricanes, fronts, atmospheric rivers, etc.) on a global scale. The next plan is to develop a similar animation for the GPM (Global Precipitation Measurement) Integrated Multi-satellitE Retrievals for GPM (IMERG). The IMERG provides precipitation on a near-global (60°N-S) coverage at half-hourly time interval, showing more details on precipitation processes and development, compared to the 3-hourly TMPA product. The entire video will contain more than 330,000 files and will last 3.6 hours. Future plans include development of fly-over videos for orbital data for an entire satellite mission or project. All videos will be uploaded and available at the GES DISC site on YouTube (https://www.youtube.com/user/NASAGESDISC).

  2. Regional growth and atlasing of the developing human brain

    PubMed Central

    Makropoulos, Antonios; Aljabar, Paul; Wright, Robert; Hüning, Britta; Merchant, Nazakat; Arichi, Tomoki; Tusor, Nora; Hajnal, Joseph V.; Edwards, A. David; Counsell, Serena J.; Rueckert, Daniel

    2016-01-01

    Detailed morphometric analysis of the neonatal brain is required to characterise brain development and define neuroimaging biomarkers related to impaired brain growth. Accurate automatic segmentation of neonatal brain MRI is a prerequisite to analyse large datasets. We have previously presented an accurate and robust automatic segmentation technique for parcellating the neonatal brain into multiple cortical and subcortical regions. In this study, we further extend our segmentation method to detect cortical sulci and provide a detailed delineation of the cortical ribbon. These detailed segmentations are used to build a 4-dimensional spatio-temporal structural atlas of the brain for 82 cortical and subcortical structures throughout this developmental period. We employ the algorithm to segment an extensive database of 420 MR images of the developing brain, from 27 to 45 weeks post-menstrual age at imaging. Regional volumetric and cortical surface measurements are derived and used to investigate brain growth and development during this critical period and to assess the impact of immaturity at birth. Whole brain volume, the absolute volume of all structures studied, cortical curvature and cortical surface area increased with increasing age at scan. Relative volumes of cortical grey matter, cerebellum and cerebrospinal fluid increased with age at scan, while relative volumes of white matter, ventricles, brainstem and basal ganglia and thalami decreased. Preterm infants at term had smaller whole brain volumes, reduced regional white matter and cortical and subcortical grey matter volumes, and reduced cortical surface area compared with term born controls, while ventricular volume was greater in the preterm group. Increasing prematurity at birth was associated with a reduction in total and regional white matter, cortical and subcortical grey matter volume, an increase in ventricular volume, and reduced cortical surface area. PMID:26499811

  3. Regional growth and atlasing of the developing human brain.

    PubMed

    Makropoulos, Antonios; Aljabar, Paul; Wright, Robert; Hüning, Britta; Merchant, Nazakat; Arichi, Tomoki; Tusor, Nora; Hajnal, Joseph V; Edwards, A David; Counsell, Serena J; Rueckert, Daniel

    2016-01-15

    Detailed morphometric analysis of the neonatal brain is required to characterise brain development and define neuroimaging biomarkers related to impaired brain growth. Accurate automatic segmentation of neonatal brain MRI is a prerequisite to analyse large datasets. We have previously presented an accurate and robust automatic segmentation technique for parcellating the neonatal brain into multiple cortical and subcortical regions. In this study, we further extend our segmentation method to detect cortical sulci and provide a detailed delineation of the cortical ribbon. These detailed segmentations are used to build a 4-dimensional spatio-temporal structural atlas of the brain for 82 cortical and subcortical structures throughout this developmental period. We employ the algorithm to segment an extensive database of 420 MR images of the developing brain, from 27 to 45weeks post-menstrual age at imaging. Regional volumetric and cortical surface measurements are derived and used to investigate brain growth and development during this critical period and to assess the impact of immaturity at birth. Whole brain volume, the absolute volume of all structures studied, cortical curvature and cortical surface area increased with increasing age at scan. Relative volumes of cortical grey matter, cerebellum and cerebrospinal fluid increased with age at scan, while relative volumes of white matter, ventricles, brainstem and basal ganglia and thalami decreased. Preterm infants at term had smaller whole brain volumes, reduced regional white matter and cortical and subcortical grey matter volumes, and reduced cortical surface area compared with term born controls, while ventricular volume was greater in the preterm group. Increasing prematurity at birth was associated with a reduction in total and regional white matter, cortical and subcortical grey matter volume, an increase in ventricular volume, and reduced cortical surface area. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  4. SU-F-J-199: Predictive Models for Cone Beam CT-Based Online Verification of Pencil Beam Scanning Proton Therapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yin, L; Lin, A; Ahn, P

    Purpose: To utilize online CBCT scans to develop models for predicting DVH metrics in proton therapy of head and neck tumors. Methods: Nine patients with locally advanced oropharyngeal cancer were retrospectively selected in this study. Deformable image registration was applied to the simulation CT, target volumes, and organs at risk (OARs) contours onto each weekly CBCT scan. Intensity modulated proton therapy (IMPT) treatment plans were created on the simulation CT and forward calculated onto each corrected CBCT scan. Thirty six potentially predictive metrics were extracted from each corrected CBCT. These features include minimum/maximum/mean over and under-ranges at the proximal andmore » distal surface of PTV volumes, and geometrical and water equivalent distance between PTV and each OARs. Principal component analysis (PCA) was used to reduce the dimension of the extracted features. Three principal components were found to account for over 90% of variances in those features. Datasets from eight patients were used to train a machine learning model to fit these principal components with DVH metrics (dose to 95% and 5% of PTV, mean dose or max dose to OARs) from the forward calculated dose on each corrected CBCT. The accuracy of this model was verified on the datasets from the 9th patient. Results: The predicted changes of DVH metrics from the model were in good agreement with actual values calculated on corrected CBCT images. Median differences were within 1 Gy for most DVH metrics except for larynx and constrictor mean dose. However, a large spread of the differences was observed, indicating additional training datasets and predictive features are needed to improve the model. Conclusion: Intensity corrected CBCT scans hold the potential to be used for online verification of proton therapy and prediction of delivered dose distributions.« less

  5. Wide-Field Megahertz OCT Imaging of Patients with Diabetic Retinopathy

    PubMed Central

    Reznicek, Lukas; Kolb, Jan P.; Klein, Thomas; Mohler, Kathrin J.; Huber, Robert; Kernt, Marcus; Märtz, Josef; Neubauer, Aljoscha S.

    2015-01-01

    Purpose. To evaluate the feasibility of wide-field Megahertz (MHz) OCT imaging in patients with diabetic retinopathy. Methods. A consecutive series of 15 eyes of 15 patients with diagnosed diabetic retinopathy were included. All patients underwent Megahertz OCT imaging, a close clinical examination, slit lamp biomicroscopy, and funduscopic evaluation. To acquire densely sampled, wide-field volumetric datasets, an ophthalmic 1050 nm OCT prototype system based on a Fourier-domain mode-locked (FDML) laser source with 1.68 MHz A-scan rate was employed. Results. We were able to obtain OCT volume scans from all included 15 patients. Acquisition time was 1.8 seconds. Obtained volume datasets consisted of 2088 × 1044 A-scans of 60° of view. Thus, reconstructed en face images had a resolution of 34.8 pixels per degree in x-axis and 17.4 pixels per degree. Due to the densely sampled OCT volume dataset, postprocessed customized cross-sectional B-frames through pathologic changes such as an individual microaneurysm or a retinal neovascularization could be imaged. Conclusions. Wide-field Megahertz OCT is feasible to successfully image patients with diabetic retinopathy at high scanning rates and a wide angle of view, providing information in all three axes. The Megahertz OCT is a useful tool to screen diabetic patients for diabetic retinopathy. PMID:26273665

  6. Wide-Field Megahertz OCT Imaging of Patients with Diabetic Retinopathy.

    PubMed

    Reznicek, Lukas; Kolb, Jan P; Klein, Thomas; Mohler, Kathrin J; Wieser, Wolfgang; Huber, Robert; Kernt, Marcus; Märtz, Josef; Neubauer, Aljoscha S

    2015-01-01

    To evaluate the feasibility of wide-field Megahertz (MHz) OCT imaging in patients with diabetic retinopathy. A consecutive series of 15 eyes of 15 patients with diagnosed diabetic retinopathy were included. All patients underwent Megahertz OCT imaging, a close clinical examination, slit lamp biomicroscopy, and funduscopic evaluation. To acquire densely sampled, wide-field volumetric datasets, an ophthalmic 1050 nm OCT prototype system based on a Fourier-domain mode-locked (FDML) laser source with 1.68 MHz A-scan rate was employed. RESULTS. We were able to obtain OCT volume scans from all included 15 patients. Acquisition time was 1.8 seconds. Obtained volume datasets consisted of 2088 × 1044 A-scans of 60° of view. Thus, reconstructed en face images had a resolution of 34.8 pixels per degree in x-axis and 17.4 pixels per degree. Due to the densely sampled OCT volume dataset, postprocessed customized cross-sectional B-frames through pathologic changes such as an individual microaneurysm or a retinal neovascularization could be imaged. Wide-field Megahertz OCT is feasible to successfully image patients with diabetic retinopathy at high scanning rates and a wide angle of view, providing information in all three axes. The Megahertz OCT is a useful tool to screen diabetic patients for diabetic retinopathy.

  7. Comparative study of standard space and real space analysis of quantitative MR brain data.

    PubMed

    Aribisala, Benjamin S; He, Jiabao; Blamire, Andrew M

    2011-06-01

    To compare the robustness of region of interest (ROI) analysis of magnetic resonance imaging (MRI) brain data in real space with analysis in standard space and to test the hypothesis that standard space image analysis introduces more partial volume effect errors compared to analysis of the same dataset in real space. Twenty healthy adults with no history or evidence of neurological diseases were recruited; high-resolution T(1)-weighted, quantitative T(1), and B(0) field-map measurements were collected. Algorithms were implemented to perform analysis in real and standard space and used to apply a simple standard ROI template to quantitative T(1) datasets. Regional relaxation values and histograms for both gray and white matter tissues classes were then extracted and compared. Regional mean T(1) values for both gray and white matter were significantly lower using real space compared to standard space analysis. Additionally, regional T(1) histograms were more compact in real space, with smaller right-sided tails indicating lower partial volume errors compared to standard space analysis. Standard space analysis of quantitative MRI brain data introduces more partial volume effect errors biasing the analysis of quantitative data compared to analysis of the same dataset in real space. Copyright © 2011 Wiley-Liss, Inc.

  8. Inter-algorithm lesion volumetry comparison of real and 3D simulated lung lesions in CT

    NASA Astrophysics Data System (ADS)

    Robins, Marthony; Solomon, Justin; Hoye, Jocelyn; Smith, Taylor; Ebner, Lukas; Samei, Ehsan

    2017-03-01

    The purpose of this study was to establish volumetric exchangeability between real and computational lung lesions in CT. We compared the overall relative volume estimation performance of segmentation tools when used to measure real lesions in actual patient CT images and computational lesions virtually inserted into the same patient images (i.e., hybrid datasets). Pathologically confirmed malignancies from 30 thoracic patient cases from Reference Image Database to Evaluate Therapy Response (RIDER) were modeled and used as the basis for the comparison. Lesions included isolated nodules as well as those attached to the pleura or other lung structures. Patient images were acquired using a 16 detector row or 64 detector row CT scanner (Lightspeed 16 or VCT; GE Healthcare). Scans were acquired using standard chest protocols during a single breath-hold. Virtual 3D lesion models based on real lesions were developed in Duke Lesion Tool (Duke University), and inserted using a validated image-domain insertion program. Nodule volumes were estimated using multiple commercial segmentation tools (iNtuition, TeraRecon, Inc., Syngo.via, Siemens Healthcare, and IntelliSpace, Philips Healthcare). Consensus based volume comparison showed consistent trends in volume measurement between real and virtual lesions across all software. The average percent bias (+/- standard error) shows -9.2+/-3.2% for real lesions versus -6.7+/-1.2% for virtual lesions with tool A, 3.9+/-2.5% and 5.0+/-0.9% for tool B, and 5.3+/-2.3% and 1.8+/-0.8% for tool C, respectively. Virtual lesion volumes were statistically similar to those of real lesions (< 4% difference) with p >.05 in most cases. Results suggest that hybrid datasets had similar inter-algorithm variability compared to real datasets.

  9. Assessing the Potential for Rooftop Rainwater Harvesting from Large Public Institutions.

    PubMed

    Adugna, Dagnachew; Jensen, Marina Bergen; Lemma, Brook; Gebrie, Geremew Sahilu

    2018-02-14

    As in many other cities, urbanization coupled with population growth worsens the water supply problem of Addis Ababa, Ethiopia, with a water supply deficit of 41% in 2016. To investigate the potential contribution of rooftop rainwater harvesting (RWH) from large public institutions, 320 such institutions were selected and grouped into 11 categories, from which 25-30% representative 588 rooftops were digitalized and the potential RWH volume computed based on a ten-year rainfall dataset. When comparing the resulting RWH potential with the water consumption, up to 2.3% of the annual, potable water supply can be provided. If reused only within one's own institution, the self-sufficiency varies from 0.9 to 649%. Non-uniform rainfall patterns add uncertainty to these numbers, since the size of the storage tank becomes critical for coverage in the dry season from October to May. Despite the low replacement potential at the city level, RWH from large institutions will enable a significant volume of potable water to be transferred to localities critically suffering from water shortage. Further, large institutions may demonstrate how RWH can be practiced, thus acting as a frontrunner for the dissemination of RWH to other types of rooftops. To narrow the water supply gap, considering rooftop RWH as an alternative water supply source is recommended. However, the present study assumed that financial constraints to install large sized storage tanks are considered as a possible challenge. Thus, future research is needed to investigate the cost-benefit balance along with the invention of a cheap storage tank as they may affect the potential contribution of RWH from rooftops.

  10. Cloud Computing: A model Construct of Real-Time Monitoring for Big Dataset Analytics Using Apache Spark

    NASA Astrophysics Data System (ADS)

    Alkasem, Ameen; Liu, Hongwei; Zuo, Decheng; Algarash, Basheer

    2018-01-01

    The volume of data being collected, analyzed, and stored has exploded in recent years, in particular in relation to the activity on the cloud computing. While large-scale data processing, analysis, storage, and platform model such as cloud computing were previously and currently are increasingly. Today, the major challenge is it address how to monitor and control these massive amounts of data and perform analysis in real-time at scale. The traditional methods and model systems are unable to cope with these quantities of data in real-time. Here we present a new methodology for constructing a model for optimizing the performance of real-time monitoring of big datasets, which includes a machine learning algorithms and Apache Spark Streaming to accomplish fine-grained fault diagnosis and repair of big dataset. As a case study, we use the failure of Virtual Machines (VMs) to start-up. The methodology proposition ensures that the most sensible action is carried out during the procedure of fine-grained monitoring and generates the highest efficacy and cost-saving fault repair through three construction control steps: (I) data collection; (II) analysis engine and (III) decision engine. We found that running this novel methodology can save a considerate amount of time compared to the Hadoop model, without sacrificing the classification accuracy or optimization of performance. The accuracy of the proposed method (92.13%) is an improvement on traditional approaches.

  11. Dosimetric impact of different CT datasets for stereotactic treatment planning using 3D conformal radiotherapy or volumetric modulated arc therapy.

    PubMed

    Oechsner, Markus; Odersky, Leonhard; Berndt, Johannes; Combs, Stephanie Elisabeth; Wilkens, Jan Jakob; Duma, Marciana Nona

    2015-12-01

    The purpose of this study was to assess the impact on dose to the planning target volume (PTV) and organs at risk (OAR) by using four differently generated CT datasets for dose calculation in stereotactic body radiotherapy (SBRT) of lung and liver tumors. Additionally, dose differences between 3D conformal radiotherapy and volumetric modulated arc therapy (VMAT) plans calculated on these CT datasets were determined. Twenty SBRT patients, ten lung cases and ten liver cases, were retrospectively selected for this study. Treatment plans were optimized on average intensity projection (AIP) CTs using 3D conformal radiotherapy (3D-CRT) and volumetric modulated arc therapy (VMAT). Afterwards, the plans were copied to the planning CTs (PCT), maximum intensity projection (MIP) and mid-ventilation (MidV) CT datasets and dose was recalculated keeping all beam parameters and monitor units unchanged. Ipsilateral lung and liver volumes and dosimetric parameters for PTV (Dmean, D2, D98, D95), ipsilateral lung and liver (Dmean, V30, V20, V10) were determined and statistically analysed using Wilcoxon test. Significant but small mean differences were found for PTV dose between the CTs (lung SBRT: ≤2.5 %; liver SBRT: ≤1.6 %). MIPs achieved the smallest lung and the largest liver volumes. OAR mean doses in MIP plans were distinctly smaller than in the other CT datasets. Furthermore, overlapping of tumors with the diaphragm results in underestimated ipsilateral lung dose in MIP plans. Best agreement was found between AIP and MidV (lung SBRT). Overall, differences in liver SBRT were smaller than in lung SBRT and VMAT plans achieved slightly smaller differences than 3D-CRT plans. Only small differences were found for PTV parameters between the four CT datasets. Larger differences occurred for the doses to organs at risk (ipsilateral lung, liver) especially for MIP plans. No relevant differences were observed between 3D-CRT or VMAT plans. MIP CTs are not appropriate for OAR dose assessment. PCT, AIP and MidV resulted in similar doses. If a 4DCT is acquired PCT can be omitted using AIP or MidV for treatment planning.

  12. What will the future of cloud-based astronomical data processing look like?

    NASA Astrophysics Data System (ADS)

    Green, Andrew W.; Mannering, Elizabeth; Harischandra, Lloyd; Vuong, Minh; O'Toole, Simon; Sealey, Katrina; Hopkins, Andrew M.

    2017-06-01

    Astronomy is rapidly approaching an impasse: very large datasets require remote or cloud-based parallel processing, yet many astronomers still try to download the data and develop serial code locally. Astronomers understand the need for change, but the hurdles remain high. We are developing a data archive designed from the ground up to simplify and encourage cloud-based parallel processing. While the volume of data we host remains modest by some standards, it is still large enough that download and processing times are measured in days and even weeks. We plan to implement a python based, notebook-like interface that automatically parallelises execution. Our goal is to provide an interface sufficiently familiar and user-friendly that it encourages the astronomer to run their analysis on our system in the cloud-astroinformatics as a service. We describe how our system addresses the approaching impasse in astronomy using the SAMI Galaxy Survey as an example.

  13. SEGMENTATION OF MITOCHONDRIA IN ELECTRON MICROSCOPY IMAGES USING ALGEBRAIC CURVES.

    PubMed

    Seyedhosseini, Mojtaba; Ellisman, Mark H; Tasdizen, Tolga

    2013-01-01

    High-resolution microscopy techniques have been used to generate large volumes of data with enough details for understanding the complex structure of the nervous system. However, automatic techniques are required to segment cells and intracellular structures in these multi-terabyte datasets and make anatomical analysis possible on a large scale. We propose a fully automated method that exploits both shape information and regional statistics to segment irregularly shaped intracellular structures such as mitochondria in electron microscopy (EM) images. The main idea is to use algebraic curves to extract shape features together with texture features from image patches. Then, these powerful features are used to learn a random forest classifier, which can predict mitochondria locations precisely. Finally, the algebraic curves together with regional information are used to segment the mitochondria at the predicted locations. We demonstrate that our method outperforms the state-of-the-art algorithms in segmentation of mitochondria in EM images.

  14. Learning Supervised Topic Models for Classification and Regression from Crowds.

    PubMed

    Rodrigues, Filipe; Lourenco, Mariana; Ribeiro, Bernardete; Pereira, Francisco C

    2017-12-01

    The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on supervised topic models. However, the nature of most annotation tasks, prone to ambiguity and noise, often with high volumes of documents, deem learning under a single-annotator assumption unrealistic or unpractical for most real-world applications. In this article, we propose two supervised topic models, one for classification and another for regression problems, which account for the heterogeneity and biases among different annotators that are encountered in practice when learning from crowds. We develop an efficient stochastic variational inference algorithm that is able to scale to very large datasets, and we empirically demonstrate the advantages of the proposed model over state-of-the-art approaches.

  15. Scalable parallel distance field construction for large-scale applications

    DOE PAGES

    Yu, Hongfeng; Xie, Jinrong; Ma, Kwan -Liu; ...

    2015-10-01

    Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less

  16. Scalable Parallel Distance Field Construction for Large-Scale Applications.

    PubMed

    Yu, Hongfeng; Xie, Jinrong; Ma, Kwan-Liu; Kolla, Hemanth; Chen, Jacqueline H

    2015-10-01

    Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.

  17. Four-dimensional ultrasonography of the fetal heart with spatiotemporal image correlation.

    PubMed

    Gonçalves, Luís F; Lee, Wesley; Chaiworapongsa, Tinnakorn; Espinoza, Jimmy; Schoen, Mary Lou; Falkensammer, Peter; Treadwell, Marjorie; Romero, Roberto

    2003-12-01

    This study was undertaken to describe a new technique for the examination of the fetal heart using four-dimensional ultrasonography with spatiotemporal image correlation (STIC). Volume data sets of the fetal heart were acquired with a new cardiac gating technique (STIC), which uses automated transverse and longitudinal sweeps of the anterior chest wall. These volumes were obtained from 69 fetuses: 35 normal, 16 with congenital anomalies not affecting the cardiovascular system, and 18 with cardiac abnormalities. Dynamic multiplanar slicing and surface rendering of cardiac structures were performed. To illustrate the STIC technique, two representative volumes from a normal fetus were compared with volumes obtained from fetuses with the following congenital heart anomalies: atrioventricular septal defect, tricuspid stenosis, tricuspid atresia, and interrupted inferior vena cava with abnormal venous drainage. Volume datasets obtained with a transverse sweep were utilized to demonstrate the cardiac chambers, moderator band, interatrial and interventricular septae, atrioventricular valves, pulmonary veins, and outflow tracts. With the use of a reference dot to navigate the four-chamber view, intracardiac structures could be simultaneously studied in three orthogonal planes. The same volume dataset was used for surface rendering of the atrioventricular valves. The aortic and ductal arches were best visualized when the original plane of acquisition was sagittal. Volumes could be interactively manipulated to simultaneously visualize both outflow tracts, in addition to the aortic and ductal arches. Novel views of specific structures were generated. For example, the location and extent of a ventricular septal defect was imaged in a sagittal view of the interventricular septum. Furthermore, surface-rendered images of the atrioventricular valves were employed to distinguish between normal and pathologic conditions. Representative video clips were posted on the Journal's Web site to demonstrate the diagnostic capabilities of this new technique. Dynamic multiplanar slicing and surface rendering of the fetal heart are feasible with STIC technology. One good quality volume dataset, obtained from a transverse sweep, can be used to examine the four-chamber view and the outflow tracts. This novel method may assist in the evaluation of fetal cardiac anatomy.

  18. The Role of Extracellular Fluid in Biokinetic Modeling

    DOE PAGES

    Miller, Guthrie; Klumpp, John A.; Melo, Dunstana; ...

    2017-12-01

    Here, the pharmacokinetic equations of Pierson et al. describing the behavior of bromide in rat provide a general approach to the modeling of extracellular fluid (ECF). The movement of material into ECF spaces is rapid and is completely characterized by tissue volumes and vascular flow rates to and from a tissue, the volumes of the tissue, and the ECF associated with the tissue. Early-time measurements are needed to characterize ECF. Measurements of DTPA disappearance from plasma by Wedeking et al. are discussed as an example of such measurements. In any biokinetic model, the fastest transfer rates are not determinable withmore » the usual datasets, and if determined empirically, these rates will have very large and highly correlated uncertainties, so particular values of these rates, even though the model fits the available data, are not significant. A pharmacokinetic front-end provides values for these fast rates. An example of such a front-end for a 200–g rat is given.« less

  19. Allometric Analysis Detects Brain Size-Independent Effects of Sex and Sex Chromosome Complement on Human Cerebellar Organization

    PubMed Central

    Mankiw, Catherine; Park, Min Tae M.; Reardon, P.K.; Fish, Ari M.; Clasen, Liv S.; Greenstein, Deanna; Blumenthal, Jonathan D.; Lerch, Jason P.; Chakravarty, M. Mallar

    2017-01-01

    The cerebellum is a large hindbrain structure that is increasingly recognized for its contribution to diverse domains of cognitive and affective processing in human health and disease. Although several of these domains are sex biased, our fundamental understanding of cerebellar sex differences—including their spatial distribution, potential biological determinants, and independence from brain volume variation—lags far behind that for the cerebrum. Here, we harness automated neuroimaging methods for cerebellar morphometrics in 417 individuals to (1) localize normative male–female differences in raw cerebellar volume, (2) compare these to sex chromosome effects estimated across five rare sex (X/Y) chromosome aneuploidy (SCA) syndromes, and (3) clarify brain size-independent effects of sex and SCA on cerebellar anatomy using a generalizable allometric approach that considers scaling relationships between regional cerebellar volume and brain volume in health. The integration of these approaches shows that (1) sex and SCA effects on raw cerebellar volume are large and distributed, but regionally heterogeneous, (2) human cerebellar volume scales with brain volume in a highly nonlinear and regionally heterogeneous fashion that departs from documented patterns of cerebellar scaling in phylogeny, and (3) cerebellar organization is modified in a brain size-independent manner by sex (relative expansion of total cerebellum, flocculus, and Crus II-lobule VIIIB volumes in males) and SCA (contraction of total cerebellar, lobule IV, and Crus I volumes with additional X- or Y-chromosomes; X-specific contraction of Crus II-lobule VIIIB). Our methods and results clarify the shifts in human cerebellar organization that accompany interwoven variations in sex, sex chromosome complement, and brain size. SIGNIFICANCE STATEMENT Cerebellar systems are implicated in diverse domains of sex-biased behavior and pathology, but we lack a basic understanding of how sex differences in the human cerebellum are distributed and determined. We leverage a rare neuroimaging dataset to deconvolve the interwoven effects of sex, sex chromosome complement, and brain size on human cerebellar organization. We reveal topographically variegated scaling relationships between regional cerebellar volume and brain size in humans, which (1) are distinct from those observed in phylogeny, (2) invalidate a traditional neuroimaging method for brain volume correction, and (3) allow more valid and accurate resolution of which cerebellar subcomponents are sensitive to sex and sex chromosome complement. These findings advance understanding of cerebellar organization in health and sex chromosome aneuploidy. PMID:28314818

  20. Regional magnetic resonance imaging measures for multivariate analysis in Alzheimer's disease and mild cognitive impairment.

    PubMed

    Westman, Eric; Aguilar, Carlos; Muehlboeck, J-Sebastian; Simmons, Andrew

    2013-01-01

    Automated structural magnetic resonance imaging (MRI) processing pipelines are gaining popularity for Alzheimer's disease (AD) research. They generate regional volumes, cortical thickness measures and other measures, which can be used as input for multivariate analysis. It is not clear which combination of measures and normalization approach are most useful for AD classification and to predict mild cognitive impairment (MCI) conversion. The current study includes MRI scans from 699 subjects [AD, MCI and controls (CTL)] from the Alzheimer's disease Neuroimaging Initiative (ADNI). The Freesurfer pipeline was used to generate regional volume, cortical thickness, gray matter volume, surface area, mean curvature, gaussian curvature, folding index and curvature index measures. 259 variables were used for orthogonal partial least square to latent structures (OPLS) multivariate analysis. Normalisation approaches were explored and the optimal combination of measures determined. Results indicate that cortical thickness measures should not be normalized, while volumes should probably be normalized by intracranial volume (ICV). Combining regional cortical thickness measures (not normalized) with cortical and subcortical volumes (normalized with ICV) using OPLS gave a prediction accuracy of 91.5 % when distinguishing AD versus CTL. This model prospectively predicted future decline from MCI to AD with 75.9 % of converters correctly classified. Normalization strategy did not have a significant effect on the accuracies of multivariate models containing multiple MRI measures for this large dataset. The appropriate choice of input for multivariate analysis in AD and MCI is of great importance. The results support the use of un-normalised cortical thickness measures and volumes normalised by ICV.

  1. Object recognition using deep convolutional neural networks with complete transfer and partial frozen layers

    NASA Astrophysics Data System (ADS)

    Kruithof, Maarten C.; Bouma, Henri; Fischer, Noëlle M.; Schutte, Klamer

    2016-10-01

    Object recognition is important to understand the content of video and allow flexible querying in a large number of cameras, especially for security applications. Recent benchmarks show that deep convolutional neural networks are excellent approaches for object recognition. This paper describes an approach of domain transfer, where features learned from a large annotated dataset are transferred to a target domain where less annotated examples are available as is typical for the security and defense domain. Many of these networks trained on natural images appear to learn features similar to Gabor filters and color blobs in the first layer. These first-layer features appear to be generic for many datasets and tasks while the last layer is specific. In this paper, we study the effect of copying all layers and fine-tuning a variable number. We performed an experiment with a Caffe-based network on 1000 ImageNet classes that are randomly divided in two equal subgroups for the transfer from one to the other. We copy all layers and vary the number of layers that is fine-tuned and the size of the target dataset. We performed additional experiments with the Keras platform on CIFAR-10 dataset to validate general applicability. We show with both platforms and both datasets that the accuracy on the target dataset improves when more target data is used. When the target dataset is large, it is beneficial to freeze only a few layers. For a large target dataset, the network without transfer learning performs better than the transfer network, especially if many layers are frozen. When the target dataset is small, it is beneficial to transfer (and freeze) many layers. For a small target dataset, the transfer network boosts generalization and it performs much better than the network without transfer learning. Learning time can be reduced by freezing many layers in a network.

  2. The experience of linking Victorian emergency medical service trauma data

    PubMed Central

    Boyle, Malcolm J

    2008-01-01

    Background The linking of a large Emergency Medical Service (EMS) dataset with the Victorian Department of Human Services (DHS) hospital datasets and Victorian State Trauma Outcome Registry and Monitoring (VSTORM) dataset to determine patient outcomes has not previously been undertaken in Victoria. The objective of this study was to identify the linkage rate of a large EMS trauma dataset with the Department of Human Services hospital datasets and VSTORM dataset. Methods The linking of an EMS trauma dataset to the hospital datasets utilised deterministic and probabilistic matching. The linking of three EMS trauma datasets to the VSTORM dataset utilised deterministic, probabilistic and manual matching. Results There were 66.7% of patients from the EMS dataset located in the VEMD. There were 96% of patients located in the VAED who were defined in the VEMD as being admitted to hospital. 3.7% of patients located in the VAED could not be found in the VEMD due to hospitals not reporting to the VEMD. For the EMS datasets, there was a 146% increase in successful links with the trauma profile dataset, a 221% increase in successful links with the mechanism of injury only dataset, and a 46% increase with sudden deterioration dataset, to VSTORM when using manual compared to deterministic matching. Conclusion This study has demonstrated that EMS data can be successfully linked to other health related datasets using deterministic and probabilistic matching with varying levels of success. The quality of EMS data needs to be improved to ensure better linkage success rates with other health related datasets. PMID:19014622

  3. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  4. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  5. The ROSCOE Manual. Volume I-1. Program Description

    DTIC Science & Technology

    1980-02-29

    CA’ CACcc C-’ c-c--c ~cc CCC) - CD Z I n -. cc - cx ccan,- C.) CC a - ~’, Zn cc-a U - -. a cc Zc- *0 F-U- CD~ cx no Ch cc a a I CACAO U) c--C Ccx c...dataset is de - noted by E8. Every dataset is tied t6 the basic dataset along with some important lists such as the object list, the radar list, and the...used for the track initiation and track functions are shown next. Most of these parameters are well de - fined, with the exception of the range gate

  6. Handling a Small Dataset Problem in Prediction Model by employ Artificial Data Generation Approach: A Review

    NASA Astrophysics Data System (ADS)

    Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd

    2017-09-01

    The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.

  7. Fully automatic GBM segmentation in the TCGA-GBM dataset: Prognosis and correlation with VASARI features.

    PubMed

    Rios Velazquez, Emmanuel; Meier, Raphael; Dunn, William D; Alexander, Brian; Wiest, Roland; Bauer, Stefan; Gutman, David A; Reyes, Mauricio; Aerts, Hugo J W L

    2015-11-18

    Reproducible definition and quantification of imaging biomarkers is essential. We evaluated a fully automatic MR-based segmentation method by comparing it to manually defined sub-volumes by experienced radiologists in the TCGA-GBM dataset, in terms of sub-volume prognosis and association with VASARI features. MRI sets of 109 GBM patients were downloaded from the Cancer Imaging archive. GBM sub-compartments were defined manually and automatically using the Brain Tumor Image Analysis (BraTumIA). Spearman's correlation was used to evaluate the agreement with VASARI features. Prognostic significance was assessed using the C-index. Auto-segmented sub-volumes showed moderate to high agreement with manually delineated volumes (range (r): 0.4 - 0.86). Also, the auto and manual volumes showed similar correlation with VASARI features (auto r = 0.35, 0.43 and 0.36; manual r = 0.17, 0.67, 0.41, for contrast-enhancing, necrosis and edema, respectively). The auto-segmented contrast-enhancing volume and post-contrast abnormal volume showed the highest AUC (0.66, CI: 0.55-0.77 and 0.65, CI: 0.54-0.76), comparable to manually defined volumes (0.64, CI: 0.53-0.75 and 0.63, CI: 0.52-0.74, respectively). BraTumIA and manual tumor sub-compartments showed comparable performance in terms of prognosis and correlation with VASARI features. This method can enable more reproducible definition and quantification of imaging based biomarkers and has potential in high-throughput medical imaging research.

  8. Sea Level Changes: Determination and Effects

    NASA Astrophysics Data System (ADS)

    Woodworth, P. L.; Pugh, D. T.; DeRonde, J. G.; Warrick, R. G.; Hannah, J.

    The measurement of sea level is of fundamental importance to a wide range of research in climatology, oceanography, geology and geodesy. This volume attempts to cover many aspects of the field. The volume opens with a description by Bolduc and Murty of one of the products stemming from the development of tide gauge networks in the northern and tropical Atlantic. This work is relevant to the growth of the Global Sea Level Observing System (GLOSS), the main goal of which is to provide the world with an efficient, coherent sea level monitoring system for océanographie and climatological research. The subsequent four papers present results from the analysis of existing tide gauge data, including those datasets available from the Permanent Service for Mean Sea Level and the TOGA Sea Level Center. Two of the four, by Wroblewski and by Pasaric and Orlic, are concerned with European sea level changes, while Yu Jiye et al. discuss inter-annual changes in the Pacific, and Wang Baocan et al. describe variability in the Changjiang estuary in China. The papers by El- Abd and A wad, on Red Sea levels, are the only contributions to the volume from the large research community of geologists concerned with sea level changes.

  9. Preface

    NASA Astrophysics Data System (ADS)

    Woodworth, P. L.; Pugh, D. T.; De Ronde, J. G.; Warrick, R. G.; Hannah, J.

    The measurement of sea level is of fundamental importance to a wide range of research in climatology, oceanography, geology and geodesy. This volume attempts to cover many aspects of the field. The volume opens with a description by Bolduc and Murty of one of the products stemming from the development of tide gauge networks in the northern and tropical Atlantic. This work is relevant to the growth of the Global Sea Level Observing System (GLOSS), the main goal of which is to provide the world with an efficient, coherent sea level monitoring system for oceanographic and climatological research. The subsequent four papers present results from the analysis of existing tide gauge data, including those datasets available from the Permanent Service for Mean Sea Level and the TOGA Sea Level Center. Two of the four, by Wróblewski and by Pasarić and Orlić, are concerned with European sea level changes, while Yu Jiye et al. discuss inter-annual changes in the Pacific, and Wang Baocan et al. describe variability in the Changjiang estuary in China. The papers by El-Abd and A wad, on Red Sea levels, are the only contributions to the volume from the large research community of geologists concerned with sea level changes.

  10. A new combined surface and volume registration

    NASA Astrophysics Data System (ADS)

    Lepore, Natasha; Joshi, Anand A.; Leahy, Richard M.; Brun, Caroline; Chou, Yi-Yu; Pennec, Xavier; Lee, Agatha D.; Barysheva, Marina; De Zubicaray, Greig I.; Wright, Margaret J.; McMahon, Katie L.; Toga, Arthur W.; Thompson, Paul M.

    2010-03-01

    3D registration of brain MRI data is vital for many medical imaging applications. However, purely intensitybased approaches for inter-subject matching of brain structure are generally inaccurate in cortical regions, due to the highly complex network of sulci and gyri, which vary widely across subjects. Here we combine a surfacebased cortical registration with a 3D fluid one for the first time, enabling precise matching of cortical folds, but allowing large deformations in the enclosed brain volume, which guarantee diffeomorphisms. This greatly improves the matching of anatomy in cortical areas. The cortices are segmented and registered with the software Freesurfer. The deformation field is initially extended to the full 3D brain volume using a 3D harmonic mapping that preserves the matching between cortical surfaces. Finally, these deformation fields are used to initialize a 3D Riemannian fluid registration algorithm, that improves the alignment of subcortical brain regions. We validate this method on an MRI dataset from 92 healthy adult twins. Results are compared to those based on volumetric registration without surface constraints; the resulting mean templates resolve consistent anatomical features both subcortically and at the cortex, suggesting that the approach is well-suited for cross-subject integration of functional and anatomic data.

  11. geoknife: Reproducible web-processing of large gridded datasets

    USGS Publications Warehouse

    Read, Jordan S.; Walker, Jordan I.; Appling, Alison P.; Blodgett, David L.; Read, Emily K.; Winslow, Luke A.

    2016-01-01

    Geoprocessing of large gridded data according to overlap with irregular landscape features is common to many large-scale ecological analyses. The geoknife R package was created to facilitate reproducible analyses of gridded datasets found on the U.S. Geological Survey Geo Data Portal web application or elsewhere, using a web-enabled workflow that eliminates the need to download and store large datasets that are reliably hosted on the Internet. The package provides access to several data subset and summarization algorithms that are available on remote web processing servers. Outputs from geoknife include spatial and temporal data subsets, spatially-averaged time series values filtered by user-specified areas of interest, and categorical coverage fractions for various land-use types.

  12. Access NASA Satellite Global Precipitation Data Visualization on YouTube

    NASA Technical Reports Server (NTRS)

    Liu, Z.; Su, J.; Acker, J.; Huffman, G.; Vollmer, B.; Wei, J.; Meyer, D.

    2017-01-01

    Since the satellite era began, NASA has collected a large volume of Earth science observations for research and applications around the world. The collected and archived satellite data at 12 NASA data centers can also be used for STEM education and activities such as disaster events, climate change, etc. However, accessing satellite data can be a daunting task for non-professional users such as teachers and students because of unfamiliarity of terminology, disciplines, data formats, data structures, computing resources, processing software, programming languages, etc. Over the years, many efforts including tools, training classes, and tutorials have been developed to improve satellite data access for users, but barriers still exist for non-professionals. In this presentation, we will present our latest activity that uses a very popular online video sharing Web site, YouTube (https://www.youtube.com/), for accessing visualizations of our global precipitation datasets at the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC). With YouTube, users can access and visualize a large volume of satellite data without the necessity to learn new software or download data. The dataset in this activity is a one-month animation for the GPM (Global Precipitation Measurement) Integrated Multi-satellite Retrievals for GPM (IMERG). IMERG provides precipitation on a near-global (60 deg. N-S) coverage at half-hourly time interval, providing more details on precipitation processes and development compared to the 3-hourly TRMM (Tropical Rainfall Measuring Mission) Multisatellite Precipitation Analysis (TMPA, 3B42) product. When the retro-processing of IMERG during the TRMM era is finished in 2018, the entire video will contain more than 330,000 files and will last 3.6 hours. Future plans include development of flyover videos for orbital data for an entire satellite mission or project. All videos, including the one-month animation, will be uploaded and available at the GES DISC site on YouTube (https://www.youtube.com/user/NASAGESDISC).

  13. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as inputs to the hydrological calibration and validation of EFAS as well as for establishing long-term discharge "proxy" climatologies which can then in turn be used for statistical analysis to derive return periods or other time series derivatives. In addition, this dataset will be used to assess climatological trends in Europe. Unfortunately, to date no baseline dataset at the European scale exists to test the quality of the herein presented data. Hence, a comparison against other existing datasets can therefore only be an indication of data quality. Due to availability, a comparison was made for precipitation and temperature only, arguably the most important meteorological drivers for hydrologic models. A variety of analyses was undertaken at country scale against data reported to EUROSTAT and E-OBS datasets. The comparison revealed that while the datasets showed overall similar temporal and spatial patterns, there were some differences in magnitudes especially for precipitation. It is not straightforward to define the specific cause for these differences. However, in most cases the comparatively low observation station density appears to be the principal reason for the differences in magnitude.

  14. Virtual probing system for medical volume data

    NASA Astrophysics Data System (ADS)

    Xiao, Yongfei; Fu, Yili; Wang, Shuguo

    2007-12-01

    Because of the huge computation in 3D medical data visualization, looking into its inner data interactively is always a problem to be resolved. In this paper, we present a novel approach to explore 3D medical dataset in real time by utilizing a 3D widget to manipulate the scanning plane. With the help of the 3D texture property in modern graphics card, a virtual scanning probe is used to explore oblique clipping plane of medical volume data in real time. A 3D model of the medical dataset is also rendered to illustrate the relationship between the scanning-plane image and the other tissues in medical data. It will be a valuable tool in anatomy education and understanding of medical images in the medical research.

  15. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.

    PubMed

    Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian

    2014-07-01

    We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.

  16. Computer-based analysis of microvascular alterations in a mouse model for Alzheimer's disease

    NASA Astrophysics Data System (ADS)

    Heinzer, Stefan; Müller, Ralph; Stampanoni, Marco; Abela, Rafael; Meyer, Eric P.; Ulmann-Schuler, Alexandra; Krucker, Thomas

    2007-03-01

    Vascular factors associated with Alzheimer's disease (AD) have recently gained increased attention. To investigate changes in vascular, particularly microvascular architecture, we developed a hierarchical imaging framework to obtain large-volume, high-resolution 3D images from brains of transgenic mice modeling AD. In this paper, we present imaging and data analysis methods which allow compiling unique characteristics from several hundred gigabytes of image data. Image acquisition is based on desktop micro-computed tomography (µCT) and local synchrotron-radiation µCT (SRµCT) scanning with a nominal voxel size of 16 µm and 1.4 µm, respectively. Two visualization approaches were implemented: stacks of Z-buffer projections for fast data browsing, and progressive-mesh based surface rendering for detailed 3D visualization of the large datasets. In a first step, image data was assessed visually via a Java client connected to a central database. Identified characteristics of interest were subsequently quantified using global morphometry software. To obtain even deeper insight into microvascular alterations, tree analysis software was developed providing local morphometric parameters such as number of vessel segments or vessel tortuosity. In the context of ever increasing image resolution and large datasets, computer-aided analysis has proven both powerful and indispensable. The hierarchical approach maintains the context of local phenomena, while proper visualization and morphometry provide the basis for detailed analysis of the pathology related to structure. Beyond analysis of microvascular changes in AD this framework will have significant impact considering that vascular changes are involved in other neurodegenerative diseases as well as in cancer, cardiovascular disease, asthma, and arthritis.

  17. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

    PubMed

    Ernst, Jason; Kellis, Manolis

    2015-04-01

    With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.

  18. An interactive web application for the dissemination of human systems immunology data.

    PubMed

    Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien

    2015-06-19

    Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.

  19. The use of large scale datasets for understanding traffic network state.

    DOT National Transportation Integrated Search

    2013-09-01

    The goal of this proposal is to develop novel modeling techniques to infer individual activity patterns from the large scale cell phone : datasets and taxi data from NYC. As such this research offers a paradigm shift from traditional transportation m...

  20. Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

    NASA Astrophysics Data System (ADS)

    Williamson, A.; Newman, A. V.

    2017-12-01

    Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.

  1. The agreement between 3D, standard 2D and triplane 2D speckle tracking: effects of image quality and 3D volume rate.

    PubMed

    Trache, Tudor; Stöbe, Stephan; Tarr, Adrienn; Pfeiffer, Dietrich; Hagendorff, Andreas

    2014-12-01

    Comparison of 3D and 2D speckle tracking performed on standard 2D and triplane 2D datasets of normal and pathological left ventricular (LV) wall-motion patterns with a focus on the effect that 3D volume rate (3DVR), image quality and tracking artifacts have on the agreement between 2D and 3D speckle tracking. 37 patients with normal LV function and 18 patients with ischaemic wall-motion abnormalities underwent 2D and 3D echocardiography, followed by offline speckle tracking measurements. The values of 3D global, regional and segmental strain were compared with the standard 2D and triplane 2D strain values. Correlation analysis with the LV ejection fraction (LVEF) was also performed. The 3D and 2D global strain values correlated good in both normally and abnormally contracting hearts, though systematic differences between the two methods were observed. Of the 3D strain parameters, the area strain showed the best correlation with the LVEF. The numerical agreement of 3D and 2D analyses varied significantly with the volume rate and image quality of the 3D datasets. The highest correlation between 2D and 3D peak systolic strain values was found between 3D area and standard 2D longitudinal strain. Regional wall-motion abnormalities were similarly detected by 2D and 3D speckle tracking. 2DST of triplane datasets showed similar results to those of conventional 2D datasets. 2D and 3D speckle tracking similarly detect normal and pathological wall-motion patterns. Limited image quality has a significant impact on the agreement between 3D and 2D numerical strain values.

  2. Segmentation and Visual Analysis of Whole-Body Mouse Skeleton microSPECT

    PubMed Central

    Khmelinskii, Artem; Groen, Harald C.; Baiker, Martin; de Jong, Marion; Lelieveldt, Boudewijn P. F.

    2012-01-01

    Whole-body SPECT small animal imaging is used to study cancer, and plays an important role in the development of new drugs. Comparing and exploring whole-body datasets can be a difficult and time-consuming task due to the inherent heterogeneity of the data (high volume/throughput, multi-modality, postural and positioning variability). The goal of this study was to provide a method to align and compare side-by-side multiple whole-body skeleton SPECT datasets in a common reference, thus eliminating acquisition variability that exists between the subjects in cross-sectional and multi-modal studies. Six whole-body SPECT/CT datasets of BALB/c mice injected with bone targeting tracers 99mTc-methylene diphosphonate (99mTc-MDP) and 99mTc-hydroxymethane diphosphonate (99mTc-HDP) were used to evaluate the proposed method. An articulated version of the MOBY whole-body mouse atlas was used as a common reference. Its individual bones were registered one-by-one to the skeleton extracted from the acquired SPECT data following an anatomical hierarchical tree. Sequential registration was used while constraining the local degrees of freedom (DoFs) of each bone in accordance to the type of joint and its range of motion. The Articulated Planar Reformation (APR) algorithm was applied to the segmented data for side-by-side change visualization and comparison of data. To quantitatively evaluate the proposed algorithm, bone segmentations of extracted skeletons from the correspondent CT datasets were used. Euclidean point to surface distances between each dataset and the MOBY atlas were calculated. The obtained results indicate that after registration, the mean Euclidean distance decreased from 11.5±12.1 to 2.6±2.1 voxels. The proposed approach yielded satisfactory segmentation results with minimal user intervention. It proved to be robust for “incomplete” data (large chunks of skeleton missing) and for an intuitive exploration and comparison of multi-modal SPECT/CT cross-sectional mouse data. PMID:23152834

  3. A Model-Based Approach for Microvasculature Structure Distortion Correction in Two-Photon Fluorescence Microscopy Images

    PubMed Central

    Dao, Lam; Glancy, Brian; Lucotte, Bertrand; Chang, Lin-Ching; Balaban, Robert S; Hsu, Li-Yueh

    2015-01-01

    SUMMARY This paper investigates a post-processing approach to correct spatial distortion in two-photon fluorescence microscopy images for vascular network reconstruction. It is aimed at in vivo imaging of large field-of-view, deep-tissue studies of vascular structures. Based on simple geometric modeling of the object-of-interest, a distortion function is directly estimated from the image volume by deconvolution analysis. Such distortion function is then applied to sub volumes of the image stack to adaptively adjust for spatially varying distortion and reduce the image blurring through blind deconvolution. The proposed technique was first evaluated in phantom imaging of fluorescent microspheres that are comparable in size to the underlying capillary vascular structures. The effectiveness of restoring three-dimensional spherical geometry of the microspheres using the estimated distortion function was compared with empirically measured point-spread function. Next, the proposed approach was applied to in vivo vascular imaging of mouse skeletal muscle to reduce the image distortion of the capillary structures. We show that the proposed method effectively improve the image quality and reduce spatially varying distortion that occurs in large field-of-view deep-tissue vascular dataset. The proposed method will help in qualitative interpretation and quantitative analysis of vascular structures from fluorescence microscopy images. PMID:26224257

  4. Redesigning the DOE Data Explorer to embed dataset relationships at the point of search and to reflect landing page organization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Studwell, Sara; Robinson, Carly; Elliott, Jannean

    Scientific research is producing ever-increasing amounts of data. Organizing and reflecting relationships across data collections, datasets, publications, and other research objects are essential functionalities of the modern science environment, yet challenging to implement. Landing pages are often used for providing ‘big picture’ contextual frameworks for datasets and data collections, and many large-volume data holders are utilizing them in thoughtful, creative ways. The benefits of their organizational efforts, however, are not realized unless the user eventually sees the landing page at the end point of their search. What if that organization and ‘big picture’ context could benefit the user at themore » beginning of the search? That is a challenging approach, but The Department of Energy’s (DOE) Office of Scientific and Technical Information (OSTI) is redesigning the database functionality of the DOE Data Explorer (DDE) with that goal in mind. Phase I is focused on redesigning the DDE database to leverage relationships between two existing distinct populations in DDE, data Projects and individual Datasets, and then adding a third intermediate population, data Collections. Mapped, structured linkages, designed to show user relationships, will allow users to make informed search choices. These linkages will be sustainable and scalable, created automatically with the use of new metadata fields and existing authorities. Phase II will study selected DOE Data ID Service clients, analyzing how their landing pages are organized, and how that organization might be used to improve DDE search capabilities. At the heart of both phases is the realization that adding more metadata information for cross-referencing may require additional effort for data scientists. Finally, OSTI’s approach seeks to leverage existing metadata and landing page intelligence without imposing an additional burden on the data creators.« less

  5. Generating and Visualizing Climate Indices using Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Erickson, T. A.; Guentchev, G.; Rood, R. B.

    2017-12-01

    Climate change is expected to have largest impacts on regional and local scales. Relevant and credible climate information is needed to support the planning and adaptation efforts in our communities. The volume of climate projections of temperature and precipitation is steadily increasing, as datasets are being generated on finer spatial and temporal grids with an increasing number of ensembles to characterize uncertainty. Despite advancements in tools for querying and retrieving subsets of these large, multi-dimensional datasets, ease of access remains a barrier for many existing and potential users who want to derive useful information from these data, particularly for those outside of the climate modelling research community. Climate indices, that can be derived from daily temperature and precipitation data, such as annual number of frost days or growing season length, can provide useful information to practitioners and stakeholders. For this work the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset was loaded into Google Earth Engine, a cloud-based geospatial processing platform. Algorithms that use the Earth Engine API to generate several climate indices were written. The indices were chosen from the set developed by the joint CCl/CLIVAR/JCOMM Expert Team on Climate Change Detection and Indices (ETCCDI). Simple user interfaces were created that allow users to query, produce maps and graphs of the indices, as well as download results for additional analyses. These browser-based interfaces could allow users in low-bandwidth environments to access climate information. This research shows that calculating climate indices from global downscaled climate projection datasets and sharing them widely using cloud computing technologies is feasible. Further development will focus on exposing the climate indices to existing applications via the Earth Engine API, and building custom user interfaces for presenting climate indices to a diverse set of user groups.

  6. Basin-fill Aquifer Modeling with Terrestrial Gravity: Assessing Static Offsets in Bulk Datasets using MATLAB; Case Study of Bridgeport, CA

    NASA Astrophysics Data System (ADS)

    Mlawsky, E. T.; Louie, J. N.; Pohll, G.; Carlson, C. W.; Blakely, R. J.

    2015-12-01

    Understanding the potential availability of water resources in Eastern California aquifers is of critical importance to making water management policy decisions and determining best-use practices for California, as well as for downstream use in Nevada. Hydrologic well log data can provide valuable information on aquifer capacity, but is often proprietarily inaccessible or economically unfeasible to obtain in sufficient quantity. In the case of basin-fill aquifers, it is possible to make estimates of aquifer geometry and volume using geophysical surveys of gravity, constrained by additional geophysical and geological observations. We use terrestrial gravity data to model depth-to-basement about the Bridgeport, CA basin for application in preserving the Walker Lake biome. In constructing the model, we assess several hundred gravity observations, existing and newly collected. We regard these datasets as "bulk," as the data are compiled from multiple sources. Inconsistencies among datasets can result in "static offsets," or artificial bull's-eye contours, within the gradient. Amending suspect offsets requires the attention of the modeler; picking these offsets by hand can be a time-consuming process when modeling large-scale basin features. We develop a MATLAB script for interpolating the residual Bouguer anomaly about the basin using sparse observation points, and leveling offset points with a user-defined sensitivity. The script is also capable of plotting gravity profiles between any two endpoints within the map extent. The resulting anomaly map provides an efficient means of locating and removing static offsets in the data, while also providing a fast visual representation of a bulk dataset. Additionally, we obtain gridded basin gravity models with an open-source alternative to proprietary modeling tools.

  7. Redesigning the DOE Data Explorer to embed dataset relationships at the point of search and to reflect landing page organization

    DOE PAGES

    Studwell, Sara; Robinson, Carly; Elliott, Jannean

    2017-04-04

    Scientific research is producing ever-increasing amounts of data. Organizing and reflecting relationships across data collections, datasets, publications, and other research objects are essential functionalities of the modern science environment, yet challenging to implement. Landing pages are often used for providing ‘big picture’ contextual frameworks for datasets and data collections, and many large-volume data holders are utilizing them in thoughtful, creative ways. The benefits of their organizational efforts, however, are not realized unless the user eventually sees the landing page at the end point of their search. What if that organization and ‘big picture’ context could benefit the user at themore » beginning of the search? That is a challenging approach, but The Department of Energy’s (DOE) Office of Scientific and Technical Information (OSTI) is redesigning the database functionality of the DOE Data Explorer (DDE) with that goal in mind. Phase I is focused on redesigning the DDE database to leverage relationships between two existing distinct populations in DDE, data Projects and individual Datasets, and then adding a third intermediate population, data Collections. Mapped, structured linkages, designed to show user relationships, will allow users to make informed search choices. These linkages will be sustainable and scalable, created automatically with the use of new metadata fields and existing authorities. Phase II will study selected DOE Data ID Service clients, analyzing how their landing pages are organized, and how that organization might be used to improve DDE search capabilities. At the heart of both phases is the realization that adding more metadata information for cross-referencing may require additional effort for data scientists. Finally, OSTI’s approach seeks to leverage existing metadata and landing page intelligence without imposing an additional burden on the data creators.« less

  8. A Prospective Study of the Use of Fetal Intelligent Navigation Echocardiography (FINE) to Obtain Standard Fetal Echocardiography Views

    PubMed Central

    Veronese, Paola; Bogana, Gianna; Cerutti, Alessia; Yeo, Lami; Romero, Roberto; Gervasi, Maria Teresa

    2016-01-01

    Objective To evaluate the performance of Fetal Intelligent Navigation Echocardiography (FINE) applied to spatiotemporal image correlation (STIC) volume datasets of the normal fetal heart in generating standard fetal echocardiography views. Methods In this prospective cohort study of patients with normal fetal hearts (19-30 gestational weeks), one or more STIC volume datasets were obtained of the apical four-chamber view. Each STIC volume successfully obtained was evaluated by STICLoop™ to determine its appropriateness before applying the FINE method. Visualization rates for standard fetal echocardiography views using diagnostic planes and/or Virtual Intelligent Sonographer Assistance (VIS-Assistance®) were calculated. Results One or more STIC volumes (n=463 total) were obtained in 246 patients. A single STIC volume per patient was analyzed using the FINE method. In normal cases, FINE was able to generate nine fetal echocardiography views using: 1) diagnostic planes in 76-100% of cases; 2) VIS-Assistance® in 96-100% of cases; and 3) a combination of diagnostic planes and/or VIS-Assistance® in 96-100% of cases. Conclusion FINE applied to STIC volumes can successfully generate nine standard fetal echocardiography views in 96-100% of cases in the second and third trimesters. This suggests that the technology can be used as a method to screen for congenital heart disease. PMID:27309391

  9. Improved biovolume estimation of Microcystis aeruginosa colonies: A statistical approach.

    PubMed

    Alcántara, I; Piccini, C; Segura, A M; Deus, S; González, C; Martínez de la Escalera, G; Kruk, C

    2018-05-27

    The Microcystis aeruginosa complex (MAC) clusters many of the most common freshwater and brackish bloom-forming cyanobacteria. In monitoring protocols, biovolume estimation is a common approach to determine MAC colonies biomass and useful for prediction purposes. Biovolume (μm 3 mL -1 ) is calculated multiplying organism abundance (orgL -1 ) by colonial volume (μm 3 org -1 ). Colonial volume is estimated based on geometric shapes and requires accurate measurements of dimensions using optical microscopy. A trade-off between easy-to-measure but low-accuracy simple shapes (e.g. sphere) and time costly but high-accuracy complex shapes (e.g. ellipsoid) volume estimation is posed. Overestimations effects in ecological studies and management decisions associated to harmful blooms are significant due to the large sizes of MAC colonies. In this work, we aimed to increase the precision of MAC biovolume estimations by developing a statistical model based on two easy-to-measure dimensions. We analyzed field data from a wide environmental gradient (800 km) spanning freshwater to estuarine and seawater. We measured length, width and depth from ca. 5700 colonies under an inverted microscope and estimated colonial volume using three different recommended geometrical shapes (sphere, prolate spheroid and ellipsoid). Because of the non-spherical shape of MAC the ellipsoid resulted in the most accurate approximation, whereas the sphere overestimated colonial volume (3-80) especially for large colonies (MLD higher than 300 μm). Ellipsoid requires measuring three dimensions and is time-consuming. Therefore, we constructed different statistical models to predict organisms depth based on length and width. Splitting the data into training (2/3) and test (1/3) sets, all models resulted in low training (1.41-1.44%) and testing average error (1.3-2.0%). The models were also evaluated using three other independent datasets. The multiple linear model was finally selected to calculate MAC volume as an ellipsoid based on length and width. This work contributes to achieve a better estimation of MAC volume applicable to monitoring programs as well as to ecological research. Copyright © 2017. Published by Elsevier B.V.

  10. An anomaly detection approach for the identification of DME patients using spectral domain optical coherence tomography images.

    PubMed

    Sidibé, Désiré; Sankar, Shrinivasan; Lemaître, Guillaume; Rastgoo, Mojdeh; Massich, Joan; Cheung, Carol Y; Tan, Gavin S W; Milea, Dan; Lamoureux, Ecosse; Wong, Tien Y; Mériaudeau, Fabrice

    2017-02-01

    This paper proposes a method for automatic classification of spectral domain OCT data for the identification of patients with retinal diseases such as Diabetic Macular Edema (DME). We address this issue as an anomaly detection problem and propose a method that not only allows the classification of the OCT volume, but also allows the identification of the individual diseased B-scans inside the volume. Our approach is based on modeling the appearance of normal OCT images with a Gaussian Mixture Model (GMM) and detecting abnormal OCT images as outliers. The classification of an OCT volume is based on the number of detected outliers. Experimental results with two different datasets show that the proposed method achieves a sensitivity and a specificity of 80% and 93% on the first dataset, and 100% and 80% on the second one. Moreover, the experiments show that the proposed method achieves better classification performance than other recently published works. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  11. Application of the Streamflow Prediction Tool to Estimate Sediment Dredging Volumes in Texas Coastal Waterways

    NASA Astrophysics Data System (ADS)

    Yeates, E.; Dreaper, G.; Afshari, S.; Tavakoly, A. A.

    2017-12-01

    Over the past six fiscal years, the United States Army Corps of Engineers (USACE) has contracted an average of about a billion dollars per year for navigation channel dredging. To execute these funds effectively, USACE Districts must determine which navigation channels need to be dredged in a given year. Improving this prioritization process results in more efficient waterway maintenance. This study uses the Streamflow Prediction Tool, a runoff routing model based on global weather forecast ensembles, to estimate dredged volumes. This study establishes regional linear relationships between cumulative flow and dredged volumes over a long-term simulation covering 30 years (1985-2015), using drainage area and shoaling parameters. The study framework integrates the National Hydrography Dataset (NHDPlus Dataset) with parameters from the Corps Shoaling Analysis Tool (CSAT) and dredging record data from USACE District records. Results in the test cases of the Houston Ship Channel and the Sabine and Port Arthur Harbor waterways in Texas indicate positive correlation between the simulated streamflows and actual dredging records.

  12. Distributed File System Utilities to Manage Large DatasetsVersion 0.5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2014-05-21

    FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.

  13. Classification of SD-OCT volumes for DME detection: an anomaly detection approach

    NASA Astrophysics Data System (ADS)

    Sankar, S.; Sidibé, D.; Cheung, Y.; Wong, T. Y.; Lamoureux, E.; Milea, D.; Meriaudeau, F.

    2016-03-01

    Diabetic Macular Edema (DME) is the leading cause of blindness amongst diabetic patients worldwide. It is characterized by accumulation of water molecules in the macula leading to swelling. Early detection of the disease helps prevent further loss of vision. Naturally, automated detection of DME from Optical Coherence Tomography (OCT) volumes plays a key role. To this end, a pipeline for detecting DME diseases in OCT volumes is proposed in this paper. The method is based on anomaly detection using Gaussian Mixture Model (GMM). It starts with pre-processing the B-scans by resizing, flattening, filtering and extracting features from them. Both intensity and Local Binary Pattern (LBP) features are considered. The dimensionality of the extracted features is reduced using PCA. As the last stage, a GMM is fitted with features from normal volumes. During testing, features extracted from the test volume are evaluated with the fitted model for anomaly and classification is made based on the number of B-scans detected as outliers. The proposed method is tested on two OCT datasets achieving a sensitivity and a specificity of 80% and 93% on the first dataset, and 100% and 80% on the second one. Moreover, experiments show that the proposed method achieves better classification performances than other recently published works.

  14. Common and distinct structural features of schizophrenia and bipolar disorder: The European Network on Psychosis, Affective disorders and Cognitive Trajectory (ENPACT) study

    PubMed Central

    Crespo-Facorro, Benedicto; Nenadic, Igor; Benedetti, Francesco; Gaser, Christian; Sauer, Heinrich; Roiz-Santiañez, Roberto; Poletti, Sara; Marinelli, Veronica; Bellani, Marcella; Perlini, Cinzia; Ruggeri, Mirella; Altamura, A. Carlo; Diwadkar, Vaibhav A.; Brambilla, Paolo

    2017-01-01

    Introduction Although schizophrenia (SCZ) and bipolar disorder (BD) share elements of pathology, their neural underpinnings are still under investigation. Here, structural Magnetic Resonance Imaging (MRI) data collected from a large sample of BD and SCZ patients and healthy controls (HC) were analyzed in terms of gray matter volume (GMV) using both voxel based morphometry (VBM) and a region of interest (ROI) approach. Methods The analysis was conducted on two datasets, Dataset1 (802 subjects: 243 SCZ, 176 BD, 383 HC) and Dataset2, a homogeneous subset of Dataset1 (301 subjects: 107 HC, 85 BD and 109 SCZ). General Linear Model analyses were performed 1) at the voxel-level in the whole brain (VBM study), 2) at the regional level in the anatomical regions emerged from the VBM study (ROI study). The GMV comparison across groups was integrated with the analysis of GMV correlates of different clinical dimensions. Results The VBM results of Dataset1 showed 1) in BD compared to HC, GMV deficits in right cingulate, superior temporal and calcarine cortices, 2) in SCZ compared to HC, GMV deficits in widespread cortical and subcortical areas, 3) in SCZ compared to BD, GMV deficits in insula and thalamus (p<0.05, cluster family wise error corrected). The regions showing GMV deficits in the BD group were mostly included in the SCZ ones. The ROI analyses confirmed the VBM results at the regional level in most of the clusters from the SCZ vs. HC comparison (p<0.05, Bonferroni corrected). The VBM and ROI analyses of Dataset2 provided further evidence for the enhanced GMV deficits characterizing SCZ. Based on the clinical-neuroanatomical analyses, we cannot exclude possible confounding effects due to 1) age of onset and medication in BD patients, 2) symptoms severity in SCZ patients. Conclusion Our study reported both shared and specific neuroanatomical characteristics between the two disorders, suggesting more severe and generalized GMV deficits in SCZ, with a specific role for insula and thalamus. PMID:29136642

  15. Subtle In-Scanner Motion Biases Automated Measurement of Brain Anatomy From In Vivo MRI

    PubMed Central

    Alexander-Bloch, Aaron; Clasen, Liv; Stockman, Michael; Ronan, Lisa; Lalonde, Francois; Giedd, Jay; Raznahan, Armin

    2016-01-01

    While the potential for small amounts of motion in functional magnetic resonance imaging (fMRI) scans to bias the results of functional neuroimaging studies is well appreciated, the impact of in-scanner motion on morphological analysis of structural MRI is relatively under-studied. Even among “good quality” structural scans, there may be systematic effects of motion on measures of brain morphometry. In the present study, the subjects’ tendency to move during fMRI scans, acquired in the same scanning sessions as their structural scans, yielded a reliable, continuous estimate of in-scanner motion. Using this approach within a sample of 127 children, adolescents, and young adults, significant relationships were found between this measure and estimates of cortical gray matter volume and mean curvature, as well as trend-level relationships with cortical thickness. Specifically, cortical volume and thickness decreased with greater motion, and mean curvature increased. These effects of subtle motion were anatomically heterogeneous, were present across different automated imaging pipelines, showed convergent validity with effects of frank motion assessed in a separate sample of 274 scans, and could be demonstrated in both pediatric and adult populations. Thus, using different motion assays in two large non-overlapping sets of structural MRI scans, convergent evidence showed that in-scanner motion—even at levels which do not manifest in visible motion artifact—can lead to systematic and regionally specific biases in anatomical estimation. These findings have special relevance to structural neuroimaging in developmental and clinical datasets, and inform ongoing efforts to optimize neuroanatomical analysis of existing and future structural MRI datasets in non-sedated humans. PMID:27004471

  16. Social brain volume is associated with in-degree social network size among older adults

    PubMed Central

    2018-01-01

    The social brain hypothesis proposes that large neocortex size evolved to support cognitively demanding social interactions. Accordingly, previous studies have observed that larger orbitofrontal and amygdala structures predict the size of an individual's social network. However, it remains uncertain how an individual's social connectedness reported by other people is associated with the social brain volume. In this study, we found that a greater in-degree network size, a measure of social ties identified by a subject's social connections rather than by the subject, significantly correlated with a larger regional volume of the orbitofrontal cortex, dorsomedial prefrontal cortex and lingual gyrus. By contrast, out-degree size, which is based on an individual's self-perceived connectedness, showed no associations. Meta-analytic reverse inference further revealed that regional volume pattern of in-degree size was specifically involved in social inference ability. These findings were possible because our dataset contained the social networks of an entire village, i.e. a global network. The results suggest that the in-degree aspect of social network size not only confirms the previously reported brain correlates of the social network but also shows an association in brain regions involved in the ability to infer other people's minds. This study provides insight into understanding how the social brain is uniquely associated with sociocentric measures derived from a global network. PMID:29367402

  17. Statistical analysis of large simulated yield datasets for studying climate effects

    USDA-ARS?s Scientific Manuscript database

    Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...

  18. Application of 3D triangulations of airborne laser scanning data to estimate boreal forest leaf area index

    NASA Astrophysics Data System (ADS)

    Majasalmi, Titta; Korhonen, Lauri; Korpela, Ilkka; Vauhkonen, Jari

    2017-07-01

    We propose 3D triangulations of airborne Laser Scanning (ALS) point clouds as a new approach to derive 3D canopy structures and to estimate forest canopy effective LAI (LAIe). Computational geometry and topological connectivity were employed to filter the triangulations to yield a quasi-optimal relationship with the field measured LAIe. The optimal filtering parameters were predicted based on ALS height metrics, emulating the production of maps of LAIe and canopy volume for large areas. The LAIe from triangulations was validated with field measured LAIe and compared with a reference LAIe calculated from ALS data using logarithmic model based on Beer's law. Canopy transmittance was estimated using All Echo Cover Index (ACI), and the mean projection of unit foliage area (β) was obtained using no-intercept regression with field measured LAIe. We investigated the influence species and season on the triangulated LAIe and demonstrated the relationship between triangulated LAIe and canopy volume. Our data is from 115 forest plots located at the southern boreal forest area in Finland and for each plot three different ALS datasets were available to apply the triangulations. The triangulation approach was found applicable for both leaf-on and leaf-off datasets after initial calibration. Results showed the Root Mean Square Errors (RMSEs) between LAIe from triangulations and field measured values agreed the most using the highest pulse density data (RMSE = 0.63, the coefficient of determination (R2) = 0.53). Yet, the LAIe calculated using ACI-index agreed better with the field measured LAIe (RMSE = 0.53 and R2 = 0.70). The best models to predict the optimal alpha value contained the ACI-index, which indicates that within-crown transmittance is accounted by the triangulation approach. The cover indices may be recommended for retrieving LAIe only, but for applications which require more sophisticated information on canopy shape and volume, such as radiative transfer models, the triangulation approach may be preferred.

  19. Robust semi-automatic segmentation of pulmonary subsolid nodules in chest computed tomography scans

    NASA Astrophysics Data System (ADS)

    Lassen, B. C.; Jacobs, C.; Kuhnigk, J.-M.; van Ginneken, B.; van Rikxoort, E. M.

    2015-02-01

    The malignancy of lung nodules is most often detected by analyzing changes of the nodule diameter in follow-up scans. A recent study showed that comparing the volume or the mass of a nodule over time is much more significant than comparing the diameter. Since the survival rate is higher when the disease is still in an early stage it is important to detect the growth rate as soon as possible. However manual segmentation of a volume is time-consuming. Whereas there are several well evaluated methods for the segmentation of solid nodules, less work is done on subsolid nodules which actually show a higher malignancy rate than solid nodules. In this work we present a fast, semi-automatic method for segmentation of subsolid nodules. As minimal user interaction the method expects a user-drawn stroke on the largest diameter of the nodule. First, a threshold-based region growing is performed based on intensity analysis of the nodule region and surrounding parenchyma. In the next step the chest wall is removed by a combination of a connected component analyses and convex hull calculation. Finally, attached vessels are detached by morphological operations. The method was evaluated on all nodules of the publicly available LIDC/IDRI database that were manually segmented and rated as non-solid or part-solid by four radiologists (Dataset 1) and three radiologists (Dataset 2). For these 59 nodules the Jaccard index for the agreement of the proposed method with the manual reference segmentations was 0.52/0.50 (Dataset 1/Dataset 2) compared to an inter-observer agreement of the manual segmentations of 0.54/0.58 (Dataset 1/Dataset 2). Furthermore, the inter-observer agreement using the proposed method (i.e. different input strokes) was analyzed and gave a Jaccard index of 0.74/0.74 (Dataset 1/Dataset 2). The presented method provides satisfactory segmentation results with minimal observer effort in minimal time and can reduce the inter-observer variability for segmentation of subsolid nodules in clinical routine.

  20. Landmark-guided diffeomorphic demons algorithm and its application to automatic segmentation of the whole spine and pelvis in CT images.

    PubMed

    Hanaoka, Shouhei; Masutani, Yoshitaka; Nemoto, Mitsutaka; Nomura, Yukihiro; Miki, Soichiro; Yoshikawa, Takeharu; Hayashi, Naoto; Ohtomo, Kuni; Shimizu, Akinobu

    2017-03-01

    A fully automatic multiatlas-based method for segmentation of the spine and pelvis in a torso CT volume is proposed. A novel landmark-guided diffeomorphic demons algorithm is used to register a given CT image to multiple atlas volumes. This algorithm can utilize both grayscale image information and given landmark coordinate information optimally. The segmentation has four steps. Firstly, 170 bony landmarks are detected in the given volume. Using these landmark positions, an atlas selection procedure is performed to reduce the computational cost of the following registration. Then the chosen atlas volumes are registered to the given CT image. Finally, voxelwise label voting is performed to determine the final segmentation result. The proposed method was evaluated using 50 torso CT datasets as well as the public SpineWeb dataset. As a result, a mean distance error of [Formula: see text] and a mean Dice coefficient of [Formula: see text] were achieved for the whole spine and the pelvic bones, which are competitive with other state-of-the-art methods. From the experimental results, the usefulness of the proposed segmentation method was validated.

  1. A fully automated non-external marker 4D-CT sorting algorithm using a serial cine scanning protocol.

    PubMed

    Carnes, Greg; Gaede, Stewart; Yu, Edward; Van Dyk, Jake; Battista, Jerry; Lee, Ting-Yim

    2009-04-07

    Current 4D-CT methods require external marker data to retrospectively sort image data and generate CT volumes. In this work we develop an automated 4D-CT sorting algorithm that performs without the aid of data collected from an external respiratory surrogate. The sorting algorithm requires an overlapping cine scan protocol. The overlapping protocol provides a spatial link between couch positions. Beginning with a starting scan position, images from the adjacent scan position (which spatial match the starting scan position) are selected by maximizing the normalized cross correlation (NCC) of the images at the overlapping slice position. The process was continued by 'daisy chaining' all couch positions using the selected images until an entire 3D volume was produced. The algorithm produced 16 phase volumes to complete a 4D-CT dataset. Additional 4D-CT datasets were also produced using external marker amplitude and phase angle sorting methods. The image quality of the volumes produced by the different methods was quantified by calculating the mean difference of the sorted overlapping slices from adjacent couch positions. The NCC sorted images showed a significant decrease in the mean difference (p < 0.01) for the five patients.

  2. Do pre-trained deep learning models improve computer-aided classification of digital mammograms?

    NASA Astrophysics Data System (ADS)

    Aboutalib, Sarah S.; Mohamed, Aly A.; Zuley, Margarita L.; Berg, Wendie A.; Luo, Yahong; Wu, Shandong

    2018-02-01

    Digital mammography screening is an important exam for the early detection of breast cancer and reduction in mortality. False positives leading to high recall rates, however, results in unnecessary negative consequences to patients and health care systems. In order to better aid radiologists, computer-aided tools can be utilized to improve distinction between image classifications and thus potentially reduce false recalls. The emergence of deep learning has shown promising results in the area of biomedical imaging data analysis. This study aimed to investigate deep learning and transfer learning methods that can improve digital mammography classification performance. In particular, we evaluated the effect of pre-training deep learning models with other imaging datasets in order to boost classification performance on a digital mammography dataset. Two types of datasets were used for pre-training: (1) a digitized film mammography dataset, and (2) a very large non-medical imaging dataset. By using either of these datasets to pre-train the network initially, and then fine-tuning with the digital mammography dataset, we found an increase in overall classification performance in comparison to a model without pre-training, with the very large non-medical dataset performing the best in improving the classification accuracy.

  3. Ray Casting of Large Multi-Resolution Volume Datasets

    NASA Astrophysics Data System (ADS)

    Lux, C.; Fröhlich, B.

    2009-04-01

    High quality volume visualization through ray casting on graphics processing units (GPU) has become an important approach for many application domains. We present a GPU-based, multi-resolution ray casting technique for the interactive visualization of massive volume data sets commonly found in the oil and gas industry. Large volume data sets are represented as a multi-resolution hierarchy based on an octree data structure. The original volume data is decomposed into small bricks of a fixed size acting as the leaf nodes of the octree. These nodes are the highest resolution of the volume. Coarser resolutions are represented through inner nodes of the hierarchy which are generated by down sampling eight neighboring nodes on a finer level. Due to limited memory resources of current desktop workstations and graphics hardware only a limited working set of bricks can be locally maintained for a frame to be displayed. This working set is chosen to represent the whole volume at different local resolution levels depending on the current viewer position, transfer function and distinct areas of interest. During runtime the working set of bricks is maintained in CPU- and GPU memory and is adaptively updated by asynchronously fetching data from external sources like hard drives or a network. The CPU memory hereby acts as a secondary level cache for these sources from which the GPU representation is updated. Our volume ray casting algorithm is based on a 3D texture-atlas in GPU memory. This texture-atlas contains the complete working set of bricks of the current multi-resolution representation of the volume. This enables the volume ray casting algorithm to access the whole working set of bricks through only a single 3D texture. For traversing rays through the volume, information about the locations and resolution levels of visited bricks are required for correct compositing computations. We encode this information into a small 3D index texture which represents the current octree subdivision on its finest level and spatially organizes the bricked data. This approach allows us to render a bricked multi-resolution volume data set utilizing only a single rendering pass with no loss of compositing precision. In contrast most state-of-the art volume rendering systems handle the bricked data as individual 3D textures, which are rendered one at a time while the results are composited into a lower precision frame buffer. Furthermore, our method enables us to integrate advanced volume rendering techniques like empty-space skipping, adaptive sampling and preintegrated transfer functions in a very straightforward manner with virtually no extra costs. Our interactive volume ray tracing implementation allows high quality visualizations of massive volume data sets of tens of Gigabytes in size on standard desktop workstations.

  4. Secondary analysis of national survey datasets.

    PubMed

    Boo, Sunjoo; Froelicher, Erika Sivarajan

    2013-06-01

    This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.

  5. Indoor Modelling Benchmark for 3D Geometry Extraction

    NASA Astrophysics Data System (ADS)

    Thomson, C.; Boehm, J.

    2014-06-01

    A combination of faster, cheaper and more accurate hardware, more sophisticated software, and greater industry acceptance have all laid the foundations for an increased desire for accurate 3D parametric models of buildings. Pointclouds are the data source of choice currently with static terrestrial laser scanning the predominant tool for large, dense volume measurement. The current importance of pointclouds as the primary source of real world representation is endorsed by CAD software vendor acquisitions of pointcloud engines in 2011. Both the capture and modelling of indoor environments require great effort in time by the operator (and therefore cost). Automation is seen as a way to aid this by reducing the workload of the user and some commercial packages have appeared that provide automation to some degree. In the data capture phase, advances in indoor mobile mapping systems are speeding up the process, albeit currently with a reduction in accuracy. As a result this paper presents freely accessible pointcloud datasets of two typical areas of a building each captured with two different capture methods and each with an accurate wholly manually created model. These datasets are provided as a benchmark for the research community to gauge the performance and improvements of various techniques for indoor geometry extraction. With this in mind, non-proprietary, interoperable formats are provided such as E57 for the scans and IFC for the reference model. The datasets can be found at: http://indoor-bench.github.io/indoor-bench.

  6. The Climate Data Analytic Services (CDAS) Framework.

    NASA Astrophysics Data System (ADS)

    Maxwell, T. P.; Duffy, D.

    2016-12-01

    Faced with unprecedented growth in climate data volume and demand, NASA has developed the Climate Data Analytic Services (CDAS) framework. This framework enables scientists to execute data processing workflows combining common analysis operations in a high performance environment close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted climate data analysis tools (ESMF, CDAT, NCO, etc.). A dynamic caching architecture enables interactive response times. CDAS utilizes Apache Spark for parallelization and a custom array framework for processing huge datasets within limited memory spaces. CDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using either direct web service calls, a python script, a unix-like shell client, or a javascript-based web application. Client packages in python, scala, or javascript contain everything needed to make CDAS requests. The CDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service permits decision makers to investigate climate changes around the globe, inspect model trends and variability, and compare multiple reanalysis datasets.

  7. An Improved TA-SVM Method Without Matrix Inversion and Its Fast Implementation for Nonstationary Datasets.

    PubMed

    Shi, Yingzhong; Chung, Fu-Lai; Wang, Shitong

    2015-09-01

    Recently, a time-adaptive support vector machine (TA-SVM) is proposed for handling nonstationary datasets. While attractive performance has been reported and the new classifier is distinctive in simultaneously solving several SVM subclassifiers locally and globally by using an elegant SVM formulation in an alternative kernel space, the coupling of subclassifiers brings in the computation of matrix inversion, thus resulting to suffer from high computational burden in large nonstationary dataset applications. To overcome this shortcoming, an improved TA-SVM (ITA-SVM) is proposed using a common vector shared by all the SVM subclassifiers involved. ITA-SVM not only keeps an SVM formulation, but also avoids the computation of matrix inversion. Thus, we can realize its fast version, that is, improved time-adaptive core vector machine (ITA-CVM) for large nonstationary datasets by using the CVM technique. ITA-CVM has the merit of asymptotic linear time complexity for large nonstationary datasets as well as inherits the advantage of TA-SVM. The effectiveness of the proposed classifiers ITA-SVM and ITA-CVM is also experimentally confirmed.

  8. Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project

    PubMed Central

    Boubela, Roland N.; Kalcher, Klaudius; Huf, Wolfgang; Našel, Christian; Moser, Ewald

    2016-01-01

    Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets. PMID:26778951

  9. The multiple imputation method: a case study involving secondary data analysis.

    PubMed

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  10. Multiclass Classification of Cardiac Arrhythmia Using Improved Feature Selection and SVM Invariants.

    PubMed

    Mustaqeem, Anam; Anwar, Syed Muhammad; Majid, Muahammad

    2018-01-01

    Arrhythmia is considered a life-threatening disease causing serious health issues in patients, when left untreated. An early diagnosis of arrhythmias would be helpful in saving lives. This study is conducted to classify patients into one of the sixteen subclasses, among which one class represents absence of disease and the other fifteen classes represent electrocardiogram records of various subtypes of arrhythmias. The research is carried out on the dataset taken from the University of California at Irvine Machine Learning Data Repository. The dataset contains a large volume of feature dimensions which are reduced using wrapper based feature selection technique. For multiclass classification, support vector machine (SVM) based approaches including one-against-one (OAO), one-against-all (OAA), and error-correction code (ECC) are employed to detect the presence and absence of arrhythmias. The SVM method results are compared with other standard machine learning classifiers using varying parameters and the performance of the classifiers is evaluated using accuracy, kappa statistics, and root mean square error. The results show that OAO method of SVM outperforms all other classifiers by achieving an accuracy rate of 81.11% when used with 80/20 data split and 92.07% using 90/10 data split option.

  11. Comprehensive optical and data management infrastructure for high-throughput light-sheet microscopy of whole mouse brains.

    PubMed

    Müllenbroich, M Caroline; Silvestri, Ludovico; Onofri, Leonardo; Costantini, Irene; Hoff, Marcel Van't; Sacconi, Leonardo; Iannello, Giulio; Pavone, Francesco S

    2015-10-01

    Comprehensive mapping and quantification of neuronal projections in the central nervous system requires high-throughput imaging of large volumes with microscopic resolution. To this end, we have developed a confocal light-sheet microscope that has been optimized for three-dimensional (3-D) imaging of structurally intact clarified whole-mount mouse brains. We describe the optical and electromechanical arrangement of the microscope and give details on the organization of the microscope management software. The software orchestrates all components of the microscope, coordinates critical timing and synchronization, and has been written in a versatile and modular structure using the LabVIEW language. It can easily be adapted and integrated to other microscope systems and has been made freely available to the light-sheet community. The tremendous amount of data routinely generated by light-sheet microscopy further requires novel strategies for data handling and storage. To complete the full imaging pipeline of our high-throughput microscope, we further elaborate on big data management from streaming of raw images up to stitching of 3-D datasets. The mesoscale neuroanatomy imaged at micron-scale resolution in those datasets allows characterization and quantification of neuronal projections in unsectioned mouse brains.

  12. Deep learning-based fine-grained car make/model classification for visual surveillance

    NASA Astrophysics Data System (ADS)

    Gundogdu, Erhan; Parıldı, Enes Sinan; Solmaz, Berkan; Yücesoy, Veysel; Koç, Aykut

    2017-10-01

    Fine-grained object recognition is a potential computer vision problem that has been recently addressed by utilizing deep Convolutional Neural Networks (CNNs). Nevertheless, the main disadvantage of classification methods relying on deep CNN models is the need for considerably large amount of data. In addition, there exists relatively less amount of annotated data for a real world application, such as the recognition of car models in a traffic surveillance system. To this end, we mainly concentrate on the classification of fine-grained car make and/or models for visual scenarios by the help of two different domains. First, a large-scale dataset including approximately 900K images is constructed from a website which includes fine-grained car models. According to their labels, a state-of-the-art CNN model is trained on the constructed dataset. The second domain that is dealt with is the set of images collected from a camera integrated to a traffic surveillance system. These images, which are over 260K, are gathered by a special license plate detection method on top of a motion detection algorithm. An appropriately selected size of the image is cropped from the region of interest provided by the detected license plate location. These sets of images and their provided labels for more than 30 classes are employed to fine-tune the CNN model which is already trained on the large scale dataset described above. To fine-tune the network, the last two fully-connected layers are randomly initialized and the remaining layers are fine-tuned in the second dataset. In this work, the transfer of a learned model on a large dataset to a smaller one has been successfully performed by utilizing both the limited annotated data of the traffic field and a large scale dataset with available annotations. Our experimental results both in the validation dataset and the real field show that the proposed methodology performs favorably against the training of the CNN model from scratch.

  13. Identification of hydrometeor mixtures in polarimetric radar measurements and their linear de-mixing

    NASA Astrophysics Data System (ADS)

    Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis

    2017-04-01

    The issue of hydrometeor mixtures affects radar sampling volumes without a clear dominant hydrometeor type. Containing a number of different hydrometeor types which significantly contribute to the polarimetric variables, these volumes are likely to occur in the vicinity of the melting layer and mainly, at large distance from a given radar. Motivated by potential benefits for both quantitative and qualitative applications of dual-pol radar, we propose a method for the identification of hydrometeor mixtures and their subsequent linear de-mixing. This method is intrinsically related to our recently proposed semi-supervised approach for hydrometeor classification. The mentioned classification approach [1] performs labeling of radar sampling volumes by using as a criterion the Euclidean distance with respect to five-dimensional centroids, depicting nine hydrometeor classes. The positions of the centroids in the space formed by four radar moments and one external parameter (phase indicator), are derived through a technique of k-medoids clustering, applied on a selected representative set of radar observations, and coupled with statistical testing which introduces the assumed microphysical properties of the different hydrometeor types. Aside from a hydrometeor type label, each radar sampling volume is characterized by an entropy estimate, indicating the uncertainty of the classification. Here, we revisit the concept of entropy presented in [1], in order to emphasize its presumed potential for the identification of hydrometeor mixtures. The calculation of entropy is based on the estimate of the probability (pi ) that the observation corresponds to the hydrometeor type i (i = 1,ṡṡṡ9) . The probability is derived from the Euclidean distance (di ) of the observation to the centroid characterizing the hydrometeor type i . The parametrization of the d → p transform is conducted in a controlled environment, using synthetic polarimetric radar datasets. It ensures balanced entropy values: low for pure volumes, and high for different possible combinations of mixed hydrometeors. The parametrized entropy is further on applied to real polarimetric C and X band radar datasets, where we demonstrate the potential of linear de-mixing using a simplex formed by a set of pre-defined centroids in the five-dimensional space. As main outcome, the proposed approach allows to provide plausible proportions of the different hydrometeors contained in a given radar sampling volume. [1] Besic, N., Figueras i Ventura, J., Grazioli, J., Gabella, M., Germann, U., and Berne, A.: Hydrometeor classification through statistical clustering of polarimetric radar measurements: a semi-supervised approach, Atmos. Meas. Tech., 9, 4425-4445, doi:10.5194/amt-9-4425-2016, 2016.

  14. Integrated circuits for volumetric ultrasound imaging with 2-D CMUT arrays.

    PubMed

    Bhuyan, Anshuman; Choe, Jung Woo; Lee, Byung Chul; Wygant, Ira O; Nikoozadeh, Amin; Oralkan, Ömer; Khuri-Yakub, Butrus T

    2013-12-01

    Real-time volumetric ultrasound imaging systems require transmit and receive circuitry to generate ultrasound beams and process received echo signals. The complexity of building such a system is high due to requirement of the front-end electronics needing to be very close to the transducer. A large number of elements also need to be interfaced to the back-end system and image processing of a large dataset could affect the imaging volume rate. In this work, we present a 3-D imaging system using capacitive micromachined ultrasonic transducer (CMUT) technology that addresses many of the challenges in building such a system. We demonstrate two approaches in integrating the transducer and the front-end electronics. The transducer is a 5-MHz CMUT array with an 8 mm × 8 mm aperture size. The aperture consists of 1024 elements (32 × 32) with an element pitch of 250 μm. An integrated circuit (IC) consists of a transmit beamformer and receive circuitry to improve the noise performance of the overall system. The assembly was interfaced with an FPGA and a back-end system (comprising of a data acquisition system and PC). The FPGA provided the digital I/O signals for the IC and the back-end system was used to process the received RF echo data (from the IC) and reconstruct the volume image using a phased array imaging approach. Imaging experiments were performed using wire and spring targets, a ventricle model and a human prostrate. Real-time volumetric images were captured at 5 volumes per second and are presented in this paper.

  15. CADDIS Volume 4. Data Analysis: Exploratory Data Analysis

    EPA Pesticide Factsheets

    Intro to exploratory data analysis. Overview of variable distributions, scatter plots, correlation analysis, GIS datasets. Use of conditional probability to examine stressor levels and impairment. Exploring correlations among multiple stressors.

  16. Computational Testing for Automated Preprocessing 2: Practical Demonstration of a System for Scientific Data-Processing Workflow Management for High-Volume EEG

    PubMed Central

    Cowley, Benjamin U.; Korpela, Jussi

    2018-01-01

    Existing tools for the preprocessing of EEG data provide a large choice of methods to suitably prepare and analyse a given dataset. Yet it remains a challenge for the average user to integrate methods for batch processing of the increasingly large datasets of modern research, and compare methods to choose an optimal approach across the many possible parameter configurations. Additionally, many tools still require a high degree of manual decision making for, e.g., the classification of artifacts in channels, epochs or segments. This introduces extra subjectivity, is slow, and is not reproducible. Batching and well-designed automation can help to regularize EEG preprocessing, and thus reduce human effort, subjectivity, and consequent error. The Computational Testing for Automated Preprocessing (CTAP) toolbox facilitates: (i) batch processing that is easy for experts and novices alike; (ii) testing and comparison of preprocessing methods. Here we demonstrate the application of CTAP to high-resolution EEG data in three modes of use. First, a linear processing pipeline with mostly default parameters illustrates ease-of-use for naive users. Second, a branching pipeline illustrates CTAP's support for comparison of competing methods. Third, a pipeline with built-in parameter-sweeping illustrates CTAP's capability to support data-driven method parameterization. CTAP extends the existing functions and data structure from the well-known EEGLAB toolbox, based on Matlab, and produces extensive quality control outputs. CTAP is available under MIT open-source licence from https://github.com/bwrc/ctap. PMID:29692705

  17. Computational Testing for Automated Preprocessing 2: Practical Demonstration of a System for Scientific Data-Processing Workflow Management for High-Volume EEG.

    PubMed

    Cowley, Benjamin U; Korpela, Jussi

    2018-01-01

    Existing tools for the preprocessing of EEG data provide a large choice of methods to suitably prepare and analyse a given dataset. Yet it remains a challenge for the average user to integrate methods for batch processing of the increasingly large datasets of modern research, and compare methods to choose an optimal approach across the many possible parameter configurations. Additionally, many tools still require a high degree of manual decision making for, e.g., the classification of artifacts in channels, epochs or segments. This introduces extra subjectivity, is slow, and is not reproducible. Batching and well-designed automation can help to regularize EEG preprocessing, and thus reduce human effort, subjectivity, and consequent error. The Computational Testing for Automated Preprocessing (CTAP) toolbox facilitates: (i) batch processing that is easy for experts and novices alike; (ii) testing and comparison of preprocessing methods. Here we demonstrate the application of CTAP to high-resolution EEG data in three modes of use. First, a linear processing pipeline with mostly default parameters illustrates ease-of-use for naive users. Second, a branching pipeline illustrates CTAP's support for comparison of competing methods. Third, a pipeline with built-in parameter-sweeping illustrates CTAP's capability to support data-driven method parameterization. CTAP extends the existing functions and data structure from the well-known EEGLAB toolbox, based on Matlab, and produces extensive quality control outputs. CTAP is available under MIT open-source licence from https://github.com/bwrc/ctap.

  18. Integrated Management and Visualization of Electronic Tag Data with Tagbase

    PubMed Central

    Lam, Chi Hin; Tsontos, Vardis M.

    2011-01-01

    Electronic tags have been used widely for more than a decade in studies of diverse marine species. However, despite significant investment in tagging programs and hardware, data management aspects have received insufficient attention, leaving researchers without a comprehensive toolset to manage their data easily. The growing volume of these data holdings, the large diversity of tag types and data formats, and the general lack of data management resources are not only complicating integration and synthesis of electronic tagging data in support of resource management applications but potentially threatening the integrity and longer-term access to these valuable datasets. To address this critical gap, Tagbase has been developed as a well-rounded, yet accessible data management solution for electronic tagging applications. It is based on a unified relational model that accommodates a suite of manufacturer tag data formats in addition to deployment metadata and reprocessed geopositions. Tagbase includes an integrated set of tools for importing tag datasets into the system effortlessly, and provides reporting utilities to interactively view standard outputs in graphical and tabular form. Data from the system can also be easily exported or dynamically coupled to GIS and other analysis packages. Tagbase is scalable and has been ported to a range of database management systems to support the needs of the tagging community, from individual investigators to large scale tagging programs. Tagbase represents a mature initiative with users at several institutions involved in marine electronic tagging research. PMID:21750734

  19. Large-scale machine learning and evaluation platform for real-time traffic surveillance

    NASA Astrophysics Data System (ADS)

    Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel

    2016-09-01

    In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.

  20. Allometric Analysis Detects Brain Size-Independent Effects of Sex and Sex Chromosome Complement on Human Cerebellar Organization.

    PubMed

    Mankiw, Catherine; Park, Min Tae M; Reardon, P K; Fish, Ari M; Clasen, Liv S; Greenstein, Deanna; Giedd, Jay N; Blumenthal, Jonathan D; Lerch, Jason P; Chakravarty, M Mallar; Raznahan, Armin

    2017-05-24

    The cerebellum is a large hindbrain structure that is increasingly recognized for its contribution to diverse domains of cognitive and affective processing in human health and disease. Although several of these domains are sex biased, our fundamental understanding of cerebellar sex differences-including their spatial distribution, potential biological determinants, and independence from brain volume variation-lags far behind that for the cerebrum. Here, we harness automated neuroimaging methods for cerebellar morphometrics in 417 individuals to (1) localize normative male-female differences in raw cerebellar volume, (2) compare these to sex chromosome effects estimated across five rare sex (X/Y) chromosome aneuploidy (SCA) syndromes, and (3) clarify brain size-independent effects of sex and SCA on cerebellar anatomy using a generalizable allometric approach that considers scaling relationships between regional cerebellar volume and brain volume in health. The integration of these approaches shows that (1) sex and SCA effects on raw cerebellar volume are large and distributed, but regionally heterogeneous, (2) human cerebellar volume scales with brain volume in a highly nonlinear and regionally heterogeneous fashion that departs from documented patterns of cerebellar scaling in phylogeny, and (3) cerebellar organization is modified in a brain size-independent manner by sex (relative expansion of total cerebellum, flocculus, and Crus II-lobule VIIIB volumes in males) and SCA (contraction of total cerebellar, lobule IV, and Crus I volumes with additional X- or Y-chromosomes; X-specific contraction of Crus II-lobule VIIIB). Our methods and results clarify the shifts in human cerebellar organization that accompany interwoven variations in sex, sex chromosome complement, and brain size. SIGNIFICANCE STATEMENT Cerebellar systems are implicated in diverse domains of sex-biased behavior and pathology, but we lack a basic understanding of how sex differences in the human cerebellum are distributed and determined. We leverage a rare neuroimaging dataset to deconvolve the interwoven effects of sex, sex chromosome complement, and brain size on human cerebellar organization. We reveal topographically variegated scaling relationships between regional cerebellar volume and brain size in humans, which (1) are distinct from those observed in phylogeny, (2) invalidate a traditional neuroimaging method for brain volume correction, and (3) allow more valid and accurate resolution of which cerebellar subcomponents are sensitive to sex and sex chromosome complement. These findings advance understanding of cerebellar organization in health and sex chromosome aneuploidy. Copyright © 2017 the authors 0270-6474/17/375222-11$15.00/0.

  1. Transforming the Geocomputational Battlespace Framework with HDF5

    DTIC Science & Technology

    2010-08-01

    layout level, dataset arrays can be stored in chunks or tiles , enabling fast subsetting of large datasets, including compressed datasets. HDF software...Image Base (CIB) image of the AOI: an orthophoto made from rectified grayscale aerial images b. An IKONOS satellite image made up of 3 spectral

  2. A dataset of forest biomass structure for Eurasia.

    PubMed

    Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

    2017-05-16

    The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

  3. A dataset of forest biomass structure for Eurasia

    NASA Astrophysics Data System (ADS)

    Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

    2017-05-01

    The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

  4. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    PubMed

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  5. An NTCP Analysis of Urethral Complications from Low Doserate Mono- and Bi-Radionuclide Brachytherapy.

    PubMed

    Nuttens, V E; Nahum, A E; Lucas, S

    2011-01-01

    Urethral NTCP has been determined for three prostates implanted with seeds based on (125)I (145 Gy), (103)Pd (125 Gy), (131)Cs (115 Gy), (103)Pd-(125)I (145 Gy), or (103)Pd-(131)Cs (115 Gy or 130 Gy). First, DU(20), meaning that 20% of the urhral volume receive a dose of at least DU(20), is converted into an I-125 LDR equivalent DU(20) in order to use the urethral NTCP model. Second, the propagation of uncertainties through the steps in the NTCP calculation was assessed in order to identify the parameters responsible for large data uncertainties. Two sets of radiobiological parameters were studied. The NTCP results all fall in the 19%-23% range and are associated with large uncertainties, making the comparison difficult. Depending on the dataset chosen, the ranking of NTCP values among the six seed implants studied changes. Moreover, the large uncertainties on the fitting parameters of the urethral NTCP model result in large uncertainty on the NTCP value. In conclusion, the use of NTCP model for permanent brachytherapy is feasible but it is essential that the uncertainties on the parameters in the model be reduced.

  6. Sleep stages identification in patients with sleep disorder using k-means clustering

    NASA Astrophysics Data System (ADS)

    Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.

    2018-05-01

    Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.

  7. Image segmentation evaluation for very-large datasets

    NASA Astrophysics Data System (ADS)

    Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

    2016-03-01

    With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

  8. Investigating the effect of plate-mantle interaction in basin creation and associated drainage systems: insights from the North West Shelf of Australia

    NASA Astrophysics Data System (ADS)

    Morón, S.; Gallagher, S. J.; Moresi, L. N.; Salles, T.; Rey, P. F.; Payenberg, T.

    2016-12-01

    The effect of plate-mantle dynamics on surface topography has increasingly being recognized. This concept is particularly useful for the understanding of the links between plate-mantle dynamics, continental break up and the creation of sedimentary basins and their associated drainage systems. To unravel these links back in time we present an approach that uses numerical models and the geological record. The sedimentary basins of the North West Shelf (NWS) of Australia contain an exceptional record of the Permian to early Cretaceous polyphased rifting of Australia from Greater India, which is in turn associated with the breakup of Gondwana. This record and the relative tectonic quiescence of the Australian Continent since the Late Cretaceous make the NWS a great natural laboratory for investigating the interaction between mantle dynamics, plate tectonics and drainage patterns. Furthermore, as a result of the extensive petroleum exploration and production in the area a uniquely large dataset containing seismic, lithologic, biostratigraphic and detrital zircon information is already available. This study will first focus on augmenting zircon datasets to refine the current conceptual models of paleodrainage systems associated with the NWS. Current conceptual models of drainage patterns suggest the previous existance of large transcontinental rivers that transported sediments from Antarctica and India, rather than from more proximal Australian sources. From a mass-balance point of view this model seems reasonable, as large transcontinental rivers would be required to transport the significant volume of sediments that are deposited in the thick (15km) sedimentary sequences of the NWS. Coupling of geodynamic (Underworld) and landscape-dynamics (Badlands) models will allow us to numerically test the likelihood of this conceptual model and also to present and integrated approach to investigate the link between deep Earth processes and surficial processes.

  9. Numericware i: Identical by State Matrix Calculator

    PubMed Central

    Kim, Bongsong; Beavis, William D

    2017-01-01

    We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375

  10. Large-Scale Pattern Discovery in Music

    NASA Astrophysics Data System (ADS)

    Bertin-Mahieux, Thierry

    This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.

  11. The use of Data Mining in the categorization of patients with Azoospermia.

    PubMed

    Mikos, Themistoklis; Maglaveras, Nikolaos; Pantazis, Konstantinos; Goulis, Dimitrios G; Bontis, John N; Papadimas, John

    2005-01-01

    Data Mining is a relatively new field of Medical Informatics. The aim of this study was to compare Data Mining diagnosis with clinical diagnosis by applying a Data Miner (DM) to a clinical dataset of infertile men with azoospermia. One hundred and forty-seven azoospermic men were clinically classified into four groups: a) obstructive azoospermia (n=63), b) non-obstructive azoospermia (n=71), c) hypergonadotropic hypogonadism (n=2), and d) hypogonadotropic hypogonadism (n=11). The DM (IBM's DB2/Intelligent Miner for Data 6.1) was asked to reproduce a four-cluster model. DM formed four groups of patients: a) eugonadal men with normal testicular volume and normal FSH levels (n=86), b) eugonadal men with significantly reduced testicular volume (median 6.5 cm3) and very high FSH levels (n=29), c) eugonadal men with moderately reduced testicular volume (median 14.5 cm3) and raised FSH levels (n=20), and d) hypogonadal men (n=12). Overall DM concordance rate in hypogonadal men was 92%, in obstructive azoospermia 73%, and in non-obstructive azoospermia 69%. Data Mining produces clinically meaningful results but different from those of the clinical diagnosis. It is possible that the use of large sets of structured and formalised data and continuous evaluation of DM results will generate a useful methodology for the Clinician.

  12. A probabilistic topic model for clinical risk stratification from electronic health records.

    PubMed

    Huang, Zhengxing; Dong, Wei; Duan, Huilong

    2015-12-01

    Risk stratification aims to provide physicians with the accurate assessment of a patient's clinical risk such that an individualized prevention or management strategy can be developed and delivered. Existing risk stratification techniques mainly focus on predicting the overall risk of an individual patient in a supervised manner, and, at the cohort level, often offer little insight beyond a flat score-based segmentation from the labeled clinical dataset. To this end, in this paper, we propose a new approach for risk stratification by exploring a large volume of electronic health records (EHRs) in an unsupervised fashion. Along this line, this paper proposes a novel probabilistic topic modeling framework called probabilistic risk stratification model (PRSM) based on Latent Dirichlet Allocation (LDA). The proposed PRSM recognizes a patient clinical state as a probabilistic combination of latent sub-profiles, and generates sub-profile-specific risk tiers of patients from their EHRs in a fully unsupervised fashion. The achieved stratification results can be easily recognized as high-, medium- and low-risk, respectively. In addition, we present an extension of PRSM, called weakly supervised PRSM (WS-PRSM) by incorporating minimum prior information into the model, in order to improve the risk stratification accuracy, and to make our models highly portable to risk stratification tasks of various diseases. We verify the effectiveness of the proposed approach on a clinical dataset containing 3463 coronary heart disease (CHD) patient instances. Both PRSM and WS-PRSM were compared with two established supervised risk stratification algorithms, i.e., logistic regression and support vector machine, and showed the effectiveness of our models in risk stratification of CHD in terms of the Area Under the receiver operating characteristic Curve (AUC) analysis. As well, in comparison with PRSM, WS-PRSM has over 2% performance gain, on the experimental dataset, demonstrating that incorporating risk scoring knowledge as prior information can improve the performance in risk stratification. Experimental results reveal that our models achieve competitive performance in risk stratification in comparison with existing supervised approaches. In addition, the unsupervised nature of our models makes them highly portable to the risk stratification tasks of various diseases. Moreover, patient sub-profiles and sub-profile-specific risk tiers generated by our models are coherent and informative, and provide significant potential to be explored for the further tasks, such as patient cohort analysis. We hypothesize that the proposed framework can readily meet the demand for risk stratification from a large volume of EHRs in an open-ended fashion. Copyright © 2015 Elsevier Inc. All rights reserved.

  13. The medical science DMZ: a network design pattern for data-intensive medical science

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Peisert, Sean; Dart, Eli; Barnett, William

    We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations.High-end networking, packet-filter firewalls, network intrusion-detection systems.We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs.The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and networkmore » resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows.By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high-speed research networks while preserving privacy and meeting regulatory requirements.« less

  14. The medical science DMZ: a network design pattern for data-intensive medical science.

    PubMed

    Peisert, Sean; Dart, Eli; Barnett, William; Balas, Edward; Cuff, James; Grossman, Robert L; Berman, Ari; Shankar, Anurag; Tierney, Brian

    2017-10-06

    We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations. High-end networking, packet-filter firewalls, network intrusion-detection systems. We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs. The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and network resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows. By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high-speed research networks while preserving privacy and meeting regulatory requirements. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  15. A peek into the future of radiology using big data applications

    PubMed Central

    Kharat, Amit T.; Singhal, Shubham

    2017-01-01

    Big data is extremely large amount of data which is available in the radiology department. Big data is identified by four Vs – Volume, Velocity, Variety, and Veracity. By applying different algorithmic tools and converting raw data to transformed data in such large datasets, there is a possibility of understanding and using radiology data for gaining new knowledge and insights. Big data analytics consists of 6Cs – Connection, Cloud, Cyber, Content, Community, and Customization. The global technological prowess and per-capita capacity to save digital information has roughly doubled every 40 months since the 1980's. By using big data, the planning and implementation of radiological procedures in radiology departments can be given a great boost. Potential applications of big data in the future are scheduling of scans, creating patient-specific personalized scanning protocols, radiologist decision support, emergency reporting, virtual quality assurance for the radiologist, etc. Targeted use of big data applications can be done for images by supporting the analytic process. Screening software tools designed on big data can be used to highlight a region of interest, such as subtle changes in parenchymal density, solitary pulmonary nodule, or focal hepatic lesions, by plotting its multidimensional anatomy. Following this, we can run more complex applications such as three-dimensional multi planar reconstructions (MPR), volumetric rendering (VR), and curved planar reconstruction, which consume higher system resources on targeted data subsets rather than querying the complete cross-sectional imaging dataset. This pre-emptive selection of dataset can substantially reduce the system requirements such as system memory, server load and provide prompt results. However, a word of caution, “big data should not become “dump data” due to inadequate and poor analysis and non-structured improperly stored data. In the near future, big data can ring in the era of personalized and individualized healthcare. PMID:28744087

  16. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  17. The Triglav Glacier (South-Eastern Alps, Slovenia): Volume Estimation, Internal Characterization and 2000-2013 Temporal Evolution by Means of Ground Penetrating Radar Measurements

    NASA Astrophysics Data System (ADS)

    Del Gobbo, Costanza; Colucci, Renato R.; Forte, Emanuele; Triglav Čekada, Michaela; Zorn, Matija

    2016-08-01

    It is well known that small glaciers of mid latitudes and especially those located at low altitude respond suddenly to climate changes both on local and global scale. For this reason their monitoring as well as evaluation of their extension and volume is essential. We present a ground penetrating radar (GPR) dataset acquired on September 23 and 24, 2013 on the Triglav glacier to identify layers with different characteristics (snow, firn, ice, debris) within the glacier and to define the extension and volume of the actual ice. Computing integrated and interpolated 3D using the whole GPR dataset, we estimate that at the moment of data acquisition the ice area was 3800 m2 and the ice volume 7400 m3. Its average thickness was 1.95 m while its maximum thickness was slightly more than 5 m. Here we compare the results with a previous GPR survey acquired in 2000. A critical review of the historical data to find the general trend and to forecast a possible evolution is also presented. Between 2000 and 2013, we observed relevant changes in the internal distribution of the different units (snow, firn, ice) and the ice volume reduced from about 35,000 m3 to about 7400 m3. Such result can be achieved only using multiple GPR surveys, which allow not only to assess the volume occupied by a glacial body, but also to image its internal structure and the actual ice volume. In fact, by applying one of the widely used empirical volume-area relations to infer the geometrical parameters of the glacier, a relevant underestimation of ice-loss would be achieved.

  18. Interactive (statistical) visualisation and exploration of a billion objects with vaex

    NASA Astrophysics Data System (ADS)

    Breddels, M. A.

    2017-06-01

    With new catalogues arriving such as the Gaia DR1, containing more than a billion objects, new methods of handling and visualizing these data volumes are needed. We show that by calculating statistics on a regular (N-dimensional) grid, visualizations of a billion objects can be done within a second on a modern desktop computer. This is achieved using memory mapping of hdf5 files together with a simple binning algorithm, which are part of a Python library called vaex. This enables efficient exploration or large datasets interactively, making science exploration of large catalogues feasible. Vaex is a Python library and an application, which allows for interactive exploration and visualization. The motivation for developing vaex is the catalogue of the Gaia satellite, however, vaex can also be used on SPH or N-body simulations, any other (future) catalogues such as SDSS, Pan-STARRS, LSST, etc. or other tabular data. The homepage for vaex is http://vaex.astro.rug.nl.

  19. GENOME-WIDE GENETIC INTERACTION ANALYSIS OF GLAUCOMA USING EXPERT KNOWLEDGE DERIVED FROM HUMAN PHENOTYPE NETWORKS

    PubMed Central

    HU, TING; DARABOS, CHRISTIAN; CRICCO, MARIA E.; KONG, EMILY; MOORE, JASON H.

    2014-01-01

    The large volume of GWAS data poses great computational challenges for analyzing genetic interactions associated with common human diseases. We propose a computational framework for characterizing epistatic interactions among large sets of genetic attributes in GWAS data. We build the human phenotype network (HPN) and focus around a disease of interest. In this study, we use the GLAUGEN glaucoma GWAS dataset and apply the HPN as a biological knowledge-based filter to prioritize genetic variants. Then, we use the statistical epistasis network (SEN) to identify a significant connected network of pairwise epistatic interactions among the prioritized SNPs. These clearly highlight the complex genetic basis of glaucoma. Furthermore, we identify key SNPs by quantifying structural network characteristics. Through functional annotation of these key SNPs using Biofilter, a software accessing multiple publicly available human genetic data sources, we find supporting biomedical evidences linking glaucoma to an array of genetic diseases, proving our concept. We conclude by suggesting hypotheses for a better understanding of the disease. PMID:25592582

  20. Integration of a neuroimaging processing pipeline into a pan-canadian computing grid

    NASA Astrophysics Data System (ADS)

    Lavoie-Courchesne, S.; Rioux, P.; Chouinard-Decorte, F.; Sherif, T.; Rousseau, M.-E.; Das, S.; Adalat, R.; Doyon, J.; Craddock, C.; Margulies, D.; Chu, C.; Lyttelton, O.; Evans, A. C.; Bellec, P.

    2012-02-01

    The ethos of the neuroimaging field is quickly moving towards the open sharing of resources, including both imaging databases and processing tools. As a neuroimaging database represents a large volume of datasets and as neuroimaging processing pipelines are composed of heterogeneous, computationally intensive tools, such open sharing raises specific computational challenges. This motivates the design of novel dedicated computing infrastructures. This paper describes an interface between PSOM, a code-oriented pipeline development framework, and CBRAIN, a web-oriented platform for grid computing. This interface was used to integrate a PSOM-compliant pipeline for preprocessing of structural and functional magnetic resonance imaging into CBRAIN. We further tested the capacity of our infrastructure to handle a real large-scale project. A neuroimaging database including close to 1000 subjects was preprocessed using our interface and publicly released to help the participants of the ADHD-200 international competition. This successful experiment demonstrated that our integrated grid-computing platform is a powerful solution for high-throughput pipeline analysis in the field of neuroimaging.

  1. A SOA-based approach to geographical data sharing

    NASA Astrophysics Data System (ADS)

    Li, Zonghua; Peng, Mingjun; Fan, Wei

    2009-10-01

    In the last few years, large volumes of spatial data have been available in different government departments in China, but these data are mainly used within these departments. With the e-government project initiated, spatial data sharing become more and more necessary. Currently, the Web has been used not only for document searching but also for the provision and use of services, known as Web services, which are published in a directory and may be automatically discovered by software agents. Particularly in the spatial domain, the possibility of accessing these large spatial datasets via Web services has motivated research into the new field of Spatial Data Infrastructure (SDI) implemented using service-oriented architecture. In this paper a Service-Oriented Architecture (SOA) based Geographical Information Systems (GIS) is proposed, and a prototype system is deployed based on Open Geospatial Consortium (OGC) standard in Wuhan, China, thus that all the departments authorized can access the spatial data within the government intranet, and also these spatial data can be easily integrated into kinds of applications.

  2. Closing the data gap: Creating an open data environment

    NASA Astrophysics Data System (ADS)

    Hester, J. R.

    2014-02-01

    Poor data management brought on by increasing volumes of complex data undermines both the integrity of the scientific process and the usefulness of datasets. Researchers should endeavour both to make their data citeable and to cite data whenever possible. The reusability of datasets is improved by community adoption of comprehensive metadata standards and public availability of reversibly reduced data. Where standards are not yet defined, as much information as possible about the experiment and samples should be preserved in datafiles written in a standard format.

  3. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  4. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  5. Quantifying the tibiofemoral joint space using x-ray tomosynthesis.

    PubMed

    Kalinosky, Benjamin; Sabol, John M; Piacsek, Kelly; Heckel, Beth; Gilat Schmidt, Taly

    2011-12-01

    Digital x-ray tomosynthesis (DTS) has the potential to provide 3D information about the knee joint in a load-bearing posture, which may improve diagnosis and monitoring of knee osteoarthritis compared with projection radiography, the current standard of care. Manually quantifying and visualizing the joint space width (JSW) from 3D tomosynthesis datasets may be challenging. This work developed a semiautomated algorithm for quantifying the 3D tibiofemoral JSW from reconstructed DTS images. The algorithm was validated through anthropomorphic phantom experiments and applied to three clinical datasets. A user-selected volume of interest within the reconstructed DTS volume was enhanced with 1D multiscale gradient kernels. The edge-enhanced volumes were divided by polarity into tibial and femoral edge maps and combined across kernel scales. A 2D connected components algorithm was performed to determine candidate tibial and femoral edges. A 2D joint space width map (JSW) was constructed to represent the 3D tibiofemoral joint space. To quantify the algorithm accuracy, an adjustable knee phantom was constructed, and eleven posterior-anterior (PA) and lateral DTS scans were acquired with the medial minimum JSW of the phantom set to 0-5 mm in 0.5 mm increments (VolumeRad™, GE Healthcare, Chalfont St. Giles, United Kingdom). The accuracy of the algorithm was quantified by comparing the minimum JSW in a region of interest in the medial compartment of the JSW map to the measured phantom setting for each trial. In addition, the algorithm was applied to DTS scans of a static knee phantom and the JSW map compared to values estimated from a manually segmented computed tomography (CT) dataset. The algorithm was also applied to three clinical DTS datasets of osteoarthritic patients. The algorithm segmented the JSW and generated a JSW map for all phantom and clinical datasets. For the adjustable phantom, the estimated minimum JSW values were plotted against the measured values for all trials. A linear fit estimated a slope of 0.887 (R² = 0.962) and a mean error across all trials of 0.34 mm for the PA phantom data. The estimated minimum JSW values for the lateral adjustable phantom acquisitions were found to have low correlation to the measured values (R² = 0.377), with a mean error of 2.13 mm. The error in the lateral adjustable-phantom datasets appeared to be caused by artifacts due to unrealistic features in the phantom bones. JSW maps generated by DTS and CT varied by a mean of 0.6 mm and 0.8 mm across the knee joint, for PA and lateral scans. The tibial and femoral edges were successfully segmented and JSW maps determined for PA and lateral clinical DTS datasets. A semiautomated method is presented for quantifying the 3D joint space in a 2D JSW map using tomosynthesis images. The proposed algorithm quantified the JSW across the knee joint to sub-millimeter accuracy for PA tomosynthesis acquisitions. Overall, the results suggest that x-ray tomosynthesis may be beneficial for diagnosing and monitoring disease progression or treatment of osteoarthritis by providing quantitative images of JSW in the load-bearing knee.

  6. Atlas-guided cluster analysis of large tractography datasets.

    PubMed

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.

  7. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets

    PubMed Central

    McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr

    2016-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010

  8. Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

    NASA Astrophysics Data System (ADS)

    Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

    2017-12-01

    Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.

  9. Automatic segmentation of airway tree based on local intensity filter and machine learning technique in 3D chest CT volume.

    PubMed

    Meng, Qier; Kitasaka, Takayuki; Nimura, Yukitaka; Oda, Masahiro; Ueno, Junji; Mori, Kensaku

    2017-02-01

    Airway segmentation plays an important role in analyzing chest computed tomography (CT) volumes for computerized lung cancer detection, emphysema diagnosis and pre- and intra-operative bronchoscope navigation. However, obtaining a complete 3D airway tree structure from a CT volume is quite a challenging task. Several researchers have proposed automated airway segmentation algorithms basically based on region growing and machine learning techniques. However, these methods fail to detect the peripheral bronchial branches, which results in a large amount of leakage. This paper presents a novel approach for more accurate extraction of the complex airway tree. This proposed segmentation method is composed of three steps. First, Hessian analysis is utilized to enhance the tube-like structure in CT volumes; then, an adaptive multiscale cavity enhancement filter is employed to detect the cavity-like structure with different radii. In the second step, support vector machine learning will be utilized to remove the false positive (FP) regions from the result obtained in the previous step. Finally, the graph-cut algorithm is used to refine the candidate voxels to form an integrated airway tree. A test dataset including 50 standard-dose chest CT volumes was used for evaluating our proposed method. The average extraction rate was about 79.1 % with the significantly decreased FP rate. A new method of airway segmentation based on local intensity structure and machine learning technique was developed. The method was shown to be feasible for airway segmentation in a computer-aided diagnosis system for a lung and bronchoscope guidance system.

  10. Distributed and parallel approach for handle and perform huge datasets

    NASA Astrophysics Data System (ADS)

    Konopko, Joanna

    2015-12-01

    Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.

  11. Uncertainty Management in Remote Sensing of Climate Data. Summary of A Workshop

    NASA Technical Reports Server (NTRS)

    McConnell, M.; Weidman, S.

    2009-01-01

    Great advances have been made in our understanding of the climate system over the past few decades, and remotely sensed data have played a key role in supporting many of these advances. Improvements in satellites and in computational and data-handling techniques have yielded high quality, readily accessible data. However, rapid increases in data volume have also led to large and complex datasets that pose significant challenges in data analysis (NRC, 2007). Uncertainty characterization is needed for every satellite mission and scientists continue to be challenged by the need to reduce the uncertainty in remotely sensed climate records and projections. The approaches currently used to quantify the uncertainty in remotely sensed data, including statistical methods used to calibrate and validate satellite instruments, lack an overall mathematically based framework.

  12. Clinical high-resolution mapping of the proteoglycan-bound water fraction in articular cartilage of the human knee joint.

    PubMed

    Bouhrara, Mustapha; Reiter, David A; Sexton, Kyle W; Bergeron, Christopher M; Zukley, Linda M; Spencer, Richard G

    2017-11-01

    We applied our recently introduced Bayesian analytic method to achieve clinically-feasible in-vivo mapping of the proteoglycan water fraction (PgWF) of human knee cartilage with improved spatial resolution and stability as compared to existing methods. Multicomponent driven equilibrium single-pulse observation of T 1 and T 2 (mcDESPOT) datasets were acquired from the knees of two healthy young subjects and one older subject with previous knee injury. Each dataset was processed using Bayesian Monte Carlo (BMC) analysis incorporating a two-component tissue model. We assessed the performance and reproducibility of BMC and of the conventional analysis of stochastic region contraction (SRC) in the estimation of PgWF. Stability of the BMC analysis of PgWF was tested by comparing independent high-resolution (HR) datasets from each of the two young subjects. Unlike SRC, the BMC-derived maps from the two HR datasets were essentially identical. Furthermore, SRC maps showed substantial random variation in estimated PgWF, and mean values that differed from those obtained using BMC. In addition, PgWF maps derived from conventional low-resolution (LR) datasets exhibited partial volume and magnetic susceptibility effects. These artifacts were absent in HR PgWF images. Finally, our analysis showed regional variation in PgWF estimates, and substantially higher values in the younger subjects as compared to the older subject. BMC-mcDESPOT permits HR in-vivo mapping of PgWF in human knee cartilage in a clinically-feasible acquisition time. HR mapping reduces the impact of partial volume and magnetic susceptibility artifacts compared to LR mapping. Finally, BMC-mcDESPOT demonstrated excellent reproducibility in the determination of PgWF. Published by Elsevier Inc.

  13. Hierarchical Bayesian modelling of mobility metrics for hazard model input calibration

    NASA Astrophysics Data System (ADS)

    Calder, Eliza; Ogburn, Sarah; Spiller, Elaine; Rutarindwa, Regis; Berger, Jim

    2015-04-01

    In this work we present a method to constrain flow mobility input parameters for pyroclastic flow models using hierarchical Bayes modeling of standard mobility metrics such as H/L and flow volume etc. The advantage of hierarchical modeling is that it can leverage the information in global dataset for a particular mobility metric in order to reduce the uncertainty in modeling of an individual volcano, especially important where individual volcanoes have only sparse datasets. We use compiled pyroclastic flow runout data from Colima, Merapi, Soufriere Hills, Unzen and Semeru volcanoes, presented in an open-source database FlowDat (https://vhub.org/groups/massflowdatabase). While the exact relationship between flow volume and friction varies somewhat between volcanoes, dome collapse flows originating from the same volcano exhibit similar mobility relationships. Instead of fitting separate regression models for each volcano dataset, we use a variation of the hierarchical linear model (Kass and Steffey, 1989). The model presents a hierarchical structure with two levels; all dome collapse flows and dome collapse flows at specific volcanoes. The hierarchical model allows us to assume that the flows at specific volcanoes share a common distribution of regression slopes, then solves for that distribution. We present comparisons of the 95% confidence intervals on the individual regression lines for the data set from each volcano as well as those obtained from the hierarchical model. The results clearly demonstrate the advantage of considering global datasets using this technique. The technique developed is demonstrated here for mobility metrics, but can be applied to many other global datasets of volcanic parameters. In particular, such methods can provide a means to better contain parameters for volcanoes for which we only have sparse data, a ubiquitous problem in volcanology.

  14. Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  15. LANL - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  16. LANL - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  17. Primary Datasets for Case Studies of River-Water Quality

    ERIC Educational Resources Information Center

    Goulder, Raymond

    2008-01-01

    Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…

  18. Accuracy assessment of the U.S. Geological Survey National Elevation Dataset, and comparison with other large-area elevation datasets: SRTM and ASTER

    USGS Publications Warehouse

    Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.

    2014-01-01

    The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).

  19. Data-driven decision support for radiologists: re-using the National Lung Screening Trial dataset for pulmonary nodule management.

    PubMed

    Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L

    2015-02-01

    Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.

  20. Quantification of γ-aminobutyric acid (GABA) in 1 H MRS volumes composed heterogeneously of grey and white matter.

    PubMed

    Mikkelsen, Mark; Singh, Krish D; Brealy, Jennifer A; Linden, David E J; Evans, C John

    2016-11-01

    The quantification of γ-aminobutyric acid (GABA) concentration using localised MRS suffers from partial volume effects related to differences in the intrinsic concentration of GABA in grey (GM) and white (WM) matter. These differences can be represented as a ratio between intrinsic GABA in GM and WM: r M . Individual differences in GM tissue volume can therefore potentially drive apparent concentration differences. Here, a quantification method that corrects for these effects is formulated and empirically validated. Quantification using tissue water as an internal concentration reference has been described previously. Partial volume effects attributed to r M can be accounted for by incorporating into this established method an additional multiplicative correction factor based on measured or literature values of r M weighted by the proportion of GM and WM within tissue-segmented MRS volumes. Simulations were performed to test the sensitivity of this correction using different assumptions of r M taken from previous studies. The tissue correction method was then validated by applying it to an independent dataset of in vivo GABA measurements using an empirically measured value of r M . It was shown that incorrect assumptions of r M can lead to overcorrection and inflation of GABA concentration measurements quantified in volumes composed predominantly of WM. For the independent dataset, GABA concentration was linearly related to GM tissue volume when only the water signal was corrected for partial volume effects. Performing a full correction that additionally accounts for partial volume effects ascribed to r M successfully removed this dependence. With an appropriate assumption of the ratio of intrinsic GABA concentration in GM and WM, GABA measurements can be corrected for partial volume effects, potentially leading to a reduction in between-participant variance, increased power in statistical tests and better discriminability of true effects. Copyright © 2016 John Wiley & Sons, Ltd.

  1. WE-G-18A-03: Cone Artifacts Correction in Iterative Cone Beam CT Reconstruction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yan, H; Folkerts, M; Jiang, S

    Purpose: For iterative reconstruction (IR) in cone-beam CT (CBCT) imaging, data truncation along the superior-inferior (SI) direction causes severe cone artifacts in the reconstructed CBCT volume images. Not only does it reduce the effective SI coverage of the reconstructed volume, it also hinders the IR algorithm convergence. This is particular a problem for regularization based IR, where smoothing type regularization operations tend to propagate the artifacts to a large area. It is our purpose to develop a practical cone artifacts correction solution. Methods: We found it is the missing data residing in the truncated cone area that leads to inconsistencymore » between the calculated forward projections and measured projections. We overcome this problem by using FDK type reconstruction to estimate the missing data and design weighting factors to compensate the inconsistency caused by the missing data. We validate the proposed methods in our multi-GPU low-dose CBCT reconstruction system on multiple patients' datasets. Results: Compared to the FDK reconstruction with full datasets, while IR is able to reconstruct CBCT images using a subset of projection data, the severe cone artifacts degrade overall image quality. For head-neck case under a full-fan mode, 13 out of 80 slices are contaminated. It is even more severe in pelvis case under half-fan mode, where 36 out of 80 slices are affected, leading to inferior soft-tissue delineation. By applying the proposed method, the cone artifacts are effectively corrected, with a mean intensity difference decreased from ∼497 HU to ∼39HU for those contaminated slices. Conclusion: A practical and effective solution for cone artifacts correction is proposed and validated in CBCT IR algorithm. This study is supported in part by NIH (1R01CA154747-01)« less

  2. Lumped parameter, isotopic model simulations of closed-basin lake response to drought in the Pacific Northwest and implications for lake sediment oxygen isotope records.

    NASA Astrophysics Data System (ADS)

    Steinman, B. A.; Rosenmeier, M.; Abbott, M.

    2008-12-01

    The economy of the Pacific Northwest relies heavily on water resources from the drought-prone Columbia River and its tributaries, as well as the many lakes and reservoirs of the region. Proper management of these water resources requires a thorough understanding of local drought histories that extends well beyond the instrumental record of the twentieth century, a time frame too short to capture the full range of drought variability in the Pacific Northwest. Here we present a lumped parameter, mass-balance model that provides insight into the influence of hydroclimatological changes on two small, closed-basin systems located in north- central Washington. Steady state model simulations of lake water oxygen isotope ratios using modern climate and catchment parameter datasets demonstrate a strong sensitivity to both the amount and timing of precipitation, and to changes in summertime relative humidity, particularly at annual and decadal time scales. Model tests also suggest that basin hypsography can have a significant impact on lake water oxygen isotope variations, largely through surface area to volume and consequent evaporative flux to volume ratio changes in response to drought and pluvial sequences. Additional simulations using input parameters derived from both on-site and National Climatic Data Center historical climate datasets accurately approximate three years of continuous lake observations (seasonal water sampling and continuous lake level monitoring) and twentieth century oxygen isotope ratios in sediment core authigenic carbonate recovered from the lakes. Results from these model simulations suggest that small, closed-basin lakes in north-central Washington are highly sensitive to changes in the drought-related climate variables, and that long (8000 year), high resolution records of quantitative changes in precipitation and evaporation are obtainable from sediment cores recovered from water bodies of the Pacific Northwest.

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Khoo, Eric L.H., E-mail: eric.khoo@roq.net.au; Schick, Karlissa; Plank, Ashley W.

    Purpose: To assess whether an education program on CT and MRI prostate anatomy would reduce inter- and intraobserver prostate contouring variation among experienced radiation oncologists. Methods and Materials: Three patient CT and MRI datasets were selected. Five radiation oncologists contoured the prostate for each patient on CT first, then MRI, and again between 2 and 4 weeks later. Three education sessions were then conducted. The same contouring process was then repeated with the same datasets and oncologists. The observer variation was assessed according to changes in the ratio of the encompassing volume to intersecting volume (volume ratio [VR]), across setsmore » of target volumes. Results: For interobserver variation, there was a 15% reduction in mean VR with CT, from 2.74 to 2.33, and a 40% reduction in mean VR with MRI, from 2.38 to 1.41 after education. A similar trend was found for intraobserver variation, with a mean VR reduction for CT and MRI of 9% (from 1.51 to 1.38) and 16% (from 1.37 to 1.15), respectively. Conclusion: A well-structured education program has reduced both inter- and intraobserver prostate contouring variations. The impact was greater on MRI than on CT. With the ongoing incorporation of new technologies into routine practice, education programs for target contouring should be incorporated as part of the continuing medical education of radiation oncologists.« less

  4. Scalable Visual Analytics of Massive Textual Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.

    2007-04-01

    This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

  5. a Critical Review of Automated Photogrammetric Processing of Large Datasets

    NASA Astrophysics Data System (ADS)

    Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.

    2017-08-01

    The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.

  6. Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer

    PubMed Central

    Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.

    2015-01-01

    Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938

  7. Treatment planning constraints to avoid xerostomia in head-and-neck radiotherapy: an independent test of QUANTEC criteria using a prospectively collected dataset.

    PubMed

    Moiseenko, Vitali; Wu, Jonn; Hovan, Allan; Saleh, Ziad; Apte, Aditya; Deasy, Joseph O; Harrow, Stephen; Rabuka, Carman; Muggli, Adam; Thompson, Anna

    2012-03-01

    The severe reduction of salivary function (xerostomia) is a common complication after radiation therapy for head-and-neck cancer. Consequently, guidelines to ensure adequate function based on parotid gland tolerance dose-volume parameters have been suggested by the QUANTEC group and by Ortholan et al. We perform a validation test of these guidelines against a prospectively collected dataset and compared with a previously published dataset. Whole-mouth stimulated salivary flow data from 66 head-and-neck cancer patients treated with radiotherapy at the British Columbia Cancer Agency (BCCA) were measured, and treatment planning data were abstracted. Flow measurements were collected from 50 patients at 3 months, and 60 patients at 12-month follow-up. Previously published data from a second institution, Washington University in St. Louis (WUSTL), were used for comparison. A logistic model was used to describe the incidence of Grade 4 xerostomia as a function of the mean dose of the spared parotid gland. The rate of correctly predicting the lack of xerostomia (negative predictive value [NPV]) was computed for both the QUANTEC constraints and Ortholan et al. recommendation to constrain the total volume of both glands receiving more than 40 Gy to less than 33%. Both datasets showed a rate of xerostomia of less than 20% when the mean dose to the least-irradiated parotid gland is kept to less than 20 Gy. Logistic model parameters for the incidence of xerostomia at 12 months after therapy, based on the least-irradiated gland, were D(50) = 32.4 Gy and and γ = 0.97. NPVs for QUANTEC guideline were 94% (BCCA data), and 90% (WUSTL data). For Ortholan et al. guideline NPVs were 85% (BCCA) and 86% (WUSTL). These data confirm that the QUANTEC guideline effectively avoids xerostomia, and this is somewhat more effective than constraints on the volume receiving more than 40 Gy. Copyright © 2012 Elsevier Inc. All rights reserved.

  8. Computer-aided liver volumetry: performance of a fully-automated, prototype post-processing solution for whole-organ and lobar segmentation based on MDCT imaging.

    PubMed

    Fananapazir, Ghaneh; Bashir, Mustafa R; Marin, Daniele; Boll, Daniel T

    2015-06-01

    To evaluate the performance of a prototype, fully-automated post-processing solution for whole-liver and lobar segmentation based on MDCT datasets. A polymer liver phantom was used to assess accuracy of post-processing applications comparing phantom volumes determined via Archimedes' principle with MDCT segmented datasets. For the IRB-approved, HIPAA-compliant study, 25 patients were enrolled. Volumetry performance compared the manual approach with the automated prototype, assessing intraobserver variability, and interclass correlation for whole-organ and lobar segmentation using ANOVA comparison. Fidelity of segmentation was evaluated qualitatively. Phantom volume was 1581.0 ± 44.7 mL, manually segmented datasets estimated 1628.0 ± 47.8 mL, representing a mean overestimation of 3.0%, automatically segmented datasets estimated 1601.9 ± 0 mL, representing a mean overestimation of 1.3%. Whole-liver and segmental volumetry demonstrated no significant intraobserver variability for neither manual nor automated measurements. For whole-liver volumetry, automated measurement repetitions resulted in identical values; reproducible whole-organ volumetry was also achieved with manual segmentation, p(ANOVA) 0.98. For lobar volumetry, automated segmentation improved reproducibility over manual approach, without significant measurement differences for either methodology, p(ANOVA) 0.95-0.99. Whole-organ and lobar segmentation results from manual and automated segmentation showed no significant differences, p(ANOVA) 0.96-1.00. Assessment of segmentation fidelity found that segments I-IV/VI showed greater segmentation inaccuracies compared to the remaining right hepatic lobe segments. Automated whole-liver segmentation showed non-inferiority of fully-automated whole-liver segmentation compared to manual approaches with improved reproducibility and post-processing duration; automated dual-seed lobar segmentation showed slight tendencies for underestimating the right hepatic lobe volume and greater variability in edge detection for the left hepatic lobe compared to manual segmentation.

  9. Treatment Planning Constraints to Avoid Xerostomia in Head-and-Neck Radiotherapy: An Independent Test of QUANTEC Criteria Using a Prospectively Collected Dataset

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Moiseenko, Vitali, E-mail: vmoiseenko@bccancer.bc.ca; Wu, Jonn; Hovan, Allan

    2012-03-01

    Purpose: The severe reduction of salivary function (xerostomia) is a common complication after radiation therapy for head-and-neck cancer. Consequently, guidelines to ensure adequate function based on parotid gland tolerance dose-volume parameters have been suggested by the QUANTEC group and by Ortholan et al. We perform a validation test of these guidelines against a prospectively collected dataset and compared with a previously published dataset. Methods and Materials: Whole-mouth stimulated salivary flow data from 66 head-and-neck cancer patients treated with radiotherapy at the British Columbia Cancer Agency (BCCA) were measured, and treatment planning data were abstracted. Flow measurements were collected from 50more » patients at 3 months, and 60 patients at 12-month follow-up. Previously published data from a second institution, Washington University in St. Louis (WUSTL), were used for comparison. A logistic model was used to describe the incidence of Grade 4 xerostomia as a function of the mean dose of the spared parotid gland. The rate of correctly predicting the lack of xerostomia (negative predictive value [NPV]) was computed for both the QUANTEC constraints and Ortholan et al. recommendation to constrain the total volume of both glands receiving more than 40 Gy to less than 33%. Results: Both datasets showed a rate of xerostomia of less than 20% when the mean dose to the least-irradiated parotid gland is kept to less than 20 Gy. Logistic model parameters for the incidence of xerostomia at 12 months after therapy, based on the least-irradiated gland, were D{sub 50} = 32.4 Gy and and {gamma} = 0.97. NPVs for QUANTEC guideline were 94% (BCCA data), and 90% (WUSTL data). For Ortholan et al. guideline NPVs were 85% (BCCA) and 86% (WUSTL). Conclusion: These data confirm that the QUANTEC guideline effectively avoids xerostomia, and this is somewhat more effective than constraints on the volume receiving more than 40 Gy.« less

  10. On the visualization of water-related big data: extracting insights from drought proxies' datasets

    NASA Astrophysics Data System (ADS)

    Diaz, Vitali; Corzo, Gerald; van Lanen, Henny A. J.; Solomatine, Dimitri

    2017-04-01

    Big data is a growing area of science where hydroinformatics can benefit largely. There have been a number of important developments in the area of data science aimed at analysis of large datasets. Such datasets related to water include measurements, simulations, reanalysis, scenario analyses and proxies. By convention, information contained in these databases is referred to a specific time and a space (i.e., longitude/latitude). This work is motivated by the need to extract insights from large water-related datasets, i.e., transforming large amounts of data into useful information that helps to better understand of water-related phenomena, particularly about drought. In this context, data visualization, part of data science, involves techniques to create and to communicate data by encoding it as visual graphical objects. They may help to better understand data and detect trends. Base on existing methods of data analysis and visualization, this work aims to develop tools for visualizing water-related large datasets. These tools were developed taking advantage of existing libraries for data visualization into a group of graphs which include both polar area diagrams (PADs) and radar charts (RDs). In both graphs, time steps are represented by the polar angles and the percentages of area in drought by the radios. For illustration, three large datasets of drought proxies are chosen to identify trends, prone areas and spatio-temporal variability of drought in a set of case studies. The datasets are (1) SPI-TS2p1 (1901-2002, 11.7 GB), (2) SPI-PRECL0p5 (1948-2016, 7.91 GB) and (3) SPEI-baseV2.3 (1901-2013, 15.3 GB). All of them are on a monthly basis and with a spatial resolution of 0.5 degrees. First two were retrieved from the repository of the International Research Institute for Climate and Society (IRI). They are included into the Analyses Standardized Precipitation Index (SPI) project (iridl.ldeo.columbia.edu/SOURCES/.IRI/.Analyses/.SPI/). The third dataset was recovered from the Standardized Precipitation Evaporation Index (SPEI) Monitor (digital.csic.es/handle/10261/128892). PADs were found suitable to identify the spatio-temporal variability and prone areas of drought. Drought trends were visually detected by using both PADs and RDs. A similar approach can be followed to include other types of graphs to deal with the analysis of water-related big data. Key words: Big data, data visualization, drought, SPI, SPEI

  11. The MATISSE analysis of large spectral datasets from the ESO Archive

    NASA Astrophysics Data System (ADS)

    Worley, C.; de Laverny, P.; Recio-Blanco, A.; Hill, V.; Vernisse, Y.; Ordenovic, C.; Bijaoui, A.

    2010-12-01

    The automated stellar classification algorithm, MATISSE, has been developed at the Observatoire de la Côte d'Azur (OCA) in order to determine stellar temperatures, gravities and chemical abundances for large datasets of stellar spectra. The Gaia Data Processing and Analysis Consortium (DPAC) has selected MATISSE as one of the key programmes to be used in the analysis of the Gaia Radial Velocity Spectrometer (RVS) spectra. MATISSE is currently being used to analyse large datasets of spectra from the ESO archive with the primary goal of producing advanced data products to be made available in the ESO database via the Virtual Observatory. This is also an invaluable opportunity to identify and address issues that can be encountered with the analysis large samples of real spectra prior to the launch of Gaia in 2012. The analysis of the archived spectra of the FEROS spectrograph is currently underway and preliminary results are presented.

  12. QAPgrid: A Two Level QAP-Based Approach for Large-Scale Data Analysis and Visualization

    PubMed Central

    Inostroza-Ponta, Mario; Berretta, Regina; Moscato, Pablo

    2011-01-01

    Background The visualization of large volumes of data is a computationally challenging task that often promises rewarding new insights. There is great potential in the application of new algorithms and models from combinatorial optimisation. Datasets often contain “hidden regularities” and a combined identification and visualization method should reveal these structures and present them in a way that helps analysis. While several methodologies exist, including those that use non-linear optimization algorithms, severe limitations exist even when working with only a few hundred objects. Methodology/Principal Findings We present a new data visualization approach (QAPgrid) that reveals patterns of similarities and differences in large datasets of objects for which a similarity measure can be computed. Objects are assigned to positions on an underlying square grid in a two-dimensional space. We use the Quadratic Assignment Problem (QAP) as a mathematical model to provide an objective function for assignment of objects to positions on the grid. We employ a Memetic Algorithm (a powerful metaheuristic) to tackle the large instances of this NP-hard combinatorial optimization problem, and we show its performance on the visualization of real data sets. Conclusions/Significance Overall, the results show that QAPgrid algorithm is able to produce a layout that represents the relationships between objects in the data set. Furthermore, it also represents the relationships between clusters that are feed into the algorithm. We apply the QAPgrid on the 84 Indo-European languages instance, producing a near-optimal layout. Next, we produce a layout of 470 world universities with an observed high degree of correlation with the score used by the Academic Ranking of World Universities compiled in the The Shanghai Jiao Tong University Academic Ranking of World Universities without the need of an ad hoc weighting of attributes. Finally, our Gene Ontology-based study on Saccharomyces cerevisiae fully demonstrates the scalability and precision of our method as a novel alternative tool for functional genomics. PMID:21267077

  13. QAPgrid: a two level QAP-based approach for large-scale data analysis and visualization.

    PubMed

    Inostroza-Ponta, Mario; Berretta, Regina; Moscato, Pablo

    2011-01-18

    The visualization of large volumes of data is a computationally challenging task that often promises rewarding new insights. There is great potential in the application of new algorithms and models from combinatorial optimisation. Datasets often contain "hidden regularities" and a combined identification and visualization method should reveal these structures and present them in a way that helps analysis. While several methodologies exist, including those that use non-linear optimization algorithms, severe limitations exist even when working with only a few hundred objects. We present a new data visualization approach (QAPgrid) that reveals patterns of similarities and differences in large datasets of objects for which a similarity measure can be computed. Objects are assigned to positions on an underlying square grid in a two-dimensional space. We use the Quadratic Assignment Problem (QAP) as a mathematical model to provide an objective function for assignment of objects to positions on the grid. We employ a Memetic Algorithm (a powerful metaheuristic) to tackle the large instances of this NP-hard combinatorial optimization problem, and we show its performance on the visualization of real data sets. Overall, the results show that QAPgrid algorithm is able to produce a layout that represents the relationships between objects in the data set. Furthermore, it also represents the relationships between clusters that are feed into the algorithm. We apply the QAPgrid on the 84 Indo-European languages instance, producing a near-optimal layout. Next, we produce a layout of 470 world universities with an observed high degree of correlation with the score used by the Academic Ranking of World Universities compiled in the The Shanghai Jiao Tong University Academic Ranking of World Universities without the need of an ad hoc weighting of attributes. Finally, our Gene Ontology-based study on Saccharomyces cerevisiae fully demonstrates the scalability and precision of our method as a novel alternative tool for functional genomics.

  14. Full-motion video analysis for improved gender classification

    NASA Astrophysics Data System (ADS)

    Flora, Jeffrey B.; Lochtefeld, Darrell F.; Iftekharuddin, Khan M.

    2014-06-01

    The ability of computer systems to perform gender classification using the dynamic motion of the human subject has important applications in medicine, human factors, and human-computer interface systems. Previous works in motion analysis have used data from sensors (including gyroscopes, accelerometers, and force plates), radar signatures, and video. However, full-motion video, motion capture, range data provides a higher resolution time and spatial dataset for the analysis of dynamic motion. Works using motion capture data have been limited by small datasets in a controlled environment. In this paper, we explore machine learning techniques to a new dataset that has a larger number of subjects. Additionally, these subjects move unrestricted through a capture volume, representing a more realistic, less controlled environment. We conclude that existing linear classification methods are insufficient for the gender classification for larger dataset captured in relatively uncontrolled environment. A method based on a nonlinear support vector machine classifier is proposed to obtain gender classification for the larger dataset. In experimental testing with a dataset consisting of 98 trials (49 subjects, 2 trials per subject), classification rates using leave-one-out cross-validation are improved from 73% using linear discriminant analysis to 88% using the nonlinear support vector machine classifier.

  15. From Streaming Data to Streaming Insights: The Impact of Data Velocities on Mental Models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Endert, Alexander; Pike, William A.; Cook, Kristin A.

    The rise of Big Data has influenced the design and technical implementation of visual analytic tools required to handle the increased volumes, velocities, and varieties of data. This has required a set of data management and computational advancements to allow us to store and compute on such datasets. However, as the ultimate goal of visual analytic technology is to enable the discovery and creation of insights from the users, an under-explored area is understanding how these datasets impact their mental models. That is, how have the analytic processes and strategies of users changed? How have users changed their perception ofmore » how to leverage, and ask questions of, these datasets?« less

  16. Universal Batch Steganalysis

    DTIC Science & Technology

    2014-06-30

    steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors...guilty’ user (of steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty...floating point operations (1 TFLOPs) for a 1 megapixel image. We designed a new implementation using Compute Unified Device Architecture (CUDA) on NVIDIA

  17. The role of metadata in managing large environmental science datasets. Proceedings

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Melton, R.B.; DeVaney, D.M.; French, J. C.

    1995-06-01

    The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.

  18. Canopy area of large trees explains aboveground biomass variations across neotropical forest landscapes

    NASA Astrophysics Data System (ADS)

    Meyer, Victoria; Saatchi, Sassan; Clark, David B.; Keller, Michael; Vincent, Grégoire; Ferraz, António; Espírito-Santo, Fernando; d'Oliveira, Marcus V. N.; Kaki, Dahlia; Chave, Jérôme

    2018-06-01

    Large tropical trees store significant amounts of carbon in woody components and their distribution plays an important role in forest carbon stocks and dynamics. Here, we explore the properties of a new lidar-derived index, the large tree canopy area (LCA) defined as the area occupied by canopy above a reference height. We hypothesize that this simple measure of forest structure representing the crown area of large canopy trees could consistently explain the landscape variations in forest volume and aboveground biomass (AGB) across a range of climate and edaphic conditions. To test this hypothesis, we assembled a unique dataset of high-resolution airborne light detection and ranging (lidar) and ground inventory data in nine undisturbed old-growth Neotropical forests, of which four had plots large enough (1 ha) to calibrate our model. We found that the LCA for trees greater than 27 m (˜ 25-30 m) in height and at least 100 m2 crown size in a unit area (1 ha), explains more than 75 % of total forest volume variations, irrespective of the forest biogeographic conditions. When weighted by average wood density of the stand, LCA can be used as an unbiased estimator of AGB across sites (R2 = 0.78, RMSE = 46.02 Mg ha-1, bias = -0.63 Mg ha-1). Unlike other lidar-derived metrics with complex nonlinear relations to biomass, the relationship between LCA and AGB is linear and remains unique across forest types. A comparison with tree inventories across the study sites indicates that LCA correlates best with the crown area (or basal area) of trees with diameter greater than 50 cm. The spatial invariance of the LCA-AGB relationship across the Neotropics suggests a remarkable regularity of forest structure across the landscape and a new technique for systematic monitoring of large trees for their contribution to AGB and changes associated with selective logging, tree mortality and other types of tropical forest disturbance and dynamics.

  19. Thermalnet: a Deep Convolutional Network for Synthetic Thermal Image Generation

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.; Gorbatsevich, V. S.; Mizginov, V. A.

    2017-05-01

    Deep convolutional neural networks have dramatically changed the landscape of the modern computer vision. Nowadays methods based on deep neural networks show the best performance among image recognition and object detection algorithms. While polishing of network architectures received a lot of scholar attention, from the practical point of view the preparation of a large image dataset for a successful training of a neural network became one of major challenges. This challenge is particularly profound for image recognition in wavelengths lying outside the visible spectrum. For example no infrared or radar image datasets large enough for successful training of a deep neural network are available to date in public domain. Recent advances of deep neural networks prove that they are also capable to do arbitrary image transformations such as super-resolution image generation, grayscale image colorisation and imitation of style of a given artist. Thus a natural question arise: how could be deep neural networks used for augmentation of existing large image datasets? This paper is focused on the development of the Thermalnet deep convolutional neural network for augmentation of existing large visible image datasets with synthetic thermal images. The Thermalnet network architecture is inspired by colorisation deep neural networks.

  20. Robust high-performance nanoliter-volume single-cell multiple displacement amplification on planar substrates.

    PubMed

    Leung, Kaston; Klaus, Anders; Lin, Bill K; Laks, Emma; Biele, Justina; Lai, Daniel; Bashashati, Ali; Huang, Yi-Fei; Aniba, Radhouane; Moksa, Michelle; Steif, Adi; Mes-Masson, Anne-Marie; Hirst, Martin; Shah, Sohrab P; Aparicio, Samuel; Hansen, Carl L

    2016-07-26

    The genomes of large numbers of single cells must be sequenced to further understanding of the biological significance of genomic heterogeneity in complex systems. Whole genome amplification (WGA) of single cells is generally the first step in such studies, but is prone to nonuniformity that can compromise genomic measurement accuracy. Despite recent advances, robust performance in high-throughput single-cell WGA remains elusive. Here, we introduce droplet multiple displacement amplification (MDA), a method that uses commercially available liquid dispensing to perform high-throughput single-cell MDA in nanoliter volumes. The performance of droplet MDA is characterized using a large dataset of 129 normal diploid cells, and is shown to exceed previously reported single-cell WGA methods in amplification uniformity, genome coverage, and/or robustness. We achieve up to 80% coverage of a single-cell genome at 5× sequencing depth, and demonstrate excellent single-nucleotide variant (SNV) detection using targeted sequencing of droplet MDA product to achieve a median allelic dropout of 15%, and using whole genome sequencing to achieve false and true positive rates of 9.66 × 10(-6) and 68.8%, respectively, in a G1-phase cell. We further show that droplet MDA allows for the detection of copy number variants (CNVs) as small as 30 kb in single cells of an ovarian cancer cell line and as small as 9 Mb in two high-grade serous ovarian cancer samples using only 0.02× depth. Droplet MDA provides an accessible and scalable method for performing robust and accurate CNV and SNV measurements on large numbers of single cells.

  1. Big data - smart health strategies. Findings from the yearbook 2014 special theme.

    PubMed

    Koutkias, V; Thiessard, F

    2014-08-15

    To select best papers published in 2013 in the field of big data and smart health strategies, and summarize outstanding research efforts. A systematic search was performed using two major bibliographic databases for relevant journal papers. The references obtained were reviewed in a two-stage process, starting with a blinded review performed by the two section editors, and followed by a peer review process operated by external reviewers recognized as experts in the field. The complete review process selected four best papers, illustrating various aspects of the special theme, among them: (a) using large volumes of unstructured data and, specifically, clinical notes from Electronic Health Records (EHRs) for pharmacovigilance; (b) knowledge discovery via querying large volumes of complex (both structured and unstructured) biological data using big data technologies and relevant tools; (c) methodologies for applying cloud computing and big data technologies in the field of genomics, and (d) system architectures enabling high-performance access to and processing of large datasets extracted from EHRs. The potential of big data in biomedicine has been pinpointed in various viewpoint papers and editorials. The review of current scientific literature illustrated a variety of interesting methods and applications in the field, but still the promises exceed the current outcomes. As we are getting closer towards a solid foundation with respect to common understanding of relevant concepts and technical aspects, and the use of standardized technologies and tools, we can anticipate to reach the potential that big data offer for personalized medicine and smart health strategies in the near future.

  2. Big Data - Smart Health Strategies

    PubMed Central

    2014-01-01

    Summary Objectives To select best papers published in 2013 in the field of big data and smart health strategies, and summarize outstanding research efforts. Methods A systematic search was performed using two major bibliographic databases for relevant journal papers. The references obtained were reviewed in a two-stage process, starting with a blinded review performed by the two section editors, and followed by a peer review process operated by external reviewers recognized as experts in the field. Results The complete review process selected four best papers, illustrating various aspects of the special theme, among them: (a) using large volumes of unstructured data and, specifically, clinical notes from Electronic Health Records (EHRs) for pharmacovigilance; (b) knowledge discovery via querying large volumes of complex (both structured and unstructured) biological data using big data technologies and relevant tools; (c) methodologies for applying cloud computing and big data technologies in the field of genomics, and (d) system architectures enabling high-performance access to and processing of large datasets extracted from EHRs. Conclusions The potential of big data in biomedicine has been pinpointed in various viewpoint papers and editorials. The review of current scientific literature illustrated a variety of interesting methods and applications in the field, but still the promises exceed the current outcomes. As we are getting closer towards a solid foundation with respect to common understanding of relevant concepts and technical aspects, and the use of standardized technologies and tools, we can anticipate to reach the potential that big data offer for personalized medicine and smart health strategies in the near future. PMID:25123721

  3. SERVS: the Spitzer Extragalactic Representative Volume Survey

    NASA Astrophysics Data System (ADS)

    Lacy, Mark; Afonso, Jose; Alexander, Dave; Best, Philip; Bonfield, David; Castro, Nieves; Cava, Antonio; Chapman, Scott; Dunlop, James; Dyke, Eleanor; Edge, Alastair; Farrah, Duncan; Ferguson, Harry; Foucaud, Sebastian; Franceschini, Alberto; Geach, Jim; Gonzales, Eduardo; Hatziminaoglou, Evanthia; Hickey, Samantha; Ivison, Rob; Jarvis, Matt; Le Fèvre, Olivier; Lonsdale, Carol; Maraston, Claudia; McLure, Ross; Mortier, Angela; Oliver, Seb; Ouchi, Masami; Parish, Glen; Perez-Fournon, Ismael; Petric, Andreea; Pierre, Mauguerite; Readhead, Tony; Ridgway, Susan; Romer, Katherine; Rottgering, Huub; Rowan-Robinson, Michael; Sajina, Anna; Seymour, Nick; Smail, Ian; Surace, Jason; Thomas, Peter; Trichas, Markos; Vaccari, Mattia; Verma, Aprajita; Xu, Kevin; van Kampen, Eelco

    2008-12-01

    We will use warm Spitzer to image 18deg^2 of sky to microJy depth. This is deep enough to undertake a complete census of massive galaxies from z~6 to ~1 in a volume ~0.8Gpc^3, large enough to overcome the effects of cosmic variance, which place severe limitations on the conclusions that can be drawn from smaller fields. We will greatly enhance the diagnostic power of the Spitzer data by performing most of this survey in the region covered by the near-IR VISTA-VIDEO survey, and in other areas covered by near-IR, Herschel and SCUBA2 surveys. We will build complete near-infrared spectral energy distributions using the superb datasets from VIDEO, in conjunction with our Spitzer data, to derive accurate photometric redshifts and the key properties of stellar mass and star formation rates for a large sample of high-z galaxies. Obscured star formation rates and dust-shrouded BH growth phases will be uncovered by combining the Spitzer data with the Herschel and SCUBA2 surveys. We will thus build a complete picture of the formation of massive galaxies from z~6, where only about 1% of the stars in massive galaxies have formed, to z~1 where ~50% of them haveE Our large volume will allow us to also find examples of rare objects such as high-z quasars (~10-100 at z>6.5), high-z galaxy clusters (~20 at z>1.5 with dark halo masses >10^14 solar masses), and evaluate how quasar activity and galaxy environment affect star formation. This survey makes nearly optimal use of warm Spitzer; (a) all of the complementary data is either taken or will be taken in the very near future, and will be immediately publicly accessible, (b) the slew overheads are relatively small, (c) the observations are deep enough to detect high redshift galaxies but not so deep that source confusion reduces the effective survey area.

  4. A Regional Climate Model Evaluation System based on Satellite and other Observations

    NASA Astrophysics Data System (ADS)

    Lean, P.; Kim, J.; Waliser, D. E.; Hall, A. D.; Mattmann, C. A.; Granger, S. L.; Case, K.; Goodale, C.; Hart, A.; Zimdars, P.; Guan, B.; Molotch, N. P.; Kaki, S.

    2010-12-01

    Regional climate models are a fundamental tool needed for downscaling global climate simulations and projections, such as those contributing to the Coupled Model Intercomparison Projects (CMIPs) that form the basis of the IPCC Assessment Reports. The regional modeling process provides the means to accommodate higher resolution and a greater complexity of Earth System processes. Evaluation of both the global and regional climate models against observations is essential to identify model weaknesses and to direct future model development efforts focused on reducing the uncertainty associated with climate projections. However, the lack of reliable observational data and the lack of formal tools are among the serious limitations to addressing these objectives. Recent satellite observations are particularly useful as they provide a wealth of information on many different aspects of the climate system, but due to their large volume and the difficulties associated with accessing and using the data, these datasets have been generally underutilized in model evaluation studies. Recognizing this problem, NASA JPL / UCLA is developing a model evaluation system to help make satellite observations, in conjunction with in-situ, assimilated, and reanalysis datasets, more readily accessible to the modeling community. The system includes a central database to store multiple datasets in a common format and codes for calculating predefined statistical metrics to assess model performance. This allows the time taken to compare model simulations with satellite observations to be reduced from weeks to days. Early results from the use this new model evaluation system for evaluating regional climate simulations over California/western US regions will be presented.

  5. Image quality of mean temporal arterial and mean temporal portal venous phase images calculated from low dose dynamic volume perfusion CT datasets in patients with hepatocellular carcinoma and pancreatic cancer.

    PubMed

    Wang, X; Henzler, T; Gawlitza, J; Diehl, S; Wilhelm, T; Schoenberg, S O; Jin, Z Y; Xue, H D; Smakic, A

    2016-11-01

    Dynamic volume perfusion CT (dVPCT) provides valuable information on tissue perfusion in patients with hepatocellular carcinoma (HCC) and pancreatic cancer. However, currently dVPCT is often performed in addition to conventional CT acquisitions due to the limited morphologic image quality of dose optimized dVPCT protocols. The aim of this study was to prospectively compare objective and subjective image quality, lesion detectability and radiation dose between mean temporal arterial (mTA) and mean temporal portal venous (mTPV) images calculated from low dose dynamic volume perfusion CT (dVPCT) datasets with linearly blended 120-kVp arterial and portal venous datasets in patients with HCC and pancreatic cancer. All patients gave written informed consent for this institutional review board-approved HIPAA compliant study. 27 consecutive patients (18 men, 9 women, mean age, 69.1 years±9.4) with histologically proven HCC or suspected pancreatic cancer were prospectively enrolled. The study CT protocol included a dVPCT protocol performed with 70 or 80kVp tube voltage (18 spiral acquisitions, 71.2s total acquisition times) and standard dual-energy (90/150kVpSn) arterial and portal venous acquisition performed 25min after the dVPCT. The mTA and mTPV images were manually reconstructed from the 3 to 5 best visually selected single arterial and 3 to 5 best single portal venous phases dVPCT dataset. The linearly blended 120-kVp images were calculated from dual-energy CT (DECT) raw data. Image noise, SNR, and CNR of the liver, abdominal aorta (AA) and main portal vein (PV) were compared between the mTA/mTPV and the linearly blended 120-kVp dual-energy arterial and portal venous datasets, respectively. Subjective image quality was evaluated by two radiologists regarding subjective image noise, sharpness and overall diagnostic image quality using a 5-point Likert Scale. In addition, liver lesion detectability was performed for each liver segment by the two radiologists using the linearly blended120-kVp arterial and portal venous datasets as the reference standard. Image noise, SNR and CNR values of the mTA and mTPV were significantly higher when compared to the corresponding linearly blended arterial and portal venous 120-kVp datasets (all p<0.001) except for image noise within the PV in the portal venous phases (p=0.136). image quality of mTA and mTPV were rated significantly better when compared to the linearly blended 120-kVp arterial and portal venous datasets. Both readers were able to detect all liver lesions found on the linearly blended 120-kVp arterial and portal venous datasets using the mTA and mTPV datasets. The effective radiation dose of the dVPCT was 27.6mSv for the 80kVp protocol and 14.5mSv for the 70kVp protocol. The mean effective radiation dose for the linearly blended 120-kVp arterial and portal venous CT protocol together of the upper abdomen was 5.60mSv±1.48mSv. Our preliminary data suggest that subjective and objective image quality of mTA and mTPV datasets calculated from low-kVp dVPCT datasets is non-inferior when compared to linearly blended 120-kVp arterial and portal venous acquisitions in patients with HCC and pancreatic cancer. Thus, dVPCT could be used as a stand-alone imaging technique without additionally performed conventional arterial and portal venous CT acquisitions. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  6. Impact of gastric filling on radiation dose delivered to gastroesophageal junction tumors.

    PubMed

    Bouchard, Myriam; McAleer, Mary Frances; Starkschall, George

    2010-05-01

    This study examined the impact of gastric filling variation on target coverage of gastroesophageal junction (GEJ) tumors in three-dimensional conformal radiation therapy (3DCRT), intensity-modulated radiation therapy (IMRT), or IMRT with simultaneous integrated boost (IMRT-SIB) plans. Eight patients previously receiving radiation therapy for esophageal cancer had computed tomography (CT) datasets acquired with full stomach (FS) and empty stomach (ES). We generated treatment plans for 3DCRT, IMRT, or IMRT-SIB for each patient on the ES-CT and on the FS-CT datasets. The 3DCRT and IMRT plans were planned to 50.4 Gy to the clinical target volume (CTV), and the same for IMRT-SIB plus 63.0 Gy to the gross tumor volume (GTV). Target coverage was evaluated using dose-volume histogram data for patient treatments simulated with ES-CT sets, assuming treatment on an FS for the entire course, and vice versa. FS volumes were a mean of 3.3 (range, 1.7-7.5) times greater than ES volumes. The volume of the GTV receiving >or=50.4 Gy (V(50.4Gy)) was 100% in all situations. The planning GTV V(63Gy) became suboptimal when gastric filling varied, regardless of whether simulation was done on the ES-CT or the FS-CT set. Stomach filling has a negligible impact on prescribed dose delivered to the GEJ GTV, using either 3DCRT or IMRT planning. Thus, local relapses are not likely to be related to variations in gastric filling. Dose escalation for GEJ tumors with IMRT-SIB may require gastric filling monitoring.

  7. A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video

    DTIC Science & Technology

    2011-06-01

    orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3

  8. Does using different modern climate datasets impact pollen-based paleoclimate reconstructions in North America during the past 2,000 years

    NASA Astrophysics Data System (ADS)

    Ladd, Matthew; Viau, Andre

    2013-04-01

    Paleoclimate reconstructions rely on the accuracy of modern climate datasets for calibration of fossil records under the assumption of climate normality through time, which means that the modern climate operates in a similar manner as over the past 2,000 years. In this study, we show how using different modern climate datasets have an impact on a pollen-based reconstruction of mean temperature of the warmest month (MTWA) during the past 2,000 years for North America. The modern climate datasets used to explore this research question include the: Whitmore et al., (2005) modern climate dataset; North American Regional Reanalysis (NARR); National Center For Environmental Prediction (NCEP); European Center for Medium Range Weather Forecasting (ECMWF) ERA-40 reanalysis; WorldClim, Global Historical Climate Network (GHCN) and New et al., which is derived from the CRU dataset. Results show that some caution is advised in using the reanalysis data on large-scale reconstructions. Station data appears to dampen out the variability of the reconstruction produced using station based datasets. The reanalysis or model-based datasets are not recommended for paleoclimate large-scale North American reconstructions as they appear to lack some of the dynamics observed in station datasets (CRU) which resulted in warm-biased reconstructions as compared to the station-based reconstructions. The Whitmore et al. (2005) modern climate dataset appears to be a compromise between CRU-based datasets and model-based datasets except for the ERA-40. In addition, an ultra-high resolution gridded climate dataset such as WorldClim may only be useful if the pollen calibration sites in North America have at least the same spatial precision. We reconstruct the MTWA to within +/-0.01°C by using an average of all curves derived from the different modern climate datasets, demonstrating the robustness of the procedure used. It may be that the use of an average of different modern datasets may reduce the impact of uncertainty of paleoclimate reconstructions, however, this is yet to be determined with certainty. Future evaluation using for example the newly developed Berkeley earth surface temperature datasets should be tested against the paleoclimate record.

  9. Atlas-Guided Cluster Analysis of Large Tractography Datasets

    PubMed Central

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292

  10. Exploiting CMS data popularity to model the evolution of data management for Run-2 and beyond

    NASA Astrophysics Data System (ADS)

    Bonacorsi, D.; Boccali, T.; Giordano, D.; Girone, M.; Neri, M.; Magini, N.; Kuznetsov, V.; Wildish, T.

    2015-12-01

    During the LHC Run-1 data taking, all experiments collected large data volumes from proton-proton and heavy-ion collisions. The collisions data, together with massive volumes of simulated data, were replicated in multiple copies, transferred among various Tier levels, transformed/slimmed in format/content. These data were then accessed (both locally and remotely) by large groups of distributed analysis communities exploiting the WorldWide LHC Computing Grid infrastructure and services. While efficient data placement strategies - together with optimal data redistribution and deletions on demand - have become the core of static versus dynamic data management projects, little effort has so far been invested in understanding the detailed data-access patterns which surfaced in Run-1. These patterns, if understood, can be used as input to simulation of computing models at the LHC, to optimise existing systems by tuning their behaviour, and to explore next-generation CPU/storage/network co-scheduling solutions. This is of great importance, given that the scale of the computing problem will increase far faster than the resources available to the experiments, for Run-2 and beyond. Studying data-access patterns involves the validation of the quality of the monitoring data collected on the “popularity of each dataset, the analysis of the frequency and pattern of accesses to different datasets by analysis end-users, the exploration of different views of the popularity data (by physics activity, by region, by data type), the study of the evolution of Run-1 data exploitation over time, the evaluation of the impact of different data placement and distribution choices on the available network and storage resources and their impact on the computing operations. This work presents some insights from studies on the popularity data from the CMS experiment. We present the properties of a range of physics analysis activities as seen by the data popularity, and make recommendations for how to tune the initial distribution of data in anticipation of how it will be used in Run-2 and beyond.

  11. NREL - SOWFA - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  12. PNNL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  13. ANL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  14. LLNL - WRF-LES - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  15. ANL - WRF-LES - Neutral - TTU

    DOE Data Explorer

    Kosovic, Branko

    2018-06-20

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  16. LANL - WRF-LES - Neutral - TTU

    DOE Data Explorer

    Kosovic, Branko

    2018-06-20

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  17. LANL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  18. A Computational Approach to Qualitative Analysis in Large Textual Datasets

    PubMed Central

    Evans, Michael S.

    2014-01-01

    In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398

  19. Evolving Deep Networks Using HPC

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, Steven R.; Rose, Derek C.; Johnston, Travis

    While a large number of deep learning networks have been studied and published that produce outstanding results on natural image datasets, these datasets only make up a fraction of those to which deep learning can be applied. These datasets include text data, audio data, and arrays of sensors that have very different characteristics than natural images. As these “best” networks for natural images have been largely discovered through experimentation and cannot be proven optimal on some theoretical basis, there is no reason to believe that they are the optimal network for these drastically different datasets. Hyperparameter search is thus oftenmore » a very important process when applying deep learning to a new problem. In this work we present an evolutionary approach to searching the possible space of network hyperparameters and construction that can scale to 18, 000 nodes. This approach is applied to datasets of varying types and characteristics where we demonstrate the ability to rapidly find best hyperparameters in order to enable practitioners to quickly iterate between idea and result.« less

  20. Improving quantitative structure-activity relationship models using Artificial Neural Networks trained with dropout.

    PubMed

    Mendenhall, Jeffrey; Meiler, Jens

    2016-02-01

    Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both enrichment false positive rate and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22-46 % over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods.

  1. Improving Quantitative Structure-Activity Relationship Models using Artificial Neural Networks Trained with Dropout

    PubMed Central

    Mendenhall, Jeffrey; Meiler, Jens

    2016-01-01

    Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery (LB-CADD) pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both Enrichment false positive rate (FPR) and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22–46% over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods. PMID:26830599

  2. Registration uncertainties between 3D cone beam computed tomography and different reference CT datasets in lung stereotactic body radiation therapy.

    PubMed

    Oechsner, Markus; Chizzali, Barbara; Devecka, Michal; Combs, Stephanie Elisabeth; Wilkens, Jan Jakob; Duma, Marciana Nona

    2016-10-26

    The aim of this study was to analyze differences in couch shifts (setup errors) resulting from image registration of different CT datasets with free breathing cone beam CTs (FB-CBCT). As well automatic as manual image registrations were performed and registration results were correlated to tumor characteristics. FB-CBCT image registration was performed for 49 patients with lung lesions using slow planning CT (PCT), average intensity projection (AIP), maximum intensity projection (MIP) and mid-ventilation CTs (MidV) as reference images. Both, automatic and manual image registrations were applied. Shift differences were evaluated between the registered CT datasets for automatic and manual registration, respectively. Furthermore, differences between automatic and manual registration were analyzed for the same CT datasets. The registration results were statistically analyzed and correlated to tumor characteristics (3D tumor motion, tumor volume, superior-inferior (SI) distance, tumor environment). Median 3D shift differences over all patients were between 0.5 mm (AIPvsMIP) and 1.9 mm (MIPvsPCT and MidVvsPCT) for the automatic registration and between 1.8 mm (AIPvsPCT) and 2.8 mm (MIPvsPCT and MidVvsPCT) for the manual registration. For some patients, large shift differences (>5.0 mm) were found (maximum 10.5 mm, automatic registration). Comparing automatic vs manual registrations for the same reference CTs, ∆AIP achieved the smallest (1.1 mm) and ∆MIP the largest (1.9 mm) median 3D shift differences. The standard deviation (variability) for the 3D shift differences was also the smallest for ∆AIP (1.1 mm). Significant correlations (p < 0.01) between 3D shift difference and 3D tumor motion (AIPvsMIP, MIPvsMidV) and SI distance (AIPvsMIP) (automatic) and also for 3D tumor motion (∆PCT, ∆MidV; automatic vs manual) were found. Using different CT datasets for image registration with FB-CBCTs can result in different 3D couch shifts. Manual registrations achieved partly different 3D shifts than automatic registrations. AIP CTs yielded the smallest shift differences and might be the most appropriate CT dataset for registration with 3D FB-CBCTs.

  3. Online Visualization and Analysis of Merged Global Geostationary Satellite Infrared Dataset

    NASA Technical Reports Server (NTRS)

    Liu, Zhong; Ostrenga, D.; Leptoukh, G.; Mehta, A.

    2008-01-01

    The NASA Goddard Earth Sciences Data Information Services Center (GES DISC) is home of Tropical Rainfall Measuring Mission (TRMM) data archive. The global merged IR product also known as the NCEP/CPC 4-km Global (60 degrees N - 60 degrees S) IR Dataset, is one of TRMM ancillary datasets. They are globally merged (60 degrees N - 60 degrees S) pixel-resolution (4 km) IR brightness temperature data (equivalent blackbody temperatures), merged from all available geostationary satellites (GOES-8/10, METEOSAT-7/5 and GMS). The availability of data from METEOSAT-5, which is located at 63E at the present time, yields a unique opportunity for total global (60 degrees N- 60 degrees S) coverage. The GES DISC has collected over 8 years of the data beginning from February of 2000. This high temporal resolution dataset can not only provide additional background information to TRMM and other satellite missions, but also allow observing a wide range of meteorological phenomena from space, such as, mesoscale convection systems, tropical cyclones, hurricanes, etc. The dataset can also be used to verify model simulations. Despite that the data can be downloaded via ftp, however, its large volume poses a challenge for many users. A single file occupies about 70 MB disk space and there is a total of approximately 73,000 files (approximately 4.5 TB) for the past 8 years. In order to facilitate data access, we have developed a web prototype to allow users to conduct online visualization and analysis of this dataset. With a web browser and few mouse clicks, users can have a full access to over 8 year and over 4.5 TB data and generate black and white IR imagery and animation without downloading any software and data. In short, you can make your own images! Basic functions include selection of area of interest, single imagery or animation, a time skip capability for different temporal resolution and image size. Users can save an animation as a file (animated gif) and import it in other presentation software, such as, Microsoft PowerPoint. The prototype will be integrated into GIOVANNI and existing GIOVANNI capabilities, such as, data download, Google Earth KMZ, etc will be available. Users will also be able to access other data products in the GIOVANNI family.

  4. A statistical comparison of cirrus particle size distributions measured using the 2-D stereo probe during the TC4, SPARTICUS, and MACPEX flight campaigns with historical cirrus datasets

    NASA Astrophysics Data System (ADS)

    Schwartz, M. Christian

    2017-08-01

    This paper addresses two straightforward questions. First, how similar are the statistics of cirrus particle size distribution (PSD) datasets collected using the Two-Dimensional Stereo (2D-S) probe to cirrus PSD datasets collected using older Particle Measuring Systems (PMS) 2-D Cloud (2DC) and 2-D Precipitation (2DP) probes? Second, how similar are the datasets when shatter-correcting post-processing is applied to the 2DC datasets? To answer these questions, a database of measured and parameterized cirrus PSDs - constructed from measurements taken during the Small Particles in Cirrus (SPARTICUS); Mid-latitude Airborne Cirrus Properties Experiment (MACPEX); and Tropical Composition, Cloud, and Climate Coupling (TC4) flight campaigns - is used.Bulk cloud quantities are computed from the 2D-S database in three ways: first, directly from the 2D-S data; second, by applying the 2D-S data to ice PSD parameterizations developed using sets of cirrus measurements collected using the older PMS probes; and third, by applying the 2D-S data to a similar parameterization developed using the 2D-S data themselves. This is done so that measurements of the same cloud volumes by parameterized versions of the 2DC and 2D-S can be compared with one another. It is thereby seen - given the same cloud field and given the same assumptions concerning ice crystal cross-sectional area, density, and radar cross section - that the parameterized 2D-S and the parameterized 2DC predict similar distributions of inferred shortwave extinction coefficient, ice water content, and 94 GHz radar reflectivity. However, the parameterization of the 2DC based on uncorrected data predicts a statistically significantly higher number of total ice crystals and a larger ratio of small ice crystals to large ice crystals than does the parameterized 2D-S. The 2DC parameterization based on shatter-corrected data also predicts statistically different numbers of ice crystals than does the parameterized 2D-S, but the comparison between the two is nevertheless more favorable. It is concluded that the older datasets continue to be useful for scientific purposes, with certain caveats, and that continuing field investigations of cirrus with more modern probes is desirable.

  5. Visualizing astronomy data using VRML

    NASA Astrophysics Data System (ADS)

    Beeson, Brett; Lancaster, Michael; Barnes, David G.; Bourke, Paul D.; Rixon, Guy T.

    2004-09-01

    Visualisation is a powerful tool for understanding the large data sets typical of astronomical surveys and can reveal unsuspected relationships and anomalous regions of parameter space which may be difficult to find programatically. Visualisation is a classic information technology for optimising scientific return. We are developing a number of generic on-line visualisation tools as a component of the Australian Virtual Observatory project. The tools will be deployed within the framework of the International Virtual Observatory Alliance (IVOA), and follow agreed-upon standards to make them accessible by other programs and people. We and our IVOA partners plan to utilise new information technologies (such as grid computing and web services) to advance the scientific return of existing and future instrumentation. Here we present a new tool - VOlume - which visualises point data. Visualisation of astronomical data normally requires the local installation of complex software, the downloading of potentially large datasets, and very often time-consuming and tedious data format conversions. VOlume enables the astronomer to visualise data using just a web browser and plug-in. This is achieved using IVOA standards which allow us to pass data between Web Services, Java Servlet Technology and Common Gateway Interface programs. Data from a catalogue server can be streamed in eXtensible Mark-up Language format to a servlet which produces Virtual Reality Modeling Language output. The user selects elements of the catalogue to map to geometry and then visualises the result in a browser plug-in such as Cortona or FreeWRL. Other than requiring an input VOTable format file, VOlume is very general. While its major use will likely be to display and explore astronomical source catalogues, it can easily render other important parameter fields such as the sky and redshift coverage of proposed surveys or the sampling of the visibility plane by a rotation-synthesis interferometer.

  6. [Database supported electronic retrospective analyses in radiation oncology: establishing a workflow using the example of pancreatic cancer].

    PubMed

    Kessel, K A; Habermehl, D; Bohn, C; Jäger, A; Floca, R O; Zhang, L; Bougatf, N; Bendl, R; Debus, J; Combs, S E

    2012-12-01

    Especially in the field of radiation oncology, handling a large variety of voluminous datasets from various information systems in different documentation styles efficiently is crucial for patient care and research. To date, conducting retrospective clinical analyses is rather difficult and time consuming. With the example of patients with pancreatic cancer treated with radio-chemotherapy, we performed a therapy evaluation by using an analysis system connected with a documentation system. A total number of 783 patients have been documented into a professional, database-based documentation system. Information about radiation therapy, diagnostic images and dose distributions have been imported into the web-based system. For 36 patients with disease progression after neoadjuvant chemoradiation, we designed and established an analysis workflow. After an automatic registration of the radiation plans with the follow-up images, the recurrence volumes are segmented manually. Based on these volumes the DVH (dose volume histogram) statistic is calculated, followed by the determination of the dose applied to the region of recurrence. All results are saved in the database and included in statistical calculations. The main goal of using an automatic analysis tool is to reduce time and effort conducting clinical analyses, especially with large patient groups. We showed a first approach and use of some existing tools, however manual interaction is still necessary. Further steps need to be taken to enhance automation. Already, it has become apparent that the benefits of digital data management and analysis lie in the central storage of data and reusability of the results. Therefore, we intend to adapt the analysis system to other types of tumors in radiation oncology.

  7. Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

    PubMed

    Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

    2017-02-01

    Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.

  8. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets.

    PubMed

    McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr

    2017-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.

  9. Rapid and accurate species tree estimation for phylogeographic investigations using replicated subsampling.

    PubMed

    Hird, Sarah; Kubatko, Laura; Carstens, Bryan

    2010-11-01

    We describe a method for estimating species trees that relies on replicated subsampling of large data matrices. One application of this method is phylogeographic research, which has long depended on large datasets that sample intensively from the geographic range of the focal species; these datasets allow systematicists to identify cryptic diversity and understand how contemporary and historical landscape forces influence genetic diversity. However, analyzing any large dataset can be computationally difficult, particularly when newly developed methods for species tree estimation are used. Here we explore the use of replicated subsampling, a potential solution to the problem posed by large datasets, with both a simulation study and an empirical analysis. In the simulations, we sample different numbers of alleles and loci, estimate species trees using STEM, and compare the estimated to the actual species tree. Our results indicate that subsampling three alleles per species for eight loci nearly always results in an accurate species tree topology, even in cases where the species tree was characterized by extremely rapid divergence. Even more modest subsampling effort, for example one allele per species and two loci, was more likely than not (>50%) to identify the correct species tree topology, indicating that in nearly all cases, computing the majority-rule consensus tree from replicated subsampling provides a good estimate of topology. These results were supported by estimating the correct species tree topology and reasonable branch lengths for an empirical 10-locus great ape dataset. Copyright © 2010 Elsevier Inc. All rights reserved.

  10. Machine Learning, Sentiment Analysis, and Tweets: An Examination of Alzheimer's Disease Stigma on Twitter.

    PubMed

    Oscar, Nels; Fox, Pamela A; Croucher, Racheal; Wernick, Riana; Keune, Jessica; Hooker, Karen

    2017-09-01

    Social scientists need practical methods for harnessing large, publicly available datasets that inform the social context of aging. We describe our development of a semi-automated text coding method and use a content analysis of Alzheimer's disease (AD) and dementia portrayal on Twitter to demonstrate its use. The approach improves feasibility of examining large publicly available datasets. Machine learning techniques modeled stigmatization expressed in 31,150 AD-related tweets collected via Twitter's search API based on 9 AD-related keywords. Two researchers manually coded 311 random tweets on 6 dimensions. This input from 1% of the dataset was used to train a classifier against the tweet text and code the remaining 99% of the dataset. Our automated process identified that 21.13% of the AD-related tweets used AD-related keywords to perpetuate public stigma, which could impact stereotypes and negative expectations for individuals with the disease and increase "excess disability". This technique could be applied to questions in social gerontology related to how social media outlets reflect and shape attitudes bearing on other developmental outcomes. Recommendations for the collection and analysis of large Twitter datasets are discussed. © The Author 2017. Published by Oxford University Press on behalf of The Gerontological Society of America. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  11. Parallel task processing of very large datasets

    NASA Astrophysics Data System (ADS)

    Romig, Phillip Richardson, III

    This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.

  12. Massachusetts Institute of Technology Lincoln Laboratory Journal Volume 6, Number 1, Spring 1993

    DTIC Science & Technology

    1993-01-01

    simple examples, we see that fractal size e goes to zero. The smallest meaningful % amLe of dimension clearly has the potential to discriminate the box...dataset. Figure 13 is an example of this dataset; it value at different resolutions means different num- shows a river with treelined banks (the river is...each bank . The smooth green areas are open fields. ntc~’:F ( ,𔄀 Fj R A6 . 43 - KREITHEN ET AL. Di,,rtmmating lar’qeit fnom (C/utter The remainder of

  13. Semi-automated surface mapping via unsupervised classification

    NASA Astrophysics Data System (ADS)

    D'Amore, M.; Le Scaon, R.; Helbert, J.; Maturilli, A.

    2017-09-01

    Due to the increasing volume of the returned data from space mission, the human search for correlation and identification of interesting features becomes more and more unfeasible. Statistical extraction of features via machine learning methods will increase the scientific output of remote sensing missions and aid the discovery of yet unknown feature hidden in dataset. Those methods exploit algorithm trained on features from multiple instrument, returning classification maps that explore intra-dataset correlation, allowing for the discovery of unknown features. We present two applications, one for Mercury and one for Vesta.

  14. Faster, efficient and secure collection of research images: the utilization of cloud technology to expand the OMI-DB

    NASA Astrophysics Data System (ADS)

    Patel, M. N.; Young, K.; Halling-Brown, M. D.

    2018-03-01

    The demand for medical images for research is ever increasing owing to the rapid rise in novel machine learning approaches for early detection and diagnosis. The OPTIMAM Medical Image Database (OMI-DB)1,2 was created to provide a centralized, fully annotated dataset for research. The database contains both processed and unprocessed images, associated data, annotations and expert-determined ground truths. Since the inception of the database in early 2011, the volume of images and associated data collected has dramatically increased owing to automation of the collection pipeline and inclusion of new sites. Currently, these data are stored at each respective collection site and synced periodically to a central store. This leads to a large data footprint at each site, requiring large physical onsite storage, which is expensive. Here, we propose an update to the OMI-DB collection system, whereby the storage of all the data is automatically transferred to the cloud on collection. This change in the data collection paradigm reduces the reliance of physical servers at each site; allows greater scope for future expansion; and removes the need for dedicated backups and improves security. Moreover, with the number of applications to access the data increasing rapidly with the maturity of the dataset cloud technology facilities faster sharing of data and better auditing of data access. Such updates, although may sound trivial; require substantial modification to the existing pipeline to ensure data integrity and security compliance. Here, we describe the extensions to the OMI-DB collection pipeline and discuss the relative merits of the new system.

  15. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE PAGES

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus; ...

    2016-04-12

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  16. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  17. Data Publishing and Sharing Via the THREDDS Data Repository

    NASA Astrophysics Data System (ADS)

    Wilson, A.; Caron, J.; Davis, E.; Baltzer, T.

    2007-12-01

    The terms "Team Science" and "Networked Science" have been coined to describe a virtual organization of researchers tied via some intellectual challenge, but often located in different organizations and locations. A critical component to these endeavors is publishing and sharing of content, including scientific data. Imagine pointing your web browser to a web page that interactively lets you upload data and metadata to a repository residing on a remote server, which can then be accessed by others in a secure fasion via the web. While any content can be added to this repository, it is designed particularly for storing and sharing scientific data and metadata. Server support includes uploading of data files that can subsequently be subsetted, aggregrated, and served in NetCDF or other scientific data formats. Metadata can be associated with the data and interactively edited. The THREDDS Data Repository (TDR) is a server that provides client initiated, on demand, location transparent storage for data of any type that can then be served by the THREDDS Data Server (TDS). The TDR provides functionality to: * securely store and "own" data files and associated metadata * upload files via HTTP and gridftp * upload a collection of data as single file * modify and restructure repository contents * incorporate metadata provided by the user * generate additional metadata programmatically * edit individual metadata elements The TDR can exist separately from a TDS, serving content via HTTP. Also, it can work in conjunction with the TDS, which includes functionality to provide: * access to data in a variety of formats via -- OPeNDAP -- OGC Web Coverage Service (for gridded datasets) -- bulk HTTP file transfer * a NetCDF view of datasets in NetCDF, OPeNDAP, HDF-5, GRIB, and NEXRAD formats * serving of very large volume datasets, such as NEXRAD radar * aggregation into virtual datasets * subsetting via OPeNDAP and NetCDF Subsetting services This talk will discuss TDR/TDS capabilities as well as how users can install this software to create their own repositories.

  18. Bayesian automated cortical segmentation for neonatal MRI

    NASA Astrophysics Data System (ADS)

    Chou, Zane; Paquette, Natacha; Ganesh, Bhavana; Wang, Yalin; Ceschin, Rafael; Nelson, Marvin D.; Macyszyn, Luke; Gaonkar, Bilwaj; Panigrahy, Ashok; Lepore, Natasha

    2017-11-01

    Several attempts have been made in the past few years to develop and implement an automated segmentation of neonatal brain structural MRI. However, accurate automated MRI segmentation remains challenging in this population because of the low signal-to-noise ratio, large partial volume effects and inter-individual anatomical variability of the neonatal brain. In this paper, we propose a learning method for segmenting the whole brain cortical grey matter on neonatal T2-weighted images. We trained our algorithm using a neonatal dataset composed of 3 fullterm and 4 preterm infants scanned at term equivalent age. Our segmentation pipeline combines the FAST algorithm from the FSL library software and a Bayesian segmentation approach to create a threshold matrix that minimizes the error of mislabeling brain tissue types. Our method shows promising results with our pilot training set. In both preterm and full-term neonates, automated Bayesian segmentation generates a smoother and more consistent parcellation compared to FAST, while successfully removing the subcortical structure and cleaning the edges of the cortical grey matter. This method show promising refinement of the FAST segmentation by considerably reducing manual input and editing required from the user, and further improving reliability and processing time of neonatal MR images. Further improvement will include a larger dataset of training images acquired from different manufacturers.

  19. Venus mesospheric sulfur dioxide measurement retrieved from SOIR on board Venus Express

    NASA Astrophysics Data System (ADS)

    Mahieux, A.; Vandaele, A. C.; Robert, S.; Wilquet, V.; Drummond, R.; Chamberlain, S.; Belyaev, D.; Bertaux, J. L.

    2015-08-01

    SOIR on board Venus Express sounds the Venus upper atmosphere using the solar occultation technique. It detects the signature from many Venus atmosphere species, including those of SO2 and CO2. SO2 has a weak absorption structure at 4 μm, from which number density profiles are regularly inferred. SO2 volume mixing ratios (VMR) are calculated from the total number density that are also derived from the SOIR measurements. This work is an update of the previous work by Belyaev et al. (2012), considering the SO2 profiles on a broader altitude range, from 65 to 85 km. Positive detection VMR profiles are presented. In 68% of the occultation spectral datasets, SO2 is detected. The SO2 VMR profiles show a large variability up to two orders of magnitude, on a short term time scales. We present mean VMR profiles for various bins of latitudes, and study the latitudinal variations; the mean latitude variations are much smaller than the short term temporal variations. A permanent minimum showing a weak latitudinal structure is observed. Long term temporal trends are also considered and discussed. The trend observed by Marcq et al. (2013) is not observed in this dataset. Our results are compared to literature data and generally show a good agreement.

  20. Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier.

    PubMed

    Kumar, Mukesh; Rath, Nitish Kumar; Rath, Santanu Kumar

    2016-04-01

    Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data. Copyright © 2016 Elsevier Inc. All rights reserved.

  1. Digital Object Identifiers (DOI's) usage and adoption in U.S Geological Survey (USGS)

    NASA Astrophysics Data System (ADS)

    Frame, M. T.; Palanisamy, G.

    2013-12-01

    Addressing grand environmental science challenges requires unprecedented access to easily understood data that cross the breadth of temporal, spatial, and thematic scales. From a scientist's perspective, the big challenges lie in discovering the relevant data, dealing with extreme data heterogeneity, large data volumes, and converting data to information and knowledge. Historical linkages between derived products, i.e. Publications, and associated datasets has not existed in the earth science community. The USGS Core Science Analytics and Synthesis, in collaboration with DOE's Oak Ridge National Laboratory (ORNL) Mercury Consortium (funded by NASA, USGS and DOE), established a Digital Object Identifier (DOI) service for USGS data, metadata, and other media. This service is offered in partnership through the University of California Digital Library EZID service. USGS scientists, data managers, and other professionals can generate globally unique, persistent and resolvable identifiers for any kind of digital objects. Additional efforts to assign DOIs to historical data and publications have also been underway. These DOI identifiers are being used to cite data in journal articles, web-accessible datasets, and other media for distribution, integration, and in support of improved data management practices. The session will discuss the current DOI efforts within USGS, including a discussion on adoption, challenges, and future efforts necessary to improve access, reuse, sharing, and discoverability of USGS data and information.

  2. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  3. Imbalanced class learning in epigenetics.

    PubMed

    Haque, M Muksitul; Skinner, Michael K; Holder, Lawrence B

    2014-07-01

    In machine learning, one of the important criteria for higher classification accuracy is a balanced dataset. Datasets with a large ratio between minority and majority classes face hindrance in learning using any classifier. Datasets having a magnitude difference in number of instances between the target concept result in an imbalanced class distribution. Such datasets can range from biological data, sensor data, medical diagnostics, or any other domain where labeling any instances of the minority class can be time-consuming or costly or the data may not be easily available. The current study investigates a number of imbalanced class algorithms for solving the imbalanced class distribution present in epigenetic datasets. Epigenetic (DNA methylation) datasets inherently come with few differentially DNA methylated regions (DMR) and with a higher number of non-DMR sites. For this class imbalance problem, a number of algorithms are compared, including the TAN+AdaBoost algorithm. Experiments performed on four epigenetic datasets and several known datasets show that an imbalanced dataset can have similar accuracy as a regular learner on a balanced dataset.

  4. Quantitative 3D Ultrashort Time-to-Echo (UTE) MRI and Micro-CT (μCT) Evaluation of the Temporomandibular Joint (TMJ) Condylar Morphology

    PubMed Central

    Geiger, Daniel; Bae, Won C.; Statum, Sheronda; Du, Jiang; Chung, Christine B.

    2014-01-01

    Objective Temporomandibular dysfunction involves osteoarthritis of the TMJ, including degeneration and morphologic changes of the mandibular condyle. Purpose of this study was to determine accuracy of novel 3D-UTE MRI versus micro-CT (μCT) for quantitative evaluation of mandibular condyle morphology. Material & Methods Nine TMJ condyle specimens were harvested from cadavers (2M, 3F; Age 85 ± 10 yrs., mean±SD). 3D-UTE MRI (TR=50ms, TE=0.05 ms, 104 μm isotropic-voxel) was performed using a 3-T MR scanner and μCT (18 μm isotropic-voxel) was performed. MR datasets were spatially-registered with μCT dataset. Two observers segmented bony contours of the condyles. Fibrocartilage was segmented on MR dataset. Using a custom program, bone and fibrocartilage surface coordinates, Gaussian curvature, volume of segmented regions and fibrocartilage thickness were determined for quantitative evaluation of joint morphology. Agreement between techniques (MRI vs. μCT) and observers (MRI vs. MRI) for Gaussian curvature, mean curvature and segmented volume of the bone were determined using intraclass correlation correlation (ICC) analyses. Results Between MRI and μCT, the average deviation of surface coordinates was 0.19±0.15 mm, slightly higher than spatial resolution of MRI. Average deviation of the Gaussian curvature and volume of segmented regions, from MRI to μCT, was 5.7±6.5% and 6.6±6.2%, respectively. ICC coefficients (MRI vs. μCT) for Gaussian curvature, mean curvature and segmented volumes were respectively 0.892, 0.893 and 0.972. Between observers (MRI vs. MRI), the ICC coefficients were 0.998, 0.999 and 0.997 respectively. Fibrocartilage thickness was 0.55±0.11 mm, as previously described in literature for grossly normal TMJ samples. Conclusion 3D-UTE MR quantitative evaluation of TMJ condyle morphology ex-vivo, including surface, curvature and segmented volume, shows high correlation against μCT and between observers. In addition, UTE MRI allows quantitative evaluation of the fibrocartilaginous condylar component. PMID:24092237

  5. Database Objects vs Files: Evaluation of alternative strategies for managing large remote sensing data

    NASA Astrophysics Data System (ADS)

    Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram

    2010-05-01

    Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.

  6. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  7. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  8. Population of 224 realistic human subject-based computational breast phantoms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Erickson, David W.; Wells, Jered R., E-mail: jered.wells@duke.edu; Sturgeon, Gregory M.

    Purpose: To create a database of highly realistic and anatomically variable 3D virtual breast phantoms based on dedicated breast computed tomography (bCT) data. Methods: A tissue classification and segmentation algorithm was used to create realistic and detailed 3D computational breast phantoms based on 230 + dedicated bCT datasets from normal human subjects. The breast volume was identified using a coarse three-class fuzzy C-means segmentation algorithm which accounted for and removed motion blur at the breast periphery. Noise in the bCT data was reduced through application of a postreconstruction 3D bilateral filter. A 3D adipose nonuniformity (bias field) correction was thenmore » applied followed by glandular segmentation using a 3D bias-corrected fuzzy C-means algorithm. Multiple tissue classes were defined including skin, adipose, and several fractional glandular densities. Following segmentation, a skin mask was produced which preserved the interdigitated skin, adipose, and glandular boundaries of the skin interior. Finally, surface modeling was used to produce digital phantoms with methods complementary to the XCAT suite of digital human phantoms. Results: After rejecting some datasets due to artifacts, 224 virtual breast phantoms were created which emulate the complex breast parenchyma of actual human subjects. The volume breast density (with skin) ranged from 5.5% to 66.3% with a mean value of 25.3% ± 13.2%. Breast volumes ranged from 25.0 to 2099.6 ml with a mean value of 716.3 ± 386.5 ml. Three breast phantoms were selected for imaging with digital compression (using finite element modeling) and simple ray-tracing, and the results show promise in their potential to produce realistic simulated mammograms. Conclusions: This work provides a new population of 224 breast phantoms based on in vivo bCT data for imaging research. Compared to previous studies based on only a few prototype cases, this dataset provides a rich source of new cases spanning a wide range of breast types, volumes, densities, and parenchymal patterns.« less

  9. Population of 224 realistic human subject-based computational breast phantoms

    PubMed Central

    Erickson, David W.; Wells, Jered R.; Sturgeon, Gregory M.; Dobbins, James T.; Segars, W. Paul; Lo, Joseph Y.

    2016-01-01

    Purpose: To create a database of highly realistic and anatomically variable 3D virtual breast phantoms based on dedicated breast computed tomography (bCT) data. Methods: A tissue classification and segmentation algorithm was used to create realistic and detailed 3D computational breast phantoms based on 230 + dedicated bCT datasets from normal human subjects. The breast volume was identified using a coarse three-class fuzzy C-means segmentation algorithm which accounted for and removed motion blur at the breast periphery. Noise in the bCT data was reduced through application of a postreconstruction 3D bilateral filter. A 3D adipose nonuniformity (bias field) correction was then applied followed by glandular segmentation using a 3D bias-corrected fuzzy C-means algorithm. Multiple tissue classes were defined including skin, adipose, and several fractional glandular densities. Following segmentation, a skin mask was produced which preserved the interdigitated skin, adipose, and glandular boundaries of the skin interior. Finally, surface modeling was used to produce digital phantoms with methods complementary to the XCAT suite of digital human phantoms. Results: After rejecting some datasets due to artifacts, 224 virtual breast phantoms were created which emulate the complex breast parenchyma of actual human subjects. The volume breast density (with skin) ranged from 5.5% to 66.3% with a mean value of 25.3% ± 13.2%. Breast volumes ranged from 25.0 to 2099.6 ml with a mean value of 716.3 ± 386.5 ml. Three breast phantoms were selected for imaging with digital compression (using finite element modeling) and simple ray-tracing, and the results show promise in their potential to produce realistic simulated mammograms. Conclusions: This work provides a new population of 224 breast phantoms based on in vivo bCT data for imaging research. Compared to previous studies based on only a few prototype cases, this dataset provides a rich source of new cases spanning a wide range of breast types, volumes, densities, and parenchymal patterns. PMID:26745896

  10. Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

    PubMed

    Klein, Max; Sharma, Rati; Bohrer, Chris H; Avelis, Cameron M; Roberts, Elijah

    2017-01-15

    Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology. Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  11. A semiparametric graphical modelling approach for large-scale equity selection.

    PubMed

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption.

  12. Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

    PubMed Central

    Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T.; Girard, F. Pascal; Stephens, Robert M.

    2013-01-01

    As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. PMID:24312478

  13. Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

    PubMed

    Mudunuri, Uma S; Khouja, Mohamad; Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T; Girard, F Pascal; Stephens, Robert M

    2013-01-01

    As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

  14. Hydraulic Tomography in Fractured Sedimentary Rocks to Estimate High-Resolution 3-D Distribution of Hydraulic Conductivity

    NASA Astrophysics Data System (ADS)

    Tiedeman, C. R.; Barrash, W.; Thrash, C. J.; Patterson, J.; Johnson, C. D.

    2016-12-01

    Hydraulic tomography was performed in a 100 m2 by 20 m thick volume of contaminated fractured mudstones at the former Naval Air Warfare Center (NAWC) in the Newark Basin, New Jersey, with the objective of estimating the detailed distribution of hydraulic conductivity (K). Characterizing the fine-scale K variability is important for designing effective remediation strategies in complex geologic settings such as fractured rock. In the tomography experiment, packers isolated two to six intervals in each of seven boreholes in the volume of investigation, and fiber-optic pressure transducers enabled collection of high-resolution drawdown observations. A hydraulic tomography dataset was obtained by conducting multiple aquifer tests in which a given isolated well interval was pumped and drawdown was monitored in all other intervals. The collective data from all tests display a wide range of behavior indicative of highly heterogeneous K within the tested volume, such as: drawdown curves for different intervals crossing one another on drawdown-time plots; unique drawdown curve shapes for certain intervals; and intervals with negligible drawdown adjacent to intervals with large drawdown. Tomographic inversion of data from 15 tests conducted in the first field season focused on estimating the K distribution at a scale of 1 m3 over approximately 25% of the investigated volume, where observation density was greatest. The estimated K field is consistent with prior geologic, geophysical, and hydraulic information, including: highly variable K within bedding-plane-parting fractures that are the primary flow and transport paths at NAWC, connected high-K features perpendicular to bedding, and a spatially heterogeneous distribution of low-K rock matrix and closed fractures. Subsequent tomographic testing was conducted in the second field season, with the region of high observation density expanded to cover a greater volume of the wellfield.

  15. Atrophy of the cholinergic basal forebrain over the adult age range and in early stages of Alzheimer´s disease

    PubMed Central

    Grothe, Michel; Heinsen, Helmut; Teipel, Stefan J.

    2013-01-01

    Background The basal forebrain cholinergic system (BFCS) is known to undergo moderate neurodegenerative changes during normal aging as well as severe atrophy in Alzheimer´s disease (AD). However, there is a controversy on how the cholinergic lesion in AD relates to early and incipient stages of the disease. In-vivo imaging studies on the structural integrity of the BFCS in normal and pathological aging are still rare. Methods We applied automated morphometry techniques in combination with high-dimensional image warping and a cytoarchitectonic map of BF cholinergic nuclei to a large cross-sectional dataset of high-resolution MRI scans, covering the whole adult age-range (20–94 years; N=211) as well as patients with very mild AD (vmAD; CDR=0.5; N=69) and clinically manifest AD (AD; CDR=1; N=28). For comparison, we investigated hippocampus volume using automated volumetry. Results Volume of the BFCS declined from early adulthood on and atrophy aggravated in advanced age. Volume reductions in vmAD were most pronounced in posterior parts of the nucleus basalis Meynert, while in AD atrophy was more extensive and included the whole BFCS. In clinically manifest AD, the diagnostic accuracy of BFCS volume reached the diagnostic accuracy of hippocampus volume. Conclusions Our findings indicate that cholinergic degeneration in AD occurs against a background of age-related atrophy and that exacerbated atrophy in AD can be detected at earliest stages of cognitive impairment. Automated in-vivo morphometry of the BFCS may become a useful tool to assess BF cholinergic degeneration in normal and pathological aging. PMID:21816388

  16. Crystallographic Orientation Relationships (CORs) between rutile inclusions and garnet hosts: towards using COR frequencies as a petrogenetic indicator

    NASA Astrophysics Data System (ADS)

    Griffiths, Thomas; Habler, Gerlinde; Schantl, Philip; Abart, Rainer

    2017-04-01

    Crystallographic orientation relationships (CORs) between crystalline inclusions and their hosts are commonly used to support particular inclusion origins, but often interpretations are based on a small fraction of all inclusions in a system. The electron backscatter diffraction (EBSD) method allows collection of large COR datasets more quickly than other methods while maintaining high spatial resolution. Large datasets allow analysis of the relative frequencies of different CORs, and identification of 'statistical CORs', where certain limited degrees of freedom exist in the orientation relationship between two neighbour crystals (Griffiths et al. 2016). Statistical CORs exist in addition to completely fixed 'specific' CORs (previously the only type of COR considered). We present a comparison of three EBSD single point datasets (all N > 200 inclusions) of rutile inclusions in garnet hosts, covering three rock systems, each with a different geological history: 1) magmatic garnet in pegmatite from the Koralpe complex, Eastern Alps, formed at temperatures > 600°C and low pressures; 2) granulite facies garnet rims on ultra-high-pressure garnets from the Kimi complex, Rhodope Massif; and 3) a Moldanubian granulite from the southeastern Bohemian Massif, equilibrated at peak conditions of 1050°C and 1.6 GPa. The present study is unique because all datasets have been analysed using the same catalogue of potential CORs, therefore relative frequencies and other COR properties can be meaningfully compared. In every dataset > 94% of the inclusions analysed exhibit one of the CORs tested for. Certain CORs are consistently among the most common in all datasets. However, the relative abundances of these common CORs show large variations between datasets (varying from 8 to 42 % relative abundance in one case). Other CORs are consistently uncommon but nonetheless present in every dataset. Lastly, there are some CORs that are common in one of the datasets and rare in the remainder. These patterns suggest competing influences on relative COR frequencies. Certain CORs seem consistently favourable, perhaps pointing to very stable low energy configurations, whereas some CORs are favoured in only one system, perhaps due to particulars of the formation mechanism, kinetics or conditions. Variations in COR frequencies between datasets seem to correlate with the conditions of host-inclusion system evolution. The two datasets from granulite-facies metamorphic samples show more similarities to each other than to the pegmatite dataset, and the sample inferred to have experienced the highest temperatures (Moldanubian granulite) shows the lowest diversity of CORs, low frequencies of statistical CORs and the highest frequency of specific CORs. These results provide evidence that petrological information is being encoded in COR distributions. They make a strong case for further studies of the factors influencing COR development and for measurements of COR distributions in other systems and between different phases. Griffiths, T.A., Habler, G., Abart, R. (2016): Crystallographic orientation relationships in host-inclusion systems: New insights from large EBSD data sets. Amer. Miner., 101, 690-705.

  17. Structural identifiability analysis of a cardiovascular system model.

    PubMed

    Pironet, Antoine; Dauby, Pierre C; Chase, J Geoffrey; Docherty, Paul D; Revie, James A; Desaive, Thomas

    2016-05-01

    The six-chamber cardiovascular system model of Burkhoff and Tyberg has been used in several theoretical and experimental studies. However, this cardiovascular system model (and others derived from it) are not identifiable from any output set. In this work, two such cases of structural non-identifiability are first presented. These cases occur when the model output set only contains a single type of information (pressure or volume). A specific output set is thus chosen, mixing pressure and volume information and containing only a limited number of clinically available measurements. Then, by manipulating the model equations involving these outputs, it is demonstrated that the six-chamber cardiovascular system model is structurally globally identifiable. A further simplification is made, assuming known cardiac valve resistances. Because of the poor practical identifiability of these four parameters, this assumption is usual. Under this hypothesis, the six-chamber cardiovascular system model is structurally identifiable from an even smaller dataset. As a consequence, parameter values computed from limited but well-chosen datasets are theoretically unique. This means that the parameter identification procedure can safely be performed on the model from such a well-chosen dataset. Thus, the model may be considered suitable for use in diagnosis. Copyright © 2016 IPEM. Published by Elsevier Ltd. All rights reserved.

  18. Chemical Data Reporting rule (CDR)

    EPA Pesticide Factsheets

    This dataset contains information on chemicals that company's produce domestically or import into the United States during the principal reporting year. For the 2012 submission period, reporters provided 2011 manufacturing, processing, and use data and 2010 production volume data for their reportable chemical substances.

  19. Multiresolution persistent homology for excessively large biomolecular datasets

    NASA Astrophysics Data System (ADS)

    Xia, Kelin; Zhao, Zhixiong; Wei, Guo-Wei

    2015-10-01

    Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.

  20. A Testbed Demonstration of an Intelligent Archive in a Knowledge Building System

    NASA Technical Reports Server (NTRS)

    Ramapriyan, Hampapuram; Isaac, David; Morse, Steve; Yang, Wenli; Bonnlander, Brian; McConaughy, Gail; Di, Liping; Danks, David

    2005-01-01

    The last decade's influx of raw data and derived geophysical parameters from several Earth observing satellites to NASA data centers has created a data-rich environment for Earth science research and applications. While advances in hardware and information management have made it possible to archive petabytes of data and distribute terabytes of data daily to a broad community of users, further progress is necessary in the transformation of data into information, and information into knowledge that can be used in particular applications in order to realize the full potential of these valuable datasets. In examining what is needed to enable this progress in the data provider environment that exists today and is expected to evolve in the next several years, we arrived at the concept of an Intelligent Archive in context of a Knowledge Building System (IA/KBS). Our prior work and associated papers investigated usage scenarios, required capabilities, system architecture, data volume issues, and supporting technologies. We identified six key capabilities of an IA/KBS: Virtual Product Generation, Significant Event Detection, Automated Data Quality Assessment, Large-Scale Data Mining, Dynamic Feedback Loop, and Data Discovery and Efficient Requesting. Among these capabilities, large-scale data mining is perceived by many in the community to be an area of technical risk. One of the main reasons for this is that standard data mining research and algorithms operate on datasets that are several orders of magnitude smaller than the actual sizes of datasets maintained by realistic earth science data archives. Therefore, we defined a test-bed activity to implement a large-scale data mining algorithm in a pseudo-operational scale environment and to examine any issues involved. The application chosen for applying the data mining algorithm is wildfire prediction over the continental U.S. This paper reports a number of observations based on our experience with this test-bed. While proof-of-concept for data mining scalability and utility has been a major goal for the research reported here, it was not the only one. The other five capabilities of an WKBS named above have been considered as well, and an assessment of the implications of our experience for these other areas will also be presented. The lessons learned through the testbed effort and presented in this paper will benefit technologists, scientists, and system operators as they consider introducing IA/KBS capabilities into production systems.

  1. Comparison of Cortical and Subcortical Measurements in Normal Older Adults across Databases and Software Packages

    PubMed Central

    Rane, Swati; Plassard, Andrew; Landman, Bennett A.; Claassen, Daniel O.; Donahue, Manus J.

    2017-01-01

    This work explores the feasibility of combining anatomical MRI data across two public repositories namely, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Progressive Parkinson’s Markers Initiative (PPMI). We compared cortical thickness and subcortical volumes in cognitively normal older adults between datasets with distinct imaging parameters to assess if they would provide equivalent information. Three distinct datasets were identified. Major differences in data were scanner manufacturer and the use of magnetization inversion to enhance tissue contrast. Equivalent datasets, i.e., those providing similar volumetric measurements in cognitively normal controls, were identified in ADNI and PPMI. These were datasets obtained on the Siemens scanner with TI = 900 ms. Our secondary goal was to assess the agreement between subcortical volumes that are obtained with different software packages. Three subcortical measurement applications (FSL, FreeSurfer, and a recent multi-atlas approach) were compared. Our results show significant agreement in the measurements of caudate, putamen, pallidum, and hippocampus across the packages and poor agreement between measurements of accumbens and amygdala. This is likely due to their smaller size and lack of gray matter-white matter tissue contrast for accurate segmentation. This work provides a segue to combine imaging data from ADNI and PPMI to increase statistical power as well as to interrogate common mechanisms in disparate pathologies such as Alzheimer’s and Parkinson’s diseases. It lays the foundation for comparison of anatomical data acquired with disparate imaging parameters and analyzed with disparate software tools. Furthermore, our work partly explains the variability in the results of studies using different software packages. PMID:29756095

  2. Comparison of Cortical and Subcortical Measurements in Normal Older Adults across Databases and Software Packages.

    PubMed

    Rane, Swati; Plassard, Andrew; Landman, Bennett A; Claassen, Daniel O; Donahue, Manus J

    2017-01-01

    This work explores the feasibility of combining anatomical MRI data across two public repositories namely, the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Progressive Parkinson's Markers Initiative (PPMI). We compared cortical thickness and subcortical volumes in cognitively normal older adults between datasets with distinct imaging parameters to assess if they would provide equivalent information. Three distinct datasets were identified. Major differences in data were scanner manufacturer and the use of magnetization inversion to enhance tissue contrast. Equivalent datasets, i.e., those providing similar volumetric measurements in cognitively normal controls, were identified in ADNI and PPMI. These were datasets obtained on the Siemens scanner with TI = 900 ms. Our secondary goal was to assess the agreement between subcortical volumes that are obtained with different software packages. Three subcortical measurement applications (FSL, FreeSurfer, and a recent multi-atlas approach) were compared. Our results show significant agreement in the measurements of caudate, putamen, pallidum, and hippocampus across the packages and poor agreement between measurements of accumbens and amygdala. This is likely due to their smaller size and lack of gray matter-white matter tissue contrast for accurate segmentation. This work provides a segue to combine imaging data from ADNI and PPMI to increase statistical power as well as to interrogate common mechanisms in disparate pathologies such as Alzheimer's and Parkinson's diseases. It lays the foundation for comparison of anatomical data acquired with disparate imaging parameters and analyzed with disparate software tools. Furthermore, our work partly explains the variability in the results of studies using different software packages.

  3. Large-Scale Sentinel-1 Processing for Solid Earth Science and Urgent Response using Cloud Computing and Machine Learning

    NASA Astrophysics Data System (ADS)

    Hua, H.; Owen, S. E.; Yun, S. H.; Agram, P. S.; Manipon, G.; Starch, M.; Sacco, G. F.; Bue, B. D.; Dang, L. B.; Linick, J. P.; Malarout, N.; Rosen, P. A.; Fielding, E. J.; Lundgren, P.; Moore, A. W.; Liu, Z.; Farr, T.; Webb, F.; Simons, M.; Gurrola, E. M.

    2017-12-01

    With the increased availability of open SAR data (e.g. Sentinel-1 A/B), new challenges are being faced with processing and analyzing the voluminous SAR datasets to make geodetic measurements. Upcoming SAR missions such as NISAR are expected to generate close to 100TB per day. The Advanced Rapid Imaging and Analysis (ARIA) project can now generate geocoded unwrapped phase and coherence products from Sentinel-1 TOPS mode data in an automated fashion, using the ISCE software. This capability is currently being exercised on various study sites across the United States and around the globe, including Hawaii, Central California, Iceland and South America. The automated and large-scale SAR data processing and analysis capabilities use cloud computing techniques to speed the computations and provide scalable processing power and storage. Aspects such as how to processing these voluminous SLCs and interferograms at global scales, keeping up with the large daily SAR data volumes, and how to handle the voluminous data rates are being explored. Scene-partitioning approaches in the processing pipeline help in handling global-scale processing up to unwrapped interferograms with stitching done at a late stage. We have built an advanced science data system with rapid search functions to enable access to the derived data products. Rapid image processing of Sentinel-1 data to interferograms and time series is already being applied to natural hazards including earthquakes, floods, volcanic eruptions, and land subsidence due to fluid withdrawal. We will present the status of the ARIA science data system for generating science-ready data products and challenges that arise from being able to process SAR datasets to derived time series data products at large scales. For example, how do we perform large-scale data quality screening on interferograms? What approaches can be used to minimize compute, storage, and data movement costs for time series analysis in the cloud? We will also present some of our findings from applying machine learning and data analytics on the processed SAR data streams. We will also present lessons learned on how to ease the SAR community onto interfacing with these cloud-based SAR science data systems.

  4. Immersive Interaction, Manipulation and Analysis of Large 3D Datasets for Planetary and Earth Sciences

    NASA Astrophysics Data System (ADS)

    Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.

    2017-12-01

    We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.

  5. Decision tree methods: applications for classification and prediction.

    PubMed

    Song, Yan-Yan; Lu, Ying

    2015-04-25

    Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.

  6. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.

    PubMed

    Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O; Gelfand, Alan E

    2016-01-01

    Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.

  7. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets

    PubMed Central

    Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O.; Gelfand, Alan E.

    2018-01-01

    Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online. PMID:29720777

  8. An innovative privacy preserving technique for incremental datasets on cloud computing.

    PubMed

    Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

    2016-08-01

    Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.

  9. geneLAB: Expanding the Impact of NASA's Biological Research in Space

    NASA Technical Reports Server (NTRS)

    Rayl, Nicole; Smith, Jeffrey D.

    2014-01-01

    The geneLAB project is designed to leverage the value of large 'omics' datasets from molecular biology projects conducted on the ISS by making these datasets available, citable, discoverable, interpretable, reusable, and reproducible. geneLAB will create a collaboration space with an integrated set of tools for depositing, accessing, analyzing, and modeling these diverse datasets from spaceflight and related terrestrial studies.

  10. A hierarchical 3D segmentation method and the definition of vertebral body coordinate systems for QCT of the lumbar spine.

    PubMed

    Mastmeyer, André; Engelke, Klaus; Fuchs, Christina; Kalender, Willi A

    2006-08-01

    We have developed a new hierarchical 3D technique to segment the vertebral bodies in order to measure bone mineral density (BMD) with high trueness and precision in volumetric CT datasets. The hierarchical approach starts with a coarse separation of the individual vertebrae, applies a variety of techniques to segment the vertebral bodies with increasing detail and ends with the definition of an anatomic coordinate system for each vertebral body, relative to which up to 41 trabecular and cortical volumes of interest are positioned. In a pre-segmentation step constraints consisting of Boolean combinations of simple geometric shapes are determined that enclose each individual vertebral body. Bound by these constraints viscous deformable models are used to segment the main shape of the vertebral bodies. Volume growing and morphological operations then capture the fine details of the bone-soft tissue interface. In the volumes of interest bone mineral density and content are determined. In addition, in the segmented vertebral bodies geometric parameters such as volume or the length of the main axes of inertia can be measured. Intra- and inter-operator precision errors of the segmentation procedure were analyzed using existing clinical patient datasets. Results for segmented volume, BMD, and coordinate system position were below 2.0%, 0.6%, and 0.7%, respectively. Trueness was analyzed using phantom scans. The bias of the segmented volume was below 4%; for BMD it was below 1.5%. The long-term goal of this work is improved fracture prediction and patient monitoring in the field of osteoporosis. A true 3D segmentation also enables an accurate measurement of geometrical parameters that may augment the clinical value of a pure BMD analysis.

  11. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  12. High-throughput bioinformatics with the Cyrille2 pipeline system

    PubMed Central

    Fiers, Mark WEJ; van der Burgt, Ate; Datema, Erwin; de Groot, Joost CW; van Ham, Roeland CHJ

    2008-01-01

    Background Modern omics research involves the application of high-throughput technologies that generate vast volumes of data. These data need to be pre-processed, analyzed and integrated with existing knowledge through the use of diverse sets of software tools, models and databases. The analyses are often interdependent and chained together to form complex workflows or pipelines. Given the volume of the data used and the multitude of computational resources available, specialized pipeline software is required to make high-throughput analysis of large-scale omics datasets feasible. Results We have developed a generic pipeline system called Cyrille2. The system is modular in design and consists of three functionally distinct parts: 1) a web based, graphical user interface (GUI) that enables a pipeline operator to manage the system; 2) the Scheduler, which forms the functional core of the system and which tracks what data enters the system and determines what jobs must be scheduled for execution, and; 3) the Executor, which searches for scheduled jobs and executes these on a compute cluster. Conclusion The Cyrille2 system is an extensible, modular system, implementing the stated requirements. Cyrille2 enables easy creation and execution of high throughput, flexible bioinformatics pipelines. PMID:18269742

  13. Compensating the intensity fall-off effect in cone-beam tomography by an empirical weight formula.

    PubMed

    Chen, Zikuan; Calhoun, Vince D; Chang, Shengjiang

    2008-11-10

    The Feldkamp-David-Kress (FDK) algorithm is widely adopted for cone-beam reconstruction due to its one-dimensional filtered backprojection structure and parallel implementation. In a reconstruction volume, the conspicuous cone-beam artifact manifests as intensity fall-off along the longitudinal direction (the gantry rotation axis). This effect is inherent to circular cone-beam tomography due to the fact that a cone-beam dataset acquired from circular scanning fails to meet the data sufficiency condition for volume reconstruction. Upon observations of the intensity fall-off phenomenon associated with the FDK reconstruction of a ball phantom, we propose an empirical weight formula to compensate for the fall-off degradation. Specifically, a reciprocal cosine can be used to compensate the voxel values along longitudinal direction during three-dimensional backprojection reconstruction, in particular for boosting the values of voxels at positions with large cone angles. The intensity degradation within the z plane, albeit insignificant, can also be compensated by using the same weight formula through a parameter for radial distance dependence. Computer simulations and phantom experiments are presented to demonstrate the compensation effectiveness of the fall-off effect inherent in circular cone-beam tomography.

  14. The maximum vector-angular margin classifier and its fast training on large datasets using a core vector machine.

    PubMed

    Hu, Wenjun; Chung, Fu-Lai; Wang, Shitong

    2012-03-01

    Although pattern classification has been extensively studied in the past decades, how to effectively solve the corresponding training on large datasets is a problem that still requires particular attention. Many kernelized classification methods, such as SVM and SVDD, can be formulated as the corresponding quadratic programming (QP) problems, but computing the associated kernel matrices requires O(n2)(or even up to O(n3)) computational complexity, where n is the size of the training patterns, which heavily limits the applicability of these methods for large datasets. In this paper, a new classification method called the maximum vector-angular margin classifier (MAMC) is first proposed based on the vector-angular margin to find an optimal vector c in the pattern feature space, and all the testing patterns can be classified in terms of the maximum vector-angular margin ρ, between the vector c and all the training data points. Accordingly, it is proved that the kernelized MAMC can be equivalently formulated as the kernelized Minimum Enclosing Ball (MEB), which leads to a distinctive merit of MAMC, i.e., it has the flexibility of controlling the sum of support vectors like v-SVC and may be extended to a maximum vector-angular margin core vector machine (MAMCVM) by connecting the core vector machine (CVM) method with MAMC such that the corresponding fast training on large datasets can be effectively achieved. Experimental results on artificial and real datasets are provided to validate the power of the proposed methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  15. Training Scalable Restricted Boltzmann Machines Using a Quantum Annealer

    NASA Astrophysics Data System (ADS)

    Kumar, V.; Bass, G.; Dulny, J., III

    2016-12-01

    Machine learning and the optimization involved therein is of critical importance for commercial and military applications. Due to the computational complexity of many-variable optimization, the conventional approach is to employ meta-heuristic techniques to find suboptimal solutions. Quantum Annealing (QA) hardware offers a completely novel approach with the potential to obtain significantly better solutions with large speed-ups compared to traditional computing. In this presentation, we describe our development of new machine learning algorithms tailored for QA hardware. We are training restricted Boltzmann machines (RBMs) using QA hardware on large, high-dimensional commercial datasets. Traditional optimization heuristics such as contrastive divergence and other closely related techniques are slow to converge, especially on large datasets. Recent studies have indicated that QA hardware when used as a sampler provides better training performance compared to conventional approaches. Most of these studies have been limited to moderately-sized datasets due to the hardware restrictions imposed by exisitng QA devices, which make it difficult to solve real-world problems at scale. In this work we develop novel strategies to circumvent this issue. We discuss scale-up techniques such as enhanced embedding and partitioned RBMs which allow large commercial datasets to be learned using QA hardware. We present our initial results obtained by training an RBM as an autoencoder on an image dataset. The results obtained so far indicate that the convergence rates can be improved significantly by increasing RBM network connectivity. These ideas can be readily applied to generalized Boltzmann machines and we are currently investigating this in an ongoing project.

  16. A dataset of human decision-making in teamwork management.

    PubMed

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-17

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  17. A dataset of human decision-making in teamwork management

    PubMed Central

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787

  18. A dataset of human decision-making in teamwork management

    NASA Astrophysics Data System (ADS)

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  19. GODIVA2: interactive visualization of environmental data on the Web.

    PubMed

    Blower, J D; Haines, K; Santokhee, A; Liu, C L

    2009-03-13

    GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.

  20. Addressing Methodological Challenges in Large Communication Datasets: Collecting and Coding Longitudinal Interactions in Home Hospice Cancer Care

    PubMed Central

    Reblin, Maija; Clayton, Margaret F; John, Kevin K; Ellington, Lee

    2015-01-01

    In this paper, we present strategies for collecting and coding a large longitudinal communication dataset collected across multiple sites, consisting of over 2000 hours of digital audio recordings from approximately 300 families. We describe our methods within the context of implementing a large-scale study of communication during cancer home hospice nurse visits, but this procedure could be adapted to communication datasets across a wide variety of settings. This research is the first study designed to capture home hospice nurse-caregiver communication, a highly understudied location and type of communication event. We present a detailed example protocol encompassing data collection in the home environment, large-scale, multi-site secure data management, the development of theoretically-based communication coding, and strategies for preventing coder drift and ensuring reliability of analyses. Although each of these challenges have the potential to undermine the utility of the data, reliability between coders is often the only issue consistently reported and addressed in the literature. Overall, our approach demonstrates rigor and provides a “how-to” example for managing large, digitally-recorded data sets from collection through analysis. These strategies can inform other large-scale health communication research. PMID:26580414

  1. Recommendations for the Use of Automated Gray Matter Segmentation Tools: Evidence from Huntington’s Disease

    PubMed Central

    Johnson, Eileanoir B.; Gregory, Sarah; Johnson, Hans J.; Durr, Alexandra; Leavitt, Blair R.; Roos, Raymund A.; Rees, Geraint; Tabrizi, Sarah J.; Scahill, Rachael I.

    2017-01-01

    The selection of an appropriate segmentation tool is a challenge facing any researcher aiming to measure gray matter (GM) volume. Many tools have been compared, yet there is currently no method that can be recommended above all others; in particular, there is a lack of validation in disease cohorts. This work utilizes a clinical dataset to conduct an extensive comparison of segmentation tools. Our results confirm that all tools have advantages and disadvantages, and we present a series of considerations that may be of use when selecting a GM segmentation method, rather than a ranking of these tools. Seven segmentation tools were compared using 3 T MRI data from 20 controls, 40 premanifest Huntington’s disease (HD), and 40 early HD participants. Segmented volumes underwent detailed visual quality control. Reliability and repeatability of total, cortical, and lobular GM were investigated in repeated baseline scans. The relationship between each tool was also examined. Longitudinal within-group change over 3 years was assessed via generalized least squares regression to determine sensitivity of each tool to disease effects. Visual quality control and raw volumes highlighted large variability between tools, especially in occipital and temporal regions. Most tools showed reliable performance and the volumes were generally correlated. Results for longitudinal within-group change varied between tools, especially within lobular regions. These differences highlight the need for careful selection of segmentation methods in clinical neuroimaging studies. This guide acts as a primer aimed at the novice or non-technical imaging scientist providing recommendations for the selection of cohort-appropriate GM segmentation software. PMID:29066997

  2. Recommendations for the Use of Automated Gray Matter Segmentation Tools: Evidence from Huntington's Disease.

    PubMed

    Johnson, Eileanoir B; Gregory, Sarah; Johnson, Hans J; Durr, Alexandra; Leavitt, Blair R; Roos, Raymund A; Rees, Geraint; Tabrizi, Sarah J; Scahill, Rachael I

    2017-01-01

    The selection of an appropriate segmentation tool is a challenge facing any researcher aiming to measure gray matter (GM) volume. Many tools have been compared, yet there is currently no method that can be recommended above all others; in particular, there is a lack of validation in disease cohorts. This work utilizes a clinical dataset to conduct an extensive comparison of segmentation tools. Our results confirm that all tools have advantages and disadvantages, and we present a series of considerations that may be of use when selecting a GM segmentation method, rather than a ranking of these tools. Seven segmentation tools were compared using 3 T MRI data from 20 controls, 40 premanifest Huntington's disease (HD), and 40 early HD participants. Segmented volumes underwent detailed visual quality control. Reliability and repeatability of total, cortical, and lobular GM were investigated in repeated baseline scans. The relationship between each tool was also examined. Longitudinal within-group change over 3 years was assessed via generalized least squares regression to determine sensitivity of each tool to disease effects. Visual quality control and raw volumes highlighted large variability between tools, especially in occipital and temporal regions. Most tools showed reliable performance and the volumes were generally correlated. Results for longitudinal within-group change varied between tools, especially within lobular regions. These differences highlight the need for careful selection of segmentation methods in clinical neuroimaging studies. This guide acts as a primer aimed at the novice or non-technical imaging scientist providing recommendations for the selection of cohort-appropriate GM segmentation software.

  3. Development of a global slope dataset for estimation of landslide occurrence resulting from earthquakes

    USGS Publications Warehouse

    Verdin, Kristine L.; Godt, Jonathan W.; Funk, Christopher C.; Pedreros, Diego; Worstell, Bruce; Verdin, James

    2007-01-01

    Landslides resulting from earthquakes can cause widespread loss of life and damage to critical infrastructure. The U.S. Geological Survey (USGS) has developed an alarm system, PAGER (Prompt Assessment of Global Earthquakes for Response), that aims to provide timely information to emergency relief organizations on the impact of earthquakes. Landslides are responsible for many of the damaging effects following large earthquakes in mountainous regions, and thus data defining the topographic relief and slope are critical to the PAGER system. A new global topographic dataset was developed to aid in rapidly estimating landslide potential following large earthquakes. We used the remotely-sensed elevation data collected as part of the Shuttle Radar Topography Mission (SRTM) to generate a slope dataset with nearly global coverage. Slopes from the SRTM data, computed at 3-arc-second resolution, were summarized at 30-arc-second resolution, along with statistics developed to describe the distribution of slope within each 30-arc-second pixel. Because there are many small areas lacking SRTM data and the northern limit of the SRTM mission was lat 60?N., statistical methods referencing other elevation data were used to fill the voids within the dataset and to extrapolate the data north of 60?. The dataset will be used in the PAGER system to rapidly assess the susceptibility of areas to landsliding following large earthquakes.

  4. Using LiDAR to Estimate Surface Erosion Volumes within the Post-storm 2012 Bagley Fire

    NASA Astrophysics Data System (ADS)

    Mikulovsky, R. P.; De La Fuente, J. A.; Mondry, Z. J.

    2014-12-01

    The total post-storm 2012 Bagley fire sediment budget of the Squaw Creek watershed in the Shasta-Trinity National Forest was estimated using many methods. A portion of the budget was quantitatively estimated using LiDAR. Simple workflows were designed to estimate the eroded volume's of debris slides, fill failures, gullies, altered channels and streams. LiDAR was also used to estimate depositional volumes. Thorough manual mapping of large erosional features using the ArcGIS 10.1 Geographic Information System was required as these mapped features determined the eroded volume boundaries in 3D space. The 3D pre-erosional surface for each mapped feature was interpolated based on the boundary elevations. A surface difference calculation was run using the estimated pre-erosional surfaces and LiDAR surfaces to determine volume of sediment potentially delivered into the stream system. In addition, cross sections of altered channels and streams were taken using stratified random selection based on channel gradient and stream order respectively. The original pre-storm surfaces of channel features were estimated using the cross sections and erosion depth criteria. Open source software Inkscape was used to estimate cross sectional areas for randomly selected channel features and then averaged for each channel gradient and stream order classes. The average areas were then multiplied by the length of each class to estimate total eroded altered channel and stream volume. Finally, reservoir and in-channel depositional volumes were estimated by mapping channel forms and generating specific reservoir elevation zones associated with depositional events. The in-channel areas and zones within the reservoir were multiplied by estimated and field observed sediment thicknesses to attain a best guess sediment volume. In channel estimates included re-occupying stream channel cross sections established before the fire. Once volumes were calculated, other erosion processes of the Bagley sedimentation study, such as surface soil erosion were combined to estimate the total fire and storm sediment budget for the Squaw Creek watershed. The LiDAR-based measurement workflows can be easily applied to other sediment budget studies using one high resolution LiDAR dataset.

  5. Integrated Strategy Improves the Prediction Accuracy of miRNA in Large Dataset

    PubMed Central

    Lipps, David; Devineni, Sree

    2016-01-01

    MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy is developed for the community for identifying novel miRNAs and the complete set of miRNAs. Source code is available at: https://github.com/xueLab/mirMeta PMID:28002428

  6. Intensification and Structure Change of Super Typhoon Flo as Related to the Large-Scale Environment.

    DTIC Science & Technology

    1998-06-01

    large dataset is a challenge. Schiavone and Papathomas (1990) summarize methods currently available for visualizing scientific 116 datasets. These...Prediction and Dynamic Meteorology, Second Edition. John Wiley and Sons, 477 pp. Hardy, R. L., 1971: Multiquadric equations of topography and other...Inter. Corp., Monterey CA, 40 pp. Sawyer, J. S., 1947: Notes on the theory of tropical cyclones. Quart. J. Roy. Meteor. Soc, 73, 101-126. Schiavone

  7. Data-driven Applications for the Sun-Earth System

    NASA Astrophysics Data System (ADS)

    Kondrashov, D. A.

    2016-12-01

    Advances in observational and data mining techniques allow extracting information from the large volume of Sun-Earth observational data that can be assimilated into first principles physical models. However, equations governing Sun-Earth phenomena are typically nonlinear, complex, and high-dimensional. The high computational demand of solving the full governing equations over a large range of scales precludes the use of a variety of useful assimilative tools that rely on applied mathematical and statistical techniques for quantifying uncertainty and predictability. Effective use of such tools requires the development of computationally efficient methods to facilitate fusion of data with models. This presentation will provide an overview of various existing as well as newly developed data-driven techniques adopted from atmospheric and oceanic sciences that proved to be useful for space physics applications, such as computationally efficient implementation of Kalman Filter in radiation belts modeling, solar wind gap-filling by Singular Spectrum Analysis, and low-rank procedure for assimilation of low-altitude ionospheric magnetic perturbations into the Lyon-Fedder-Mobarry (LFM) global magnetospheric model. Reduced-order non-Markovian inverse modeling and novel data-adaptive decompositions of Sun-Earth datasets will be also demonstrated.

  8. A decentralized training algorithm for Echo State Networks in distributed big data applications.

    PubMed

    Scardapane, Simone; Wang, Dianhui; Panella, Massimo

    2016-06-01

    The current big data deluge requires innovative solutions for performing efficient inference on large, heterogeneous amounts of information. Apart from the known challenges deriving from high volume and velocity, real-world big data applications may impose additional technological constraints, including the need for a fully decentralized training architecture. While several alternatives exist for training feed-forward neural networks in such a distributed setting, less attention has been devoted to the case of decentralized training of recurrent neural networks (RNNs). In this paper, we propose such an algorithm for a class of RNNs known as Echo State Networks. The algorithm is based on the well-known Alternating Direction Method of Multipliers optimization procedure. It is formulated only in terms of local exchanges between neighboring agents, without reliance on a coordinating node. Additionally, it does not require the communication of training patterns, which is a crucial component in realistic big data implementations. Experimental results on large scale artificial datasets show that it compares favorably with a fully centralized implementation, in terms of speed, efficiency and generalization accuracy. Copyright © 2015 Elsevier Ltd. All rights reserved.

  9. Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center

    PubMed Central

    Stathias, Vasileios; Koleti, Amar; Vidović, Dušica; Cooper, Daniel J.; Jagodnik, Kathleen M.; Terryn, Raymond; Forlin, Michele; Chung, Caty; Torre, Denis; Ayad, Nagi; Medvedovic, Mario; Ma'ayan, Avi; Pillai, Ajay; Schürer, Stephan C.

    2018-01-01

    The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles. PMID:29917015

  10. Study of Hydrokinetic Turbine Arrays with Large Eddy Simulation

    NASA Astrophysics Data System (ADS)

    Sale, Danny; Aliseda, Alberto

    2014-11-01

    Marine renewable energy is advancing towards commercialization, including electrical power generation from ocean, river, and tidal currents. The focus of this work is to develop numerical simulations capable of predicting the power generation potential of hydrokinetic turbine arrays-this includes analysis of unsteady and averaged flow fields, turbulence statistics, and unsteady loadings on turbine rotors and support structures due to interaction with rotor wakes and ambient turbulence. The governing equations of large-eddy-simulation (LES) are solved using a finite-volume method, and the presence of turbine blades are approximated by the actuator-line method in which hydrodynamic forces are projected to the flow field as a body force. The actuator-line approach captures helical wake formation including vortex shedding from individual blades, and the effects of drag and vorticity generation from the rough seabed surface are accounted for by wall-models. This LES framework was used to replicate a previous flume experiment consisting of three hydrokinetic turbines tested under various operating conditions and array layouts. Predictions of the power generation, velocity deficit and turbulence statistics in the wakes are compared between the LES and experimental datasets.

  11. SAMSA2: a standalone metatranscriptome analysis pipeline.

    PubMed

    Westreich, Samuel T; Treiber, Michelle L; Mills, David A; Korf, Ian; Lemay, Danielle G

    2018-05-21

    Complex microbial communities are an area of growing interest in biology. Metatranscriptomics allows researchers to quantify microbial gene expression in an environmental sample via high-throughput sequencing. Metatranscriptomic experiments are computationally intensive because the experiments generate a large volume of sequence data and each sequence must be compared with reference sequences from thousands of organisms. SAMSA2 is an upgrade to the original Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) pipeline that has been redesigned for standalone use on a supercomputing cluster. SAMSA2 is faster due to the use of the DIAMOND aligner, and more flexible and reproducible because it uses local databases. SAMSA2 is available with detailed documentation, and example input and output files along with examples of master scripts for full pipeline execution. SAMSA2 is a rapid and efficient metatranscriptome pipeline for analyzing large RNA-seq datasets in a supercomputing cluster environment. SAMSA2 provides simplified output that can be examined directly or used for further analyses, and its reference databases may be upgraded, altered or customized to fit the needs of any experiment.

  12. Design and analysis issues in quantitative proteomics studies.

    PubMed

    Karp, Natasha A; Lilley, Kathryn S

    2007-09-01

    Quantitative proteomics is the comparison of distinct proteomes which enables the identification of protein species which exhibit changes in expression or post-translational state in response to a given stimulus. Many different quantitative techniques are being utilized and generate large datasets. Independent of the technique used, these large datasets need robust data analysis to ensure valid conclusions are drawn from such studies. Approaches to address the problems that arise with large datasets are discussed to give insight into the types of statistical analyses of data appropriate for the various experimental strategies that can be employed by quantitative proteomic studies. This review also highlights the importance of employing a robust experimental design and highlights various issues surrounding the design of experiments. The concepts and examples discussed within will show how robust design and analysis will lead to confident results that will ensure quantitative proteomics delivers.

  13. A semiparametric graphical modelling approach for large-scale equity selection

    PubMed Central

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption. PMID:28316507

  14. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  15. SU-E-T-333: Dosimetric Impact of Rotational Error On the Target Coverage in IMPT Lung Cancer Plans

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rana, S; Zheng, Y

    2015-06-15

    Purpose: The main purpose of this study was to investigate the impact of rotational (yaw, roll, and pitch) error on the planning target volume (PTV) coverage in lung cancer plans generated by intensity modulated proton therapy (IMPT). Methods: In this retrospective study, computed tomography (CT) dataset of previously treated lung case was used. IMPT plan were generated on the original CT dataset using left-lateral (LL) and posterior-anterior (PA) beams for a total dose of 74 Gy[RBE] with 2 Gy[RBE] per fraction. In order to investigate the dosimetric impact of rotational error, 12 new CT datasets were generated by re-sampling themore » original CT dataset for rotational (roll, yaw, and pitch) angles ranged from −5° to +5°, with an increment of 2.5°. A total of 12 new IMPT plans were generated based on the re-sampled CT datasets using beam parameters identical to the ones in the original IMPT plan. All treatment plans were generated in XiO treatment planning system. The PTV coverage (i.e., dose received by 95% of the PTV volume, D95) in new IMPT plans were then compared with the PTV coverage in the original IMPT plan. Results: Rotational errors caused the reduction in the PTV coverage in all 12 new IMPT plans when compared to the original IMPT lung plan. Specifically, the PTV coverage was reduced by 4.94% to 50.51% for yaw, by 4.04% to 23.74% for roll, and by 5.21% to 46.88% for pitch errors. Conclusion: Unacceptable dosimetric results were observed in new IMPT plans as the PTV coverage was reduced by up to 26.87% and 50.51% for rotational error of 2.5° and 5°, respectively. Further investigation is underway in evaluating the PTV coverage loss in the IMPT lung cancer plans for smaller rotational angle change.« less

  16. LIPS database with LIPService: a microscopic image database of intracellular structures in Arabidopsis guard cells.

    PubMed

    Higaki, Takumi; Kutsuna, Natsumaro; Hasezawa, Seiichiro

    2013-05-16

    Intracellular configuration is an important feature of cell status. Recent advances in microscopic imaging techniques allow us to easily obtain a large number of microscopic images of intracellular structures. In this circumstance, automated microscopic image recognition techniques are of extreme importance to future phenomics/visible screening approaches. However, there was no benchmark microscopic image dataset for intracellular organelles in a specified plant cell type. We previously established the Live Images of Plant Stomata (LIPS) database, a publicly available collection of optical-section images of various intracellular structures of plant guard cells, as a model system of environmental signal perception and transduction. Here we report recent updates to the LIPS database and the establishment of a database table, LIPService. We updated the LIPS dataset and established a new interface named LIPService to promote efficient inspection of intracellular structure configurations. Cell nuclei, microtubules, actin microfilaments, mitochondria, chloroplasts, endoplasmic reticulum, peroxisomes, endosomes, Golgi bodies, and vacuoles can be filtered using probe names or morphometric parameters such as stomatal aperture. In addition to the serial optical sectional images of the original LIPS database, new volume-rendering data for easy web browsing of three-dimensional intracellular structures have been released to allow easy inspection of their configurations or relationships with cell status/morphology. We also demonstrated the utility of the new LIPS image database for automated organelle recognition of images from another plant cell image database with image clustering analyses. The updated LIPS database provides a benchmark image dataset for representative intracellular structures in Arabidopsis guard cells. The newly released LIPService allows users to inspect the relationship between organellar three-dimensional configurations and morphometrical parameters.

  17. Analyzing a Lung Cancer Patient Dataset with the Focus on Predicting Survival Rate One Year after Thoracic Surgery

    PubMed

    Rezaei Hachesu, Peyman; Moftian, Nazila; Dehghani, Mahsa; Samad Soltani, Taha

    2017-06-25

    Background: Data mining, a new concept introduced in the mid-1990s, can help researchers to gain new, profound insights and facilitate access to unanticipated knowledge sources in biomedical datasets. Many issues in the medical field are concerned with the diagnosis of diseases based on tests conducted on individuals at risk. Early diagnosis and treatment can provide a better outcome regarding the survival of lung cancer patients. Researchers can use data mining techniques to create effective diagnostic models. The aim of this study was to evaluate patterns existing in risk factor data of for mortality one year after thoracic surgery for lung cancer. Methods: The dataset used in this study contained 470 records and 17 features. First, the most important variables involved in the incidence of lung cancer were extracted using knowledge discovery and datamining algorithms such as naive Bayes, maximum expectation and then, using a regression analysis algorithm, a questionnaire was developed to predict the risk of death one year after lung surgery. Outliers in the data were excluded and reported using the clustering algorithm. Finally, a calculator was designed to estimate the risk for one-year post-operative mortality based on a scorecard algorithm. Results: The results revealed the most important factor involved in increased mortality to be large tumor size. Roles for type II diabetes and preoperative dyspnea in lower survival were also identified. The greatest commonality in classification of patients was Forced expiratory volume in first second (FEV1), based on levels of which patients could be classified into different categories. Conclusion: Development of a questionnaire based on calculations to diagnose disease can be used to identify and fill knowledge gaps in clinical practice guidelines. Creative Commons Attribution License

  18. The Earth Data Analytic Services (EDAS) Framework

    NASA Astrophysics Data System (ADS)

    Maxwell, T. P.; Duffy, D.

    2017-12-01

    Faced with unprecedented growth in earth data volume and demand, NASA has developed the Earth Data Analytic Services (EDAS) framework, a high performance big data analytics framework built on Apache Spark. This framework enables scientists to execute data processing workflows combining common analysis operations close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted earth data analysis tools (ESMF, CDAT, NCO, etc.). EDAS utilizes a dynamic caching architecture, a custom distributed array framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces with interactive response times. EDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using direct web service calls, a Python script, a Unix-like shell client, or a JavaScript-based web application. New analytic operations can be developed in Python, Java, or Scala (with support for other languages planned). Client packages in Python, Java/Scala, or JavaScript contain everything needed to build and submit EDAS requests. The EDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service enables decision makers to compare multiple reanalysis datasets and investigate trends, variability, and anomalies in earth system dynamics around the globe.

  19. Domain Adaptation for Alzheimer’s Disease Diagnostics

    PubMed Central

    Wachinger, Christian; Reuter, Martin

    2016-01-01

    With the increasing prevalence of Alzheimer’s disease, research focuses on the early computer-aided diagnosis of dementia with the goal to understand the disease process, determine risk and preserving factors, and explore preventive therapies. By now, large amounts of data from multi-site studies have been made available for developing, training, and evaluating automated classifiers. Yet, their translation to the clinic remains challenging, in part due to their limited generalizability across different datasets. In this work, we describe a compact classification approach that mitigates overfitting by regularizing the multinomial regression with the mixed ℓ1/ℓ2 norm. We combine volume, thickness, and anatomical shape features from MRI scans to characterize neuroanatomy for the three-class classification of Alzheimer’s disease, mild cognitive impairment and healthy controls. We demonstrate high classification accuracy via independent evaluation within the scope of the CADDementia challenge. We, furthermore, demonstrate that variations between source and target datasets can substantially influence classification accuracy. The main contribution of this work addresses this problem by proposing an approach for supervised domain adaptation based on instance weighting. Integration of this method into our classifier allows us to assess different strategies for domain adaptation. Our results demonstrate (i) that training on only the target training set yields better results than the naïve combination (union) of source and target training sets, and (ii) that domain adaptation with instance weighting yields the best classification results, especially if only a small training component of the target dataset is available. These insights imply that successful deployment of systems for computer-aided diagnostics to the clinic depends not only on accurate classifiers that avoid overfitting, but also on a dedicated domain adaptation strategy. PMID:27262241

  20. Measuring the effects of morphological changes to sea turtle nesting beaches over time with LiDAR data

    NASA Astrophysics Data System (ADS)

    Yamamoto, Kristina H.; Anderson, Sharolyn J.; Sutton, Paul C.

    2015-10-01

    Sea turtle nesting beaches in southeastern Florida were evaluated for changes from 1999 to 2005 using LiDAR datasets. Changes to beach volume were correlated with changes in several elevation-derived characteristics, such as elevation and slope. In addition, these changes to beach geomorphology were correlated to changes in nest success, illustrating that beach alterations may affect sea turtle nesting behavior. The ability to use LiDAR datasets to quickly and efficiently conduct beach comparisons for habitat use represents another benefit to this high spatial resolution data.

  1. Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data

    PubMed Central

    Wong, Raymond K.; Mohammed, Sabah; Fiaidhi, Jinan; Sung, Yunsick

    2017-01-01

    Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method. PMID:28753613

  2. A noninvasive method to study regulation of extracellular fluid volume in rats using nuclear magnetic resonance

    EPA Pesticide Factsheets

    NMR fluid measurements of commonly used rat strains when subjected to SQ normotonic or hypertonic salines, as well as physiologic comparisons to sedentary and exercised subjects.This dataset is associated with the following publication:Gordon , C., P. Phillips , and A. Johnstone. A Noninvasive Method to Study Regulation of Extracellular Fluid Volume in Rats Using Nuclear Magnetic Resonance. American Journal of Physiology- Renal Physiology. American Physiological Society, Bethesda, MD, USA, 310(5): 426-31, (2016).

  3. Large scale validation of the M5L lung CAD on heterogeneous CT datasets.

    PubMed

    Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P

    2015-04-01

    M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.

  4. Application of an imputation method for geospatial inventory of forest structural attributes across multiple spatial scales in the Lake States, U.S.A

    NASA Astrophysics Data System (ADS)

    Deo, Ram K.

    Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.

  5. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor.

    PubMed

    Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y

    2012-11-18

    Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.

  6. GeoNotebook: Browser based Interactive analysis and visualization workflow for very large climate and geospatial datasets

    NASA Astrophysics Data System (ADS)

    Ozturk, D.; Chaudhary, A.; Votava, P.; Kotfila, C.

    2016-12-01

    Jointly developed by Kitware and NASA Ames, GeoNotebook is an open source tool designed to give the maximum amount of flexibility to analysts, while dramatically simplifying the process of exploring geospatially indexed datasets. Packages like Fiona (backed by GDAL), Shapely, Descartes, Geopandas, and PySAL provide a stack of technologies for reading, transforming, and analyzing geospatial data. Combined with the Jupyter notebook and libraries like matplotlib/Basemap it is possible to generate detailed geospatial visualizations. Unfortunately, visualizations generated is either static or does not perform well for very large datasets. Also, this setup requires a great deal of boilerplate code to create and maintain. Other extensions exist to remedy these problems, but they provide a separate map for each input cell and do not support map interactions that feed back into the python environment. To support interactive data exploration and visualization on large datasets we have developed an extension to the Jupyter notebook that provides a single dynamic map that can be managed from the Python environment, and that can communicate back with a server which can perform operations like data subsetting on a cloud-based cluster.

  7. Precipitation intercomparison of a set of satellite- and raingauge-derived datasets, ERA Interim reanalysis, and a single WRF regional climate simulation over Europe and the North Atlantic

    NASA Astrophysics Data System (ADS)

    Skok, Gregor; Žagar, Nedjeljka; Honzak, Luka; Žabkar, Rahela; Rakovec, Jože; Ceglar, Andrej

    2016-01-01

    The study presents a precipitation intercomparison based on two satellite-derived datasets (TRMM 3B42, CMORPH), four raingauge-based datasets (GPCC, E-OBS, Willmott & Matsuura, CRU), ERA Interim reanalysis (ERAInt), and a single climate simulation using the WRF model. The comparison was performed for a domain encompassing parts of Europe and the North Atlantic over the 11-year period of 2000-2010. The four raingauge-based datasets are similar to the TRMM dataset with biases over Europe ranging from -7 % to +4 %. The spread among the raingauge-based datasets is relatively small over most of Europe, although areas with greater uncertainty (more than 30 %) exist, especially near the Alps and other mountainous regions. There are distinct differences between the datasets over the European land area and the Atlantic Ocean in comparison to the TRMM dataset. ERAInt has a small dry bias over the land; the WRF simulation has a large wet bias (+30 %), whereas CMORPH is characterized by a large and spatially consistent dry bias (-21 %). Over the ocean, both ERAInt and CMORPH have a small wet bias (+8 %) while the wet bias in WRF is significantly larger (+47 %). ERAInt has the highest frequency of low-intensity precipitation while the frequency of high-intensity precipitation is the lowest due to its lower native resolution. Both satellite-derived datasets have more low-intensity precipitation over the ocean than over the land, while the frequency of higher-intensity precipitation is similar or larger over the land. This result is likely related to orography, which triggers more intense convective precipitation, while the Atlantic Ocean is characterized by more homogenous large-scale precipitation systems which are associated with larger areas of lower intensity precipitation. However, this is not observed in ERAInt and WRF, indicating the insufficient representation of convective processes in the models. Finally, the Fraction Skill Score confirmed that both models perform better over the Atlantic Ocean with ERAInt outperforming the WRF at low thresholds and WRF outperforming ERAInt at higher thresholds. The diurnal cycle is simulated better in the WRF simulation than in ERAInt, although WRF could not reproduce well the amplitude of the diurnal cycle. While the evaluation of the WRF model confirms earlier findings related to the model's wet bias over European land, the applied satellite-derived precipitation datasets revealed differences between the land and ocean areas along with uncertainties in the observation datasets.

  8. Non-negative Tensor Factorization for Robust Exploratory Big-Data Analytics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alexandrov, Boian; Vesselinov, Velimir Valentinov; Djidjev, Hristo Nikolov

    Currently, large multidimensional datasets are being accumulated in almost every field. Data are: (1) collected by distributed sensor networks in real-time all over the globe, (2) produced by large-scale experimental measurements or engineering activities, (3) generated by high-performance simulations, and (4) gathered by electronic communications and socialnetwork activities, etc. Simultaneous analysis of these ultra-large heterogeneous multidimensional datasets is often critical for scientific discoveries, decision-making, emergency response, and national and global security. The importance of such analyses mandates the development of the next-generation of robust machine learning (ML) methods and tools for bigdata exploratory analysis.

  9. A new Glacier Inventory of the Antarctic Peninsula as compiled from pre-existing Datasets

    NASA Astrophysics Data System (ADS)

    Huber, J.; Cook, A. J.; Paul, F.; Zemp, M.

    2016-12-01

    The glaciers on the Antarctic Peninsula (AP) potentially make a large contribution to sea level rise. However, this contribution was difficult to estimate, as no complete glacier inventory (outlines, attributes, separation from the ice sheet) was available so far. This work fills the gap and presents a new glacier inventory of the AP north of 70° S based on digitally combining pre-existing datasets with GIS techniques. Rock outcrops are removed from the glacier basin outlines of Cook et al. (2014) by digital intersection with the latest layer of the Antarctic Digital Database (Burton-Johnson et al. 2016). Glacier-specific topographic parameters (e.g. mean elevation, slope and aspect) as well as hypsometry have been calculated from the DEM of Cook et al. (2012). We also assigned connectivity levels to all glaciers following the concept by Rastner et al. (2012). Moreover, the bedrock dataset of Huss and Farinotti (2014) enabled us to add ice thickness and volume for each glacier. The new inventory is available from the GLIMS database and consists of 1589 glaciers covering an area of 95273 km2, slightly more than the 90000 km2 covered by glaciers surrounding the Greenland Ice Sheet. The total ice volume is 34590 km3 of which 1/3 is below sea level. The hypsometric curve has a bimodal shape due to the special topography of the AP consisting mainly of ice caps with outlet glaciers. Most of the glacierized area is located at 200-500 m a.s.l. with a secondary maximum at 1500-1900 m. About 63% of the area is drained by marine-terminating glaciers and ice shelf tributary glaciers cover 35% of the area. This combination results in a high sensitivity of the glaciers to climate change for several reasons: (1) only slightly rising equilibrium line altitudes would expose huge additional areas to ablation, (2) rising ocean temperatures increase melting of marine terminating glaciers, and (3) ice shelves have a buttressing effect on their feeding glaciers and their collapse would alter glacier dynamics and strongly enhance ice loss (Rott et al. 2011). The new inventory should facilitate modeling of the related effects using approaches tailored to glaciers for a more accurate determination of their future evolution and contribution to sea level rise.

  10. 3D Reconstructed Cyto-, Muscarinic M2 Receptor, and Fiber Architecture of the Rat Brain Registered to the Waxholm Space Atlas

    PubMed Central

    Schubert, Nicole; Axer, Markus; Schober, Martin; Huynh, Anh-Minh; Huysegoms, Marcel; Palomero-Gallagher, Nicola; Bjaalie, Jan G.; Leergaard, Trygve B.; Kirlangic, Mehmet E.; Amunts, Katrin; Zilles, Karl

    2016-01-01

    High-resolution multiscale and multimodal 3D models of the brain are essential tools to understand its complex structural and functional organization. Neuroimaging techniques addressing different aspects of brain organization should be integrated in a reference space to enable topographically correct alignment and subsequent analysis of the various datasets and their modalities. The Waxholm Space (http://software.incf.org/software/waxholm-space) is a publicly available 3D coordinate-based standard reference space for the mapping and registration of neuroanatomical data in rodent brains. This paper provides a newly developed pipeline combining imaging and reconstruction steps with a novel registration strategy to integrate new neuroimaging modalities into the Waxholm Space atlas. As a proof of principle, we incorporated large scale high-resolution cyto-, muscarinic M2 receptor, and fiber architectonic images of rat brains into the 3D digital MRI based atlas of the Sprague Dawley rat in Waxholm Space. We describe the whole workflow, from image acquisition to reconstruction and registration of these three modalities into the Waxholm Space rat atlas. The registration of the brain sections into the atlas is performed by using both linear and non-linear transformations. The validity of the procedure is qualitatively demonstrated by visual inspection, and a quantitative evaluation is performed by measurement of the concordance between representative atlas-delineated regions and the same regions based on receptor or fiber architectonic data. This novel approach enables for the first time the generation of 3D reconstructed volumes of nerve fibers and fiber tracts, or of muscarinic M2 receptor density distributions, in an entire rat brain. Additionally, our pipeline facilitates the inclusion of further neuroimaging datasets, e.g., 3D reconstructed volumes of histochemical stainings or of the regional distributions of multiple other receptor types, into the Waxholm Space. Thereby, a multiscale and multimodal rat brain model was created in the Waxholm Space atlas of the rat brain. Since the registration of these multimodal high-resolution datasets into the same coordinate system is an indispensable requisite for multi-parameter analyses, this approach enables combined studies on receptor and cell distributions as well as fiber densities in the same anatomical structures at microscopic scales for the first time. PMID:27199682

  11. 3D Reconstructed Cyto-, Muscarinic M2 Receptor, and Fiber Architecture of the Rat Brain Registered to the Waxholm Space Atlas.

    PubMed

    Schubert, Nicole; Axer, Markus; Schober, Martin; Huynh, Anh-Minh; Huysegoms, Marcel; Palomero-Gallagher, Nicola; Bjaalie, Jan G; Leergaard, Trygve B; Kirlangic, Mehmet E; Amunts, Katrin; Zilles, Karl

    2016-01-01

    High-resolution multiscale and multimodal 3D models of the brain are essential tools to understand its complex structural and functional organization. Neuroimaging techniques addressing different aspects of brain organization should be integrated in a reference space to enable topographically correct alignment and subsequent analysis of the various datasets and their modalities. The Waxholm Space (http://software.incf.org/software/waxholm-space) is a publicly available 3D coordinate-based standard reference space for the mapping and registration of neuroanatomical data in rodent brains. This paper provides a newly developed pipeline combining imaging and reconstruction steps with a novel registration strategy to integrate new neuroimaging modalities into the Waxholm Space atlas. As a proof of principle, we incorporated large scale high-resolution cyto-, muscarinic M2 receptor, and fiber architectonic images of rat brains into the 3D digital MRI based atlas of the Sprague Dawley rat in Waxholm Space. We describe the whole workflow, from image acquisition to reconstruction and registration of these three modalities into the Waxholm Space rat atlas. The registration of the brain sections into the atlas is performed by using both linear and non-linear transformations. The validity of the procedure is qualitatively demonstrated by visual inspection, and a quantitative evaluation is performed by measurement of the concordance between representative atlas-delineated regions and the same regions based on receptor or fiber architectonic data. This novel approach enables for the first time the generation of 3D reconstructed volumes of nerve fibers and fiber tracts, or of muscarinic M2 receptor density distributions, in an entire rat brain. Additionally, our pipeline facilitates the inclusion of further neuroimaging datasets, e.g., 3D reconstructed volumes of histochemical stainings or of the regional distributions of multiple other receptor types, into the Waxholm Space. Thereby, a multiscale and multimodal rat brain model was created in the Waxholm Space atlas of the rat brain. Since the registration of these multimodal high-resolution datasets into the same coordinate system is an indispensable requisite for multi-parameter analyses, this approach enables combined studies on receptor and cell distributions as well as fiber densities in the same anatomical structures at microscopic scales for the first time.

  12. Analytics to Better Interpret and Use Large Amounts of Heterogeneous Data

    NASA Astrophysics Data System (ADS)

    Mathews, T. J.; Baskin, W. E.; Rinsland, P. L.

    2014-12-01

    Data scientists at NASA's Atmospheric Science Data Center (ASDC) are seasoned software application developers who have worked with the creation, archival, and distribution of large datasets (multiple terabytes and larger). In order for ASDC data scientists to effectively implement the most efficient processes for cataloging and organizing data access applications, they must be intimately familiar with data contained in the datasets with which they are working. Key technologies that are critical components to the background of ASDC data scientists include: large RBMSs (relational database management systems) and NoSQL databases; web services; service-oriented architectures; structured and unstructured data access; as well as processing algorithms. However, as prices of data storage and processing decrease, sources of data increase, and technologies advance - granting more people to access to data at real or near-real time - data scientists are being pressured to accelerate their ability to identify and analyze vast amounts of data. With existing tools this is becoming exceedingly more challenging to accomplish. For example, NASA Earth Science Data and Information System (ESDIS) alone grew from having just over 4PBs of data in 2009 to nearly 6PBs of data in 2011. This amount then increased to roughly10PBs of data in 2013. With data from at least ten new missions to be added to the ESDIS holdings by 2017, the current volume will continue to grow exponentially and drive the need to be able to analyze more data even faster. Though there are many highly efficient, off-the-shelf analytics tools available, these tools mainly cater towards business data, which is predominantly unstructured. Inadvertently, there are very few known analytics tools that interface well to archived Earth science data, which is predominantly heterogeneous and structured. This presentation will identify use cases for data analytics from an Earth science perspective in order to begin to identify specific tools that may be able to address those challenges.

  13. Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.

    PubMed

    Adaszewski, Stanisław

    2014-01-01

    Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach--data stay mostly in the CSV files; "zero configuration"--no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

  14. Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

    PubMed Central

    Adaszewski, Stanisław

    2014-01-01

    Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results. PMID:25068261

  15. Partial Information Community Detection in a Multilayer Network

    DTIC Science & Technology

    2016-06-01

    Network was taken from the CORE Lab at the Naval Postgraduate School [27]. Facebook dataset We will use a subgraph of the Facebook network to build a...larger synthetic multilayer network. We want to use this Facebook data as a way to introduce a real world example of a network into our synthetic network...This data is provided by the Standford Large Network Dataset Collection [28]. This is a large anonymous subgraph of Facebook . It contains over 4,000

  16. cellVIEW: a Tool for Illustrative and Multi-Scale Rendering of Large Biomolecular Datasets

    PubMed Central

    Le Muzic, Mathieu; Autin, Ludovic; Parulek, Julius; Viola, Ivan

    2017-01-01

    In this article we introduce cellVIEW, a new system to interactively visualize large biomolecular datasets on the atomic level. Our tool is unique and has been specifically designed to match the ambitions of our domain experts to model and interactively visualize structures comprised of several billions atom. The cellVIEW system integrates acceleration techniques to allow for real-time graphics performance of 60 Hz display rate on datasets representing large viruses and bacterial organisms. Inspired by the work of scientific illustrators, we propose a level-of-detail scheme which purpose is two-fold: accelerating the rendering and reducing visual clutter. The main part of our datasets is made out of macromolecules, but it also comprises nucleic acids strands which are stored as sets of control points. For that specific case, we extend our rendering method to support the dynamic generation of DNA strands directly on the GPU. It is noteworthy that our tool has been directly implemented inside a game engine. We chose to rely on a third party engine to reduce software development work-load and to make bleeding-edge graphics techniques more accessible to the end-users. To our knowledge cellVIEW is the only suitable solution for interactive visualization of large bimolecular landscapes on the atomic level and is freely available to use and extend. PMID:29291131

  17. Mapping and spatiotemporal analysis tool for hydrological data: Spellmap

    USDA-ARS?s Scientific Manuscript database

    Lack of data management and analyses tools is one of the major limitations to effectively evaluate and use large datasets of high-resolution atmospheric, surface, and subsurface observations. High spatial and temporal resolution datasets better represent the spatiotemporal variability of hydrologica...

  18. SU-G-JeP3-12: Use of Cone Beam CT and Deformable Image Registration for Assessing Geometrical and Dosimetric Variations During Lung Radiotherapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jurkovic, I; Stathakis, S; Markovic, M

    Purpose: To assess the value of cone beam CT (CBCT) combined with deformable image registration in estimating the accuracy of the delivered treatment and the suitability of the applied target margins. Methods: Two patients with lung tumor were selected. Using their CT images intensity modulated radiation therapy (IMRT) treatment plans were developed to deliver 66Gy to the 95% of the PTV in 2Gy fractions. Using the Velocity AI software, the planning CT of each patient was registered with the fractional CBCT images that were obtained through the course of the treatment. After a CT to CBCT deformable image registration (DIR),more » the same fractional deformation matrix was used for the deformation of the planned dose distributions, as well as of all the contoured volumes, to each CBCT dataset. The dosimetric differences between the planning target volume (PTV) and various organs at risk (OARs) were recorded and compared. Results: CBCT data such as CTV volume change and PTV coverage was analyzed. There was a moderate relationship between volume changes and contouring method (automatic contouring using the DIR transformation vs. manual contouring on each CBCT) for patient #1 (r = 0.49), and a strong relationship for patient #2 (r = 0.83). The average PTV volume coverage from all the CBCT datasets was 91.2% for patient #1 and 95.6% for patient #2. Conclusion: Daily setup variations, tumor volume motion and lung deformation due to breathing yield differences in the actual delivered dose distributions versus the planned ones. The results presented indicate that these differences are apparent even with the use of daily IGRT. In certain fractions, the margins used seem to be insufficient to ensure acceptable lung tumor coverage. The observed differences notably depend on the tumor volume size and location. A larger cohort of patient is under investigation to verify those findings.« less

  19. The Relationship between Case-Volume, Care Quality, and Outcomes of Complex Cancer Surgery

    PubMed Central

    Auerbach, Andrew D; Maselli, Judith; Carter, Jonathan; Pekow, Penelope S; Lindenauer, Peter K

    2010-01-01

    Background How case volume and quality of care relate to each other and to results of complex cancer surgery is not well understood. Study Design Observational cohort of 14,170 patients 18 or older who underwent pneumonectomy, esophagectomy, pancreatectomy, or pelvic surgery for cancer between 10/1/2003 and 9/1/2005 at a United States hospital participating in a large benchmarking database. Case volumes were estimated within our dataset. Quality was measured by determining whether ideal patients did not receive appropriate perioperative medications (such as antibiotics to prevent surgical site infections) both as individual ‘missed’ measures, as well as the overall number missed. We used hierarchical models to estimate effects of volume and quality on 30-day readmission, in-hospital mortality, length of stay, and costs. Results After adjustment, we noted no consistent associations between higher hospital or surgeon volume and mortality, readmission, length of stay, or costs. Adherence to individual measures was not consistently associated with improvement in readmission, mortality, or other outcomes. For example, continuing antimicrobials past 24 hours was associated with longer length of stay (21.5% higher, 95% CI 19.5% to 23.6%) and higher costs (17% higher, 95% CI 16% to 19%). In contrast, overall adherence, while not not associated with differences in mortality or readmission, was consistently associated with longer length of stay (7.4% longer with one missed measure and 16.4% longer with 2 or more) and higher costs (5% higher with one missed measure, and 11% higher with 2 or more). Conclusions While hospital and surgeon volume were not associated with outcomes, lower overall adherence to quality measures is associated with higher costs, but not improved outcomes. This finding may provide a rationale for improving care systems by maximizing care consistency, even if outcomes are not affected. PMID:20829079

  20. Unsupervised classification of multivariate geostatistical data: Two algorithms

    NASA Astrophysics Data System (ADS)

    Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques

    2015-12-01

    With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.

  1. Integrating heterogeneous earth observation data for assessment of high-resolution inundation boundaries generated during flood emergencies.

    NASA Astrophysics Data System (ADS)

    Sava, E.; Cervone, G.; Kalyanapu, A. J.; Sampson, K. M.

    2017-12-01

    The increasing trend in flooding events, paired with rapid urbanization and an aging infrastructure is projected to enhance the risk of catastrophic losses and increase the frequency of both flash and large area floods. During such events, it is critical for decision makers and emergency responders to have access to timely actionable knowledge regarding preparedness, emergency response, and recovery before, during and after a disaster. Large volumes of data sets derived from sophisticated sensors, mobile phones, and social media feeds are increasingly being used to improve citizen services and provide clues to the best way to respond to emergencies through the use of visualization and GIS mapping. Such data, coupled with recent advancements in data fusion techniques of remote sensing with near real time heterogeneous datasets have allowed decision makers to more efficiently extract precise and relevant knowledge and better understand how damage caused by disasters have real time effects on urban population. This research assesses the feasibility of integrating multiple sources of contributed data into hydrodynamic models for flood inundation simulation and estimating damage assessment. It integrates multiple sources of high-resolution physiographic data such as satellite remote sensing imagery coupled with non-authoritative data such as Civil Air Patrol (CAP) and `during-event' social media observations of flood inundation in order to improve the identification of flood mapping. The goal is to augment remote sensing imagery with new open-source datasets to generate flood extend maps at higher temporal and spatial resolution. The proposed methodology is applied on two test cases, relative to the 2013 Boulder Colorado flood and the 2015 floods in Texas.

  2. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    PubMed

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.

  3. Development of large scale riverine terrain-bathymetry dataset by integrating NHDPlus HR with NED,CoNED and HAND data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Clark, E. P.

    2017-12-01

    Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.

  4. Medical imaging informatics based solutions for human performance analytics

    NASA Astrophysics Data System (ADS)

    Verma, Sneha; McNitt-Gray, Jill; Liu, Brent J.

    2018-03-01

    For human performance analysis, extensive experimental trials are often conducted to identify the underlying cause or long-term consequences of certain pathologies and to improve motor functions by examining the movement patterns of affected individuals. Data collected for human performance analysis includes high-speed video, surveys, spreadsheets, force data recordings from instrumented surfaces etc. These datasets are recorded from various standalone sources and therefore captured in different folder structures as well as in varying formats depending on the hardware configurations. Therefore, data integration and synchronization present a huge challenge while handling these multimedia datasets specifically for large datasets. Another challenge faced by researchers is querying large quantity of unstructured data and to design feedbacks/reporting tools for users who need to use datasets at various levels. In the past, database server storage solutions have been introduced to securely store these datasets. However, to automate the process of uploading raw files, various file manipulation steps are required. In the current workflow, this file manipulation and structuring is done manually and is not feasible for large amounts of data. However, by attaching metadata files and data dictionaries with these raw datasets, they can provide information and structure needed for automated server upload. We introduce one such system for metadata creation for unstructured multimedia data based on the DICOM data model design. We will discuss design and implementation of this system and evaluate this system with data set collected for movement analysis study. The broader aim of this paper is to present a solutions space achievable based on medical imaging informatics design and methods for improvement in workflow for human performance analysis in a biomechanics research lab.

  5. A data-driven model for influenza transmission incorporating media effects.

    PubMed

    Mitchell, Lewis; Ross, Joshua V

    2016-10-01

    Numerous studies have attempted to model the effect of mass media on the transmission of diseases such as influenza; however, quantitative data on media engagement has until recently been difficult to obtain. With the recent explosion of 'big data' coming from online social media and the like, large volumes of data on a population's engagement with mass media during an epidemic are becoming available to researchers. In this study, we combine an online dataset comprising millions of shared messages relating to influenza with traditional surveillance data on flu activity to suggest a functional form for the relationship between the two. Using this data, we present a simple deterministic model for influenza dynamics incorporating media effects, and show that such a model helps explain the dynamics of historical influenza outbreaks. Furthermore, through model selection we show that the proposed media function fits historical data better than other media functions proposed in earlier studies.

  6. Genetic overlap between Alzheimer’s disease and Parkinson’s disease at the MAPT locus

    PubMed Central

    Desikan, Rahul S.; Schork, Andrew J.; Wang, Yunpeng; Witoelar, Aree; Sharma, Manu; McEvoy, Linda K.; Holland, Dominic; Brewer, James B.; Chen, Chi-Hua; Thompson, Wesley K.; Harold, Denise; Williams, Julie; Owen, Michael J.; O’Donovan, Michael C.; Pericak-Vance, Margaret A.; Mayeux, Richard; Haines, Jonathan L.; Farrer, Lindsay A.; Schellenberg, Gerard D.; Heutink, Peter; Singleton, Andrew B.; Brice, Alexis; Wood, Nicolas W.; Hardy, John; Martinez, Maria; Choi, Seung Hoi; DeStefano, Anita; Ikram, M. Arfan; Bis, Joshua C.; Smith, Albert; Fitzpatrick, Annette L.; Launer, Lenore; van Duijn, Cornelia; Seshadri, Sudha; Ulstein, Ingun Dina; Aarsland, Dag; Fladby, Tormod; Djurovic, Srdjan; Hyman, Bradley T.; Snaedal, Jon; Stefansson, Hreinn; Stefansson, Kari; Gasser, Thomas; Andreassen, Ole A.; Dale, Anders M.

    2015-01-01

    We investigated genetic overlap between Alzheimer’s disease (AD) and Parkinson’s disease (PD). Using summary statistics (p-values) from large recent genomewide association studies (GWAS) (total n = 89,904 individuals), we sought to identify single nucleotide polymorphisms (SNPs) associating with both AD and PD. We found and replicated association of both AD and PD with the A allele of rs393152 within the extended MAPT region on chromosome 17 (meta analysis p-value across 5 independent AD cohorts = 1.65 × 10−7). In independent datasets, we found a dose-dependent effect of the A allele of rs393152 on intra-cerebral MAPT transcript levels and volume loss within the entorhinal cortex and hippocampus. Our findings identify the tau-associated MAPT locus as a site of genetic overlap between AD and PD and extending prior work, we show that the MAPT region increases risk of Alzheimer’s neurodegeneration. PMID:25687773

  7. Application of the actor model to large scale NDE data analysis

    NASA Astrophysics Data System (ADS)

    Coughlin, Chris

    2018-03-01

    The Actor model of concurrent computation discretizes a problem into a series of independent units or actors that interact only through the exchange of messages. Without direct coupling between individual components, an Actor-based system is inherently concurrent and fault-tolerant. These traits lend themselves to so-called "Big Data" applications in which the volume of data to analyze requires a distributed multi-system design. For a practical demonstration of the Actor computational model, a system was developed to assist with the automated analysis of Nondestructive Evaluation (NDE) datasets using the open source Myriad Data Reduction Framework. A machine learning model trained to detect damage in two-dimensional slices of C-Scan data was deployed in a streaming data processing pipeline. To demonstrate the flexibility of the Actor model, the pipeline was deployed on a local system and re-deployed as a distributed system without recompiling, reconfiguring, or restarting the running application.

  8. Fully-automated segmentation of fluid regions in exudative age-related macular degeneration subjects: Kernel graph cut in neutrosophic domain

    PubMed Central

    Rashno, Abdolreza; Nazari, Behzad; Koozekanani, Dara D.; Drayna, Paul M.; Sadri, Saeed; Rabbani, Hossein

    2017-01-01

    A fully-automated method based on graph shortest path, graph cut and neutrosophic (NS) sets is presented for fluid segmentation in OCT volumes for exudative age related macular degeneration (EAMD) subjects. The proposed method includes three main steps: 1) The inner limiting membrane (ILM) and the retinal pigment epithelium (RPE) layers are segmented using proposed methods based on graph shortest path in NS domain. A flattened RPE boundary is calculated such that all three types of fluid regions, intra-retinal, sub-retinal and sub-RPE, are located above it. 2) Seed points for fluid (object) and tissue (background) are initialized for graph cut by the proposed automated method. 3) A new cost function is proposed in kernel space, and is minimized with max-flow/min-cut algorithms, leading to a binary segmentation. Important properties of the proposed steps are proven and quantitative performance of each step is analyzed separately. The proposed method is evaluated using a publicly available dataset referred as Optima and a local dataset from the UMN clinic. For fluid segmentation in 2D individual slices, the proposed method outperforms the previously proposed methods by 18%, 21% with respect to the dice coefficient and sensitivity, respectively, on the Optima dataset, and by 16%, 11% and 12% with respect to the dice coefficient, sensitivity and precision, respectively, on the local UMN dataset. Finally, for 3D fluid volume segmentation, the proposed method achieves true positive rate (TPR) and false positive rate (FPR) of 90% and 0.74%, respectively, with a correlation of 95% between automated and expert manual segmentations using linear regression analysis. PMID:29059257

  9. Evaluating the intra- and interobserver reliability of three-dimensional ultrasound and power Doppler angiography (3D-PDA) for assessment of placental volume and vascularity in the second trimester of pregnancy.

    PubMed

    Jones, Nia W; Raine-Fenning, Nick J; Mousa, Hatem A; Bradley, Eileen; Bugg, George J

    2011-03-01

    Three-dimensional (3-D) power Doppler angiography (3-D-PDA) allows visualisation of Doppler signals within the placenta and their quantification is possible by the generation of vascular indices by the 4-D View software programme. This study aimed to investigate intra- and interobserver reproducibility of 3-D-PDA analysis of stored datasets at varying gestations with the ultimate goal being to develop a tool for predicting placental dysfunction. Women with an uncomplicated, viable singleton pregnancy were scanned at 12, 16 or 20 weeks gestational age groups. 3-D-PDA datasets acquired of the whole placenta were analysed using the VOCAL software processing tool. Each volume was analysed by three observers twice in the A plane. Intra- and interobserver reliability was assessed by intraclass correlation coefficients (ICCs) and Bland Altman plots. At each gestational age group, 20 low risk women were scanned resulting in 60 datasets in total. The ICC demonstrated a high level of measurement reliability at each gestation with intraobserver values >0.90 and interobserver values of >0.6 for the vascular indices. Bland Altman plots also showed high levels of agreement. Systematic bias was seen at 20 weeks in the vascular indices obtained by different observers. This study demonstrates that 3-D-PDA data can be measured reliably by different observers from stored datasets up to 18 weeks gestation. Measurements become less reliable as gestation advances with bias between observers evident at 20 weeks. Copyright © 2011 World Federation for Ultrasound in Medicine & Biology. Published by Elsevier Inc. All rights reserved.

  10. Normalized gradient fields cross-correlation for automated detection of prostate in magnetic resonance images

    NASA Astrophysics Data System (ADS)

    Fotin, Sergei V.; Yin, Yin; Periaswamy, Senthil; Kunz, Justin; Haldankar, Hrishikesh; Muradyan, Naira; Cornud, François; Turkbey, Baris; Choyke, Peter L.

    2012-02-01

    Fully automated prostate segmentation helps to address several problems in prostate cancer diagnosis and treatment: it can assist in objective evaluation of multiparametric MR imagery, provides a prostate contour for MR-ultrasound (or CT) image fusion for computer-assisted image-guided biopsy or therapy planning, may facilitate reporting and enables direct prostate volume calculation. Among the challenges in automated analysis of MR images of the prostate are the variations of overall image intensities across scanners, the presence of nonuniform multiplicative bias field within scans and differences in acquisition setup. Furthermore, images acquired with the presence of an endorectal coil suffer from localized high-intensity artifacts at the posterior part of the prostate. In this work, a three-dimensional method for fast automated prostate detection based on normalized gradient fields cross-correlation, insensitive to intensity variations and coil-induced artifacts, is presented and evaluated. The components of the method, offline template learning and the localization algorithm, are described in detail. The method was validated on a dataset of 522 T2-weighted MR images acquired at the National Cancer Institute, USA that was split in two halves for development and testing. In addition, second dataset of 29 MR exams from Centre d'Imagerie Médicale Tourville, France were used to test the algorithm. The 95% confidence intervals for the mean Euclidean distance between automatically and manually identified prostate centroids were 4.06 +/- 0.33 mm and 3.10 +/- 0.43 mm for the first and second test datasets respectively. Moreover, the algorithm provided the centroid within the true prostate volume in 100% of images from both datasets. Obtained results demonstrate high utility of the detection method for a fully automated prostate segmentation.

  11. Comparison of volume estimation methods for pancreatic islet cells

    NASA Astrophysics Data System (ADS)

    Dvořák, JiřÃ.­; Å vihlík, Jan; Habart, David; Kybic, Jan

    2016-03-01

    In this contribution we study different methods of automatic volume estimation for pancreatic islets which can be used in the quality control step prior to the islet transplantation. The total islet volume is an important criterion in the quality control. Also, the individual islet volume distribution is interesting -- it has been indicated that smaller islets can be more effective. A 2D image of a microscopy slice containing the islets is acquired. The input of the volume estimation methods are segmented images of individual islets. The segmentation step is not discussed here. We consider simple methods of volume estimation assuming that the islets have spherical or ellipsoidal shape. We also consider a local stereological method, namely the nucleator. The nucleator does not rely on any shape assumptions and provides unbiased estimates if isotropic sections through the islets are observed. We present a simulation study comparing the performance of the volume estimation methods in different scenarios and an experimental study comparing the methods on a real dataset.

  12. A novel approach to segmentation and measurement of medical image using level set methods.

    PubMed

    Chen, Yao-Tien

    2017-06-01

    The study proposes a novel approach for segmentation and visualization plus value-added surface area and volume measurements for brain medical image analysis. The proposed method contains edge detection and Bayesian based level set segmentation, surface and volume rendering, and surface area and volume measurements for 3D objects of interest (i.e., brain tumor, brain tissue, or whole brain). Two extensions based on edge detection and Bayesian level set are first used to segment 3D objects. Ray casting and a modified marching cubes algorithm are then adopted to facilitate volume and surface visualization of medical-image dataset. To provide physicians with more useful information for diagnosis, the surface area and volume of an examined 3D object are calculated by the techniques of linear algebra and surface integration. Experiment results are finally reported in terms of 3D object extraction, surface and volume rendering, and surface area and volume measurements for medical image analysis. Copyright © 2017 Elsevier Inc. All rights reserved.

  13. CT-based manual segmentation and evaluation of paranasal sinuses.

    PubMed

    Pirner, S; Tingelhoff, K; Wagner, I; Westphal, R; Rilk, M; Wahl, F M; Bootz, F; Eichhorn, Klaus W G

    2009-04-01

    Manual segmentation of computed tomography (CT) datasets was performed for robot-assisted endoscope movement during functional endoscopic sinus surgery (FESS). Segmented 3D models are needed for the robots' workspace definition. A total of 50 preselected CT datasets were each segmented in 150-200 coronal slices with 24 landmarks being set. Three different colors for segmentation represent diverse risk areas. Extension and volumetric measurements were performed. Three-dimensional reconstruction was generated after segmentation. Manual segmentation took 8-10 h for each CT dataset. The mean volumes were: right maxillary sinus 17.4 cm(3), left side 17.9 cm(3), right frontal sinus 4.2 cm(3), left side 4.0 cm(3), total frontal sinuses 7.9 cm(3), sphenoid sinus right side 5.3 cm(3), left side 5.5 cm(3), total sphenoid sinus volume 11.2 cm(3). Our manually segmented 3D-models present the patient's individual anatomy with a special focus on structures in danger according to the diverse colored risk areas. For safe robot assistance, the high-accuracy models represent an average of the population for anatomical variations, extension and volumetric measurements. They can be used as a database for automatic model-based segmentation. None of the segmentation methods so far described provide risk segmentation. The robot's maximum distance to the segmented border can be adjusted according to the differently colored areas.

  14. Development and comparison of projection and image space 3D nodule insertion techniques

    NASA Astrophysics Data System (ADS)

    Robins, Marthony; Solomon, Justin; Sahbaee, Pooyan; Samei, Ehsan

    2016-04-01

    This study aimed to develop and compare two methods of inserting computerized virtual lesions into CT datasets. 24 physical (synthetic) nodules of three sizes and four morphologies were inserted into an anthropomorphic chest phantom (LUNGMAN, KYOTO KAGAKU). The phantom was scanned (Somatom Definition Flash, Siemens Healthcare) with and without nodules present, and images were reconstructed with filtered back projection and iterative reconstruction (SAFIRE) at 0.6 mm slice thickness using a standard thoracic CT protocol at multiple dose settings. Virtual 3D CAD models based on the physical nodules were virtually inserted (accounting for the system MTF) into the nodule-free CT data using two techniques. These techniques include projection-based and image-based insertion. Nodule volumes were estimated using a commercial segmentation tool (iNtuition, TeraRecon, Inc.). Differences were tested using paired t-tests and R2 goodness of fit between the virtually and physically inserted nodules. Both insertion techniques resulted in nodule volumes very similar to the real nodules (<3% difference) and in most cases the differences were not statistically significant. Also, R2 values were all <0.97 for both insertion techniques. These data imply that these techniques can confidently be used as a means of inserting virtual nodules in CT datasets. These techniques can be instrumental in building hybrid CT datasets composed of patient images with virtually inserted nodules.

  15. An algorithm based on OmniView technology to reconstruct sagittal and coronal planes of the fetal brain from volume datasets acquired by three-dimensional ultrasound.

    PubMed

    Rizzo, G; Capponi, A; Pietrolucci, M E; Capece, A; Aiello, E; Mammarella, S; Arduini, D

    2011-08-01

    To describe a novel algorithm, based on the new display technology 'OmniView', developed to visualize diagnostic sagittal and coronal planes of the fetal brain from volumes obtained by three-dimensional (3D) ultrasonography. We developed an algorithm to image standard neurosonographic planes by drawing dissecting lines through the axial transventricular view of 3D volume datasets acquired transabdominally. The algorithm was tested on 106 normal fetuses at 18-24 weeks of gestation and the visualization rates of brain diagnostic planes were evaluated by two independent reviewers. The algorithm was also applied to nine cases with proven brain defects. The two reviewers, using the algorithm on normal fetuses, found satisfactory images with visualization rates ranging between 71.7% and 96.2% for sagittal planes and between 76.4% and 90.6% for coronal planes. The agreement rate between the two reviewers, as expressed by Cohen's kappa coefficient, was > 0.93 for sagittal planes and > 0.89 for coronal planes. All nine abnormal volumes were identified by a single observer from among a series including normal brains, and eight of these nine cases were diagnosed correctly. This novel algorithm can be used to visualize standard sagittal and coronal planes in the fetal brain. This approach may simplify the examination of the fetal brain and reduce dependency of success on operator skill. Copyright © 2011 ISUOG. Published by John Wiley & Sons, Ltd.

  16. Fast and Accurate Support Vector Machines on Large Scale Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry

    Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less

  17. Scalable persistent identifier systems for dynamic datasets

    NASA Astrophysics Data System (ADS)

    Golodoniuc, P.; Cox, S. J. D.; Klump, J. F.

    2016-12-01

    Reliable and persistent identification of objects, whether tangible or not, is essential in information management. Many Internet-based systems have been developed to identify digital data objects, e.g., PURL, LSID, Handle, ARK. These were largely designed for identification of static digital objects. The amount of data made available online has grown exponentially over the last two decades and fine-grained identification of dynamically generated data objects within large datasets using conventional systems (e.g., PURL) has become impractical. We have compared capabilities of various technological solutions to enable resolvability of data objects in dynamic datasets, and developed a dataset-centric approach to resolution of identifiers. This is particularly important in Semantic Linked Data environments where dynamic frequently changing data is delivered live via web services, so registration of individual data objects to obtain identifiers is impractical. We use identifier patterns and pattern hierarchies for identification of data objects, which allows relationships between identifiers to be expressed, and also provides means for resolving a single identifier into multiple forms (i.e. views or representations of an object). The latter can be implemented through (a) HTTP content negotiation, or (b) use of URI querystring parameters. The pattern and hierarchy approach has been implemented in the Linked Data API supporting the United Nations Spatial Data Infrastructure (UNSDI) initiative and later in the implementation of geoscientific data delivery for the Capricorn Distal Footprints project using International Geo Sample Numbers (IGSN). This enables flexible resolution of multi-view persistent identifiers and provides a scalable solution for large heterogeneous datasets.

  18. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    PubMed Central

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595

  19. Plant databases and data analysis tools

    USDA-ARS?s Scientific Manuscript database

    It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...

  20. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    PubMed Central

    Jensen, Tue V.; Pinson, Pierre

    2017-01-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600

Top