Sample records for large scale datasets

  1. Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

    NASA Astrophysics Data System (ADS)

    Maskey, M.; Ramachandran, R.; Miller, J.

    2017-12-01

    Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.

  2. Multiresolution comparison of precipitation datasets for large-scale models

    NASA Astrophysics Data System (ADS)

    Chun, K. P.; Sapriza Azuri, G.; Davison, B.; DeBeer, C. M.; Wheater, H. S.

    2014-12-01

    Gridded precipitation datasets are crucial for driving large-scale models which are related to weather forecast and climate research. However, the quality of precipitation products is usually validated individually. Comparisons between gridded precipitation products along with ground observations provide another avenue for investigating how the precipitation uncertainty would affect the performance of large-scale models. In this study, using data from a set of precipitation gauges over British Columbia and Alberta, we evaluate several widely used North America gridded products including the Canadian Gridded Precipitation Anomalies (CANGRD), the National Center for Environmental Prediction (NCEP) reanalysis, the Water and Global Change (WATCH) project, the thin plate spline smoothing algorithms (ANUSPLIN) and Canadian Precipitation Analysis (CaPA). Based on verification criteria for various temporal and spatial scales, results provide an assessment of possible applications for various precipitation datasets. For long-term climate variation studies (~100 years), CANGRD, NCEP, WATCH and ANUSPLIN have different comparative advantages in terms of their resolution and accuracy. For synoptic and mesoscale precipitation patterns, CaPA provides appealing performance of spatial coherence. In addition to the products comparison, various downscaling methods are also surveyed to explore new verification and bias-reduction methods for improving gridded precipitation outputs for large-scale models.

  3. Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

    NASA Astrophysics Data System (ADS)

    Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

    2017-12-01

    Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.

  4. The use of large scale datasets for understanding traffic network state.

    DOT National Transportation Integrated Search

    2013-09-01

    The goal of this proposal is to develop novel modeling techniques to infer individual activity patterns from the large scale cell phone : datasets and taxi data from NYC. As such this research offers a paradigm shift from traditional transportation m...

  5. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

    PubMed

    Ernst, Jason; Kellis, Manolis

    2015-04-01

    With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.

  6. Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.

    PubMed

    Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T

    2017-01-01

    Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.

  7. Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

    DOE PAGES

    Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...

    2017-01-28

    Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less

  8. Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De

    Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less

  9. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    PubMed Central

    Jensen, Tue V.; Pinson, Pierre

    2017-01-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600

  10. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.

    PubMed

    Jensen, Tue V; Pinson, Pierre

    2017-11-28

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  11. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    NASA Astrophysics Data System (ADS)

    Jensen, Tue V.; Pinson, Pierre

    2017-11-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  12. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.

    PubMed

    Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian

    2014-07-01

    We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.

  13. Querying Large Biological Network Datasets

    ERIC Educational Resources Information Center

    Gulsoy, Gunhan

    2013-01-01

    New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…

  14. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    PubMed

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution

  15. Discovering Cortical Folding Patterns in Neonatal Cortical Surfaces Using Large-Scale Dataset

    PubMed Central

    Meng, Yu; Li, Gang; Wang, Li; Lin, Weili; Gilmore, John H.

    2017-01-01

    The cortical folding of the human brain is highly complex and variable across individuals. Mining the major patterns of cortical folding from modern large-scale neuroimaging datasets is of great importance in advancing techniques for neuroimaging analysis and understanding the inter-individual variations of cortical folding and its relationship with cognitive function and disorders. As the primary cortical folding is genetically influenced and has been established at term birth, neonates with the minimal exposure to the complicated postnatal environmental influence are the ideal candidates for understanding the major patterns of cortical folding. In this paper, for the first time, we propose a novel method for discovering the major patterns of cortical folding in a large-scale dataset of neonatal brain MR images (N = 677). In our method, first, cortical folding is characterized by the distribution of sulcal pits, which are the locally deepest points in cortical sulci. Because deep sulcal pits are genetically related, relatively consistent across individuals, and also stable during brain development, they are well suitable for representing and characterizing cortical folding. Then, the similarities between sulcal pit distributions of any two subjects are measured from spatial, geometrical, and topological points of view. Next, these different measurements are adaptively fused together using a similarity network fusion technique, to preserve their common information and also catch their complementary information. Finally, leveraging the fused similarity measurements, a hierarchical affinity propagation algorithm is used to group similar sulcal folding patterns together. The proposed method has been applied to 677 neonatal brains (the largest neonatal dataset to our knowledge) in the central sulcus, superior temporal sulcus, and cingulate sulcus, and revealed multiple distinct and meaningful folding patterns in each region. PMID:28229131

  16. Multiresolution persistent homology for excessively large biomolecular datasets

    NASA Astrophysics Data System (ADS)

    Xia, Kelin; Zhao, Zhixiong; Wei, Guo-Wei

    2015-10-01

    Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.

  17. Image segmentation evaluation for very-large datasets

    NASA Astrophysics Data System (ADS)

    Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

    2016-03-01

    With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

  18. OMERO and Bio-Formats 5: flexible access to large bioimaging datasets at scale

    NASA Astrophysics Data System (ADS)

    Moore, Josh; Linkert, Melissa; Blackburn, Colin; Carroll, Mark; Ferguson, Richard K.; Flynn, Helen; Gillen, Kenneth; Leigh, Roger; Li, Simon; Lindner, Dominik; Moore, William J.; Patterson, Andrew J.; Pindelski, Blazej; Ramalingam, Balaji; Rozbicki, Emil; Tarkowska, Aleksandra; Walczysko, Petr; Allan, Chris; Burel, Jean-Marie; Swedlow, Jason

    2015-03-01

    The Open Microscopy Environment (OME) has built and released Bio-Formats, a Java-based proprietary file format conversion tool and OMERO, an enterprise data management platform under open source licenses. In this report, we describe new versions of Bio-Formats and OMERO that are specifically designed to support large, multi-gigabyte or terabyte scale datasets that are routinely collected across most domains of biological and biomedical research. Bio- Formats reads image data directly from native proprietary formats, bypassing the need for conversion into a standard format. It implements the concept of a file set, a container that defines the contents of multi-dimensional data comprised of many files. OMERO uses Bio-Formats to read files natively, and provides a flexible access mechanism that supports several different storage and access strategies. These new capabilities of OMERO and Bio-Formats make them especially useful for use in imaging applications like digital pathology, high content screening and light sheet microscopy that create routinely large datasets that must be managed and analyzed.

  19. A large-scale dataset of solar event reports from automated feature recognition modules

    NASA Astrophysics Data System (ADS)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  20. Large scale validation of the M5L lung CAD on heterogeneous CT datasets.

    PubMed

    Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P

    2015-04-01

    M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.

  1. CImbinator: a web-based tool for drug synergy analysis in small- and large-scale datasets.

    PubMed

    Flobak, Åsmund; Vazquez, Miguel; Lægreid, Astrid; Valencia, Alfonso

    2017-08-01

    Drug synergies are sought to identify combinations of drugs particularly beneficial. User-friendly software solutions that can assist analysis of large-scale datasets are required. CImbinator is a web-service that can aid in batch-wise and in-depth analyzes of data from small-scale and large-scale drug combination screens. CImbinator offers to quantify drug combination effects, using both the commonly employed median effect equation, as well as advanced experimental mathematical models describing dose response relationships. CImbinator is written in Ruby and R. It uses the R package drc for advanced drug response modeling. CImbinator is available at http://cimbinator.bioinfo.cnio.es , the source-code is open and available at https://github.com/Rbbt-Workflows/combination_index . A Docker image is also available at https://hub.docker.com/r/mikisvaz/rbbt-ci_mbinator/ . asmund.flobak@ntnu.no or miguel.vazquez@cnio.es. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  2. geoknife: Reproducible web-processing of large gridded datasets

    USGS Publications Warehouse

    Read, Jordan S.; Walker, Jordan I.; Appling, Alison P.; Blodgett, David L.; Read, Emily K.; Winslow, Luke A.

    2016-01-01

    Geoprocessing of large gridded data according to overlap with irregular landscape features is common to many large-scale ecological analyses. The geoknife R package was created to facilitate reproducible analyses of gridded datasets found on the U.S. Geological Survey Geo Data Portal web application or elsewhere, using a web-enabled workflow that eliminates the need to download and store large datasets that are reliably hosted on the Internet. The package provides access to several data subset and summarization algorithms that are available on remote web processing servers. Outputs from geoknife include spatial and temporal data subsets, spatially-averaged time series values filtered by user-specified areas of interest, and categorical coverage fractions for various land-use types.

  3. Determining Scale-dependent Patterns in Spatial and Temporal Datasets

    NASA Astrophysics Data System (ADS)

    Roy, A.; Perfect, E.; Mukerji, T.; Sylvester, L.

    2016-12-01

    Spatial and temporal datasets of interest to Earth scientists often contain plots of one variable against another, e.g., rainfall magnitude vs. time or fracture aperture vs. spacing. Such data, comprised of distributions of events along a transect / timeline along with their magnitudes, can display persistent or antipersistent trends, as well as random behavior, that may contain signatures of underlying physical processes. Lacunarity is a technique that was originally developed for multiscale analysis of data. In a recent study we showed that lacunarity can be used for revealing changes in scale-dependent patterns in fracture spacing data. Here we present a further improvement in our technique, with lacunarity applied to various non-binary datasets comprised of event spacings and magnitudes. We test our technique on a set of four synthetic datasets, three of which are based on an autoregressive model and have magnitudes at every point along the "timeline" thus representing antipersistent, persistent, and random trends. The fourth dataset is made up of five clusters of events, each containing a set of random magnitudes. The concept of lacunarity ratio, LR, is introduced; this is the lacunarity of a given dataset normalized to the lacunarity of its random counterpart. It is demonstrated that LR can successfully delineate scale-dependent changes in terms of antipersistence and persistence in the synthetic datasets. This technique is then applied to three different types of data: a hundred-year rainfall record from Knoxville, TN, USA, a set of varved sediments from Marca Shale, and a set of fracture aperture and spacing data from NE Mexico. While the rainfall data and varved sediments both appear to be persistent at small scales, at larger scales they both become random. On the other hand, the fracture data shows antipersistence at small scale (within cluster) and random behavior at large scales. Such differences in behavior with respect to scale-dependent changes in

  4. Extraction of drainage networks from large terrain datasets using high throughput computing

    NASA Astrophysics Data System (ADS)

    Gong, Jianya; Xie, Jibo

    2009-02-01

    Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.

  5. A large-scale solar dynamics observatory image dataset for computer vision applications.

    PubMed

    Kucuk, Ahmet; Banda, Juan M; Angryk, Rafal A

    2017-01-01

    The National Aeronautics Space Agency (NASA) Solar Dynamics Observatory (SDO) mission has given us unprecedented insight into the Sun's activity. By capturing approximately 70,000 images a day, this mission has created one of the richest and biggest repositories of solar image data available to mankind. With such massive amounts of information, researchers have been able to produce great advances in detecting solar events. In this resource, we compile SDO solar data into a single repository in order to provide the computer vision community with a standardized and curated large-scale dataset of several hundred thousand solar events found on high resolution solar images. This publicly available resource, along with the generation source code, will accelerate computer vision research on NASA's solar image data by reducing the amount of time spent performing data acquisition and curation from the multiple sources we have compiled. By improving the quality of the data with thorough curation, we anticipate a wider adoption and interest from the computer vision to the solar physics community.

  6. Development of large scale riverine terrain-bathymetry dataset by integrating NHDPlus HR with NED,CoNED and HAND data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Clark, E. P.

    2017-12-01

    Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.

  7. MODFLOW-LGR: Practical application to a large regional dataset

    NASA Astrophysics Data System (ADS)

    Barnes, D.; Coulibaly, K. M.

    2011-12-01

    In many areas of the US, including southwest Florida, large regional-scale groundwater models have been developed to aid in decision making and water resources management. These models are subsequently used as a basis for site-specific investigations. Because the large scale of these regional models is not appropriate for local application, refinement is necessary to analyze the local effects of pumping wells and groundwater related projects at specific sites. The most commonly used approach to date is Telescopic Mesh Refinement or TMR. It allows the extraction of a subset of the large regional model with boundary conditions derived from the regional model results. The extracted model is then updated and refined for local use using a variable sized grid focused on the area of interest. MODFLOW-LGR, local grid refinement, is an alternative approach which allows model discretization at a finer resolution in areas of interest and provides coupling between the larger "parent" model and the locally refined "child." In the present work, these two approaches are tested on a mining impact assessment case in southwest Florida using a large regional dataset (The Lower West Coast Surficial Aquifer System Model). Various metrics for performance are considered. They include: computation time, water balance (as compared to the variable sized grid), calibration, implementation effort, and application advantages and limitations. The results indicate that MODFLOW-LGR is a useful tool to improve local resolution of regional scale models. While performance metrics, such as computation time, are case-dependent (model size, refinement level, stresses involved), implementation effort, particularly when regional models of suitable scale are available, can be minimized. The creation of multiple child models within a larger scale parent model makes it possible to reuse the same calibrated regional dataset with minimal modification. In cases similar to the Lower West Coast model, where a

  8. Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data.

    PubMed

    Gray, Vanessa E; Hause, Ronald J; Luebeck, Jens; Shendure, Jay; Fowler, Douglas M

    2018-01-24

    Large datasets describing the quantitative effects of mutations on protein function are becoming increasingly available. Here, we leverage these datasets to develop Envision, which predicts the magnitude of a missense variant's molecular effect. Envision combines 21,026 variant effect measurements from nine large-scale experimental mutagenesis datasets, a hitherto untapped training resource, with a supervised, stochastic gradient boosting learning algorithm. Envision outperforms other missense variant effect predictors both on large-scale mutagenesis data and on an independent test dataset comprising 2,312 TP53 variants whose effects were measured using a low-throughput approach. This dataset was never used for hyperparameter tuning or model training and thus serves as an independent validation set. Envision prediction accuracy is also more consistent across amino acids than other predictors. Finally, we demonstrate that Envision's performance improves as more large-scale mutagenesis data are incorporated. We precompute Envision predictions for every possible single amino acid variant in human, mouse, frog, zebrafish, fruit fly, worm, and yeast proteomes (https://envision.gs.washington.edu/). Copyright © 2017 Elsevier Inc. All rights reserved.

  9. cellVIEW: a Tool for Illustrative and Multi-Scale Rendering of Large Biomolecular Datasets

    PubMed Central

    Le Muzic, Mathieu; Autin, Ludovic; Parulek, Julius; Viola, Ivan

    2017-01-01

    In this article we introduce cellVIEW, a new system to interactively visualize large biomolecular datasets on the atomic level. Our tool is unique and has been specifically designed to match the ambitions of our domain experts to model and interactively visualize structures comprised of several billions atom. The cellVIEW system integrates acceleration techniques to allow for real-time graphics performance of 60 Hz display rate on datasets representing large viruses and bacterial organisms. Inspired by the work of scientific illustrators, we propose a level-of-detail scheme which purpose is two-fold: accelerating the rendering and reducing visual clutter. The main part of our datasets is made out of macromolecules, but it also comprises nucleic acids strands which are stored as sets of control points. For that specific case, we extend our rendering method to support the dynamic generation of DNA strands directly on the GPU. It is noteworthy that our tool has been directly implemented inside a game engine. We chose to rely on a third party engine to reduce software development work-load and to make bleeding-edge graphics techniques more accessible to the end-users. To our knowledge cellVIEW is the only suitable solution for interactive visualization of large bimolecular landscapes on the atomic level and is freely available to use and extend. PMID:29291131

  10. Impacts of a lengthening open water season on Alaskan coastal communities: deriving locally relevant indices from large-scale datasets and community observations

    NASA Astrophysics Data System (ADS)

    Rolph, Rebecca J.; Mahoney, Andrew R.; Walsh, John; Loring, Philip A.

    2018-05-01

    Using thresholds of physical climate variables developed from community observations, together with two large-scale datasets, we have produced local indices directly relevant to the impacts of a reduced sea ice cover on Alaska coastal communities. The indices include the number of false freeze-ups defined by transient exceedances of ice concentration prior to a corresponding exceedance that persists, false break-ups, timing of freeze-up and break-up, length of the open water duration, number of days when the winds preclude hunting via boat (wind speed threshold exceedances), the number of wind events conducive to geomorphological work or damage to infrastructure from ocean waves, and the number of these wind events with on- and along-shore components promoting water setup along the coastline. We demonstrate how community observations can inform use of large-scale datasets to derive these locally relevant indices. The two primary large-scale datasets are the Historical Sea Ice Atlas for Alaska and the atmospheric output from a regional climate model used to downscale the ERA-Interim atmospheric reanalysis. We illustrate the variability and trends of these indices by application to the rural Alaska communities of Kotzebue, Shishmaref, and Utqiaġvik (previously Barrow), although the same procedure and metrics can be applied to other coastal communities. Over the 1979-2014 time period, there has been a marked increase in the number of combined false freeze-ups and false break-ups as well as the number of days too windy for hunting via boat for all three communities, especially Utqiaġvik. At Utqiaġvik, there has been an approximate tripling of the number of wind events conducive to coastline erosion from 1979 to 2014. We have also found a delay in freeze-up and earlier break-up, leading to a lengthened open water period for all of the communities examined.

  11. Parallel Index and Query for Large Scale Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chou, Jerry; Wu, Kesheng; Ruebel, Oliver

    2011-07-18

    Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less

  12. Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets

    NASA Astrophysics Data System (ADS)

    Day-Lewis, F. D.; Slater, L. D.; Johnson, T.

    2012-12-01

    Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.

  13. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.

    PubMed

    Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O; Gelfand, Alan E

    2016-01-01

    Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.

  14. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets

    PubMed Central

    Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O.; Gelfand, Alan E.

    2018-01-01

    Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online. PMID:29720777

  15. Interactive Exploration on Large Genomic Datasets.

    PubMed

    Tu, Eric

    2016-01-01

    The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM [4], genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley [25] and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango.

  16. A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video

    DTIC Science & Technology

    2011-06-01

    orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3

  17. Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System

    NASA Astrophysics Data System (ADS)

    Wilson, B. D.; Manipon, G.; Xing, Z.; Fetzer, E.

    2008-12-01

    NASA's Earth Observing System (EOS) is the world's most ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the A-Train platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the cloud scenes from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time matchups between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, and assemble merged datasets for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the VizFlow GUI, or uses a

  18. Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System

    NASA Astrophysics Data System (ADS)

    Wilson, B.; Manipon, G.; Xing, Z.; Fetzer, E.

    2009-04-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the Viz

  19. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than

  20. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting

  1. Large Scale Survey Data in Career Development Research

    ERIC Educational Resources Information Center

    Diemer, Matthew A.

    2008-01-01

    Large scale survey datasets have been underutilized but offer numerous advantages for career development scholars, as they contain numerous career development constructs with large and diverse samples that are followed longitudinally. Constructs such as work salience, vocational expectations, educational expectations, work satisfaction, and…

  2. Large-scale machine learning and evaluation platform for real-time traffic surveillance

    NASA Astrophysics Data System (ADS)

    Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel

    2016-09-01

    In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.

  3. Finding Spatio-Temporal Patterns in Large Sensor Datasets

    ERIC Educational Resources Information Center

    McGuire, Michael Patrick

    2010-01-01

    Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…

  4. Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets

    NASA Astrophysics Data System (ADS)

    Juric, Mario

    2011-01-01

    The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.

  5. A Hybrid Neuro-Fuzzy Model For Integrating Large Earth-Science Datasets

    NASA Astrophysics Data System (ADS)

    Porwal, A.; Carranza, J.; Hale, M.

    2004-12-01

    A GIS-based hybrid neuro-fuzzy approach to integration of large earth-science datasets for mineral prospectivity mapping is described. It implements a Takagi-Sugeno type fuzzy inference system in the framework of a four-layered feed-forward adaptive neural network. Each unique combination of the datasets is considered a feature vector whose components are derived by knowledge-based ordinal encoding of the constituent datasets. A subset of feature vectors with a known output target vector (i.e., unique conditions known to be associated with either a mineralized or a barren location) is used for the training of an adaptive neuro-fuzzy inference system. Training involves iterative adjustment of parameters of the adaptive neuro-fuzzy inference system using a hybrid learning procedure for mapping each training vector to its output target vector with minimum sum of squared error. The trained adaptive neuro-fuzzy inference system is used to process all feature vectors. The output for each feature vector is a value that indicates the extent to which a feature vector belongs to the mineralized class or the barren class. These values are used to generate a prospectivity map. The procedure is demonstrated by an application to regional-scale base metal prospectivity mapping in a study area located in the Aravalli metallogenic province (western India). A comparison of the hybrid neuro-fuzzy approach with pure knowledge-driven fuzzy and pure data-driven neural network approaches indicates that the former offers a superior method for integrating large earth-science datasets for predictive spatial mathematical modelling.

  6. Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.

    PubMed

    Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher

    2008-01-01

    With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.

  7. Addressing Methodological Challenges in Large Communication Datasets: Collecting and Coding Longitudinal Interactions in Home Hospice Cancer Care

    PubMed Central

    Reblin, Maija; Clayton, Margaret F; John, Kevin K; Ellington, Lee

    2015-01-01

    In this paper, we present strategies for collecting and coding a large longitudinal communication dataset collected across multiple sites, consisting of over 2000 hours of digital audio recordings from approximately 300 families. We describe our methods within the context of implementing a large-scale study of communication during cancer home hospice nurse visits, but this procedure could be adapted to communication datasets across a wide variety of settings. This research is the first study designed to capture home hospice nurse-caregiver communication, a highly understudied location and type of communication event. We present a detailed example protocol encompassing data collection in the home environment, large-scale, multi-site secure data management, the development of theoretically-based communication coding, and strategies for preventing coder drift and ensuring reliability of analyses. Although each of these challenges have the potential to undermine the utility of the data, reliability between coders is often the only issue consistently reported and addressed in the literature. Overall, our approach demonstrates rigor and provides a “how-to” example for managing large, digitally-recorded data sets from collection through analysis. These strategies can inform other large-scale health communication research. PMID:26580414

  8. Parallel task processing of very large datasets

    NASA Astrophysics Data System (ADS)

    Romig, Phillip Richardson, III

    This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.

  9. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets

    PubMed Central

    Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L

    2014-01-01

    Background As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Methods Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Results Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Conclusions Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. PMID:24464852

  10. Atlas-Guided Cluster Analysis of Large Tractography Datasets

    PubMed Central

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292

  11. Atlas-guided cluster analysis of large tractography datasets.

    PubMed

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.

  12. Really big data: Processing and analysis of large datasets

    USDA-ARS?s Scientific Manuscript database

    Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...

  13. Large-scale image region documentation for fully automated image biomarker algorithm development and evaluation.

    PubMed

    Reeves, Anthony P; Xie, Yiting; Liu, Shuang

    2017-04-01

    With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset.

  14. TIMPs of parasitic helminths - a large-scale analysis of high-throughput sequence datasets.

    PubMed

    Cantacessi, Cinzia; Hofmann, Andreas; Pickering, Darren; Navarro, Severine; Mitreva, Makedonka; Loukas, Alex

    2013-05-30

    Tissue inhibitors of metalloproteases (TIMPs) are a multifunctional family of proteins that orchestrate extracellular matrix turnover, tissue remodelling and other cellular processes. In parasitic helminths, such as hookworms, TIMPs have been proposed to play key roles in the host-parasite interplay, including invasion of and establishment in the vertebrate animal hosts. Currently, knowledge of helminth TIMPs is limited to a small number of studies on canine hookworms, whereas no information is available on the occurrence of TIMPs in other parasitic helminths causing neglected diseases. In the present study, we conducted a large-scale investigation of TIMP proteins of a range of neglected human parasites including the hookworm Necator americanus, the roundworm Ascaris suum, the liver flukes Clonorchis sinensis and Opisthorchis viverrini, as well as the schistosome blood flukes. This entailed mining available transcriptomic and/or genomic sequence datasets for the presence of homologues of known TIMPs, predicting secondary structures of defined protein sequences, systematic phylogenetic analyses and assessment of differential expression of genes encoding putative TIMPs in the developmental stages of A. suum, N. americanus and Schistosoma haematobium which infect the mammalian hosts. A total of 15 protein sequences with high homology to known eukaryotic TIMPs were predicted from the complement of sequence data available for parasitic helminths and subjected to in-depth bioinformatic analyses. Supported by the availability of gene manipulation technologies such as RNA interference and/or transgenesis, this work provides a basis for future functional explorations of helminth TIMPs and, in particular, of their role/s in fundamental biological pathways linked to long-term establishment in the vertebrate hosts, with a view towards the development of novel approaches for the control of neglected helminthiases.

  15. Large-Scale Astrophysical Visualization on Smartphones

    NASA Astrophysics Data System (ADS)

    Becciani, U.; Massimino, P.; Costa, A.; Gheller, C.; Grillo, A.; Krokos, M.; Petta, C.

    2011-07-01

    Nowadays digital sky surveys and long-duration, high-resolution numerical simulations using high performance computing and grid systems produce multidimensional astrophysical datasets in the order of several Petabytes. Sharing visualizations of such datasets within communities and collaborating research groups is of paramount importance for disseminating results and advancing astrophysical research. Moreover educational and public outreach programs can benefit greatly from novel ways of presenting these datasets by promoting understanding of complex astrophysical processes, e.g., formation of stars and galaxies. We have previously developed VisIVO Server, a grid-enabled platform for high-performance large-scale astrophysical visualization. This article reviews the latest developments on VisIVO Web, a custom designed web portal wrapped around VisIVO Server, then introduces VisIVO Smartphone, a gateway connecting VisIVO Web and data repositories for mobile astrophysical visualization. We discuss current work and summarize future developments.

  16. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.

    PubMed

    Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L

    2014-01-01

    As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  17. Large-scale image region documentation for fully automated image biomarker algorithm development and evaluation

    PubMed Central

    Reeves, Anthony P.; Xie, Yiting; Liu, Shuang

    2017-01-01

    Abstract. With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset. PMID:28612037

  18. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset.

    PubMed

    Seashore-Ludlow, Brinton; Rees, Matthew G; Cheah, Jaime H; Cokol, Murat; Price, Edmund V; Coletti, Matthew E; Jones, Victor; Bodycombe, Nicole E; Soule, Christian K; Gould, Joshua; Alexander, Benjamin; Li, Ava; Montgomery, Philip; Wawer, Mathias J; Kuru, Nurdan; Kotz, Joanne D; Hon, C Suk-Yee; Munoz, Benito; Liefeld, Ted; Dančík, Vlado; Bittker, Joshua A; Palmer, Michelle; Bradner, James E; Shamji, Alykhan F; Clemons, Paul A; Schreiber, Stuart L

    2015-11-01

    Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset. This analysis reveals insights into small-molecule mechanisms of action, and genomic features that associate with CCL response to small-molecule treatment. We are able to recapitulate known relationships between FDA-approved therapies and cancer dependencies and to uncover new relationships, including for KRAS-mutant cancers and neuroblastoma. To enable the cancer community to explore these data, and to generate novel hypotheses, we created an updated version of the Cancer Therapeutic Response Portal (CTRP v2). We present the largest CCL sensitivity dataset yet available, and an analysis method integrating information from multiple CCLs and multiple small molecules to identify CCL response predictors robustly. We updated the CTRP to enable the cancer research community to leverage these data and analyses. ©2015 American Association for Cancer Research.

  19. Unified Access Architecture for Large-Scale Scientific Datasets

    NASA Astrophysics Data System (ADS)

    Karna, Risav

    2014-05-01

    Data-intensive sciences have to deploy diverse large scale database technologies for data analytics as scientists have now been dealing with much larger volume than ever before. While array databases have bridged many gaps between the needs of data-intensive research fields and DBMS technologies (Zhang 2011), invocation of other big data tools accompanying these databases is still manual and separate the database management's interface. We identify this as an architectural challenge that will increasingly complicate the user's work flow owing to the growing number of useful but isolated and niche database tools. Such use of data analysis tools in effect leaves the burden on the user's end to synchronize the results from other data manipulation analysis tools with the database management system. To this end, we propose a unified access interface for using big data tools within large scale scientific array database using the database queries themselves to embed foreign routines belonging to the big data tools. Such an invocation of foreign data manipulation routines inside a query into a database can be made possible through a user-defined function (UDF). UDFs that allow such levels of freedom as to call modules from another language and interface back and forth between the query body and the side-loaded functions would be needed for this purpose. For the purpose of this research we attempt coupling of four widely used tools Hadoop (hadoop1), Matlab (matlab1), R (r1) and ScaLAPACK (scalapack1) with UDF feature of rasdaman (Baumann 98), an array-based data manager, for investigating this concept. The native array data model used by an array-based data manager provides compact data storage and high performance operations on ordered data such as spatial data, temporal data, and matrix-based data for linear algebra operations (scidbusr1). Performances issues arising due to coupling of tools with different paradigms, niche functionalities, separate processes and output

  20. paraGSEA: a scalable approach for large-scale gene expression profiling

    PubMed Central

    Peng, Shaoliang; Yang, Shunyun

    2017-01-01

    Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463

  1. Remote visualization and scale analysis of large turbulence datatsets

    NASA Astrophysics Data System (ADS)

    Livescu, D.; Pulido, J.; Burns, R.; Canada, C.; Ahrens, J.; Hamann, B.

    2015-12-01

    Accurate simulations of turbulent flows require solving all the dynamically relevant scales of motions. This technique, called Direct Numerical Simulation, has been successfully applied to a variety of simple flows; however, the large-scale flows encountered in Geophysical Fluid Dynamics (GFD) would require meshes outside the range of the most powerful supercomputers for the foreseeable future. Nevertheless, the current generation of petascale computers has enabled unprecedented simulations of many types of turbulent flows which focus on various GFD aspects, from the idealized configurations extensively studied in the past to more complex flows closer to the practical applications. The pace at which such simulations are performed only continues to increase; however, the simulations themselves are restricted to a small number of groups with access to large computational platforms. Yet the petabytes of turbulence data offer almost limitless information on many different aspects of the flow, from the hierarchy of turbulence moments, spectra and correlations, to structure-functions, geometrical properties, etc. The ability to share such datasets with other groups can significantly reduce the time to analyze the data, help the creative process and increase the pace of discovery. Using the largest DOE supercomputing platforms, we have performed some of the biggest turbulence simulations to date, in various configurations, addressing specific aspects of turbulence production and mixing mechanisms. Until recently, the visualization and analysis of such datasets was restricted by access to large supercomputers. The public Johns Hopkins Turbulence database simplifies the access to multi-Terabyte turbulence datasets and facilitates turbulence analysis through the use of commodity hardware. First, one of our datasets, which is part of the database, will be described and then a framework that adds high-speed visualization and wavelet support for multi-resolution analysis of

  2. How large a dataset should be in order to estimate scaling exponents and other statistics correctly in studies of solar wind turbulence

    NASA Astrophysics Data System (ADS)

    Rowlands, G.; Kiyani, K. H.; Chapman, S. C.; Watkins, N. W.

    2009-12-01

    Quantitative analysis of solar wind fluctuations are often performed in the context of intermittent turbulence and center around methods to quantify statistical scaling, such as power spectra and structure functions which assume a stationary process. The solar wind exhibits large scale secular changes and so the question arises as to whether the timeseries of the fluctuations is non-stationary. One approach is to seek a local stationarity by parsing the time interval over which statistical analysis is performed. Hence, natural systems such as the solar wind unavoidably provide observations over restricted intervals. Consequently, due to a reduction of sample size leading to poorer estimates, a stationary stochastic process (time series) can yield anomalous time variation in the scaling exponents, suggestive of nonstationarity. The variance in the estimates of scaling exponents computed from an interval of N observations is known for finite variance processes to vary as ~1/N as N becomes large for certain statistical estimators; however, the convergence to this behavior will depend on the details of the process, and may be slow. We study the variation in the scaling of second-order moments of the time-series increments with N for a variety of synthetic and “real world” time series, and we find that in particular for heavy tailed processes, for realizable N, one is far from this ~1/N limiting behavior. We propose a semiempirical estimate for the minimum N needed to make a meaningful estimate of the scaling exponents for model stochastic processes and compare these with some “real world” time series from the solar wind. With fewer datapoints the stationary timeseries becomes indistinguishable from a nonstationary process and we illustrate this with nonstationary synthetic datasets. Reference article: K. H. Kiyani, S. C. Chapman and N. W. Watkins, Phys. Rev. E 79, 036109 (2009).

  3. A semiparametric graphical modelling approach for large-scale equity selection

    PubMed Central

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption. PMID:28316507

  4. A semiparametric graphical modelling approach for large-scale equity selection.

    PubMed

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption.

  5. Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

    PubMed

    Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas

    2016-09-19

    Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.

  6. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as

  7. TIMPs of parasitic helminths – a large-scale analysis of high-throughput sequence datasets

    PubMed Central

    2013-01-01

    Background Tissue inhibitors of metalloproteases (TIMPs) are a multifunctional family of proteins that orchestrate extracellular matrix turnover, tissue remodelling and other cellular processes. In parasitic helminths, such as hookworms, TIMPs have been proposed to play key roles in the host-parasite interplay, including invasion of and establishment in the vertebrate animal hosts. Currently, knowledge of helminth TIMPs is limited to a small number of studies on canine hookworms, whereas no information is available on the occurrence of TIMPs in other parasitic helminths causing neglected diseases. Methods In the present study, we conducted a large-scale investigation of TIMP proteins of a range of neglected human parasites including the hookworm Necator americanus, the roundworm Ascaris suum, the liver flukes Clonorchis sinensis and Opisthorchis viverrini, as well as the schistosome blood flukes. This entailed mining available transcriptomic and/or genomic sequence datasets for the presence of homologues of known TIMPs, predicting secondary structures of defined protein sequences, systematic phylogenetic analyses and assessment of differential expression of genes encoding putative TIMPs in the developmental stages of A. suum, N. americanus and Schistosoma haematobium which infect the mammalian hosts. Results A total of 15 protein sequences with high homology to known eukaryotic TIMPs were predicted from the complement of sequence data available for parasitic helminths and subjected to in-depth bioinformatic analyses. Conclusions Supported by the availability of gene manipulation technologies such as RNA interference and/or transgenesis, this work provides a basis for future functional explorations of helminth TIMPs and, in particular, of their role/s in fundamental biological pathways linked to long-term establishment in the vertebrate hosts, with a view towards the development of novel approaches for the control of neglected helminthiases. PMID:23721526

  8. VisIVO: A Library and Integrated Tools for Large Astrophysical Dataset Exploration

    NASA Astrophysics Data System (ADS)

    Becciani, U.; Costa, A.; Ersotelos, N.; Krokos, M.; Massimino, P.; Petta, C.; Vitello, F.

    2012-09-01

    VisIVO provides an integrated suite of tools and services that can be used in many scientific fields. VisIVO development starts in the Virtual Observatory framework. VisIVO allows users to visualize meaningfully highly-complex, large-scale datasets and create movies of these visualizations based on distributed infrastructures. VisIVO supports high-performance, multi-dimensional visualization of large-scale astrophysical datasets. Users can rapidly obtain meaningful visualizations while preserving full and intuitive control of the relevant parameters. VisIVO consists of VisIVO Desktop - a stand-alone application for interactive visualization on standard PCs, VisIVO Server - a platform for high performance visualization, VisIVO Web - a custom designed web portal, VisIVOSmartphone - an application to exploit the VisIVO Server functionality and the latest VisIVO features: VisIVO Library allows a job running on a computational system (grid, HPC, etc.) to produce movies directly with the code internal data arrays without the need to produce intermediate files. This is particularly important when running on large computational facilities, where the user wants to have a look at the results during the data production phase. For example, in grid computing facilities, images can be produced directly in the grid catalogue while the user code is running in a system that cannot be directly accessed by the user (a worker node). The deployment of VisIVO on the DG and gLite is carried out with the support of EDGI and EGI-Inspire projects. Depending on the structure and size of datasets under consideration, the data exploration process could take several hours of CPU for creating customized views and the production of movies could potentially last several days. For this reason an MPI parallel version of VisIVO could play a fundamental role in increasing performance, e.g. it could be automatically deployed on nodes that are MPI aware. A central concept in our development is thus to

  9. Structure Discovery in Large Semantic Graphs Using Extant Ontological Scaling and Descriptive Statistics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    al-Saffar, Sinan; Joslyn, Cliff A.; Chappell, Alan R.

    As semantic datasets grow to be very large and divergent, there is a need to identify and exploit their inherent semantic structure for discovery and optimization. Towards that end, we present here a novel methodology to identify the semantic structures inherent in an arbitrary semantic graph dataset. We first present the concept of an extant ontology as a statistical description of the semantic relations present amongst the typed entities modeled in the graph. This serves as a model of the underlying semantic structure to aid in discovery and visualization. We then describe a method of ontological scaling in which themore » ontology is employed as a hierarchical scaling filter to infer different resolution levels at which the graph structures are to be viewed or analyzed. We illustrate these methods on three large and publicly available semantic datasets containing more than one billion edges each. Keywords-Semantic Web; Visualization; Ontology; Multi-resolution Data Mining;« less

  10. The Large Scale Distribution of Water Ice in the Polar Regions of the Moon

    NASA Astrophysics Data System (ADS)

    Jordan, A.; Wilson, J. K.; Schwadron, N.; Spence, H. E.

    2017-12-01

    For in situ resource utilization, one must know where water ice is on the Moon. Many datasets have revealed both surface deposits of water ice and subsurface deposits of hydrogen near the lunar poles, but it has proved difficult to resolve the differences among the locations of these deposits. Despite these datasets disagreeing on how deposits are distributed on small scales, we show that most of these datasets do agree on the large scale distribution of water ice. We present data from the Cosmic Ray Telescope for the Effects of Radiation (CRaTER) on the Lunar Reconnaissance Orbiter (LRO), LRO's Lunar Exploration Neutron Detector (LEND), the Neutron Spectrometer on Lunar Prospector (LPNS), LRO's Lyman Alpha Mapping Project (LAMP), LRO's Lunar Orbiter Laser Altimeter (LOLA), and Chandrayaan-1's Moon Mineralogy Mapper (M3). All, including those that show clear evidence for water ice, reveal surprisingly similar trends with latitude, suggesting that both surface and subsurface datasets are measuring ice. All show that water ice increases towards the poles, and most demonstrate that its signature appears at about ±70° latitude and increases poleward. This is consistent with simulations of how surface and subsurface cold traps are distributed with latitude. This large scale agreement constrains the origin of the ice, suggesting that an ancient cometary impact (or impacts) created a large scale deposit that has been rendered locally heterogeneous by subsequent impacts. Furthermore, it also shows that water ice may be available down to ±70°—latitudes that are more accessible than the poles for landing.

  11. The impact of the resolution of meteorological datasets on catchment-scale drought studies

    NASA Astrophysics Data System (ADS)

    Hellwig, Jost; Stahl, Kerstin

    2017-04-01

    Gridded meteorological datasets provide the basis to study drought at a range of scales, including catchment scale drought studies in hydrology. They are readily available to study past weather conditions and often serve real time monitoring as well. As these datasets differ in spatial/temporal coverage and spatial/temporal resolution, for most studies there is a tradeoff between these features. Our investigation examines whether biases occur when studying drought on catchment scale with low resolution input data. For that, a comparison among the datasets HYRAS (covering Central Europe, 1x1 km grid, daily data, 1951 - 2005), E-OBS (Europe, 0.25° grid, daily data, 1950-2015) and GPCC (whole world, 0.5° grid, monthly data, 1901 - 2013) is carried out. Generally, biases in precipitation increase with decreasing resolution. Most important variations are found during summer. In low mountain range of Central Europe the datasets of sparse resolution (E-OBS, GPCC) overestimate dry days and underestimate total precipitation since they are not able to describe high spatial variability. However, relative measures like the correlation coefficient reveal good consistencies of dry and wet periods, both for absolute precipitation values and standardized indices like the Standardized Precipitation Index (SPI) or Standardized Precipitation Evaporation Index (SPEI). Particularly the most severe droughts derived from the different datasets match very well. These results indicate that absolute values of sparse resolution datasets applied to catchment scale might be critical to use for an assessment of the hydrological drought at catchment scale, whereas relative measures for determining periods of drought are more trustworthy. Therefore, studies on drought, that downscale meteorological data, should carefully consider their data needs and focus on relative measures for dry periods if sufficient for the task.

  12. The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.

    PubMed

    Karapetyan, Karen; Batchelor, Colin; Sharpe, David; Tkachenko, Valery; Williams, Antony J

    2015-01-01

    There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets. The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set

  13. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL

    NASA Astrophysics Data System (ADS)

    Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer

    2018-04-01

    Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.

  14. Statistical analysis of mesoscale rainfall: Dependence of a random cascade generator on large-scale forcing

    NASA Technical Reports Server (NTRS)

    Over, Thomas, M.; Gupta, Vijay K.

    1994-01-01

    Under the theory of independent and identically distributed random cascades, the probability distribution of the cascade generator determines the spatial and the ensemble properties of spatial rainfall. Three sets of radar-derived rainfall data in space and time are analyzed to estimate the probability distribution of the generator. A detailed comparison between instantaneous scans of spatial rainfall and simulated cascades using the scaling properties of the marginal moments is carried out. This comparison highlights important similarities and differences between the data and the random cascade theory. Differences are quantified and measured for the three datasets. Evidence is presented to show that the scaling properties of the rainfall can be captured to the first order by a random cascade with a single parameter. The dependence of this parameter on forcing by the large-scale meteorological conditions, as measured by the large-scale spatial average rain rate, is investigated for these three datasets. The data show that this dependence can be captured by a one-to-one function. Since the large-scale average rain rate can be diagnosed from the large-scale dynamics, this relationship demonstrates an important linkage between the large-scale atmospheric dynamics and the statistical cascade theory of mesoscale rainfall. Potential application of this research to parameterization of runoff from the land surface and regional flood frequency analysis is briefly discussed, and open problems for further research are presented.

  15. A Computational Approach to Qualitative Analysis in Large Textual Datasets

    PubMed Central

    Evans, Michael S.

    2014-01-01

    In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398

  16. Robust regression for large-scale neuroimaging studies.

    PubMed

    Fritsch, Virgile; Da Mota, Benoit; Loth, Eva; Varoquaux, Gaël; Banaschewski, Tobias; Barker, Gareth J; Bokde, Arun L W; Brühl, Rüdiger; Butzek, Brigitte; Conrod, Patricia; Flor, Herta; Garavan, Hugh; Lemaitre, Hervé; Mann, Karl; Nees, Frauke; Paus, Tomas; Schad, Daniel J; Schümann, Gunter; Frouin, Vincent; Poline, Jean-Baptiste; Thirion, Bertrand

    2015-05-01

    Multi-subject datasets used in neuroimaging group studies have a complex structure, as they exhibit non-stationary statistical properties across regions and display various artifacts. While studies with small sample sizes can rarely be shown to deviate from standard hypotheses (such as the normality of the residuals) due to the poor sensitivity of normality tests with low degrees of freedom, large-scale studies (e.g. >100 subjects) exhibit more obvious deviations from these hypotheses and call for more refined models for statistical inference. Here, we demonstrate the benefits of robust regression as a tool for analyzing large neuroimaging cohorts. First, we use an analytic test based on robust parameter estimates; based on simulations, this procedure is shown to provide an accurate statistical control without resorting to permutations. Second, we show that robust regression yields more detections than standard algorithms using as an example an imaging genetics study with 392 subjects. Third, we show that robust regression can avoid false positives in a large-scale analysis of brain-behavior relationships with over 1500 subjects. Finally we embed robust regression in the Randomized Parcellation Based Inference (RPBI) method and demonstrate that this combination further improves the sensitivity of tests carried out across the whole brain. Altogether, our results show that robust procedures provide important advantages in large-scale neuroimaging group studies. Copyright © 2015 Elsevier Inc. All rights reserved.

  17. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

    PubMed Central

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948

  18. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies.

    PubMed

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.

  19. Approaching the exa-scale: a real-world evaluation of rendering extremely large data sets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Patchett, John M; Ahrens, James P; Lo, Li - Ta

    2010-10-15

    Extremely large scale analysis is becoming increasingly important as supercomputers and their simulations move from petascale to exascale. The lack of dedicated hardware acceleration for rendering on today's supercomputing platforms motivates our detailed evaluation of the possibility of interactive rendering on the supercomputer. In order to facilitate our understanding of rendering on the supercomputing platform, we focus on scalability of rendering algorithms and architecture envisioned for exascale datasets. To understand tradeoffs for dealing with extremely large datasets, we compare three different rendering algorithms for large polygonal data: software based ray tracing, software based rasterization and hardware accelerated rasterization. We presentmore » a case study of strong and weak scaling of rendering extremely large data on both GPU and CPU based parallel supercomputers using Para View, a parallel visualization tool. Wc use three different data sets: two synthetic and one from a scientific application. At an extreme scale, algorithmic rendering choices make a difference and should be considered while approaching exascale computing, visualization, and analysis. We find software based ray-tracing offers a viable approach for scalable rendering of the projected future massive data sizes.« less

  20. High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

    PubMed Central

    Teodoro, George; Pan, Tony; Kurc, Tahsin M.; Kong, Jun; Cooper, Lee A. D.; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H.

    2014-01-01

    Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system. PMID:25419546

  1. Learning visual balance from large-scale datasets of aesthetically highly rated images

    NASA Astrophysics Data System (ADS)

    Jahanian, Ali; Vishwanathan, S. V. N.; Allebach, Jan P.

    2015-03-01

    The concept of visual balance is innate for humans, and influences how we perceive visual aesthetics and cognize harmony. Although visual balance is a vital principle of design and taught in schools of designs, it is barely quantified. On the other hand, with emergence of automantic/semi-automatic visual designs for self-publishing, learning visual balance and computationally modeling it, may escalate aesthetics of such designs. In this paper, we present how questing for understanding visual balance inspired us to revisit one of the well-known theories in visual arts, the so called theory of "visual rightness", elucidated by Arnheim. We define Arnheim's hypothesis as a design mining problem with the goal of learning visual balance from work of professionals. We collected a dataset of 120K images that are aesthetically highly rated, from a professional photography website. We then computed factors that contribute to visual balance based on the notion of visual saliency. We fitted a mixture of Gaussians to the saliency maps of the images, and obtained the hotspots of the images. Our inferred Gaussians align with Arnheim's hotspots, and confirm his theory. Moreover, the results support the viability of the center of mass, symmetry, as well as the Rule of Thirds in our dataset.

  2. a Critical Review of Automated Photogrammetric Processing of Large Datasets

    NASA Astrophysics Data System (ADS)

    Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.

    2017-08-01

    The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.

  3. Potential for using regional and global datasets for national scale ecosystem service modelling

    NASA Astrophysics Data System (ADS)

    Maxwell, Deborah; Jackson, Bethanna

    2016-04-01

    Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European

  4. Contractual Duration and Investment Incentives: Evidence from Large Scale Production Units in China

    NASA Astrophysics Data System (ADS)

    Li, Fang; Feng, Shuyi; D'Haese, Marijke; Lu, Hualiang; Qu, Futian

    2017-04-01

    Large Scale Production Units have become important forces in the supply of agricultural commodities and agricultural modernization in China. Contractual duration in farmland transfer to Large Scale Production Units can be considered to reflect land tenure security. Theoretically, long-term tenancy contracts can encourage Large Scale Production Units to increase long-term investments by ensuring land rights stability or favoring access to credit. Using a unique Large Scale Production Units- and plot-level field survey dataset from Jiangsu and Jiangxi Province, this study aims to examine the effect of contractual duration on Large Scale Production Units' soil conservation behaviours. IV method is applied to take into account the endogeneity of contractual duration and unobserved household heterogeneity. Results indicate that farmland transfer contract duration significantly and positively affects land-improving investments. Policies aimed at improving transaction platforms and intermediary organizations in farmland transfer to facilitate Large Scale Production Units to access farmland with long-term tenancy contracts may therefore play an important role in improving soil quality and land productivity.

  5. The MATISSE analysis of large spectral datasets from the ESO Archive

    NASA Astrophysics Data System (ADS)

    Worley, C.; de Laverny, P.; Recio-Blanco, A.; Hill, V.; Vernisse, Y.; Ordenovic, C.; Bijaoui, A.

    2010-12-01

    The automated stellar classification algorithm, MATISSE, has been developed at the Observatoire de la Côte d'Azur (OCA) in order to determine stellar temperatures, gravities and chemical abundances for large datasets of stellar spectra. The Gaia Data Processing and Analysis Consortium (DPAC) has selected MATISSE as one of the key programmes to be used in the analysis of the Gaia Radial Velocity Spectrometer (RVS) spectra. MATISSE is currently being used to analyse large datasets of spectra from the ESO archive with the primary goal of producing advanced data products to be made available in the ESO database via the Virtual Observatory. This is also an invaluable opportunity to identify and address issues that can be encountered with the analysis large samples of real spectra prior to the launch of Gaia in 2012. The analysis of the archived spectra of the FEROS spectrograph is currently underway and preliminary results are presented.

  6. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study.

    PubMed

    Dolz, Jose; Desrosiers, Christian; Ben Ayed, Ismail

    2018-04-15

    This study investigates a 3D and fully convolutional neural network (CNN) for subcortical brain structure segmentation in MRI. 3D CNN architectures have been generally avoided due to their computational and memory requirements during inference. We address the problem via small kernels, allowing deeper architectures. We further model both local and global context by embedding intermediate-layer outputs in the final prediction, which encourages consistency between features extracted at different scales and embeds fine-grained information directly in the segmentation process. Our model is efficiently trained end-to-end on a graphics processing unit (GPU), in a single stage, exploiting the dense inference capabilities of fully CNNs. We performed comprehensive experiments over two publicly available datasets. First, we demonstrate a state-of-the-art performance on the ISBR dataset. Then, we report a large-scale multi-site evaluation over 1112 unregistered subject datasets acquired from 17 different sites (ABIDE dataset), with ages ranging from 7 to 64 years, showing that our method is robust to various acquisition protocols, demographics and clinical factors. Our method yielded segmentations that are highly consistent with a standard atlas-based approach, while running in a fraction of the time needed by atlas-based methods and avoiding registration/normalization steps. This makes it convenient for massive multi-site neuroanatomical imaging studies. To the best of our knowledge, our work is the first to study subcortical structure segmentation on such large-scale and heterogeneous data. Copyright © 2017 Elsevier Inc. All rights reserved.

  7. Collaboration tools and techniques for large model datasets

    USGS Publications Warehouse

    Signell, R.P.; Carniel, S.; Chiggiato, J.; Janekovic, I.; Pullen, J.; Sherwood, C.R.

    2008-01-01

    In MREA and many other marine applications, it is common to have multiple models running with different grids, run by different institutions. Techniques and tools are described for low-bandwidth delivery of data from large multidimensional datasets, such as those from meteorological and oceanographic models, directly into generic analysis and visualization tools. Output is stored using the NetCDF CF Metadata Conventions, and then delivered to collaborators over the web via OPeNDAP. OPeNDAP datasets served by different institutions are then organized via THREDDS catalogs. Tools and procedures are then used which enable scientists to explore data on the original model grids using tools they are familiar with. It is also low-bandwidth, enabling users to extract just the data they require, an important feature for access from ship or remote areas. The entire implementation is simple enough to be handled by modelers working with their webmasters - no advanced programming support is necessary. ?? 2007 Elsevier B.V. All rights reserved.

  8. FLUXNET2015 Dataset: Batteries included

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.

    2016-12-01

    The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.

  9. Large-scale seismic waveform quality metric calculation using Hadoop

    DOE PAGES

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; ...

    2016-05-27

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will

  10. Large-scale seismic waveform quality metric calculation using Hadoop

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will

  11. Large-scale seismic waveform quality metric calculation using Hadoop

    NASA Astrophysics Data System (ADS)

    Magana-Zook, S.; Gaylord, J. M.; Knapp, D. R.; Dodge, D. A.; Ruppert, S. D.

    2016-09-01

    In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of 0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely

  12. Integrated Strategy Improves the Prediction Accuracy of miRNA in Large Dataset

    PubMed Central

    Lipps, David; Devineni, Sree

    2016-01-01

    MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy

  13. Distributed File System Utilities to Manage Large DatasetsVersion 0.5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2014-05-21

    FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.

  14. Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud

    NASA Astrophysics Data System (ADS)

    Wilson, B. D.; Manipon, G.; Hua, H.

    2012-12-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within Sci

  15. Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud

    NASA Astrophysics Data System (ADS)

    Wilson, B.; Manipon, G.; Hua, H.

    2012-04-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within Sci

  16. Tools for Large-Scale Mobile Malware Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bierma, Michael

    Analyzing mobile applications for malicious behavior is an important area of re- search, and is made di cult, in part, by the increasingly large number of appli- cations available for the major operating systems. There are currently over 1.2 million apps available in both the Google Play and Apple App stores (the respec- tive o cial marketplaces for the Android and iOS operating systems)[1, 2]. Our research provides two large-scale analysis tools to aid in the detection and analysis of mobile malware. The rst tool we present, Andlantis, is a scalable dynamic analysis system capa- ble of processing over 3000more » Android applications per hour. Traditionally, Android dynamic analysis techniques have been relatively limited in scale due to the compu- tational resources required to emulate the full Android system to achieve accurate execution. Andlantis is the most scalable Android dynamic analysis framework to date, and is able to collect valuable forensic data, which helps reverse-engineers and malware researchers identify and understand anomalous application behavior. We discuss the results of running 1261 malware samples through the system, and provide examples of malware analysis performed with the resulting data. While techniques exist to perform static analysis on a large number of appli- cations, large-scale analysis of iOS applications has been relatively small scale due to the closed nature of the iOS ecosystem, and the di culty of acquiring appli- cations for analysis. The second tool we present, iClone, addresses the challenges associated with iOS research in order to detect application clones within a dataset of over 20,000 iOS applications.« less

  17. Fast randomization of large genomic datasets while preserving alteration counts.

    PubMed

    Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio

    2014-09-01

    Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  18. Large-Scale Pattern Discovery in Music

    NASA Astrophysics Data System (ADS)

    Bertin-Mahieux, Thierry

    This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.

  19. A Combined Eulerian-Lagrangian Data Representation for Large-Scale Applications.

    PubMed

    Sauer, Franz; Xie, Jinrong; Ma, Kwan-Liu

    2017-10-01

    The Eulerian and Lagrangian reference frames each provide a unique perspective when studying and visualizing results from scientific systems. As a result, many large-scale simulations produce data in both formats, and analysis tasks that simultaneously utilize information from both representations are becoming increasingly popular. However, due to their fundamentally different nature, drawing correlations between these data formats is a computationally difficult task, especially in a large-scale setting. In this work, we present a new data representation which combines both reference frames into a joint Eulerian-Lagrangian format. By reorganizing Lagrangian information according to the Eulerian simulation grid into a "unit cell" based approach, we can provide an efficient out-of-core means of sampling, querying, and operating with both representations simultaneously. We also extend this design to generate multi-resolution subsets of the full data to suit the viewer's needs and provide a fast flow-aware trajectory construction scheme. We demonstrate the effectiveness of our method using three large-scale real world scientific datasets and provide insight into the types of performance gains that can be achieved.

  20. Considerations for Observational Research using Large Datasets in Radiation Oncology

    PubMed Central

    Jagsi, Reshma; Bekelman, Justin E.; Chen, Aileen; Chen, Ronald C.; Hoffman, Karen; Shih, Ya-Chen Tina; Smith, Benjamin D.; Yu, James B.

    2014-01-01

    The radiation oncology community has witnessed growing interest in observational research conducted using large-scale data sources such as registries and claims-based datasets. With the growing emphasis on observational analyses in health care, the radiation oncology community must possess a sophisticated understanding of the methodological considerations of such studies in order to evaluate evidence appropriately to guide practice and policy. Because observational research has unique features that distinguish it from clinical trials and other forms of traditional radiation oncology research, the Red Journal assembled a panel of experts in health services research to provide a concise and well-referenced review, intended to be informative for the lay reader, as well as for scholars who wish to embark on such research without prior experience. This review begins by discussing the types of research questions relevant to radiation oncology that large-scale databases may help illuminate. It then describes major potential data sources for such endeavors, including information regarding access and insights regarding the strengths and limitations of each. Finally, it provides guidance regarding the analytic challenges that observational studies must confront, along with discussion of the techniques that have been developed to help minimize the impact of certain common analytical issues in observational analysis. Features characterizing a well-designed observational study include clearly defined research questions, careful selection of an appropriate data source, consultation with investigators with relevant methodological expertise, inclusion of sensitivity analyses, caution not to overinterpret small but significant differences, and recognition of limitations when trying to evaluate causality. This review concludes that carefully designed and executed studies using observational data that possess these qualities hold substantial promise for advancing our understanding of

  1. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets.

    PubMed

    McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr

    2017-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.

  2. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets

    PubMed Central

    McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr

    2016-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010

  3. Statistical analysis of large simulated yield datasets for studying climate effects

    USDA-ARS?s Scientific Manuscript database

    Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...

  4. Measurement and genetics of human subcortical and hippocampal asymmetries in large datasets.

    PubMed

    Guadalupe, Tulio; Zwiers, Marcel P; Teumer, Alexander; Wittfeld, Katharina; Vasquez, Alejandro Arias; Hoogman, Martine; Hagoort, Peter; Fernandez, Guillen; Buitelaar, Jan; Hegenscheid, Katrin; Völzke, Henry; Franke, Barbara; Fisher, Simon E; Grabe, Hans J; Francks, Clyde

    2014-07-01

    Functional and anatomical asymmetries are prevalent features of the human brain, linked to gender, handedness, and cognition. However, little is known about the neurodevelopmental processes involved. In zebrafish, asymmetries arise in the diencephalon before extending within the central nervous system. We aimed to identify genes involved in the development of subtle, left-right volumetric asymmetries of human subcortical structures using large datasets. We first tested the feasibility of measuring left-right volume differences in such large-scale samples, as assessed by two automated methods of subcortical segmentation (FSL|FIRST and FreeSurfer), using data from 235 subjects who had undergone MRI twice. We tested the agreement between the first and second scan, and the agreement between the segmentation methods, for measures of bilateral volumes of six subcortical structures and the hippocampus, and their volumetric asymmetries. We also tested whether there were biases introduced by left-right differences in the regional atlases used by the methods, by analyzing left-right flipped images. While many bilateral volumes were measured well (scan-rescan r = 0.6-0.8), most asymmetries, with the exception of the caudate nucleus, showed lower repeatabilites. We meta-analyzed genome-wide association scan results for caudate nucleus asymmetry in a combined sample of 3,028 adult subjects but did not detect associations at genome-wide significance (P < 5 × 10(-8) ). There was no enrichment of genetic association in genes involved in left-right patterning of the viscera. Our results provide important information for researchers who are currently aiming to carry out large-scale genome-wide studies of subcortical and hippocampal volumes, and their asymmetries. Copyright © 2013 Wiley Periodicals, Inc.

  5. [No relationship between blood type and personality: evidence from large-scale surveys in Japan and the US].

    PubMed

    Nawata, Kengo

    2014-06-01

    Despite the widespread popular belief in Japan about a relationship between personality and ABO blood type, this association has not been empirically substantiated. This study provides more robust evidence that there is no relationship between blood type and personality, through a secondary analysis of large-scale survey data. Recent data (after 2000) were collected using large-scale random sampling from over 10,000 people in total from both Japan and the US. Effect sizes were calculated. Japanese datasets from 2004 (N = 2,878-2,938), and 2,005 (N = 3,618-3,692) as well as one dataset from the US in 2004 (N = 3,037-3,092) were used. In all the datasets, 65 of 68 items yielded non-significant differences between blood groups. Effect sizes (eta2) were less than .003. This means that blood type explained less than 0.3% of the total variance in personality. These results show the non-relevance of blood type for personality.

  6. Large Dataset of Acute Oral Toxicity Data Created for Testing ...

    EPA Pesticide Factsheets

    Acute toxicity data is a common requirement for substance registration in the US. Currently only data derived from animal tests are accepted by regulatory agencies, and the standard in vivo tests use lethality as the endpoint. Non-animal alternatives such as in silico models are being developed due to animal welfare and resource considerations. We compiled a large dataset of oral rat LD50 values to assess the predictive performance currently available in silico models. Our dataset combines LD50 values from five different sources: literature data provided by The Dow Chemical Company, REACH data from eChemportal, HSDB (Hazardous Substances Data Bank), RTECS data from Leadscope, and the training set underpinning TEST (Toxicity Estimation Software Tool). Combined these data sources yield 33848 chemical-LD50 pairs (data points), with 23475 unique data points covering 16439 compounds. The entire dataset was loaded into a chemical properties database. All of the compounds were registered in DSSTox and 59.5% have publically available structures. Compounds without a structure in DSSTox are currently having their structures registered. The structural data will be used to evaluate the predictive performance and applicable chemical domains of three QSAR models (TIMES, PROTOX, and TEST). Future work will combine the dataset with information from ToxCast assays, and using random forest modeling, assess whether ToxCast assays are useful in predicting acute oral toxicity. Pre

  7. Scalable Visual Analytics of Massive Textual Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.

    2007-04-01

    This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

  8. A dataset of human decision-making in teamwork management.

    PubMed

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-17

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  9. A dataset of human decision-making in teamwork management

    PubMed Central

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787

  10. A dataset of human decision-making in teamwork management

    NASA Astrophysics Data System (ADS)

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  11. Large-Scale Image Analytics Using Deep Learning

    NASA Astrophysics Data System (ADS)

    Ganguly, S.; Nemani, R. R.; Basu, S.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.

    2014-12-01

    High resolution land cover classification maps are needed to increase the accuracy of current Land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) land cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agricultural Imaging Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with ~60 million pixels) and has a total size of ~100 terabytes for a single acquisition. Features extracted from the entire dataset would amount to ~8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. In order to perform image analytics in such a granular system, it is mandatory to devise an intelligent archiving and query system for image retrieval, file structuring, metadata processing and filtering of all available image scenes. Using the Open NASA Earth Exchange (NEX) initiative, which is a partnership with Amazon Web Services (AWS), we have developed an end-to-end architecture for designing the database and the deep belief network (following the distbelief computing model) to solve a grand challenge of scaling this process across quarter million NAIP tiles that cover the entire Continental United States. The

  12. Publicly Releasing a Large Simulation Dataset with NDS Labs

    NASA Astrophysics Data System (ADS)

    Goldbaum, Nathan

    2016-03-01

    Optimally, all publicly funded research should be accompanied by the tools, code, and data necessary to fully reproduce the analysis performed in journal articles describing the research. This ideal can be difficult to attain, particularly when dealing with large (>10 TB) simulation datasets. In this lightning talk, we describe the process of publicly releasing a large simulation dataset to accompany the submission of a journal article. The simulation was performed using Enzo, an open source, community-developed N-body/hydrodynamics code and was analyzed using a wide range of community- developed tools in the scientific Python ecosystem. Although the simulation was performed and analyzed using an ecosystem of sustainably developed tools, we enable sustainable science using our data by making it publicly available. Combining the data release with the NDS Labs infrastructure allows a substantial amount of added value, including web-based access to analysis and visualization using the yt analysis package through an IPython notebook interface. In addition, we are able to accompany the paper submission to the arXiv preprint server with links to the raw simulation data as well as interactive real-time data visualizations that readers can explore on their own or share with colleagues during journal club discussions. It is our hope that the value added by these services will substantially increase the impact and readership of the paper.

  13. Spectral methods in machine learning and new strategies for very large datasets

    PubMed Central

    Belabbas, Mohamed-Ali; Wolfe, Patrick J.

    2009-01-01

    Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these—based on sampling—leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach—based on sorting—provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. PMID:19129490

  14. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    PubMed

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  15. Developing a "Semi-Systematic" Approach to Using Large-Scale Data-Sets for Small-Scale Interventions: The "Baby Matterz" Initiative as a Case Study

    ERIC Educational Resources Information Center

    O'Brien, Mark

    2011-01-01

    The appropriateness of using statistical data to inform the design of any given service development or initiative often depends upon judgements regarding scale. Large-scale data sets, perhaps national in scope, whilst potentially important in informing the design, implementation and roll-out of experimental initiatives, will often remain unused…

  16. An Application of Hydraulic Tomography to a Large-Scale Fractured Granite Site, Mizunami, Japan.

    PubMed

    Zha, Yuanyuan; Yeh, Tian-Chyi J; Illman, Walter A; Tanaka, Tatsuya; Bruines, Patrick; Onoe, Hironori; Saegusa, Hiromitsu; Mao, Deqiang; Takeuchi, Shinji; Wen, Jet-Chau

    2016-11-01

    While hydraulic tomography (HT) is a mature aquifer characterization technology, its applications to characterize hydrogeology of kilometer-scale fault and fracture zones are rare. This paper sequentially analyzes datasets from two new pumping tests as well as those from two previous pumping tests analyzed by Illman et al. (2009) at a fractured granite site in Mizunami, Japan. Results of this analysis show that datasets from two previous pumping tests at one side of a fault zone as used in the previous study led to inaccurate mapping of fracture and fault zones. Inclusion of the datasets from the two new pumping tests (one of which was conducted on the other side of the fault) yields locations of the fault zone consistent with those based on geological mapping. The new datasets also produce a detailed image of the irregular fault zone, which is not available from geological investigation alone and the previous study. As a result, we conclude that if prior knowledge about geological structures at a field site is considered during the design of HT surveys, valuable non-redundant datasets about the fracture and fault zones can be collected. Only with these non-redundant data sets, can HT then be a viable and robust tool for delineating fracture and fault distributions over kilometer scales, even when only a limited number of boreholes are available. In essence, this paper proves that HT is a new tool for geologists, geophysicists, and engineers for mapping large-scale fracture and fault zone distributions. © 2016, National Ground Water Association.

  17. Production of a national 1:1,000,000-scale hydrography dataset for the United States: feature selection, simplification, and refinement

    USGS Publications Warehouse

    Gary, Robin H.; Wilson, Zachary D.; Archuleta, Christy-Ann M.; Thompson, Florence E.; Vrabel, Joseph

    2009-01-01

    During 2006-09, the U.S. Geological Survey, in cooperation with the National Atlas of the United States, produced a 1:1,000,000-scale (1:1M) hydrography dataset comprising streams and waterbodies for the entire United States, including Puerto Rico and the U.S. Virgin Islands, for inclusion in the recompiled National Atlas. This report documents the methods used to select, simplify, and refine features in the 1:100,000-scale (1:100K) (1:63,360-scale in Alaska) National Hydrography Dataset to create the national 1:1M hydrography dataset. Custom tools and semi-automated processes were created to facilitate generalization of the 1:100K National Hydrography Dataset (1:63,360-scale in Alaska) to 1:1M on the basis of existing small-scale hydrography datasets. The first step in creating the new 1:1M dataset was to address feature selection and optimal data density in the streams network. Several existing methods were evaluated. The production method that was established for selecting features for inclusion in the 1:1M dataset uses a combination of the existing attributes and network in the National Hydrography Dataset and several of the concepts from the methods evaluated. The process for creating the 1:1M waterbodies dataset required a similar approach to that used for the streams dataset. Geometric simplification of features was the next step. Stream reaches and waterbodies indicated in the feature selection process were exported as new feature classes and then simplified using a geographic information system tool. The final step was refinement of the 1:1M streams and waterbodies. Refinement was done through the use of additional geographic information system tools.

  18. Panoptes: web-based exploration of large scale genome variation data.

    PubMed

    Vauterin, Paul; Jeffery, Ben; Miles, Alistair; Amato, Roberto; Hart, Lee; Wright, Ian; Kwiatkowski, Dominic

    2017-10-15

    The size and complexity of modern large-scale genome variation studies demand novel approaches for exploring and sharing the data. In order to unlock the potential of these data for a broad audience of scientists with various areas of expertise, a unified exploration framework is required that is accessible, coherent and user-friendly. Panoptes is an open-source software framework for collaborative visual exploration of large-scale genome variation data and associated metadata in a web browser. It relies on technology choices that allow it to operate in near real-time on very large datasets. It can be used to browse rich, hybrid content in a coherent way, and offers interactive visual analytics approaches to assist the exploration. We illustrate its application using genome variation data of Anopheles gambiae, Plasmodium falciparum and Plasmodium vivax. Freely available at https://github.com/cggh/panoptes, under the GNU Affero General Public License. paul.vauterin@gmail.com. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  19. 3D granulometry: grain-scale shape and size distribution from point cloud dataset of river environments

    NASA Astrophysics Data System (ADS)

    Steer, Philippe; Lague, Dimitri; Gourdon, Aurélie; Croissant, Thomas; Crave, Alain

    2016-04-01

    The grain-scale morphology of river sediments and their size distribution are important factors controlling the efficiency of fluvial erosion and transport. In turn, constraining the spatial evolution of these two metrics offer deep insights on the dynamics of river erosion and sediment transport from hillslopes to the sea. However, the size distribution of river sediments is generally assessed using statistically-biased field measurements and determining the grain-scale shape of river sediments remains a real challenge in geomorphology. Here we determine, with new methodological approaches based on the segmentation and geomorphological fitting of 3D point cloud dataset, the size distribution and grain-scale shape of sediments located in river environments. Point cloud segmentation is performed using either machine-learning algorithms or geometrical criterion, such as local plan fitting or curvature analysis. Once the grains are individualized into several sub-clouds, each grain-scale morphology is determined using a 3D geometrical fitting algorithm applied on the sub-cloud. If different geometrical models can be conceived and tested, only ellipsoidal models were used in this study. A phase of results checking is then performed to remove grains showing a best-fitting model with a low level of confidence. The main benefits of this automatic method are that it provides 1) an un-biased estimate of grain-size distribution on a large range of scales, from centimeter to tens of meters; 2) access to a very large number of data, only limited by the number of grains in the point-cloud dataset; 3) access to the 3D morphology of grains, in turn allowing to develop new metrics characterizing the size and shape of grains. The main limit of this method is that it is only able to detect grains with a characteristic size greater than the resolution of the point cloud. This new 3D granulometric method is then applied to river terraces both in the Poerua catchment in New-Zealand and

  20. The role of metadata in managing large environmental science datasets. Proceedings

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Melton, R.B.; DeVaney, D.M.; French, J. C.

    1995-06-01

    The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.

  1. Large-Scale, Multi-Sensor Atmospheric Data Fusion Using Hybrid Cloud Computing

    NASA Astrophysics Data System (ADS)

    Wilson, Brian; Manipon, Gerald; Hua, Hook; Fetzer, Eric

    2014-05-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map-reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in a hybrid Cloud (private eucalyptus & public Amazon). Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the

  2. Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing

    NASA Astrophysics Data System (ADS)

    Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E.

    2013-05-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will

  3. Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing

    NASA Astrophysics Data System (ADS)

    Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E. J.

    2013-12-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the 'A-Train' platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (MERRA), stratify the comparisons using a classification of the 'cloud scenes' from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically 'sharded' by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will

  4. Algorithm of OMA for large-scale orthology inference

    PubMed Central

    Roth, Alexander CJ; Gonnet, Gaston H; Dessimoz, Christophe

    2008-01-01

    Background OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind. Results The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests. Conclusion OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments. PMID:19055798

  5. The PREP pipeline: standardized preprocessing for large-scale EEG analysis.

    PubMed

    Bigdely-Shamlo, Nima; Mullen, Tim; Kothe, Christian; Su, Kyung-Min; Robbins, Kay A

    2015-01-01

    The technology to collect brain imaging and physiological measures has become portable and ubiquitous, opening the possibility of large-scale analysis of real-world human imaging. By its nature, such data is large and complex, making automated processing essential. This paper shows how lack of attention to the very early stages of an EEG preprocessing pipeline can reduce the signal-to-noise ratio and introduce unwanted artifacts into the data, particularly for computations done in single precision. We demonstrate that ordinary average referencing improves the signal-to-noise ratio, but that noisy channels can contaminate the results. We also show that identification of noisy channels depends on the reference and examine the complex interaction of filtering, noisy channel identification, and referencing. We introduce a multi-stage robust referencing scheme to deal with the noisy channel-reference interaction. We propose a standardized early-stage EEG processing pipeline (PREP) and discuss the application of the pipeline to more than 600 EEG datasets. The pipeline includes an automatically generated report for each dataset processed. Users can download the PREP pipeline as a freely available MATLAB library from http://eegstudy.org/prepcode.

  6. The PREP pipeline: standardized preprocessing for large-scale EEG analysis

    PubMed Central

    Bigdely-Shamlo, Nima; Mullen, Tim; Kothe, Christian; Su, Kyung-Min; Robbins, Kay A.

    2015-01-01

    The technology to collect brain imaging and physiological measures has become portable and ubiquitous, opening the possibility of large-scale analysis of real-world human imaging. By its nature, such data is large and complex, making automated processing essential. This paper shows how lack of attention to the very early stages of an EEG preprocessing pipeline can reduce the signal-to-noise ratio and introduce unwanted artifacts into the data, particularly for computations done in single precision. We demonstrate that ordinary average referencing improves the signal-to-noise ratio, but that noisy channels can contaminate the results. We also show that identification of noisy channels depends on the reference and examine the complex interaction of filtering, noisy channel identification, and referencing. We introduce a multi-stage robust referencing scheme to deal with the noisy channel-reference interaction. We propose a standardized early-stage EEG processing pipeline (PREP) and discuss the application of the pipeline to more than 600 EEG datasets. The pipeline includes an automatically generated report for each dataset processed. Users can download the PREP pipeline as a freely available MATLAB library from http://eegstudy.org/prepcode. PMID:26150785

  7. On sample size and different interpretations of snow stability datasets

    NASA Astrophysics Data System (ADS)

    Schirmer, M.; Mitterer, C.; Schweizer, J.

    2009-04-01

    Interpretations of snow stability variations need an assessment of the stability itself, independent of the scale investigated in the study. Studies on stability variations at a regional scale have often chosen stability tests such as the Rutschblock test or combinations of various tests in order to detect differences in aspect and elevation. The question arose: ‘how capable are such stability interpretations in drawing conclusions'. There are at least three possible errors sources: (i) the variance of the stability test itself; (ii) the stability variance at an underlying slope scale, and (iii) that the stability interpretation might not be directly related to the probability of skier triggering. Various stability interpretations have been proposed in the past that provide partly different results. We compared a subjective one based on expert knowledge with a more objective one based on a measure derived from comparing skier-triggered slopes vs. slopes that have been skied but not triggered. In this study, the uncertainties are discussed and their effects on regional scale stability variations will be quantified in a pragmatic way. An existing dataset with very large sample sizes was revisited. This dataset contained the variance of stability at a regional scale for several situations. The stability in this dataset was determined using the subjective interpretation scheme based on expert knowledge. The question to be answered was how many measurements were needed to obtain similar results (mainly stability differences in aspect or elevation) as with the complete dataset. The optimal sample size was obtained in several ways: (i) assuming a nominal data scale the sample size was determined with a given test, significance level and power, and by calculating the mean and standard deviation of the complete dataset. With this method it can also be determined if the complete dataset consists of an appropriate sample size. (ii) Smaller subsets were created with similar

  8. A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs

    PubMed Central

    Miao, Zhichao; Westhof, Eric

    2015-01-01

    Computational prediction of nucleic acid binding sites in proteins are necessary to disentangle functional mechanisms in most biological processes and to explore the binding mechanisms. Several strategies have been proposed, but the state-of-the-art approaches display a great diversity in i) the definition of nucleic acid binding sites; ii) the training and test datasets; iii) the algorithmic methods for the prediction strategies; iv) the performance measures and v) the distribution and availability of the prediction programs. Here we report a large-scale assessment of 19 web servers and 3 stand-alone programs on 41 datasets including more than 5000 proteins derived from 3D structures of protein-nucleic acid complexes. Well-defined binary assessment criteria (specificity, sensitivity, precision, accuracy…) are applied. We found that i) the tools have been greatly improved over the years; ii) some of the approaches suffer from theoretical defects and there is still room for sorting out the essential mechanisms of binding; iii) RNA binding and DNA binding appear to follow similar driving forces and iv) dataset bias may exist in some methods. PMID:26681179

  9. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  10. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  11. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  12. Accuracy assessment of the U.S. Geological Survey National Elevation Dataset, and comparison with other large-area elevation datasets: SRTM and ASTER

    USGS Publications Warehouse

    Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.

    2014-01-01

    The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).

  13. Fast large-scale object retrieval with binary quantization

    NASA Astrophysics Data System (ADS)

    Zhou, Shifu; Zeng, Dan; Shen, Wei; Zhang, Zhijiang; Tian, Qi

    2015-11-01

    The objective of large-scale object retrieval systems is to search for images that contain the target object in an image database. Where state-of-the-art approaches rely on global image representations to conduct searches, we consider many boxes per image as candidates to search locally in a picture. In this paper, a feature quantization algorithm called binary quantization is proposed. In binary quantization, a scale-invariant feature transform (SIFT) feature is quantized into a descriptive and discriminative bit-vector, which allows itself to adapt to the classic inverted file structure for box indexing. The inverted file, which stores the bit-vector and box ID where the SIFT feature is located inside, is compact and can be loaded into the main memory for efficient box indexing. We evaluate our approach on available object retrieval datasets. Experimental results demonstrate that the proposed approach is fast and achieves excellent search quality. Therefore, the proposed approach is an improvement over state-of-the-art approaches for object retrieval.

  14. The allometric exponent for scaling clearance varies with age: a study on seven propofol datasets ranging from preterm neonates to adults.

    PubMed

    Wang, Chenguang; Allegaert, Karel; Peeters, Mariska Y M; Tibboel, Dick; Danhof, Meindert; Knibbe, Catherijne A J

    2014-01-01

    For scaling clearance between adults and children, allometric scaling with a fixed exponent of 0.75 is often applied. In this analysis, we performed a systematic study on the allometric exponent for scaling propofol clearance between two subpopulations selected from neonates, infants, toddlers, children, adolescents and adults. Seven propofol studies were included in the analysis (neonates, infants, toddlers, children, adolescents, adults1 and adults2). In a systematic manner, two out of the six study populations were selected resulting in 15 combined datasets. In addition, the data of the seven studies were regrouped into five age groups (FDA Guidance 1998), from which four combined datasets were prepared consisting of one paediatric age group and the adult group. In each of these 19 combined datasets, the allometric scaling exponent for clearance was estimated using population pharmacokinetic modelling (nonmem 7.2). The allometric exponent for propofol clearance varied between 1.11 and 2.01 in cases where the neonate dataset was included. When two paediatric datasets were analyzed, the exponent varied between 0.2 and 2.01, while it varied between 0.56 and 0.81 when the adult population and a paediatric dataset except for neonates were selected. Scaling from adults to adolescents, children, infants and neonates resulted in exponents of 0.74, 0.70, 0.60 and 1.11 respectively. For scaling clearance, ¾ allometric scaling may be of value for scaling between adults and adolescents or children, while it can neither be used for neonates nor for two paediatric populations. For scaling to neonates an exponent between 1 and 2 was identified. © 2013 The British Pharmacological Society.

  15. Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E

    NASA Technical Reports Server (NTRS)

    Ma, Kwan-Liu; Crockett, Thomas W.

    1999-01-01

    This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.

  16. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset | Office of Cancer Genomics

    Cancer.gov

    Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset.

  17. Multi-GNSS PPP-RTK: From Large- to Small-Scale Networks.

    PubMed

    Nadarajah, Nandakumaran; Khodabandeh, Amir; Wang, Kan; Choudhury, Mazher; Teunissen, Peter J G

    2018-04-03

    Precise point positioning (PPP) and its integer ambiguity resolution-enabled variant, PPP-RTK (real-time kinematic), can benefit enormously from the integration of multiple global navigation satellite systems (GNSS). In such a multi-GNSS landscape, the positioning convergence time is expected to be reduced considerably as compared to the one obtained by a single-GNSS setup. It is therefore the goal of the present contribution to provide numerical insights into the role taken by the multi-GNSS integration in delivering fast and high-precision positioning solutions (sub-decimeter and centimeter levels) using PPP-RTK. To that end, we employ the Curtin PPP-RTK platform and process data-sets of GPS, BeiDou Navigation Satellite System (BDS) and Galileo in stand-alone and combined forms. The data-sets are collected by various receiver types, ranging from high-end multi-frequency geodetic receivers to low-cost single-frequency mass-market receivers. The corresponding stations form a large-scale (Australia-wide) network as well as a small-scale network with inter-station distances less than 30 km. In case of the Australia-wide GPS-only ambiguity-float setup, 90% of the horizontal positioning errors (kinematic mode) are shown to become less than five centimeters after 103 min. The stated required time is reduced to 66 min for the corresponding GPS + BDS + Galieo setup. The time is further reduced to 15 min by applying single-receiver ambiguity resolution. The outcomes are supported by the positioning results of the small-scale network.

  18. I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.

    Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less

  19. Storm-tracks interannual variability and large-scale climate modes

    NASA Astrophysics Data System (ADS)

    Liberato, Margarida L. R.; Trigo, Isabel F.; Trigo, Ricardo M.

    2013-04-01

    In this study we focus on the interannual variability and observed changes in northern hemisphere mid-latitude storm-tracks and relate them to large scale atmospheric circulation variability modes. Extratropical storminess, cyclones dominant paths, frequency and intensity have long been the object of climatological studies. The analysis of storm characteristics and historical trends presented here is based on the cyclone detecting and tracking algorithm first developed for the Mediterranean region (Trigo et al. 1999) and recently extended to a larger Euro-Atlantic region (Trigo 2006). The objective methodology, which identifies and follows individual lows as minima in SLP fields, fulfilling a set of conditions regarding the central pressure and the pressure gradient, is applied to the northern hemisphere 6-hourly geopotential data at 1000 hPa from the 20th Century Reanalyses (20CRv2) project and from reanalyses datasets provided by the European Centre for Medium-Range Weather Forecasts (ECMWF): ERA-40 and ERA Interim reanalyses. First, we assess the interannual variability and cyclone frequency trends for each of the datasets, for the 20th century and for the period between 1958 and 2002 using the highest spatial resolution available (1.125° x 1.125°) from the ERA-40 data. Results show that winter variability of storm paths, cyclone frequency and travel times is in agreement with the reported variability in a number of large-scale climate patterns (including the North Atlantic Oscillation, the East Atlantic Pattern and the Scandinavian Pattern). In addition, three storm-track databases are built spanning the common available extended winter seasons from October 1979 to March 2002. Although relatively short, this common period allows a comparison of systems represented in reanalyses datasets with distinct horizontal resolutions. This exercise is mostly focused on the key areas of cyclogenesis and cyclolysis and main cyclone characteristics over the northern

  20. Orthology detection combining clustering and synteny for very large datasets.

    PubMed

    Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K; Prohaska, Sonja J; Stadler, Peter F

    2014-01-01

    The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.

  1. Orthology Detection Combining Clustering and Synteny for Very Large Datasets

    PubMed Central

    Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K.; Prohaska, Sonja J.; Stadler, Peter F.

    2014-01-01

    The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets. PMID:25137074

  2. Assessment of the effects and limitations of the 1998 to 2008 Abbreviated Injury Scale map using a large population-based dataset.

    PubMed

    Palmer, Cameron S; Franklyn, Melanie

    2011-01-07

    Trauma systems should consistently monitor a given trauma population over a period of time. The Abbreviated Injury Scale (AIS) and derived scores such as the Injury Severity Score (ISS) are commonly used to quantify injury severities in trauma registries. To reflect contemporary trauma management and treatment, the most recent version of the AIS (AIS08) contains many codes which differ in severity from their equivalents in the earlier 1998 version (AIS98). Consequently, the adoption of AIS08 may impede comparisons between data coded using different AIS versions. It may also affect the number of patients classified as major trauma. The entire AIS98-coded injury dataset of a large population based trauma registry was retrieved and mapped to AIS08 using the currently available AIS98-AIS08 dictionary map. The percentage of codes which had increased or decreased in severity, or could not be mapped, was examined in conjunction with the effect of these changes to the calculated ISS. The potential for free text information accompanying AIS coding to improve the quality of AIS mapping was explored. A total of 128280 AIS98-coded injuries were evaluated in 32134 patients, 15471 patients of whom were classified as major trauma. Although only 4.5% of dictionary codes decreased in severity from AIS98 to AIS08, this represented almost 13% of injuries in the registry. In 4.9% of patients, no injuries could be mapped. ISS was potentially unreliable in one-third of patients, as they had at least one AIS98 code which could not be mapped. Using AIS08, the number of patients classified as major trauma decreased by between 17.3% and 30.3%. Evaluation of free text descriptions for some injuries demonstrated the potential to improve mapping between AIS versions. Converting AIS98-coded data to AIS08 results in a significant decrease in the number of patients classified as major trauma. Many AIS98 codes are missing from the existing AIS map, and across a trauma population the AIS08 dataset

  3. SLIDE - a web-based tool for interactive visualization of large-scale -omics data.

    PubMed

    Ghosh, Soumita; Datta, Abhik; Tan, Kaisen; Choi, Hyungwon

    2018-06-28

    Data visualization is often regarded as a post hoc step for verifying statistically significant results in the analysis of high-throughput data sets. This common practice leaves a large amount of raw data behind, from which more information can be extracted. However, existing solutions do not provide capabilities to explore large-scale raw datasets using biologically sensible queries, nor do they allow user interaction based real-time customization of graphics. To address these drawbacks, we have designed an open-source, web-based tool called Systems-Level Interactive Data Exploration, or SLIDE to visualize large-scale -omics data interactively. SLIDE's interface makes it easier for scientists to explore quantitative expression data in multiple resolutions in a single screen. SLIDE is publicly available under BSD license both as an online version as well as a stand-alone version at https://github.com/soumitag/SLIDE. Supplementary Information are available at Bioinformatics online.

  4. Approximate registration of point clouds with large scale differences

    NASA Astrophysics Data System (ADS)

    Novak, D.; Schindler, K.

    2013-10-01

    3D reconstruction of objects is a basic task in many fields, including surveying, engineering, entertainment and cultural heritage. The task is nowadays often accomplished with a laser scanner, which produces dense point clouds, but lacks accurate colour information, and lacks per-point accuracy measures. An obvious solution is to combine laser scanning with photogrammetric recording. In that context, the problem arises to register the two datasets, which feature large scale, translation and rotation differences. The absence of approximate registration parameters (3D translation, 3D rotation and scale) precludes the use of fine-registration methods such as ICP. Here, we present a method to register realistic photogrammetric and laser point clouds in a fully automated fashion. The proposed method decomposes the registration into a sequence of simpler steps: first, two rotation angles are determined by finding dominant surface normal directions, then the remaining parameters are found with RANSAC followed by ICP and scale refinement. These two steps are carried out at low resolution, before computing a precise final registration at higher resolution.

  5. A Global View of Large-Scale Commercial Fishing

    NASA Astrophysics Data System (ADS)

    Kroodsma, D.

    2016-12-01

    Advances in big data processing and satellite technology, combined with the widespread adoption of Automatic Identification System (AIS) devices, now allow the monitoring of fishing activity at a global scale and in high resolution. We analyzed AIS data from more than 40,000 vessels from 2012-2015 to produce 0.1 degree global daily maps of apparent fishing effort. Vessels were matched to publically accessible fishing vessel registries and identified as fishing vessels through AIS Type 5 and Type 24 self-reported messages. Fishing vessels that broadcasted false locations in AIS data were excluded from the analysis. To model fishing pattern classification, a subset of fishing vessels were analyzed and specific movements were classified as "fishing" or "not fishing." A logistic regression model was fitted to these classifications using the following features: a vessel's average speed, the standard deviation of its speed, and the standard deviation of its course over a 12 hour time window. We then applied this model to the entire fishing vessel dataset and time normalized it to produce a global map of fishing hours. The resulting dataset allows for numerous new analyses. For instance, it can assist with monitoring apparent fishing activity in large pelagic marine protected areas and restricted gear use areas, or it can quantify how activity may be affected by seasonal or annual changes in biological productivity. This dataset is now published and freely available in Google's Earth Engine platform, available for researchers to answer a host of questions related to global fishing effort.

  6. Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks

    PubMed Central

    Yamanaka, Ryota; Kitano, Hiroaki

    2013-01-01

    Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007

  7. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history

  8. Development of the Large-Scale Forcing Data to Support MC3E Cloud Modeling Studies

    NASA Astrophysics Data System (ADS)

    Xie, S.; Zhang, Y.

    2011-12-01

    The large-scale forcing fields (e.g., vertical velocity and advective tendencies) are required to run single-column and cloud-resolving models (SCMs/CRMs), which are the two key modeling frameworks widely used to link field data to climate model developments. In this study, we use an advanced objective analysis approach to derive the required forcing data from the soundings collected by the Midlatitude Continental Convective Cloud Experiment (MC3E) in support of its cloud modeling studies. MC3E is the latest major field campaign conducted during the period 22 April 2011 to 06 June 2011 in south-central Oklahoma through a joint effort between the DOE ARM program and the NASA Global Precipitation Measurement Program. One of its primary goals is to provide a comprehensive dataset that can be used to describe the large-scale environment of convective cloud systems and evaluate model cumulus parameterizations. The objective analysis used in this study is the constrained variational analysis method. A unique feature of this approach is the use of domain-averaged surface and top-of-the atmosphere (TOA) observations (e.g., precipitation and radiative and turbulent fluxes) as constraints to adjust atmospheric state variables from soundings by the smallest possible amount to conserve column-integrated mass, moisture, and static energy so that the final analysis data is dynamically and thermodynamically consistent. To address potential uncertainties in the surface observations, an ensemble forcing dataset will be developed. Multi-scale forcing will be also created for simulating various scale convective systems. At the meeting, we will provide more details about the forcing development and present some preliminary analysis of the characteristics of the large-scale forcing structures for several selected convective systems observed during MC3E.

  9. Comparison and validation of gridded precipitation datasets for Spain

    NASA Astrophysics Data System (ADS)

    Quintana-Seguí, Pere; Turco, Marco; Míguez-Macho, Gonzalo

    2016-04-01

    In this study, two gridded precipitation datasets are compared and validated in Spain: the recently developed SAFRAN dataset and the Spain02 dataset. These are validated using rain gauges and they are also compared to the low resolution ERA-Interim reanalysis. The SAFRAN precipitation dataset has been recently produced, using the SAFRAN meteorological analysis, which is extensively used in France (Durand et al. 1993, 1999; Quintana-Seguí et al. 2008; Vidal et al., 2010) and which has recently been applied to Spain (Quintana-Seguí et al., 2015). SAFRAN uses an optimal interpolation (OI) algorithm and uses all available rain gauges from the Spanish State Meteorological Agency (Agencia Estatal de Meteorología, AEMET). The product has a spatial resolution of 5 km and it spans from September 1979 to August 2014. This dataset has been produced mainly to be used in large scale hydrological applications. Spain02 (Herrera et al. 2012, 2015) is another high quality precipitation dataset for Spain based on a dense network of quality-controlled stations and it has different versions at different resolutions. In this study we used the version with a resolution of 0.11°. The product spans from 1971 to 2010. Spain02 is well tested and widely used, mainly, but not exclusively, for RCM model validation and statistical downscliang. ERA-Interim is a well known global reanalysis with a spatial resolution of ˜79 km. It has been included in the comparison because it is a widely used product for continental and global scale studies and also in smaller scale studies in data poor countries. Thus, its comparison with higher resolution products of a data rich country, such as Spain, allows us to quantify the errors made when using such datasets for national scale studies, in line with some of the objectives of the EU-FP7 eartH2Observe project. The comparison shows that SAFRAN and Spain02 perform similarly, even though their underlying principles are different. Both products are largely

  10. Energy transfers in large-scale and small-scale dynamos

    NASA Astrophysics Data System (ADS)

    Samtaney, Ravi; Kumar, Rohit; Verma, Mahendra

    2015-11-01

    We present the energy transfers, mainly energy fluxes and shell-to-shell energy transfers in small-scale dynamo (SSD) and large-scale dynamo (LSD) using numerical simulations of MHD turbulence for Pm = 20 (SSD) and for Pm = 0.2 on 10243 grid. For SSD, we demonstrate that the magnetic energy growth is caused by nonlocal energy transfers from the large-scale or forcing-scale velocity field to small-scale magnetic field. The peak of these energy transfers move towards lower wavenumbers as dynamo evolves, which is the reason for the growth of the magnetic fields at the large scales. The energy transfers U2U (velocity to velocity) and B2B (magnetic to magnetic) are forward and local. For LSD, we show that the magnetic energy growth takes place via energy transfers from large-scale velocity field to large-scale magnetic field. We observe forward U2U and B2B energy flux, similar to SSD.

  11. Multi-GNSS PPP-RTK: From Large- to Small-Scale Networks

    PubMed Central

    Nadarajah, Nandakumaran; Wang, Kan; Choudhury, Mazher

    2018-01-01

    Precise point positioning (PPP) and its integer ambiguity resolution-enabled variant, PPP-RTK (real-time kinematic), can benefit enormously from the integration of multiple global navigation satellite systems (GNSS). In such a multi-GNSS landscape, the positioning convergence time is expected to be reduced considerably as compared to the one obtained by a single-GNSS setup. It is therefore the goal of the present contribution to provide numerical insights into the role taken by the multi-GNSS integration in delivering fast and high-precision positioning solutions (sub-decimeter and centimeter levels) using PPP-RTK. To that end, we employ the Curtin PPP-RTK platform and process data-sets of GPS, BeiDou Navigation Satellite System (BDS) and Galileo in stand-alone and combined forms. The data-sets are collected by various receiver types, ranging from high-end multi-frequency geodetic receivers to low-cost single-frequency mass-market receivers. The corresponding stations form a large-scale (Australia-wide) network as well as a small-scale network with inter-station distances less than 30 km. In case of the Australia-wide GPS-only ambiguity-float setup, 90% of the horizontal positioning errors (kinematic mode) are shown to become less than five centimeters after 103 min. The stated required time is reduced to 66 min for the corresponding GPS + BDS + Galieo setup. The time is further reduced to 15 min by applying single-receiver ambiguity resolution. The outcomes are supported by the positioning results of the small-scale network. PMID:29614040

  12. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

    NASA Astrophysics Data System (ADS)

    Beck, Hylke E.; Vergopolan, Noemi; Pan, Ming; Levizzani, Vincenzo; van Dijk, Albert I. J. M.; Weedon, Graham P.; Brocca, Luca; Pappenberger, Florian; Huffman, George J.; Wood, Eric F.

    2017-12-01

    We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( < 50 000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR) and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1). Our results highlight large differences in estimation accuracy, and hence the importance of P

  13. Towards large scale modelling of wetland water dynamics in northern basins.

    NASA Astrophysics Data System (ADS)

    Pedinotti, V.; Sapriza, G.; Stone, L.; Davison, B.; Pietroniro, A.; Quinton, W. L.; Spence, C.; Wheater, H. S.

    2015-12-01

    Understanding the hydrological behaviour of low topography, wetland-dominated sub-arctic areas is one major issue needed for the improvement of large scale hydrological models. These wet organic soils cover a large extent of Northern America and have a considerable impact on the rainfall-runoff response of a catchment. Moreover their strong interactions with the lower atmosphere and the carbon cycle make of these areas a noteworthy component of the regional climate system. In the framework of the Changing Cold Regions Network (CCRN), this study aims at providing a model for wetland water dynamics that can be used for large scale applications in cold regions. The modelling system has two main components : a) the simulation of surface runoff using the Modélisation Environmentale Communautaire - Surface and Hydrology (MESH) land surface model driven with several gridded atmospheric datasets and b) the routing of surface runoff using the WATROUTE channel scheme. As a preliminary study, we focus on two small representative study basins in Northern Canada : Scotty Creek in the lower Liard River valley of the Northwest Territories and Baker Creek, located a few kilometers north of Yellowknife. Both areas present characteristic landscapes dominated by a series of peat plateaus, channel fens, small lakes and bogs. Moreover, they constitute important fieldwork sites with detailed data to support our modelling study. The challenge of our new wetland model is to represent the hydrological functioning of the various landscape units encountered in those watersheds and their interactions using simple numerical formulations that can be later extended to larger basins such as the Mackenzie river basin. Using observed datasets, the performance of the model to simulate the temporal evolution of hydrological variables such as the water table depth, frost table depth and discharge is assessed.

  14. Analyzing large scale genomic data on the cloud with Sparkhit

    PubMed Central

    Huang, Liren; Krüger, Jan

    2018-01-01

    Abstract Motivation The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. Availability and implementation Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact asczyrba@cebitec.uni-bielefeld.de Supplementary information Supplementary data are available at Bioinformatics online. PMID:29253074

  15. Reverse engineering and analysis of large genome-scale gene networks

    PubMed Central

    Aluru, Maneesha; Zola, Jaroslaw; Nettleton, Dan; Aluru, Srinivas

    2013-01-01

    Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web. PMID:23042249

  16. A fast boosting-based screening method for large-scale association study in complex traits with genetic heterogeneity.

    PubMed

    Wang, Lu-Yong; Fasulo, D

    2006-01-01

    Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms (SNPs) data. Univariate statistical test (i.e. Fisher exact test) was used to single out non-associated SNPs. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large-scale dataset. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent random forest method provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity. We propose a boosting-based method for rapid screening in large-scale analysis of complex traits in the presence of genetic heterogeneity. It provides a relatively fast and fairly good tool for screening and limiting the candidate SNPs for further more complex computational modeling task.

  17. Knowledge-Guided Robust MRI Brain Extraction for Diverse Large-Scale Neuroimaging Studies on Humans and Non-Human Primates

    PubMed Central

    Wang, Yaping; Nie, Jingxin; Yap, Pew-Thian; Li, Gang; Shi, Feng; Geng, Xiujuan; Guo, Lei; Shen, Dinggang

    2014-01-01

    Accurate and robust brain extraction is a critical step in most neuroimaging analysis pipelines. In particular, for the large-scale multi-site neuroimaging studies involving a significant number of subjects with diverse age and diagnostic groups, accurate and robust extraction of the brain automatically and consistently is highly desirable. In this paper, we introduce population-specific probability maps to guide the brain extraction of diverse subject groups, including both healthy and diseased adult human populations, both developing and aging human populations, as well as non-human primates. Specifically, the proposed method combines an atlas-based approach, for coarse skull-stripping, with a deformable-surface-based approach that is guided by local intensity information and population-specific prior information learned from a set of real brain images for more localized refinement. Comprehensive quantitative evaluations were performed on the diverse large-scale populations of ADNI dataset with over 800 subjects (55∼90 years of age, multi-site, various diagnosis groups), OASIS dataset with over 400 subjects (18∼96 years of age, wide age range, various diagnosis groups), and NIH pediatrics dataset with 150 subjects (5∼18 years of age, multi-site, wide age range as a complementary age group to the adult dataset). The results demonstrate that our method consistently yields the best overall results across almost the entire human life span, with only a single set of parameters. To demonstrate its capability to work on non-human primates, the proposed method is further evaluated using a rhesus macaque dataset with 20 subjects. Quantitative comparisons with popularly used state-of-the-art methods, including BET, Two-pass BET, BET-B, BSE, HWA, ROBEX and AFNI, demonstrate that the proposed method performs favorably with superior performance on all testing datasets, indicating its robustness and effectiveness. PMID:24489639

  18. Extracting Prior Distributions from a Large Dataset of In-Situ Measurements to Support SWOT-based Estimation of River Discharge

    NASA Astrophysics Data System (ADS)

    Hagemann, M.; Gleason, C. J.

    2017-12-01

    The upcoming (2021) Surface Water and Ocean Topography (SWOT) NASA satellite mission aims, in part, to estimate discharge on major rivers worldwide using reach-scale measurements of stream width, slope, and height. Current formalizations of channel and floodplain hydraulics are insufficient to fully constrain this problem mathematically, resulting in an infinitely large solution set for any set of satellite observations. Recent work has reformulated this problem in a Bayesian statistical setting, in which the likelihood distributions derive directly from hydraulic flow-law equations. When coupled with prior distributions on unknown flow-law parameters, this formulation probabilistically constrains the parameter space, and results in a computationally tractable description of discharge. Using a curated dataset of over 200,000 in-situ acoustic Doppler current profiler (ADCP) discharge measurements from over 10,000 USGS gaging stations throughout the United States, we developed empirical prior distributions for flow-law parameters that are not observable by SWOT, but that are required in order to estimate discharge. This analysis quantified prior uncertainties on quantities including cross-sectional area, at-a-station hydraulic geometry width exponent, and discharge variability, that are dependent on SWOT-observable variables including reach-scale statistics of width and height. When compared against discharge estimation approaches that do not use this prior information, the Bayesian approach using ADCP-derived priors demonstrated consistently improved performance across a range of performance metrics. This Bayesian approach formally transfers information from in-situ gaging stations to remote-sensed estimation of discharge, in which the desired quantities are not directly observable. Further investigation using large in-situ datasets is therefore a promising way forward in improving satellite-based estimates of river discharge.

  19. Drought Variability in Eastern Part of Romania and its Connection with Large-Scale Air Circulation

    NASA Astrophysics Data System (ADS)

    Barbu, Nicu; Stefan, Sabina; Georgescu, Florinela

    2014-05-01

    Drought is a phenomenon that appears due to precipitation deficit and it is intensified by strong winds, high temperatures, low relative humidity and high insolation; in fact, all these factors lead to increasing of evapotranspiration processes that contribute to soil water deficit. The Standardized Precipitation Evapotranspiration Index (SPEI) take into account all this factors listed above. The temporal variability of the drought in Eastern part of Romania for 50 years, during the period 1961-2010, is investigated. This study is focused on the drought variability related to large scale air circulation. The gridded dataset with spatial resolution of 0.5º lat/lon of SPEI, (https://digital.csic.es/handle/10261/72264) were used to analyze drought periods in connection with large scale air circulation determinate from the two catalogues (GWT - GrossWetter-Typen and WLK - WetterLargenKlassifikation) defined in COST733Action. The GWT catalogue uses at input dataset the sea level pressure and the WLK catalogue uses as input dataset the geopotential field at 925 hPa and 500 hPa, wind at 700 hPa and total water content for entire atmospheric column. In this study we use the GWT catalogue with 18 circulation types and the WLK catalogue with 40 circulation types. The analysis for Barlad Hydrological Basin indicated that the negative values (that means water deficit - drought period) of SPEI are associated with prevailing anticyclonic regime and positive values (that means water excess - rainy period) of SPEI are associated with prevailing cyclonic regime as was expected. In last decade was observed an increase of dry period associated with an increase of anticyclonic activity over Romania. Using GWT18 catalogue the drought are associated with the north-eastern anticyclonic circulation type (NE-A). According to the WLK40 catalogue, the dominant circulation type associated with the drought is north-west-anticyclonic-dry anticyclonic (NW-AAD) type. keywords: drought, SPEI

  20. Merging Station Observations with Large-Scale Gridded Data to Improve Hydrological Predictions over Chile

    NASA Astrophysics Data System (ADS)

    Peng, L.; Sheffield, J.; Verbist, K. M. J.

    2016-12-01

    Hydrological predictions at regional-to-global scales are often hampered by the lack of meteorological forcing data. The use of large-scale gridded meteorological data is able to overcome this limitation, but these data are subject to regional biases and unrealistic values at local scale. This is especially challenging in regions such as Chile, where climate exhibits high spatial heterogeneity as a result of long latitude span and dramatic elevation changes. However, regional station-based observational datasets are not fully exploited and have the potential of constraining biases and spatial patterns. This study aims at adjusting precipitation and temperature estimates from the Princeton University global meteorological forcing (PGF) gridded dataset to improve hydrological simulations over Chile, by assimilating 982 gauges from the Dirección General de Aguas (DGA). To merge station data with the gridded dataset, we use a state-space estimation method to produce optimal gridded estimates, considering both the error of the station measurements and the gridded PGF product. The PGF daily precipitation, maximum and minimum temperature at 0.25° spatial resolution are adjusted for the period of 1979-2010. Precipitation and temperature gauges with long and continuous records (>70% temporal coverage) are selected, while the remaining stations are used for validation. The leave-one-out cross validation verifies the robustness of this data assimilation approach. The merged dataset is then used to force the Variable Infiltration Capacity (VIC) hydrological model over Chile at daily time step which are compared to the observations of streamflow. Our initial results show that the station-merged PGF precipitation effectively captures drizzle and the spatial pattern of storms. Overall the merged dataset has significant improvements compared to the original PGF with reduced biases and stronger inter-annual variability. The invariant spatial pattern of errors between the station

  1. [Parallel virtual reality visualization of extreme large medical datasets].

    PubMed

    Tang, Min

    2010-04-01

    On the basis of a brief description of grid computing, the essence and critical techniques of parallel visualization of extreme large medical datasets are discussed in connection with Intranet and common-configuration computers of hospitals. In this paper are introduced several kernel techniques, including the hardware structure, software framework, load balance and virtual reality visualization. The Maximum Intensity Projection algorithm is realized in parallel using common PC cluster. In virtual reality world, three-dimensional models can be rotated, zoomed, translated and cut interactively and conveniently through the control panel built on virtual reality modeling language (VRML). Experimental results demonstrate that this method provides promising and real-time results for playing the role in of a good assistant in making clinical diagnosis.

  2. Large scale cardiac modeling on the Blue Gene supercomputer.

    PubMed

    Reumann, Matthias; Fitch, Blake G; Rayshubskiy, Aleksandr; Keller, David U; Weiss, Daniel L; Seemann, Gunnar; Dössel, Olaf; Pitman, Michael C; Rice, John J

    2008-01-01

    Multi-scale, multi-physical heart models have not yet been able to include a high degree of accuracy and resolution with respect to model detail and spatial resolution due to computational limitations of current systems. We propose a framework to compute large scale cardiac models. Decomposition of anatomical data in segments to be distributed on a parallel computer is carried out by optimal recursive bisection (ORB). The algorithm takes into account a computational load parameter which has to be adjusted according to the cell models used. The diffusion term is realized by the monodomain equations. The anatomical data-set was given by both ventricles of the Visible Female data-set in a 0.2 mm resolution. Heterogeneous anisotropy was included in the computation. Model weights as input for the decomposition and load balancing were set to (a) 1 for tissue and 0 for non-tissue elements; (b) 10 for tissue and 1 for non-tissue elements. Scaling results for 512, 1024, 2048, 4096 and 8192 computational nodes were obtained for 10 ms simulation time. The simulations were carried out on an IBM Blue Gene/L parallel computer. A 1 s simulation was then carried out on 2048 nodes for the optimal model load. Load balances did not differ significantly across computational nodes even if the number of data elements distributed to each node differed greatly. Since the ORB algorithm did not take into account computational load due to communication cycles, the speedup is close to optimal for the computation time but not optimal overall due to the communication overhead. However, the simulation times were reduced form 87 minutes on 512 to 11 minutes on 8192 nodes. This work demonstrates that it is possible to run simulations of the presented detailed cardiac model within hours for the simulation of a heart beat.

  3. Assessment of the effects and limitations of the 1998 to 2008 Abbreviated Injury Scale map using a large population-based dataset

    PubMed Central

    2011-01-01

    Background Trauma systems should consistently monitor a given trauma population over a period of time. The Abbreviated Injury Scale (AIS) and derived scores such as the Injury Severity Score (ISS) are commonly used to quantify injury severities in trauma registries. To reflect contemporary trauma management and treatment, the most recent version of the AIS (AIS08) contains many codes which differ in severity from their equivalents in the earlier 1998 version (AIS98). Consequently, the adoption of AIS08 may impede comparisons between data coded using different AIS versions. It may also affect the number of patients classified as major trauma. Methods The entire AIS98-coded injury dataset of a large population based trauma registry was retrieved and mapped to AIS08 using the currently available AIS98-AIS08 dictionary map. The percentage of codes which had increased or decreased in severity, or could not be mapped, was examined in conjunction with the effect of these changes to the calculated ISS. The potential for free text information accompanying AIS coding to improve the quality of AIS mapping was explored. Results A total of 128280 AIS98-coded injuries were evaluated in 32134 patients, 15471 patients of whom were classified as major trauma. Although only 4.5% of dictionary codes decreased in severity from AIS98 to AIS08, this represented almost 13% of injuries in the registry. In 4.9% of patients, no injuries could be mapped. ISS was potentially unreliable in one-third of patients, as they had at least one AIS98 code which could not be mapped. Using AIS08, the number of patients classified as major trauma decreased by between 17.3% and 30.3%. Evaluation of free text descriptions for some injuries demonstrated the potential to improve mapping between AIS versions. Conclusions Converting AIS98-coded data to AIS08 results in a significant decrease in the number of patients classified as major trauma. Many AIS98 codes are missing from the existing AIS map, and across

  4. Dynamical links between small- and large-scale mantle heterogeneity: Seismological evidence

    NASA Astrophysics Data System (ADS)

    Frost, Daniel A.; Garnero, Edward J.; Rost, Sebastian

    2018-01-01

    We identify PKP • PKP scattered waves (also known as P‧ •P‧) from earthquakes recorded at small-aperture seismic arrays at distances less than 65°. P‧ •P‧ energy travels as a PKP wave through the core, up into the mantle, then scatters back down through the core to the receiver as a second PKP. P‧ •P‧ waves are unique in that they allow scattering heterogeneities throughout the mantle to be imaged. We use array-processing methods to amplify low amplitude, coherent scattered energy signals and resolve their incoming direction. We deterministically map scattering heterogeneity locations from the core-mantle boundary to the surface. We use an extensive dataset with sensitivity to a large volume of the mantle and a location method allowing us to resolve and map more heterogeneities than have previously been possible, representing a significant increase in our understanding of small-scale structure within the mantle. Our results demonstrate that the distribution of scattering heterogeneities varies both radially and laterally. Scattering is most abundant in the uppermost and lowermost mantle, and a minimum in the mid-mantle, resembling the radial distribution of tomographically derived whole-mantle velocity heterogeneity. We investigate the spatial correlation of scattering heterogeneities with large-scale tomographic velocities, lateral velocity gradients, the locations of deep-seated hotspots and subducted slabs. In the lowermost 1500 km of the mantle, small-scale heterogeneities correlate with regions of low seismic velocity, high lateral seismic gradient, and proximity to hotspots. In the upper 1000 km of the mantle there is no significant correlation between scattering heterogeneity location and subducted slabs. Between 600 and 900 km depth, scattering heterogeneities are more common in the regions most remote from slabs, and close to hotspots. Scattering heterogeneities show an affinity for regions close to slabs within the upper 200 km of the

  5. Large-scale Scanning Transmission Electron Microscopy (Nanotomy) of Healthy and Injured Zebrafish Brain.

    PubMed

    Kuipers, Jeroen; Kalicharan, Ruby D; Wolters, Anouk H G; van Ham, Tjakko J; Giepmans, Ben N G

    2016-05-25

    Large-scale 2D electron microscopy (EM), or nanotomy, is the tissue-wide application of nanoscale resolution electron microscopy. Others and we previously applied large scale EM to human skin pancreatic islets, tissue culture and whole zebrafish larvae(1-7). Here we describe a universally applicable method for tissue-scale scanning EM for unbiased detection of sub-cellular and molecular features. Nanotomy was applied to investigate the healthy and a neurodegenerative zebrafish brain. Our method is based on standardized EM sample preparation protocols: Fixation with glutaraldehyde and osmium, followed by epoxy-resin embedding, ultrathin sectioning and mounting of ultrathin-sections on one-hole grids, followed by post staining with uranyl and lead. Large-scale 2D EM mosaic images are acquired using a scanning EM connected to an external large area scan generator using scanning transmission EM (STEM). Large scale EM images are typically ~ 5 - 50 G pixels in size, and best viewed using zoomable HTML files, which can be opened in any web browser, similar to online geographical HTML maps. This method can be applied to (human) tissue, cross sections of whole animals as well as tissue culture(1-5). Here, zebrafish brains were analyzed in a non-invasive neuronal ablation model. We visualize within a single dataset tissue, cellular and subcellular changes which can be quantified in various cell types including neurons and microglia, the brain's macrophages. In addition, nanotomy facilitates the correlation of EM with light microscopy (CLEM)(8) on the same tissue, as large surface areas previously imaged using fluorescent microscopy, can subsequently be subjected to large area EM, resulting in the nano-anatomy (nanotomy) of tissues. In all, nanotomy allows unbiased detection of features at EM level in a tissue-wide quantifiable manner.

  6. Large-scale Scanning Transmission Electron Microscopy (Nanotomy) of Healthy and Injured Zebrafish Brain

    PubMed Central

    Kuipers, Jeroen; Kalicharan, Ruby D.; Wolters, Anouk H. G.

    2016-01-01

    Large-scale 2D electron microscopy (EM), or nanotomy, is the tissue-wide application of nanoscale resolution electron microscopy. Others and we previously applied large scale EM to human skin pancreatic islets, tissue culture and whole zebrafish larvae1-7. Here we describe a universally applicable method for tissue-scale scanning EM for unbiased detection of sub-cellular and molecular features. Nanotomy was applied to investigate the healthy and a neurodegenerative zebrafish brain. Our method is based on standardized EM sample preparation protocols: Fixation with glutaraldehyde and osmium, followed by epoxy-resin embedding, ultrathin sectioning and mounting of ultrathin-sections on one-hole grids, followed by post staining with uranyl and lead. Large-scale 2D EM mosaic images are acquired using a scanning EM connected to an external large area scan generator using scanning transmission EM (STEM). Large scale EM images are typically ~ 5 - 50 G pixels in size, and best viewed using zoomable HTML files, which can be opened in any web browser, similar to online geographical HTML maps. This method can be applied to (human) tissue, cross sections of whole animals as well as tissue culture1-5. Here, zebrafish brains were analyzed in a non-invasive neuronal ablation model. We visualize within a single dataset tissue, cellular and subcellular changes which can be quantified in various cell types including neurons and microglia, the brain's macrophages. In addition, nanotomy facilitates the correlation of EM with light microscopy (CLEM)8 on the same tissue, as large surface areas previously imaged using fluorescent microscopy, can subsequently be subjected to large area EM, resulting in the nano-anatomy (nanotomy) of tissues. In all, nanotomy allows unbiased detection of features at EM level in a tissue-wide quantifiable manner. PMID:27285162

  7. Using First Differences to Reduce Inhomogeneity in Radiosonde Temperature Datasets.

    NASA Astrophysics Data System (ADS)

    Free, Melissa; Angell, James K.; Durre, Imke; Lanzante, John; Peterson, Thomas C.; Seidel, Dian J.

    2004-11-01

    The utility of a “first difference” method for producing temporally homogeneous large-scale mean time series is assessed. Starting with monthly averages, the method involves dropping data around the time of suspected discontinuities and then calculating differences in temperature from one year to the next, resulting in a time series of year-to-year differences for each month at each station. These first difference time series are then combined to form large-scale means, and mean temperature time series are constructed from the first difference series. When applied to radiosonde temperature data, the method introduces random errors that decrease with the number of station time series used to create the large-scale time series and increase with the number of temporal gaps in the station time series. Root-mean-square errors for annual means of datasets produced with this method using over 500 stations are estimated at no more than 0.03 K, with errors in trends less than 0.02 K decade-1 for 1960 97 at 500 mb. For a 50-station dataset, errors in trends in annual global means introduced by the first differencing procedure may be as large as 0.06 K decade-1 (for six breaks per series), which is greater than the standard error of the trend. Although the first difference method offers significant resource and labor advantages over methods that attempt to adjust the data, it introduces an error in large-scale mean time series that may be unacceptable in some cases.


  8. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

    NASA Astrophysics Data System (ADS)

    Beck, H.; Vergopolan, N.; Pan, M.; Levizzani, V.; van Dijk, A.; Weedon, G. P.; Brocca, L.; Huffman, G. J.; Wood, E. F.; William, L.

    2017-12-01

    We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Twelve non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76,086 gauges worldwide. Another ten gauge-corrected ones were evaluated using hydrological modeling, by calibrating the conceptual model HBV against streamflow records for each of 9053 small to medium-sized (<50,000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR), the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed those indirectly incorporating gauge data through other multi-source datasets (PERSIANN-CDR V1R1 and PGF). Our results highlight large differences in estimation accuracy, and hence, the importance of P dataset selection in both research and operational applications

  9. Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats.

    PubMed

    Seman, Ali; Sapawi, Azizian Mohd; Salleh, Mohd Zaki

    2015-06-01

    Y-chromosome short tandem repeats (Y-STRs) are genetic markers with practical applications in human identification. However, where mass identification is required (e.g., in the aftermath of disasters with significant fatalities), the efficiency of the process could be improved with new statistical approaches. Clustering applications are relatively new tools for large-scale comparative genotyping, and the k-Approximate Modal Haplotype (k-AMH), an efficient algorithm for clustering large-scale Y-STR data, represents a promising method for developing these tools. In this study we improved the k-AMH and produced three new algorithms: the Nk-AMH I (including a new initial cluster center selection), the Nk-AMH II (including a new dominant weighting value), and the Nk-AMH III (combining I and II). The Nk-AMH III was the superior algorithm, with mean clustering accuracy that increased in four out of six datasets and remained at 100% in the other two. Additionally, the Nk-AMH III achieved a 2% higher overall mean clustering accuracy score than the k-AMH, as well as optimal accuracy for all datasets (0.84-1.00). With inclusion of the two new methods, the Nk-AMH III produced an optimal solution for clustering Y-STR data; thus, the algorithm has potential for further development towards fully automatic clustering of any large-scale genotypic data.

  10. Target-decoy Based False Discovery Rate Estimation for Large-scale Metabolite Identification.

    PubMed

    Wang, Xusheng; Jones, Drew R; Shaw, Timothy I; Cho, Ji-Hoon; Wang, Yuanyuan; Tan, Haiyan; Xie, Boer; Zhou, Suiping; Li, Yuxin; Peng, Junmin

    2018-05-23

    Metabolite identification is a crucial step in mass spectrometry (MS)-based metabolomics. However, it is still challenging to assess the confidence of assigned metabolites. In this study, we report a novel method for estimating false discovery rate (FDR) of metabolite assignment with a target-decoy strategy, in which the decoys are generated through violating the octet rule of chemistry by adding small odd numbers of hydrogen atoms. The target-decoy strategy was integrated into JUMPm, an automated metabolite identification pipeline for large-scale MS analysis, and was also evaluated with two other metabolomics tools, mzMatch and mzMine 2. The reliability of FDR calculation was examined by false datasets, which were simulated by altering MS1 or MS2 spectra. Finally, we used the JUMPm pipeline coupled with the target-decoy strategy to process unlabeled and stable-isotope labeled metabolomic datasets. The results demonstrate that the target-decoy strategy is a simple and effective method for evaluating the confidence of high-throughput metabolite identification.

  11. Large-scale mapping of mutations affecting zebrafish development.

    PubMed

    Geisler, Robert; Rauch, Gerd-Jörg; Geiger-Rudolph, Silke; Albrecht, Andrea; van Bebber, Frauke; Berger, Andrea; Busch-Nentwich, Elisabeth; Dahm, Ralf; Dekens, Marcus P S; Dooley, Christopher; Elli, Alexandra F; Gehring, Ines; Geiger, Horst; Geisler, Maria; Glaser, Stefanie; Holley, Scott; Huber, Matthias; Kerr, Andy; Kirn, Anette; Knirsch, Martina; Konantz, Martina; Küchler, Axel M; Maderspacher, Florian; Neuhauss, Stephan C; Nicolson, Teresa; Ober, Elke A; Praeg, Elke; Ray, Russell; Rentzsch, Brit; Rick, Jens M; Rief, Eva; Schauerte, Heike E; Schepp, Carsten P; Schönberger, Ulrike; Schonthaler, Helia B; Seiler, Christoph; Sidi, Samuel; Söllner, Christian; Wehner, Anja; Weiler, Christian; Nüsslein-Volhard, Christiane

    2007-01-09

    Large-scale mutagenesis screens in the zebrafish employing the mutagen ENU have isolated several hundred mutant loci that represent putative developmental control genes. In order to realize the potential of such screens, systematic genetic mapping of the mutations is necessary. Here we report on a large-scale effort to map the mutations generated in mutagenesis screening at the Max Planck Institute for Developmental Biology by genome scanning with microsatellite markers. We have selected a set of microsatellite markers and developed methods and scoring criteria suitable for efficient, high-throughput genome scanning. We have used these methods to successfully obtain a rough map position for 319 mutant loci from the Tübingen I mutagenesis screen and subsequent screening of the mutant collection. For 277 of these the corresponding gene is not yet identified. Mapping was successful for 80 % of the tested loci. By comparing 21 mutation and gene positions of cloned mutations we have validated the correctness of our linkage group assignments and estimated the standard error of our map positions to be approximately 6 cM. By obtaining rough map positions for over 300 zebrafish loci with developmental phenotypes, we have generated a dataset that will be useful not only for cloning of the affected genes, but also to suggest allelism of mutations with similar phenotypes that will be identified in future screens. Furthermore this work validates the usefulness of our methodology for rapid, systematic and inexpensive microsatellite mapping of zebrafish mutations.

  12. Immersive Interaction, Manipulation and Analysis of Large 3D Datasets for Planetary and Earth Sciences

    NASA Astrophysics Data System (ADS)

    Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.

    2017-12-01

    We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.

  13. Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

    PubMed Central

    Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip

    2013-01-01

    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http

  14. Evaluation of Global Observations-Based Evapotranspiration Datasets and IPCC AR4 Simulations

    NASA Technical Reports Server (NTRS)

    Mueller, B.; Seneviratne, S. I.; Jimenez, C.; Corti, T.; Hirschi, M.; Balsamo, G.; Ciais, P.; Dirmeyer, P.; Fisher, J. B.; Guo, Z.; hide

    2011-01-01

    Quantification of global land evapotranspiration (ET) has long been associated with large uncertainties due to the lack of reference observations. Several recently developed products now provide the capacity to estimate ET at global scales. These products, partly based on observational data, include satellite ]based products, land surface model (LSM) simulations, atmospheric reanalysis output, estimates based on empirical upscaling of eddycovariance flux measurements, and atmospheric water balance datasets. The LandFlux-EVAL project aims to evaluate and compare these newly developed datasets. Additionally, an evaluation of IPCC AR4 global climate model (GCM) simulations is presented, providing an assessment of their capacity to reproduce flux behavior relative to the observations ]based products. Though differently constrained with observations, the analyzed reference datasets display similar large-scale ET patterns. ET from the IPCC AR4 simulations was significantly smaller than that from the other products for India (up to 1 mm/d) and parts of eastern South America, and larger in the western USA, Australia and China. The inter-product variance is lower across the IPCC AR4 simulations than across the reference datasets in several regions, which indicates that uncertainties may be underestimated in the IPCC AR4 models due to shared biases of these simulations.

  15. Development of a large-sample catchment-scale hydro-meteorological, land cover and physical dataset for Chile

    NASA Astrophysics Data System (ADS)

    Alvarez-Garreton, C. D.; Mendoza, P. A.; Zambrano-Bigiarini, M.; Galleguillos, M. H.; Boisier, J. P.; Lara, A.; Cortés, G.; Garreaud, R.; McPhee, J. P.; Addor, N.; Puelma, C.

    2017-12-01

    We provide the first catchment-based hydrometeorological, vegetation and physical data set over 531 catchments in Chile (17.8 S - 55.0 S). We compiled publicly available streamflow records at daily time steps for the period 1980-2015, and generated basin-averaged time series of the following hydrometeorological variables: 1) daily precipitation coming from three different gridded sources (re-analysis and satellite-based); 2) daily maximum and minimum temperature; 3) 8-days potential evapotranspiration (PET) based on MODIS imagery and daily PET based on Hargreaves formula; and 4) daily snow water equivalent. Additionally, catchments are characterized by their main physical (area, mean elevation, mean slope) and land cover characteristics. We synthetized these datasets with several indices characterizing the spatial distribution of climatic, hydrological, topographic and vegetation attributes. The new catchment-based dataset is unprecedented in the region and provides information that can be used in a myriad of applications, including catchment classification and regionalization studies, impacts of different land cover types on catchment response, characterization of drought history and projections, climate change impacts on hydrological processes, etc. Derived practical applications include water management and allocation strategies, decision making and adaptation planning to climate change. This data set will be publicly available and we encourage the community to use it.

  16. On the scaling of small-scale jet noise to large scale

    NASA Technical Reports Server (NTRS)

    Soderman, Paul T.; Allen, Christopher S.

    1992-01-01

    An examination was made of several published jet noise studies for the purpose of evaluating scale effects important to the simulation of jet aeroacoustics. Several studies confirmed that small conical jets, one as small as 59 mm diameter, could be used to correctly simulate the overall or PNL noise of large jets dominated by mixing noise. However, the detailed acoustic spectra of large jets are more difficult to simulate because of the lack of broad-band turbulence spectra in small jets. One study indicated that a jet Reynolds number of 5 x 10 exp 6 based on exhaust diameter enabled the generation of broad-band noise representative of large jet mixing noise. Jet suppressor aeroacoustics is even more difficult to simulate at small scale because of the small mixer nozzles with flows sensitive to Reynolds number. Likewise, one study showed incorrect ejector mixing and entrainment using small-scale, short ejector that led to poor acoustic scaling. Conversely, fairly good results were found with a longer ejector and, in a different study, with a 32-chute suppressor nozzle. Finally, it was found that small-scale aeroacoustic resonance produced by jets impacting ground boards does not reproduce at large scale.

  17. Statistical Analysis of Large Simulated Yield Datasets for Studying Climate Effects

    NASA Technical Reports Server (NTRS)

    Makowski, David; Asseng, Senthold; Ewert, Frank; Bassu, Simona; Durand, Jean-Louis; Martre, Pierre; Adam, Myriam; Aggarwal, Pramod K.; Angulo, Carlos; Baron, Chritian; hide

    2015-01-01

    Many studies have been carried out during the last decade to study the effect of climate change on crop yields and other key crop characteristics. In these studies, one or several crop models were used to simulate crop growth and development for different climate scenarios that correspond to different projections of atmospheric CO2 concentration, temperature, and rainfall changes (Semenov et al., 1996; Tubiello and Ewert, 2002; White et al., 2011). The Agricultural Model Intercomparison and Improvement Project (AgMIP; Rosenzweig et al., 2013) builds on these studies with the goal of using an ensemble of multiple crop models in order to assess effects of climate change scenarios for several crops in contrasting environments. These studies generate large datasets, including thousands of simulated crop yield data. They include series of yield values obtained by combining several crop models with different climate scenarios that are defined by several climatic variables (temperature, CO2, rainfall, etc.). Such datasets potentially provide useful information on the possible effects of different climate change scenarios on crop yields. However, it is sometimes difficult to analyze these datasets and to summarize them in a useful way due to their structural complexity; simulated yield data can differ among contrasting climate scenarios, sites, and crop models. Another issue is that it is not straightforward to extrapolate the results obtained for the scenarios to alternative climate change scenarios not initially included in the simulation protocols. Additional dynamic crop model simulations for new climate change scenarios are an option but this approach is costly, especially when a large number of crop models are used to generate the simulated data, as in AgMIP. Statistical models have been used to analyze responses of measured yield data to climate variables in past studies (Lobell et al., 2011), but the use of a statistical model to analyze yields simulated by complex

  18. An Intercomparison of Large-Extent Tree Canopy Cover Geospatial Datasets

    NASA Astrophysics Data System (ADS)

    Bender, S.; Liknes, G.; Ruefenacht, B.; Reynolds, J.; Miller, W. P.

    2017-12-01

    As a member of the Multi-Resolution Land Characteristics Consortium (MRLC), the U.S. Forest Service (USFS) is responsible for producing and maintaining the tree canopy cover (TCC) component of the National Land Cover Database (NLCD). The NLCD-TCC data are available for the conterminous United States (CONUS), coastal Alaska, Hawai'i, Puerto Rico, and the U.S. Virgin Islands. The most recent official version of the NLCD-TCC data is based primarily on reference data from 2010-2011 and is part of the multi-component 2011 version of the NLCD. NLCD data are updated on a five-year cycle. The USFS is currently producing the next official version (2016) of the NLCD-TCC data for the United States, and it will be made publicly-available in early 2018. In this presentation, we describe the model inputs, modeling methods, and tools used to produce the 30-m NLCD-TCC data. Several tree cover datasets at 30-m, as well as datasets at finer resolution, have become available in recent years due to advancements in earth observation data and their availability, computing, and sensors. We compare multiple tree cover datasets that have similar resolution to the NLCD-TCC data. We also aggregate the tree class from fine-resolution land cover datasets to a percent canopy value on a 30-m pixel, in order to compare the fine-resolution datasets to the datasets created directly from 30-m Landsat data. The extent of the tree canopy cover datasets included in the study ranges from global and national to the state level. Preliminary investigation of multiple tree cover datasets over the CONUS indicates a high amount of spatial variability. For example, in a comparison of the NLCD-TCC and the Global Land Cover Facility's Landsat Tree Cover Continuous Fields (2010) data by MRLC mapping zones, the zone-level root mean-square deviation ranges from 2% to 39% (mean=17%, median=15%). The analysis outcomes are expected to inform USFS decisions with regard to the next cycle (2021) of NLCD-TCC production.

  19. Large-Scale Disasters

    NASA Astrophysics Data System (ADS)

    Gad-El-Hak, Mohamed

    "Extreme" events - including climatic events, such as hurricanes, tornadoes, and drought - can cause massive disruption to society, including large death tolls and property damage in the billions of dollars. Events in recent years have shown the importance of being prepared and that countries need to work together to help alleviate the resulting pain and suffering. This volume presents a review of the broad research field of large-scale disasters. It establishes a common framework for predicting, controlling and managing both manmade and natural disasters. There is a particular focus on events caused by weather and climate change. Other topics include air pollution, tsunamis, disaster modeling, the use of remote sensing and the logistics of disaster management. It will appeal to scientists, engineers, first responders and health-care professionals, in addition to graduate students and researchers who have an interest in the prediction, prevention or mitigation of large-scale disasters.

  20. Impacts of spatial resolution and representation of flow connectivity on large-scale simulation of floods

    NASA Astrophysics Data System (ADS)

    Mateo, Cherry May R.; Yamazaki, Dai; Kim, Hyungjun; Champathong, Adisorn; Vaze, Jai; Oki, Taikan

    2017-10-01

    Global-scale river models (GRMs) are core tools for providing consistent estimates of global flood hazard, especially in data-scarce regions. Due to former limitations in computational power and input datasets, most GRMs have been developed to use simplified representations of flow physics and run at coarse spatial resolutions. With increasing computational power and improved datasets, the application of GRMs to finer resolutions is becoming a reality. To support development in this direction, the suitability of GRMs for application to finer resolutions needs to be assessed. This study investigates the impacts of spatial resolution and flow connectivity representation on the predictive capability of a GRM, CaMa-Flood, in simulating the 2011 extreme flood in Thailand. Analyses show that when single downstream connectivity (SDC) is assumed, simulation results deteriorate with finer spatial resolution; Nash-Sutcliffe efficiency coefficients decreased by more than 50 % between simulation results at 10 km resolution and 1 km resolution. When multiple downstream connectivity (MDC) is represented, simulation results slightly improve with finer spatial resolution. The SDC simulations result in excessive backflows on very flat floodplains due to the restrictive flow directions at finer resolutions. MDC channels attenuated these effects by maintaining flow connectivity and flow capacity between floodplains in varying spatial resolutions. While a regional-scale flood was chosen as a test case, these findings should be universal and may have significant impacts on large- to global-scale simulations, especially in regions where mega deltas exist.These results demonstrate that a GRM can be used for higher resolution simulations of large-scale floods, provided that MDC in rivers and floodplains is adequately represented in the model structure.

  1. Consistency and interpretation of changes in millimeter-scale cortical intrinsic curvature across three independent datasets in schizophrenia☆

    PubMed Central

    Ronan, Lisa; Voets, Natalie L.; Hough, Morgan; Mackay, Clare; Roberts, Neil; Suckling, John; Bullmore, Edward; James, Anthony; Fletcher, Paul C.

    2012-01-01

    Several studies have sought to test the neurodevelopmental hypothesis of schizophrenia through analysis of cortical gyrification. However, to date, results have been inconsistent. A possible reason for this is that gyrification measures at the centimeter scale may be insensitive to subtle morphological changes at smaller scales. The lack of consistency in such studies may impede further interpretation of cortical morphology as an aid to understanding the etiology of schizophrenia. In this study we developed a new approach, examining whether millimeter-scale measures of cortical curvature are sensitive to changes in fundamental geometric properties of the cortical surface in schizophrenia. We determined and compared millimeter-scale and centimeter-scale curvature in three separate case–control studies; specifically two adult groups and one adolescent group. The datasets were of different sizes, with different ages and gender-spreads. The results clearly show that millimeter-scale intrinsic curvature measures were more robust and consistent in identifying reduced gyrification in patients across all three datasets. To further interpret this finding we quantified the ratio of expansion in the upper and lower cortical layers. The results suggest that reduced gyrification in schizophrenia is driven by a reduction in the expansion of upper cortical layers. This may plausibly be related to a reduction in short-range connectivity. PMID:22743195

  2. Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

    PubMed

    Klein, Max; Sharma, Rati; Bohrer, Chris H; Avelis, Cameron M; Roberts, Elijah

    2017-01-15

    Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology. Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  3. GeoNotebook: Browser based Interactive analysis and visualization workflow for very large climate and geospatial datasets

    NASA Astrophysics Data System (ADS)

    Ozturk, D.; Chaudhary, A.; Votava, P.; Kotfila, C.

    2016-12-01

    Jointly developed by Kitware and NASA Ames, GeoNotebook is an open source tool designed to give the maximum amount of flexibility to analysts, while dramatically simplifying the process of exploring geospatially indexed datasets. Packages like Fiona (backed by GDAL), Shapely, Descartes, Geopandas, and PySAL provide a stack of technologies for reading, transforming, and analyzing geospatial data. Combined with the Jupyter notebook and libraries like matplotlib/Basemap it is possible to generate detailed geospatial visualizations. Unfortunately, visualizations generated is either static or does not perform well for very large datasets. Also, this setup requires a great deal of boilerplate code to create and maintain. Other extensions exist to remedy these problems, but they provide a separate map for each input cell and do not support map interactions that feed back into the python environment. To support interactive data exploration and visualization on large datasets we have developed an extension to the Jupyter notebook that provides a single dynamic map that can be managed from the Python environment, and that can communicate back with a server which can perform operations like data subsetting on a cloud-based cluster.

  4. Assessing Human Modifications to Floodplains using Large-Scale Hydrogeomorphic Floodplain Modeling

    NASA Astrophysics Data System (ADS)

    Morrison, R. R.; Scheel, K.; Nardi, F.; Annis, A.

    2017-12-01

    Human modifications to floodplains for water resource and flood management purposes have significantly transformed river-floodplain connectivity dynamics in many watersheds. Bridges, levees, reservoirs, shifts in land use, and other hydraulic engineering works have altered flow patterns and caused changes in the timing and extent of floodplain inundation processes. These hydrogeomorphic changes have likely resulted in negative impacts to aquatic habitat and ecological processes. The availability of large-scale topographic datasets at high resolution provide an opportunity for detecting anthropogenic impacts by means of geomorphic mapping. We have developed and are implementing a methodology for comparing a hydrogeomorphic floodplain mapping technique to hydraulically-modeled floodplain boundaries to estimate floodplain loss due to human activities. Our hydrogeomorphic mapping methodology assumes that river valley morphology intrinsically includes information on flood-driven erosion and depositional phenomena. We use a digital elevation model-based algorithm to identify the floodplain as the area of the fluvial corridor laying below water reference levels, which are estimated using a simplified hydrologic model. Results from our hydrogeomorphic method are compared to hydraulically-derived flood zone maps and spatial datasets of levee protected-areas to explore where water management features, such as levees, have changed floodplain dynamics and landscape features. Parameters associated with commonly used F-index functions are quantified and analyzed to better understand how floodplain areas have been reduced within a basin. Preliminary results indicate that the hydrogeomorphic floodplain model is useful for quickly delineating floodplains at large watershed scales, but further analyses are needed to understand the caveats for using the model in determining floodplain loss due to levees. We plan to continue this work by exploring the spatial dependencies of the F

  5. On the scaling of small-scale jet noise to large scale

    NASA Technical Reports Server (NTRS)

    Soderman, Paul T.; Allen, Christopher S.

    1992-01-01

    An examination was made of several published jet noise studies for the purpose of evaluating scale effects important to the simulation of jet aeroacoustics. Several studies confirmed that small conical jets, one as small as 59 mm diameter, could be used to correctly simulate the overall or perceived noise level (PNL) noise of large jets dominated by mixing noise. However, the detailed acoustic spectra of large jets are more difficult to simulate because of the lack of broad-band turbulence spectra in small jets. One study indicated that a jet Reynolds number of 5 x 10(exp 6) based on exhaust diameter enabled the generation of broad-band noise representative of large jet mixing noise. Jet suppressor aeroacoustics is even more difficult to simulate at small scale because of the small mixer nozzles with flows sensitive to Reynolds number. Likewise, one study showed incorrect ejector mixing and entrainment using a small-scale, short ejector that led to poor acoustic scaling. Conversely, fairly good results were found with a longer ejector and, in a different study, with a 32-chute suppressor nozzle. Finally, it was found that small-scale aeroacoustic resonance produced by jets impacting ground boards does not reproduce at large scale.

  6. Large-scale runoff generation - parsimonious parameterisation using high-resolution topography

    NASA Astrophysics Data System (ADS)

    Gong, L.; Halldin, S.; Xu, C.-Y.

    2011-08-01

    World water resources have primarily been analysed by global-scale hydrological models in the last decades. Runoff generation in many of these models are based on process formulations developed at catchments scales. The division between slow runoff (baseflow) and fast runoff is primarily governed by slope and spatial distribution of effective water storage capacity, both acting at very small scales. Many hydrological models, e.g. VIC, account for the spatial storage variability in terms of statistical distributions; such models are generally proven to perform well. The statistical approaches, however, use the same runoff-generation parameters everywhere in a basin. The TOPMODEL concept, on the other hand, links the effective maximum storage capacity with real-world topography. Recent availability of global high-quality, high-resolution topographic data makes TOPMODEL attractive as a basis for a physically-based runoff-generation algorithm at large scales, even if its assumptions are not valid in flat terrain or for deep groundwater systems. We present a new runoff-generation algorithm for large-scale hydrology based on TOPMODEL concepts intended to overcome these problems. The TRG (topography-derived runoff generation) algorithm relaxes the TOPMODEL equilibrium assumption so baseflow generation is not tied to topography. TRG only uses the topographic index to distribute average storage to each topographic index class. The maximum storage capacity is proportional to the range of topographic index and is scaled by one parameter. The distribution of storage capacity within large-scale grid cells is obtained numerically through topographic analysis. The new topography-derived distribution function is then inserted into a runoff-generation framework similar VIC's. Different basin parts are parameterised by different storage capacities, and different shapes of the storage-distribution curves depend on their topographic characteristics. The TRG algorithm is driven by the

  7. Large-scale runoff generation - parsimonious parameterisation using high-resolution topography

    NASA Astrophysics Data System (ADS)

    Gong, L.; Halldin, S.; Xu, C.-Y.

    2010-09-01

    World water resources have primarily been analysed by global-scale hydrological models in the last decades. Runoff generation in many of these models are based on process formulations developed at catchments scales. The division between slow runoff (baseflow) and fast runoff is primarily governed by slope and spatial distribution of effective water storage capacity, both acting a very small scales. Many hydrological models, e.g. VIC, account for the spatial storage variability in terms of statistical distributions; such models are generally proven to perform well. The statistical approaches, however, use the same runoff-generation parameters everywhere in a basin. The TOPMODEL concept, on the other hand, links the effective maximum storage capacity with real-world topography. Recent availability of global high-quality, high-resolution topographic data makes TOPMODEL attractive as a basis for a physically-based runoff-generation algorithm at large scales, even if its assumptions are not valid in flat terrain or for deep groundwater systems. We present a new runoff-generation algorithm for large-scale hydrology based on TOPMODEL concepts intended to overcome these problems. The TRG (topography-derived runoff generation) algorithm relaxes the TOPMODEL equilibrium assumption so baseflow generation is not tied to topography. TGR only uses the topographic index to distribute average storage to each topographic index class. The maximum storage capacity is proportional to the range of topographic index and is scaled by one parameter. The distribution of storage capacity within large-scale grid cells is obtained numerically through topographic analysis. The new topography-derived distribution function is then inserted into a runoff-generation framework similar VIC's. Different basin parts are parameterised by different storage capacities, and different shapes of the storage-distribution curves depend on their topographic characteristics. The TRG algorithm is driven by the

  8. Adaptive Neuron Apoptosis for Accelerating Deep Learning on Large Scale Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Siegel, Charles M.; Daily, Jeffrey A.; Vishnu, Abhinav

    Machine Learning and Data Mining (MLDM) algorithms are becoming ubiquitous in {\\em model learning} from the large volume of data generated using simulations, experiments and handheld devices. Deep Learning algorithms -- a class of MLDM algorithms -- are applied for automatic feature extraction, and learning non-linear models for unsupervised and supervised algorithms. Naturally, several libraries which support large scale Deep Learning -- such as TensorFlow and Caffe -- have become popular. In this paper, we present novel techniques to accelerate the convergence of Deep Learning algorithms by conducting low overhead removal of redundant neurons -- {\\em apoptosis} of neurons --more » which do not contribute to model learning, during the training phase itself. We provide in-depth theoretical underpinnings of our heuristics (bounding accuracy loss and handling apoptosis of several neuron types), and present the methods to conduct adaptive neuron apoptosis. We implement our proposed heuristics with the recently introduced TensorFlow and using its recently proposed extension with MPI. Our performance evaluation on two difference clusters -- one connected with Intel Haswell multi-core systems, and other with nVIDIA GPUs -- using InfiniBand, indicates the efficacy of the proposed heuristics and implementations. Specifically, we are able to improve the training time for several datasets by 2-3x, while reducing the number of parameters by 30x (4-5x on average) on datasets such as ImageNet classification. For the Higgs Boson dataset, our implementation improves the accuracy (measured by Area Under Curve (AUC)) for classification from 0.88/1 to 0.94/1, while reducing the number of parameters by 3x in comparison to existing literature, while achieving a 2.44x speedup in comparison to the default (no apoptosis) algorithm.« less

  9. Large-scale structural optimization

    NASA Technical Reports Server (NTRS)

    Sobieszczanski-Sobieski, J.

    1983-01-01

    Problems encountered by aerospace designers in attempting to optimize whole aircraft are discussed, along with possible solutions. Large scale optimization, as opposed to component-by-component optimization, is hindered by computational costs, software inflexibility, concentration on a single, rather than trade-off, design methodology and the incompatibility of large-scale optimization with single program, single computer methods. The software problem can be approached by placing the full analysis outside of the optimization loop. Full analysis is then performed only periodically. Problem-dependent software can be removed from the generic code using a systems programming technique, and then embody the definitions of design variables, objective function and design constraints. Trade-off algorithms can be used at the design points to obtain quantitative answers. Finally, decomposing the large-scale problem into independent subproblems allows systematic optimization of the problems by an organization of people and machines.

  10. The LANDFIRE Refresh strategy: updating the national dataset

    USGS Publications Warehouse

    Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

    2013-01-01

    The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.

  11. A unique large-scale undergraduate research experience in molecular systems biology for non-mathematics majors.

    PubMed

    Kappler, Ulrike; Rowland, Susan L; Pedwell, Rhianna K

    2017-05-01

    Systems biology is frequently taught with an emphasis on mathematical modeling approaches. This focus effectively excludes most biology, biochemistry, and molecular biology students, who are not mathematics majors. The mathematical focus can also present a misleading picture of systems biology, which is a multi-disciplinary pursuit requiring collaboration between biochemists, bioinformaticians, and mathematicians. This article describes an authentic large-scale undergraduate research experience (ALURE) in systems biology that incorporates proteomics, bacterial genomics, and bioinformatics in the one exercise. This project is designed to engage students who have a basic grounding in protein chemistry and metabolism and no mathematical modeling skills. The pedagogy around the research experience is designed to help students attack complex datasets and use their emergent metabolic knowledge to make meaning from large amounts of raw data. On completing the ALURE, participants reported a significant increase in their confidence around analyzing large datasets, while the majority of the cohort reported good or great gains in a variety of skills including "analysing data for patterns" and "conducting database or internet searches." An environmental scan shows that this ALURE is the only undergraduate-level system-biology research project offered on a large-scale in Australia; this speaks to the perceived difficulty of implementing such an opportunity for students. We argue however, that based on the student feedback, allowing undergraduate students to complete a systems-biology project is both feasible and desirable, even if the students are not maths and computing majors. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):235-248, 2017. © 2016 The International Union of Biochemistry and Molecular Biology.

  12. Large Scale Metal Additive Techniques Review

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nycz, Andrzej; Adediran, Adeola I; Noakes, Mark W

    2016-01-01

    In recent years additive manufacturing made long strides toward becoming a main stream production technology. Particularly strong progress has been made in large-scale polymer deposition. However, large scale metal additive has not yet reached parity with large scale polymer. This paper is a review study of the metal additive techniques in the context of building large structures. Current commercial devices are capable of printing metal parts on the order of several cubic feet compared to hundreds of cubic feet for the polymer side. In order to follow the polymer progress path several factors are considered: potential to scale, economy, environmentmore » friendliness, material properties, feedstock availability, robustness of the process, quality and accuracy, potential for defects, and post processing as well as potential applications. This paper focuses on current state of art of large scale metal additive technology with a focus on expanding the geometric limits.« less

  13. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

    PubMed

    Privé, Florian; Aschard, Hugues; Ziyatdinov, Andrey; Blum, Michael G B

    2017-03-30

    Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools. Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove SNPs in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of SNPs. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer. https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/. florian.prive@univ-grenoble-alpes.fr & michael.blum@univ-grenoble-alpes.fr. Supplementary materials are available at Bioinformatics online.

  14. Segmentation of Unstructured Datasets

    NASA Technical Reports Server (NTRS)

    Bhat, Smitha

    1996-01-01

    Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.

  15. GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.

    PubMed

    Jeong, Seongmun; Kim, Jae-Yoon; Jeong, Soon-Chun; Kang, Sung-Taeg; Moon, Jung-Kyung; Kim, Namshin

    2017-01-01

    Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.

  16. Large-scale seismic signal analysis with Hadoop

    DOE PAGES

    Addair, T. G.; Dodge, D. A.; Walter, W. R.; ...

    2014-02-11

    In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less

  17. Large-scale seismic signal analysis with Hadoop

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Addair, T. G.; Dodge, D. A.; Walter, W. R.

    In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less

  18. Recent Regional Climate State and Change - Derived through Downscaling Homogeneous Large-scale Components of Re-analyses

    NASA Astrophysics Data System (ADS)

    Von Storch, H.; Klehmet, K.; Geyer, B.; Li, D.; Schubert-Frisius, M.; Tim, N.; Zorita, E.

    2015-12-01

    Global re-analyses suffer from inhomogeneities, as they process data from networks under development. However, the large-scale component of such re-analyses is mostly homogeneous; additional observational data add in most cases to a better description of regional details and less so on large-scale states. Therefore, the concept of downscaling may be applied to homogeneously complementing the large-scale state of the re-analyses with regional detail - wherever the condition of homogeneity of the large-scales is fulfilled. Technically this can be done by using a regional climate model, or a global climate model, which is constrained on the large scale by spectral nudging. This approach has been developed and tested for the region of Europe, and a skillful representation of regional risks - in particular marine risks - was identified. While the data density in Europe is considerably better than in most other regions of the world, even here insufficient spatial and temporal coverage is limiting risk assessments. Therefore, downscaled data-sets are frequently used by off-shore industries. We have run this system also in regions with reduced or absent data coverage, such as the Lena catchment in Siberia, in the Yellow Sea/Bo Hai region in East Asia, in Namibia and the adjacent Atlantic Ocean. Also a global (large scale constrained) simulation has been. It turns out that spatially detailed reconstruction of the state and change of climate in the three to six decades is doable for any region of the world.The different data sets are archived and may freely by used for scientific purposes. Of course, before application, a careful analysis of the quality for the intended application is needed, as sometimes unexpected changes in the quality of the description of large-scale driving states prevail.

  19. PharmacoGx: an R package for analysis of large pharmacogenomic datasets.

    PubMed

    Smirnov, Petr; Safikhani, Zhaleh; El-Hachem, Nehme; Wang, Dong; She, Adrian; Olsen, Catharina; Freeman, Mark; Selby, Heather; Gendoo, Deena M A; Grossmann, Patrick; Beck, Andrew H; Aerts, Hugo J W L; Lupien, Mathieu; Goldenberg, Anna; Haibe-Kains, Benjamin

    2016-04-15

    Pharmacogenomics holds great promise for the development of biomarkers of drug response and the design of new therapeutic options, which are key challenges in precision medicine. However, such data are scattered and lack standards for efficient access and analysis, consequently preventing the realization of the full potential of pharmacogenomics. To address these issues, we implemented PharmacoGx, an easy-to-use, open source package for integrative analysis of multiple pharmacogenomic datasets. We demonstrate the utility of our package in comparing large drug sensitivity datasets, such as the Genomics of Drug Sensitivity in Cancer and the Cancer Cell Line Encyclopedia. Moreover, we show how to use our package to easily perform Connectivity Map analysis. With increasing availability of drug-related data, our package will open new avenues of research for meta-analysis of pharmacogenomic data. PharmacoGx is implemented in R and can be easily installed on any system. The package is available from CRAN and its source code is available from GitHub. bhaibeka@uhnresearch.ca or benjamin.haibe.kains@utoronto.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Sensitivity of local air quality to the interplay between small- and large-scale circulations: a large-eddy simulation study

    NASA Astrophysics Data System (ADS)

    Wolf-Grosse, Tobias; Esau, Igor; Reuder, Joachim

    2017-06-01

    Street-level urban air pollution is a challenging concern for modern urban societies. Pollution dispersion models assume that the concentrations decrease monotonically with raising wind speed. This convenient assumption breaks down when applied to flows with local recirculations such as those found in topographically complex coastal areas. This study looks at a practically important and sufficiently common case of air pollution in a coastal valley city. Here, the observed concentrations are determined by the interaction between large-scale topographically forced and local-scale breeze-like recirculations. Analysis of a long observational dataset in Bergen, Norway, revealed that the most extreme cases of recurring wintertime air pollution episodes were accompanied by increased large-scale wind speeds above the valley. Contrary to the theoretical assumption and intuitive expectations, the maximum NO2 concentrations were not found for the lowest 10 m ERA-Interim wind speeds but in situations with wind speeds of 3 m s-1. To explain this phenomenon, we investigated empirical relationships between the large-scale forcing and the local wind and air quality parameters. We conducted 16 large-eddy simulation (LES) experiments with the Parallelised Large-Eddy Simulation Model (PALM) for atmospheric and oceanic flows. The LES accounted for the realistic relief and coastal configuration as well as for the large-scale forcing and local surface condition heterogeneity in Bergen. They revealed that emerging local breeze-like circulations strongly enhance the urban ventilation and dispersion of the air pollutants in situations with weak large-scale winds. Slightly stronger large-scale winds, however, can counteract these local recirculations, leading to enhanced surface air stagnation. Furthermore, this study looks at the concrete impact of the relative configuration of warmer water bodies in the city and the major transport corridor. We found that a relatively small local water

  1. Large-scale physical activity data reveal worldwide activity inequality

    PubMed Central

    Althoff, Tim; Sosič, Rok; Hicks, Jennifer L.; King, Abby C.; Delp, Scott L.; Leskovec, Jure

    2018-01-01

    Understanding the basic principles that govern physical activity is needed to curb the global pandemic of physical inactivity1–7 and the 5.3 million deaths per year associated with in-activity2. Our knowledge, however, remains limited owing to the lack of large-scale measurements of physical activity patterns across free-living populations worldwide1, 6. Here, we leverage the wide usage of smartphones with built-in accelerometry to measure physical activity at planetary scale. We study a dataset consisting of 68 million days of physical activity for 717,527 people, giving us a window into activity in 111 countries across the globe. We find inequality in how activity is distributed within countries and that this inequality is a better predictor of obesity prevalence in the population than average activity volume. Reduced activity in females contributes to a large portion of the observed activity inequality. Aspects of the built environment, such as the walkability of a city, were associated with less gender gap in activity and activity inequality. In more walkable cities, activity is greater throughout the day and throughout the week, across age, gender, and body mass index (BMI) groups, with the greatest increases in activity for females. Our findings have implications for global public health policy and urban planning and highlight the role of activity inequality and the built environment for improving physical activity and health. PMID:28693034

  2. Skin Friction Reduction Through Large-Scale Forcing

    NASA Astrophysics Data System (ADS)

    Bhatt, Shibani; Artham, Sravan; Gnanamanickam, Ebenezer

    2017-11-01

    Flow structures in a turbulent boundary layer larger than an integral length scale (δ), referred to as large-scales, interact with the finer scales in a non-linear manner. By targeting these large-scales and exploiting this non-linear interaction wall shear stress (WSS) reduction of over 10% has been achieved. The plane wall jet (PWJ), a boundary layer which has highly energetic large-scales that become turbulent independent of the near-wall finer scales, is the chosen model flow field. It's unique configuration allows for the independent control of the large-scales through acoustic forcing. Perturbation wavelengths from about 1 δ to 14 δ were considered with a reduction in WSS for all wavelengths considered. This reduction, over a large subset of the wavelengths, scales with both inner and outer variables indicating a mixed scaling to the underlying physics, while also showing dependence on the PWJ global properties. A triple decomposition of the velocity fields shows an increase in coherence due to forcing with a clear organization of the small scale turbulence with respect to the introduced large-scale. The maximum reduction in WSS occurs when the introduced large-scale acts in a manner so as to reduce the turbulent activity in the very near wall region. This material is based upon work supported by the Air Force Office of Scientific Research under Award Number FA9550-16-1-0194 monitored by Dr. Douglas Smith.

  3. SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Large Scale

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meng, Jintao; Seo, Sangmin; Balaji, Pavan

    2016-08-16

    In this paper, we analyze and optimize the most time-consuming steps of the SWAP-Assembler, a parallel genome assembler, so that it can scale to a large number of cores for huge genomes with the size of sequencing data ranging from terabyes to petabytes. According to the performance analysis results, the most time-consuming steps are input parallelization, k-mer graph construction, and graph simplification (edge merging). For the input parallelization, the input data is divided into virtual fragments with nearly equal size, and the start position and end position of each fragment are automatically separated at the beginning of the reads. Inmore » k-mer graph construction, in order to improve the communication efficiency, the message size is kept constant between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step for each round. The memory usage is also decreased because only a small part of the input data is processed in each round. With graph simplification, the communication protocol reduces the number of communication loops from four to two loops and decreases the idle communication time. The optimized assembler is denoted as SWAP-Assembler 2 (SWAP2). In our experiments using a 1000 Genomes project dataset of 4 terabytes (the largest dataset ever used for assembling) on the supercomputer Mira, the results show that SWAP2 scales to 131,072 cores with an efficiency of 40%. We also compared our work with both the HipMER assembler and the SWAP-Assembler. On the Yanhuang dataset of 300 gigabytes, SWAP2 shows a 3X speedup and 4X better scalability compared with the HipMer assembler and is 45 times faster than the SWAP-Assembler. The SWAP2 software is available at https://sourceforge.net/projects/swapassembler.« less

  4. Secure Access Control and Large Scale Robust Representation for Online Multimedia Event Detection

    PubMed Central

    Liu, Changyu; Li, Huiling

    2014-01-01

    We developed an online multimedia event detection (MED) system. However, there are a secure access control issue and a large scale robust representation issue when we want to integrate traditional event detection algorithms into the online environment. For the first issue, we proposed a tree proxy-based and service-oriented access control (TPSAC) model based on the traditional role based access control model. Verification experiments were conducted on the CloudSim simulation platform, and the results showed that the TPSAC model is suitable for the access control of dynamic online environments. For the second issue, inspired by the object-bank scene descriptor, we proposed a 1000-object-bank (1000OBK) event descriptor. Feature vectors of the 1000OBK were extracted from response pyramids of 1000 generic object detectors which were trained on standard annotated image datasets, such as the ImageNet dataset. A spatial bag of words tiling approach was then adopted to encode these feature vectors for bridging the gap between the objects and events. Furthermore, we performed experiments in the context of event classification on the challenging TRECVID MED 2012 dataset, and the results showed that the robust 1000OBK event descriptor outperforms the state-of-the-art approaches. PMID:25147840

  5. Secure access control and large scale robust representation for online multimedia event detection.

    PubMed

    Liu, Changyu; Lu, Bin; Li, Huiling

    2014-01-01

    We developed an online multimedia event detection (MED) system. However, there are a secure access control issue and a large scale robust representation issue when we want to integrate traditional event detection algorithms into the online environment. For the first issue, we proposed a tree proxy-based and service-oriented access control (TPSAC) model based on the traditional role based access control model. Verification experiments were conducted on the CloudSim simulation platform, and the results showed that the TPSAC model is suitable for the access control of dynamic online environments. For the second issue, inspired by the object-bank scene descriptor, we proposed a 1000-object-bank (1000OBK) event descriptor. Feature vectors of the 1000OBK were extracted from response pyramids of 1000 generic object detectors which were trained on standard annotated image datasets, such as the ImageNet dataset. A spatial bag of words tiling approach was then adopted to encode these feature vectors for bridging the gap between the objects and events. Furthermore, we performed experiments in the context of event classification on the challenging TRECVID MED 2012 dataset, and the results showed that the robust 1000OBK event descriptor outperforms the state-of-the-art approaches.

  6. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE PAGES

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew; ...

    2015-08-13

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  7. BactoGeNIE: a large-scale comparative genome visualization for big displays

    PubMed Central

    2015-01-01

    Background The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. Results In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE through a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. Conclusions BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics. PMID:26329021

  8. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  9. Large-scale Estimates of Leaf Area Index from Active Remote Sensing Laser Altimetry

    NASA Astrophysics Data System (ADS)

    Hopkinson, C.; Mahoney, C.

    2016-12-01

    Leaf area index (LAI) is a key parameter that describes the spatial distribution of foliage within forest canopies which in turn control numerous relationships between the ground, canopy, and atmosphere. The retrieval of LAI has demonstrated success by in-situ (digital) hemispherical photography (DHP) and airborne laser scanning (ALS) data; however, field and ALS acquisitions are often spatially limited (100's km2) and costly. Large-scale (>1000's km2) retrievals have been demonstrated by optical sensors, however, accuracies remain uncertain due to the sensor's inability to penetrate the canopy. The spaceborne Geoscience Laser Altimeter System (GLAS) provides a possible solution in retrieving large-scale derivations whilst simultaneously penetrating the canopy. LAI retrieved by multiple DHP from 6 Australian sites, representing a cross-section of Australian ecosystems, were employed to model ALS LAI, which in turn were used to infer LAI from GLAS data at 5 other sites. An optimally filtered GLAS dataset was then employed in conjunction with a host of supplementary data to build a Random Forest (RF) model to infer predictions (and uncertainties) of LAI at a 250 m resolution across the forested regions of Australia. Predictions were validated against ALS-based LAI from 20 sites (R2=0.64, RMSE=1.1 m2m-2); MODIS-based LAI were also assessed against these sites (R2=0.30, RMSE=1.78 m2m-2) to demonstrate the strength of GLAS-based predictions. The large-scale nature of current predictions was also leveraged to demonstrate large-scale relationships of LAI with other environmental characteristics, such as: canopy height, elevation, and slope. The need for such wide-scale quantification of LAI is key in the assessment and modification of forest management strategies across Australia. Such work also assists Australia's Terrestrial Ecosystem Research Network, in fulfilling their government issued mandates.

  10. Contribution of Road Grade to the Energy Use of Modern Automobiles Across Large Datasets of Real-World Drive Cycles: Preprint

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wood, E.; Burton, E.; Duran, A.

    Understanding the real-world power demand of modern automobiles is of critical importance to engineers using modeling and simulation to inform the intelligent design of increasingly efficient powertrains. Increased use of global positioning system (GPS) devices has made large scale data collection of vehicle speed (and associated power demand) a reality. While the availability of real-world GPS data has improved the industry's understanding of in-use vehicle power demand, relatively little attention has been paid to the incremental power requirements imposed by road grade. This analysis quantifies the incremental efficiency impacts of real-world road grade by appending high fidelity elevation profiles tomore » GPS speed traces and performing a large simulation study. Employing a large real-world dataset from the National Renewable Energy Laboratory's Transportation Secure Data Center, vehicle powertrain simulations are performed with and without road grade under five vehicle models. Aggregate results of this study suggest that road grade could be responsible for 1% to 3% of fuel use in light-duty automobiles.« less

  11. The Plant Phenology Ontology: A New Informatics Resource for Large-Scale Integration of Plant Phenology Data.

    PubMed

    Stucky, Brian J; Guralnick, Rob; Deck, John; Denny, Ellen G; Bolmgren, Kjell; Walls, Ramona

    2018-01-01

    Plant phenology - the timing of plant life-cycle events, such as flowering or leafing out - plays a fundamental role in the functioning of terrestrial ecosystems, including human agricultural systems. Because plant phenology is often linked with climatic variables, there is widespread interest in developing a deeper understanding of global plant phenology patterns and trends. Although phenology data from around the world are currently available, truly global analyses of plant phenology have so far been difficult because the organizations producing large-scale phenology data are using non-standardized terminologies and metrics during data collection and data processing. To address this problem, we have developed the Plant Phenology Ontology (PPO). The PPO provides the standardized vocabulary and semantic framework that is needed for large-scale integration of heterogeneous plant phenology data. Here, we describe the PPO, and we also report preliminary results of using the PPO and a new data processing pipeline to build a large dataset of phenology information from North America and Europe.

  12. Large-scale neuromorphic computing systems

    NASA Astrophysics Data System (ADS)

    Furber, Steve

    2016-10-01

    Neuromorphic computing covers a diverse range of approaches to information processing all of which demonstrate some degree of neurobiological inspiration that differentiates them from mainstream conventional computing systems. The philosophy behind neuromorphic computing has its origins in the seminal work carried out by Carver Mead at Caltech in the late 1980s. This early work influenced others to carry developments forward, and advances in VLSI technology supported steady growth in the scale and capability of neuromorphic devices. Recently, a number of large-scale neuromorphic projects have emerged, taking the approach to unprecedented scales and capabilities. These large-scale projects are associated with major new funding initiatives for brain-related research, creating a sense that the time and circumstances are right for progress in our understanding of information processing in the brain. In this review we present a brief history of neuromorphic engineering then focus on some of the principal current large-scale projects, their main features, how their approaches are complementary and distinct, their advantages and drawbacks, and highlight the sorts of capabilities that each can deliver to neural modellers.

  13. Large Scale EOF Analysis of Climate Data

    NASA Astrophysics Data System (ADS)

    Prabhat, M.; Gittens, A.; Kashinath, K.; Cavanaugh, N. R.; Mahoney, M.

    2016-12-01

    We present a distributed approach towards extracting EOFs from 3D climate data. We implement the method in Apache Spark, and process multi-TB sized datasets on O(1000-10,000) cores. We apply this method to latitude-weighted ocean temperature data from CSFR, a 2.2 terabyte-sized data set comprising ocean and subsurface reanalysis measurements collected at 41 levels in the ocean, at 6 hour intervals over 31 years. We extract the first 100 EOFs of this full data set and compare to the EOFs computed simply on the surface temperature field. Our analyses provide evidence of Kelvin and Rossy waves and components of large-scale modes of oscillation including the ENSO and PDO that are not visible in the usual SST EOFs. Further, they provide information on the the most influential parts of the ocean, such as the thermocline, that exist below the surface. Work is ongoing to understand the factors determining the depth-varying spatial patterns observed in the EOFs. We will experiment with weighting schemes to appropriately account for the differing depths of the observations. We also plan to apply the same distributed approach to analysis of analysis of 3D atmospheric climatic data sets, including multiple variables. Because the atmosphere changes on a quicker time-scale than the ocean, we expect that the results will demonstrate an even greater advantage to computing 3D EOFs in lieu of 2D EOFs.

  14. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  15. The maximum vector-angular margin classifier and its fast training on large datasets using a core vector machine.

    PubMed

    Hu, Wenjun; Chung, Fu-Lai; Wang, Shitong

    2012-03-01

    Although pattern classification has been extensively studied in the past decades, how to effectively solve the corresponding training on large datasets is a problem that still requires particular attention. Many kernelized classification methods, such as SVM and SVDD, can be formulated as the corresponding quadratic programming (QP) problems, but computing the associated kernel matrices requires O(n2)(or even up to O(n3)) computational complexity, where n is the size of the training patterns, which heavily limits the applicability of these methods for large datasets. In this paper, a new classification method called the maximum vector-angular margin classifier (MAMC) is first proposed based on the vector-angular margin to find an optimal vector c in the pattern feature space, and all the testing patterns can be classified in terms of the maximum vector-angular margin ρ, between the vector c and all the training data points. Accordingly, it is proved that the kernelized MAMC can be equivalently formulated as the kernelized Minimum Enclosing Ball (MEB), which leads to a distinctive merit of MAMC, i.e., it has the flexibility of controlling the sum of support vectors like v-SVC and may be extended to a maximum vector-angular margin core vector machine (MAMCVM) by connecting the core vector machine (CVM) method with MAMC such that the corresponding fast training on large datasets can be effectively achieved. Experimental results on artificial and real datasets are provided to validate the power of the proposed methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  16. Combinatorial Approach for Large-scale Identification of Linked Peptides from Tandem Mass Spectrometry Spectra*

    PubMed Central

    Wang, Jian; Anania, Veronica G.; Knott, Jeff; Rush, John; Lill, Jennie R.; Bourne, Philip E.; Bandeira, Nuno

    2014-01-01

    The combination of chemical cross-linking and mass spectrometry has recently been shown to constitute a powerful tool for studying protein–protein interactions and elucidating the structure of large protein complexes. However, computational methods for interpreting the complex MS/MS spectra from linked peptides are still in their infancy, making the high-throughput application of this approach largely impractical. Because of the lack of large annotated datasets, most current approaches do not capture the specific fragmentation patterns of linked peptides and therefore are not optimal for the identification of cross-linked peptides. Here we propose a generic approach to address this problem and demonstrate it using disulfide-bridged peptide libraries to (i) efficiently generate large mass spectral reference data for linked peptides at a low cost and (ii) automatically train an algorithm that can efficiently and accurately identify linked peptides from MS/MS spectra. We show that using this approach we were able to identify thousands of MS/MS spectra from disulfide-bridged peptides through comparison with proteome-scale sequence databases and significantly improve the sensitivity of cross-linked peptide identification. This allowed us to identify 60% more direct pairwise interactions between the protein subunits in the 20S proteasome complex than existing tools on cross-linking studies of the proteasome complexes. The basic framework of this approach and the MS/MS reference dataset generated should be valuable resources for the future development of new tools for the identification of linked peptides. PMID:24493012

  17. A Frequency-Domain Implementation of a Sliding-Window Traffic Sign Detector for Large Scale Panoramic Datasets

    NASA Astrophysics Data System (ADS)

    Creusen, I. M.; Hazelhoff, L.; De With, P. H. N.

    2013-10-01

    In large-scale automatic traffic sign surveying systems, the primary computational effort is concentrated at the traffic sign detection stage. This paper focuses on reducing the computational load of particularly the sliding window object detection algorithm which is employed for traffic sign detection. Sliding-window object detectors often use a linear SVM to classify the features in a window. In this case, the classification can be seen as a convolution of the feature maps with the SVM kernel. It is well known that convolution can be efficiently implemented in the frequency domain, for kernels larger than a certain size. We show that by careful reordering of sliding-window operations, most of the frequency-domain transformations can be eliminated, leading to a substantial increase in efficiency. Additionally, we suggest to use the overlap-add method to keep the memory use within reasonable bounds. This allows us to keep all the transformed kernels in memory, thereby eliminating even more domain transformations, and allows all scales in a multiscale pyramid to be processed using the same set of transformed kernels. For a typical sliding-window implementation, we have found that the detector execution performance improves with a factor of 5.3. As a bonus, many of the detector improvements from literature, e.g. chi-squared kernel approximations, sub-class splitting algorithms etc., can be more easily applied at a lower performance penalty because of an improved scalability.

  18. Chemical elements in the environment: multi-element geochemical datasets from continental to national scale surveys on four continents

    USGS Publications Warehouse

    Caritat, Patrice de; Reimann, Clemens; Smith, David; Wang, Xueqiu

    2017-01-01

    During the last 10-20 years, Geological Surveys around the world have undertaken a major effort towards delivering fully harmonized and tightly quality-controlled low-density multi-element soil geochemical maps and datasets of vast regions including up to whole continents. Concentrations of between 45 and 60 elements commonly have been determined in a variety of different regolith types (e.g., sediment, soil). The multi-element datasets are published as complete geochemical atlases and made available to the general public. Several other geochemical datasets covering smaller areas but generally at a higher spatial density are also available. These datasets may, however, not be found by superficial internet-based searches because the elements are not mentioned individually either in the title or in the keyword lists of the original references. This publication attempts to increase the visibility and discoverability of these fundamental background datasets covering large areas up to whole continents.

  19. Bundle block adjustment of large-scale remote sensing data with Block-based Sparse Matrix Compression combined with Preconditioned Conjugate Gradient

    NASA Astrophysics Data System (ADS)

    Zheng, Maoteng; Zhang, Yongjun; Zhou, Shunping; Zhu, Junfeng; Xiong, Xiaodong

    2016-07-01

    In recent years, new platforms and sensors in photogrammetry, remote sensing and computer vision areas have become available, such as Unmanned Aircraft Vehicles (UAV), oblique camera systems, common digital cameras and even mobile phone cameras. Images collected by all these kinds of sensors could be used as remote sensing data sources. These sensors can obtain large-scale remote sensing data which consist of a great number of images. Bundle block adjustment of large-scale data with conventional algorithm is very time and space (memory) consuming due to the super large normal matrix arising from large-scale data. In this paper, an efficient Block-based Sparse Matrix Compression (BSMC) method combined with the Preconditioned Conjugate Gradient (PCG) algorithm is chosen to develop a stable and efficient bundle block adjustment system in order to deal with the large-scale remote sensing data. The main contribution of this work is the BSMC-based PCG algorithm which is more efficient in time and memory than the traditional algorithm without compromising the accuracy. Totally 8 datasets of real data are used to test our proposed method. Preliminary results have shown that the BSMC method can efficiently decrease the time and memory requirement of large-scale data.

  20. Scalable parallel distance field construction for large-scale applications

    DOE PAGES

    Yu, Hongfeng; Xie, Jinrong; Ma, Kwan -Liu; ...

    2015-10-01

    Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less

  1. Scalable Parallel Distance Field Construction for Large-Scale Applications.

    PubMed

    Yu, Hongfeng; Xie, Jinrong; Ma, Kwan-Liu; Kolla, Hemanth; Chen, Jacqueline H

    2015-10-01

    Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.

  2. Large-Scale Outflows in Seyfert Galaxies

    NASA Astrophysics Data System (ADS)

    Colbert, E. J. M.; Baum, S. A.

    1995-12-01

    \\catcode`\\@=11 \\ialign{m @th#1hfil ##hfil \\crcr#2\\crcr\\sim\\crcr}}} \\catcode`\\@=12 Highly collimated outflows extend out to Mpc scales in many radio-loud active galaxies. In Seyfert galaxies, which are radio-quiet, the outflows extend out to kpc scales and do not appear to be as highly collimated. In order to study the nature of large-scale (>~1 kpc) outflows in Seyferts, we have conducted optical, radio and X-ray surveys of a distance-limited sample of 22 edge-on Seyfert galaxies. Results of the optical emission-line imaging and spectroscopic survey imply that large-scale outflows are present in >~{{1} /{4}} of all Seyferts. The radio (VLA) and X-ray (ROSAT) surveys show that large-scale radio and X-ray emission is present at about the same frequency. Kinetic luminosities of the outflows in Seyferts are comparable to those in starburst-driven superwinds. Large-scale radio sources in Seyferts appear diffuse, but do not resemble radio halos found in some edge-on starburst galaxies (e.g. M82). We discuss the feasibility of the outflows being powered by the active nucleus (e.g. a jet) or a circumnuclear starburst.

  3. Long-term coastal measurements for large-scale climate trends characterization

    NASA Astrophysics Data System (ADS)

    Pomaro, Angela; Cavaleri, Luigi; Lionello, Piero

    2017-04-01

    Multi-decadal time-series of observational wave data beginning in the late 1970's are relatively rare. The present study refers to the analysis of the 37-year long directional wave time-series recorded between 1979 and 2015 at the CNR-ISMAR (Institute of Marine Sciences of the Italian National Research Council) "Acqua Alta" oceanographic research tower, located in the Northern Adriatic Sea, 15 km offshore the Venice lagoon, on 16 m depth. The extent of the time series allows to exploit its content not only for modelling purposes or short-term statistical analyses, but also at the climatological scale thanks to the peculiar meteorological and oceanographic aspects of the coastal area where this relevant infrastructure has been installed. We explore the dataset both to characterize the local average climate and its variability, and to detect the possible long-term trends that might be suggestive of, or emphasize, large scale circulation patterns and trends. Measured data are essential for the assessment, and often for the calibration, of model data, generally, if long enough, also the reference also for climate studies. By applying this analysis to an area well characterized from the meteorological point of view, we first assess the changes in time based on measured data, and then we compare them to the ones derived from the ERA-Interim regional simulation over the same area, thus showing the strong improvement that is still needed to get reliable climate models projections on coastal areas and the Mediterranean Region as a whole. Moreover, long term hindcast aiming at climatic considerations are well known for 1) underestimating, if their resolution is not high enough, the actual wave heights as well as for 2) being strongly affected by different conditions over time that are likely to introduce spurious trends of variable magnitude. In particular the different amount, in time, of assimilated data by the hindcast models, directly and indirectly affects the results

  4. Xray: N-dimensional, labeled arrays for analyzing physical datasets in Python

    NASA Astrophysics Data System (ADS)

    Hoyer, S.

    2015-12-01

    Efficient analysis of geophysical datasets requires tools that both preserve and utilize metadata, and that transparently scale to process large datas. Xray is such a tool, in the form of an open source Python library for analyzing the labeled, multi-dimensional array (tensor) datasets that are ubiquitous in the Earth sciences. Xray's approach pairs Python data structures based on the data model of the netCDF file format with the proven design and user interface of pandas, the popular Python data analysis library for labeled tabular data. On top of the NumPy array, xray adds labeled dimensions (e.g., "time") and coordinate values (e.g., "2015-04-10"), which it uses to enable a host of operations powered by these labels: selection, aggregation, alignment, broadcasting, split-apply-combine, interoperability with pandas and serialization to netCDF/HDF5. Many of these operations are enabled by xray's tight integration with pandas. Finally, to allow for easy parallelism and to enable its labeled data operations to scale to datasets that does not fit into memory, xray integrates with the parallel processing library dask.

  5. Synchronization of coupled large-scale Boolean networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Fangfei, E-mail: li-fangfei@163.com

    2014-03-15

    This paper investigates the complete synchronization and partial synchronization of two large-scale Boolean networks. First, the aggregation algorithm towards large-scale Boolean network is reviewed. Second, the aggregation algorithm is applied to study the complete synchronization and partial synchronization of large-scale Boolean networks. Finally, an illustrative example is presented to show the efficiency of the proposed results.

  6. Fast and Accurate Support Vector Machines on Large Scale Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry

    Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less

  7. Bridging the gap between small and large scale sediment budgets? - A scaling challenge in the Upper Rhone Basin, Switzerland

    NASA Astrophysics Data System (ADS)

    Schoch, Anna; Blöthe, Jan; Hoffmann, Thomas; Schrott, Lothar

    2016-04-01

    -regions cover all three litho-tectonic units of the URB (Helvetic nappes, Penninic nappes, External massifs) and different catchment sizes to capture the inherent variability. Different parameters characterizing topography, surface characteristics, and vegetation cover are analyzed for each storage type. The data is then used in geostatistical models (PCA, stepwise logistic regression) to predict the spatial distribution of sediment storage for the whole URB. We further conduct morphometric analyses of the URB to gain information on the varying degree of glacial imprint and postglacial landscape evolution and their control on the spatial distribution of sediment storage in a large scale drainage basin. Geophysical methods (ground penetrating radar and electrical resistivity tomography) are applied on different sediment storage types on the local scale to estimate mean thicknesses. Additional data from published studies are used to complement our dataset. We integrate the local data in the statistical model on the spatial distribution of sediment storages for the whole URB. Hence, we can extrapolate the stored sediment volumes to the regional scale in order to bridge the gap between small and large scale studies.

  8. Consensus properties and their large-scale applications for the gene duplication problem.

    PubMed

    Moon, Jucheol; Lin, Harris T; Eulenstein, Oliver

    2016-06-01

    Solving the gene duplication problem is a classical approach for species tree inference from gene trees that are confounded by gene duplications. This problem takes a collection of gene trees and seeks a species tree that implies the minimum number of gene duplications. Wilkinson et al. posed the conjecture that the gene duplication problem satisfies the desirable Pareto property for clusters. That is, for every instance of the problem, all clusters that are commonly present in the input gene trees of this instance, called strict consensus, will also be found in every solution to this instance. We prove that this conjecture does not generally hold. Despite this negative result we show that the gene duplication problem satisfies a weaker version of the Pareto property where the strict consensus is found in at least one solution (rather than all solutions). This weaker property contributes to our design of an efficient scalable algorithm for the gene duplication problem. We demonstrate the performance of our algorithm in analyzing large-scale empirical datasets. Finally, we utilize the algorithm to evaluate the accuracy of standard heuristics for the gene duplication problem using simulated datasets.

  9. Dissecting the large-scale galactic conformity

    NASA Astrophysics Data System (ADS)

    Seo, Seongu

    2018-01-01

    Galactic conformity is an observed phenomenon that galaxies located in the same region have similar properties such as star formation rate, color, gas fraction, and so on. The conformity was first observed among galaxies within in the same halos (“one-halo conformity”). The one-halo conformity can be readily explained by mutual interactions among galaxies within a halo. Recent observations however further witnessed a puzzling connection among galaxies with no direct interaction. In particular, galaxies located within a sphere of ~5 Mpc radius tend to show similarities, even though the galaxies do not share common halos with each other ("two-halo conformity" or “large-scale conformity”). Using a cosmological hydrodynamic simulation, Illustris, we investigate the physical origin of the two-halo conformity and put forward two scenarios. First, back-splash galaxies are likely responsible for the large-scale conformity. They have evolved into red galaxies due to ram-pressure stripping in a given galaxy cluster and happen to reside now within a ~5 Mpc sphere. Second, galaxies in strong tidal field induced by large-scale structure also seem to give rise to the large-scale conformity. The strong tides suppress star formation in the galaxies. We discuss the importance of the large-scale conformity in the context of galaxy evolution.

  10. Efficient genotype compression and analysis of large genetic variation datasets

    PubMed Central

    Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.

    2015-01-01

    Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772

  11. The Large -scale Distribution of Galaxies

    NASA Astrophysics Data System (ADS)

    Flin, Piotr

    A review of the Large-scale structure of the Universe is given. A connection is made with the titanic work by Johannes Kepler in many areas of astronomy and cosmology. A special concern is made to spatial distribution of Galaxies, voids and walls (cellular structure of the Universe). Finaly, the author is concluding that the large scale structure of the Universe can be observed in much greater scale that it was thought twenty years ago.

  12. Large-scale urban point cloud labeling and reconstruction

    NASA Astrophysics Data System (ADS)

    Zhang, Liqiang; Li, Zhuqiang; Li, Anjian; Liu, Fangyu

    2018-04-01

    The large number of object categories and many overlapping or closely neighboring objects in large-scale urban scenes pose great challenges in point cloud classification. In this paper, a novel framework is proposed for classification and reconstruction of airborne laser scanning point cloud data. To label point clouds, we present a rectified linear units neural network named ReLu-NN where the rectified linear units (ReLu) instead of the traditional sigmoid are taken as the activation function in order to speed up the convergence. Since the features of the point cloud are sparse, we reduce the number of neurons by the dropout to avoid over-fitting of the training process. The set of feature descriptors for each 3D point is encoded through self-taught learning, and forms a discriminative feature representation which is taken as the input of the ReLu-NN. The segmented building points are consolidated through an edge-aware point set resampling algorithm, and then they are reconstructed into 3D lightweight models using the 2.5D contouring method (Zhou and Neumann, 2010). Compared with deep learning approaches, the ReLu-NN introduced can easily classify unorganized point clouds without rasterizing the data, and it does not need a large number of training samples. Most of the parameters in the network are learned, and thus the intensive parameter tuning cost is significantly reduced. Experimental results on various datasets demonstrate that the proposed framework achieves better performance than other related algorithms in terms of classification accuracy and reconstruction quality.

  13. Large-scale virtual screening on public cloud resources with Apache Spark.

    PubMed

    Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola

    2017-01-01

    Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.

  14. A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

    PubMed Central

    Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien

    2016-01-01

    Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452

  15. On Feature Extraction from Large Scale Linear LiDAR Data

    NASA Astrophysics Data System (ADS)

    Acharjee, Partha Pratim

    Airborne light detection and ranging (LiDAR) can generate co-registered elevation and intensity map over large terrain. The co-registered 3D map and intensity information can be used efficiently for different feature extraction application. In this dissertation, we developed two algorithms for feature extraction, and usages of features for practical applications. One of the developed algorithms can map still and flowing waterbody features, and another one can extract building feature and estimate solar potential on rooftops and facades. Remote sensing capabilities, distinguishing characteristics of laser returns from water surface and specific data collection procedures provide LiDAR data an edge in this application domain. Furthermore, water surface mapping solutions must work on extremely large datasets, from a thousand square miles, to hundreds of thousands of square miles. National and state-wide map generation/upgradation and hydro-flattening of LiDAR data for many other applications are two leading needs of water surface mapping. These call for as much automation as possible. Researchers have developed many semi-automated algorithms using multiple semi-automated tools and human interventions. This reported work describes a consolidated algorithm and toolbox developed for large scale, automated water surface mapping. Geometric features such as flatness of water surface, higher elevation change in water-land interface and, optical properties such as dropouts caused by specular reflection, bimodal intensity distributions were some of the linear LiDAR features exploited for water surface mapping. Large-scale data handling capabilities are incorporated by automated and intelligent windowing, by resolving boundary issues and integrating all results to a single output. This whole algorithm is developed as an ArcGIS toolbox using Python libraries. Testing and validation are performed on a large datasets to determine the effectiveness of the toolbox and results are

  16. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing.

    PubMed

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance; Messina, Thomas; Fan, Hongtao; Jaeger, Edward; Stephens, Susan

    2013-06-27

    Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available

  17. Boosting association rule mining in large datasets via Gibbs sampling.

    PubMed

    Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

    2016-05-03

    Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

  18. Large scale dynamic systems

    NASA Technical Reports Server (NTRS)

    Doolin, B. F.

    1975-01-01

    Classes of large scale dynamic systems were discussed in the context of modern control theory. Specific examples discussed were in the technical fields of aeronautics, water resources and electric power.

  19. Large-Scale Sentinel-1 Processing for Solid Earth Science and Urgent Response using Cloud Computing and Machine Learning

    NASA Astrophysics Data System (ADS)

    Hua, H.; Owen, S. E.; Yun, S. H.; Agram, P. S.; Manipon, G.; Starch, M.; Sacco, G. F.; Bue, B. D.; Dang, L. B.; Linick, J. P.; Malarout, N.; Rosen, P. A.; Fielding, E. J.; Lundgren, P.; Moore, A. W.; Liu, Z.; Farr, T.; Webb, F.; Simons, M.; Gurrola, E. M.

    2017-12-01

    With the increased availability of open SAR data (e.g. Sentinel-1 A/B), new challenges are being faced with processing and analyzing the voluminous SAR datasets to make geodetic measurements. Upcoming SAR missions such as NISAR are expected to generate close to 100TB per day. The Advanced Rapid Imaging and Analysis (ARIA) project can now generate geocoded unwrapped phase and coherence products from Sentinel-1 TOPS mode data in an automated fashion, using the ISCE software. This capability is currently being exercised on various study sites across the United States and around the globe, including Hawaii, Central California, Iceland and South America. The automated and large-scale SAR data processing and analysis capabilities use cloud computing techniques to speed the computations and provide scalable processing power and storage. Aspects such as how to processing these voluminous SLCs and interferograms at global scales, keeping up with the large daily SAR data volumes, and how to handle the voluminous data rates are being explored. Scene-partitioning approaches in the processing pipeline help in handling global-scale processing up to unwrapped interferograms with stitching done at a late stage. We have built an advanced science data system with rapid search functions to enable access to the derived data products. Rapid image processing of Sentinel-1 data to interferograms and time series is already being applied to natural hazards including earthquakes, floods, volcanic eruptions, and land subsidence due to fluid withdrawal. We will present the status of the ARIA science data system for generating science-ready data products and challenges that arise from being able to process SAR datasets to derived time series data products at large scales. For example, how do we perform large-scale data quality screening on interferograms? What approaches can be used to minimize compute, storage, and data movement costs for time series analysis in the cloud? We will also

  20. Nanocubes for real-time exploration of spatiotemporal datasets.

    PubMed

    Lins, Lauro; Klosowski, James T; Scheidegger, Carlos

    2013-12-01

    Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop's main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.

  1. Cloud-Scale Genomic Signals Processing for Robust Large-Scale Cancer Genomic Microarray Data Analysis.

    PubMed

    Harvey, Benjamin Simeon; Ji, Soo-Yeon

    2017-01-01

    As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring forth oncological inference to the bioinformatics community through the analysis of large-scale cancer genomic (LSCG) DNA and mRNA microarray data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological interpretation by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale distributed parallel (CSDP) separable 1-D wavelet decomposition technique for denoising through differential expression thresholding and classification of LSCG microarray data. This research presents a novel methodology that utilizes a CSDP separable 1-D method for wavelet-based transformation in order to initialize a threshold which will retain significantly expressed genes through the denoising process for robust classification of cancer patients. Additionally, the overall study was implemented and encompassed within CSDP environment. The utilization of cloud computing and wavelet-based thresholding for denoising was used for the classification of samples within the Global Cancer Map, Cancer Cell Line Encyclopedia, and The Cancer Genome Atlas. The results proved that separable 1-D parallel distributed wavelet denoising in the cloud and differential expression thresholding increased the computational performance and enabled the generation of higher quality LSCG microarray datasets, which led to more accurate classification results.

  2. Large scale analysis of signal reachability.

    PubMed

    Todor, Andrei; Gabr, Haitham; Dobra, Alin; Kahveci, Tamer

    2014-06-15

    Major disorders, such as leukemia, have been shown to alter the transcription of genes. Understanding how gene regulation is affected by such aberrations is of utmost importance. One promising strategy toward this objective is to compute whether signals can reach to the transcription factors through the transcription regulatory network (TRN). Due to the uncertainty of the regulatory interactions, this is a #P-complete problem and thus solving it for very large TRNs remains to be a challenge. We develop a novel and scalable method to compute the probability that a signal originating at any given set of source genes can arrive at any given set of target genes (i.e., transcription factors) when the topology of the underlying signaling network is uncertain. Our method tackles this problem for large networks while providing a provably accurate result. Our method follows a divide-and-conquer strategy. We break down the given network into a sequence of non-overlapping subnetworks such that reachability can be computed autonomously and sequentially on each subnetwork. We represent each interaction using a small polynomial. The product of these polynomials express different scenarios when a signal can or cannot reach to target genes from the source genes. We introduce polynomial collapsing operators for each subnetwork. These operators reduce the size of the resulting polynomial and thus the computational complexity dramatically. We show that our method scales to entire human regulatory networks in only seconds, while the existing methods fail beyond a few tens of genes and interactions. We demonstrate that our method can successfully characterize key reachability characteristics of the entire transcriptions regulatory networks of patients affected by eight different subtypes of leukemia, as well as those from healthy control samples. All the datasets and code used in this article are available at bioinformatics.cise.ufl.edu/PReach/scalable.htm. © The Author 2014

  3. Transition from large-scale to small-scale dynamo.

    PubMed

    Ponty, Y; Plunian, F

    2011-04-15

    The dynamo equations are solved numerically with a helical forcing corresponding to the Roberts flow. In the fully turbulent regime the flow behaves as a Roberts flow on long time scales, plus turbulent fluctuations at short time scales. The dynamo onset is controlled by the long time scales of the flow, in agreement with the former Karlsruhe experimental results. The dynamo mechanism is governed by a generalized α effect, which includes both the usual α effect and turbulent diffusion, plus all higher order effects. Beyond the onset we find that this generalized α effect scales as O(Rm(-1)), suggesting the takeover of small-scale dynamo action. This is confirmed by simulations in which dynamo occurs even if the large-scale field is artificially suppressed.

  4. Combining semi-automated image analysis techniques with machine learning algorithms to accelerate large-scale genetic studies.

    PubMed

    Atkinson, Jonathan A; Lobet, Guillaume; Noll, Manuel; Meyer, Patrick E; Griffiths, Marcus; Wells, Darren M

    2017-10-01

    Genetic analyses of plant root systems require large datasets of extracted architectural traits. To quantify such traits from images of root systems, researchers often have to choose between automated tools (that are prone to error and extract only a limited number of architectural traits) or semi-automated ones (that are highly time consuming). We trained a Random Forest algorithm to infer architectural traits from automatically extracted image descriptors. The training was performed on a subset of the dataset, then applied to its entirety. This strategy allowed us to (i) decrease the image analysis time by 73% and (ii) extract meaningful architectural traits based on image descriptors. We also show that these traits are sufficient to identify the quantitative trait loci that had previously been discovered using a semi-automated method. We have shown that combining semi-automated image analysis with machine learning algorithms has the power to increase the throughput of large-scale root studies. We expect that such an approach will enable the quantification of more complex root systems for genetic studies. We also believe that our approach could be extended to other areas of plant phenotyping. © The Authors 2017. Published by Oxford University Press.

  5. Combining semi-automated image analysis techniques with machine learning algorithms to accelerate large-scale genetic studies

    PubMed Central

    Atkinson, Jonathan A.; Lobet, Guillaume; Noll, Manuel; Meyer, Patrick E.; Griffiths, Marcus

    2017-01-01

    Abstract Genetic analyses of plant root systems require large datasets of extracted architectural traits. To quantify such traits from images of root systems, researchers often have to choose between automated tools (that are prone to error and extract only a limited number of architectural traits) or semi-automated ones (that are highly time consuming). We trained a Random Forest algorithm to infer architectural traits from automatically extracted image descriptors. The training was performed on a subset of the dataset, then applied to its entirety. This strategy allowed us to (i) decrease the image analysis time by 73% and (ii) extract meaningful architectural traits based on image descriptors. We also show that these traits are sufficient to identify the quantitative trait loci that had previously been discovered using a semi-automated method. We have shown that combining semi-automated image analysis with machine learning algorithms has the power to increase the throughput of large-scale root studies. We expect that such an approach will enable the quantification of more complex root systems for genetic studies. We also believe that our approach could be extended to other areas of plant phenotyping. PMID:29020748

  6. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Gray, A.

    2014-04-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex

  7. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Astronomy Data Centre, Canadian

    2014-01-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  8. Large-Scale Hybrid Motor Testing. Chapter 10

    NASA Technical Reports Server (NTRS)

    Story, George

    2006-01-01

    Hybrid rocket motors can be successfully demonstrated at a small scale virtually anywhere. There have been many suitcase sized portable test stands assembled for demonstration of hybrids. They show the safety of hybrid rockets to the audiences. These small show motors and small laboratory scale motors can give comparative burn rate data for development of different fuel/oxidizer combinations, however questions that are always asked when hybrids are mentioned for large scale applications are - how do they scale and has it been shown in a large motor? To answer those questions, large scale motor testing is required to verify the hybrid motor at its true size. The necessity to conduct large-scale hybrid rocket motor tests to validate the burn rate from the small motors to application size has been documented in several place^'^^.^. Comparison of small scale hybrid data to that of larger scale data indicates that the fuel burn rate goes down with increasing port size, even with the same oxidizer flux. This trend holds for conventional hybrid motors with forward oxidizer injection and HTPB based fuels. While the reason this is occurring would make a great paper or study or thesis, it is not thoroughly understood at this time. Potential causes include the fact that since hybrid combustion is boundary layer driven, the larger port sizes reduce the interaction (radiation, mixing and heat transfer) from the core region of the port. This chapter focuses on some of the large, prototype sized testing of hybrid motors. The largest motors tested have been AMROC s 250K-lbf thrust motor at Edwards Air Force Base and the Hybrid Propulsion Demonstration Program s 250K-lbf thrust motor at Stennis Space Center. Numerous smaller tests were performed to support the burn rate, stability and scaling concepts that went into the development of those large motors.

  9. Why small-scale cannabis growers stay small: five mechanisms that prevent small-scale growers from going large scale.

    PubMed

    Hammersvik, Eirik; Sandberg, Sveinung; Pedersen, Willy

    2012-11-01

    Over the past 15-20 years, domestic cultivation of cannabis has been established in a number of European countries. New techniques have made such cultivation easier; however, the bulk of growers remain small-scale. In this study, we explore the factors that prevent small-scale growers from increasing their production. The study is based on 1 year of ethnographic fieldwork and qualitative interviews conducted with 45 Norwegian cannabis growers, 10 of whom were growing on a large-scale and 35 on a small-scale. The study identifies five mechanisms that prevent small-scale indoor growers from going large-scale. First, large-scale operations involve a number of people, large sums of money, a high work-load and a high risk of detection, and thus demand a higher level of organizational skills than for small growing operations. Second, financial assets are needed to start a large 'grow-site'. Housing rent, electricity, equipment and nutrients are expensive. Third, to be able to sell large quantities of cannabis, growers need access to an illegal distribution network and knowledge of how to act according to black market norms and structures. Fourth, large-scale operations require advanced horticultural skills to maximize yield and quality, which demands greater skills and knowledge than does small-scale cultivation. Fifth, small-scale growers are often embedded in the 'cannabis culture', which emphasizes anti-commercialism, anti-violence and ecological and community values. Hence, starting up large-scale production will imply having to renegotiate or abandon these values. Going from small- to large-scale cannabis production is a demanding task-ideologically, technically, economically and personally. The many obstacles that small-scale growers face and the lack of interest and motivation for going large-scale suggest that the risk of a 'slippery slope' from small-scale to large-scale growing is limited. Possible political implications of the findings are discussed. Copyright

  10. The Role of SST and Large-Scale Dynamical Motions on the Onset and Shutdown of the Super Greenhouse Effect

    NASA Astrophysics Data System (ADS)

    O'Brien, T. A.; Kashinath, K.; Collins, W.

    2015-12-01

    Over warm tropical oceans the increase in greenhouse trapping with increasing SST is faster than that of the surface emission, resulting in a decrease in outgoing longwave radiation at the top of the atmosphere (OLR) when SST increases, also known as the super greenhouse effect (SGE). If SGE is directly linked to SST changes, there are profound implications for positive climate feedbacks in the tropics. However, a number of studies in the last 20 years have provided compelling evidence that the OLR-SST relationship is coincidental rather than causal. These studies suggested that the onset of SGE is dominated by the large-scale dynamics, and that the apparent OLR-SST relationships disappear when individual large-scale regimes are considered. We show that these conclusions are contingent on the quality of the datasets used in the analysis, and that modern satellite observations and reanalyses support a strong relationship between SGE and SST. We find that the SGE occurs across all dynamical regimes, suggesting that this may be related primarily to SST rather than large-scale dynamics. We also find that the discontinuity in the relationship between OLR and SST at high SST (29.5 C), i.e. the shutdown of SGE, also occurs across almost all dynamical regimes, suggesting that this behavior may also be strongly linked to SST. Collectively, these results suggest that the SGE may actually be controlled by SST. Work is ongoing to understand the robustness of this new result to other datasets, to understand whether SST is truly the controlling variable, and to understand the mechanism by which OLR could decrease with increasing SST even under strongly subsiding conditions.

  11. Discriminant WSRC for Large-Scale Plant Species Recognition.

    PubMed

    Zhang, Shanwen; Zhang, Chuanlei; Zhu, Yihai; You, Zhuhong

    2017-01-01

    In sparse representation based classification (SRC) and weighted SRC (WSRC), it is time-consuming to solve the global sparse representation problem. A discriminant WSRC (DWSRC) is proposed for large-scale plant species recognition, including two stages. Firstly, several subdictionaries are constructed by dividing the dataset into several similar classes, and a subdictionary is chosen by the maximum similarity between the test sample and the typical sample of each similar class. Secondly, the weighted sparse representation of the test image is calculated with respect to the chosen subdictionary, and then the leaf category is assigned through the minimum reconstruction error. Different from the traditional SRC and its improved approaches, we sparsely represent the test sample on a subdictionary whose base elements are the training samples of the selected similar class, instead of using the generic overcomplete dictionary on the entire training samples. Thus, the complexity to solving the sparse representation problem is reduced. Moreover, DWSRC is adapted to newly added leaf species without rebuilding the dictionary. Experimental results on the ICL plant leaf database show that the method has low computational complexity and high recognition rate and can be clearly interpreted.

  12. Generation of Large-Scale Magnetic Fields by Small-Scale Dynamo in Shear Flows.

    PubMed

    Squire, J; Bhattacharjee, A

    2015-10-23

    We propose a new mechanism for a turbulent mean-field dynamo in which the magnetic fluctuations resulting from a small-scale dynamo drive the generation of large-scale magnetic fields. This is in stark contrast to the common idea that small-scale magnetic fields should be harmful to large-scale dynamo action. These dynamos occur in the presence of a large-scale velocity shear and do not require net helicity, resulting from off-diagonal components of the turbulent resistivity tensor as the magnetic analogue of the "shear-current" effect. Given the inevitable existence of nonhelical small-scale magnetic fields in turbulent plasmas, as well as the generic nature of velocity shear, the suggested mechanism may help explain the generation of large-scale magnetic fields across a wide range of astrophysical objects.

  13. Automated Topographic Change Detection via Dem Differencing at Large Scales Using The Arcticdem Database

    NASA Astrophysics Data System (ADS)

    Candela, S. G.; Howat, I.; Noh, M. J.; Porter, C. C.; Morin, P. J.

    2016-12-01

    In the last decade, high resolution satellite imagery has become an increasingly accessible tool for geoscientists to quantify changes in the Arctic land surface due to geophysical, ecological and anthropomorphic processes. However, the trade off between spatial coverage and spatial-temporal resolution has limited detailed, process-level change detection over large (i.e. continental) scales. The ArcticDEM project utilized over 300,000 Worldview image pairs to produce a nearly 100% coverage elevation model (above 60°N) offering the first polar, high spatial - high resolution (2-8m by region) dataset, often with multiple repeats in areas of particular interest to geo-scientists. A dataset of this size (nearly 250 TB) offers endless new avenues of scientific inquiry, but quickly becomes unmanageable computationally and logistically for the computing resources available to the average scientist. Here we present TopoDiff, a framework for a generalized. automated workflow that requires minimal input from the end user about a study site, and utilizes cloud computing resources to provide a temporally sorted and differenced dataset, ready for geostatistical analysis. This hands-off approach allows the end user to focus on the science, without having to manage thousands of files, or petabytes of data. At the same time, TopoDiff provides a consistent and accurate workflow for image sorting, selection, and co-registration enabling cross-comparisons between research projects.

  14. Large Scale Traffic Simulations

    DOT National Transportation Integrated Search

    1997-01-01

    Large scale microscopic (i.e. vehicle-based) traffic simulations pose high demands on computation speed in at least two application areas: (i) real-time traffic forecasting, and (ii) long-term planning applications (where repeated "looping" between t...

  15. Generation of large-scale magnetic fields by small-scale dynamo in shear flows

    DOE PAGES

    Squire, J.; Bhattacharjee, A.

    2015-10-20

    We propose a new mechanism for a turbulent mean-field dynamo in which the magnetic fluctuations resulting from a small-scale dynamo drive the generation of large-scale magnetic fields. This is in stark contrast to the common idea that small-scale magnetic fields should be harmful to large-scale dynamo action. These dynamos occur in the presence of a large-scale velocity shear and do not require net helicity, resulting from off-diagonal components of the turbulent resistivity tensor as the magnetic analogue of the "shear-current" effect. Furthermore, given the inevitable existence of nonhelical small-scale magnetic fields in turbulent plasmas, as well as the generic naturemore » of velocity shear, the suggested mechanism may help explain the generation of large-scale magnetic fields across a wide range of astrophysical objects.« less

  16. Benchmarking Deep Learning Models on Large Healthcare Datasets.

    PubMed

    Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan

    2018-06-04

    Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.

  17. Personality in 100,000 Words: A large-scale analysis of personality and word use among bloggers

    PubMed Central

    Yarkoni, Tal

    2010-01-01

    Previous studies have found systematic associations between personality and individual differences in word use. Such studies have typically focused on broad associations between major personality domains and aggregate word categories, potentially masking more specific associations. Here I report the results of a large-scale analysis of personality and word use in a large sample of blogs (N=694). The size of the dataset enabled pervasive correlations with personality to be identified for a broad range of lexical variables, including both aggregate word categories and individual English words. The results replicated category-level findings from previous offline studies, identified numerous novel associations at both a categorical and single-word level, and underscored the value of complementary approaches to the study of personality and word use. PMID:20563301

  18. Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

    NASA Astrophysics Data System (ADS)

    Williamson, A.; Newman, A. V.

    2017-12-01

    Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.

  19. The Development of the Older Persons and Informal Caregivers Survey Minimum DataSet (TOPICS-MDS): A Large-Scale Data Sharing Initiative

    PubMed Central

    Lutomski, Jennifer E.; Baars, Maria A. E.; Schalk, Bianca W. M.; Boter, Han; Buurman, Bianca M.; den Elzen, Wendy P. J.; Jansen, Aaltje P. D.; Kempen, Gertrudis I. J. M.; Steunenberg, Bas; Steyerberg, Ewout W.; Olde Rikkert, Marcel G. M.; Melis, René J. F.

    2013-01-01

    feasibility of constructing a large (>30,000 observations), standardised dataset pooled from various study protocols with different sampling frameworks. This unique implementation strategy improves efficiency and facilitates individual-level data meta-analysis. PMID:24324716

  20. Improving Large-Scale Image Retrieval Through Robust Aggregation of Local Descriptors.

    PubMed

    Husain, Syed Sameed; Bober, Miroslaw

    2017-09-01

    Visual search and image retrieval underpin numerous applications, however the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. Prior art methods rely on aggregation of local scale-invariant descriptors, such as SIFT, via mechanisms including Bag of Visual Words (BoW), Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV). However, their performance is still short of what is required. This paper presents a novel method for deriving a compact and distinctive representation of image content called Robust Visual Descriptor with Whitening (RVD-W). It significantly advances the state of the art and delivers world-class performance. In our approach local descriptors are rank-assigned to multiple clusters. Residual vectors are then computed in each cluster, normalized using a direction-preserving normalization function and aggregated based on the neighborhood rank. Importantly, the residual vectors are de-correlated and whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and significantly improved performance. We also propose a new post-PCA normalization approach which improves separability between the matching and non-matching global descriptors. This new normalization benefits not only our RVD-W descriptor but also improves existing approaches based on FV and VLAD aggregation. Furthermore, we show that the aggregation framework developed using hand-crafted SIFT features also performs exceptionally well with Convolutional Neural Network (CNN) based features. The RVD-W pipeline outperforms state-of-the-art global descriptors on both the Holidays and Oxford datasets. On the large scale datasets, Holidays1M and Oxford1M, SIFT-based RVD-W representation obtains a mAP of 45.1 and 35.1 percent, while CNN-based RVD-W achieve a mAP of 63.5 and 44.8 percent, all yielding superior performance to the

  1. Toward Increasing Fairness in Score Scale Calibrations Employed in International Large-Scale Assessments

    ERIC Educational Resources Information Center

    Oliveri, Maria Elena; von Davier, Matthias

    2014-01-01

    In this article, we investigate the creation of comparable score scales across countries in international assessments. We examine potential improvements to current score scale calibration procedures used in international large-scale assessments. Our approach seeks to improve fairness in scoring international large-scale assessments, which often…

  2. Secondary analysis of national survey datasets.

    PubMed

    Boo, Sunjoo; Froelicher, Erika Sivarajan

    2013-06-01

    This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.

  3. Very Large Scale Integration (VLSI).

    ERIC Educational Resources Information Center

    Yeaman, Andrew R. J.

    Very Large Scale Integration (VLSI), the state-of-the-art production techniques for computer chips, promises such powerful, inexpensive computing that, in the future, people will be able to communicate with computer devices in natural language or even speech. However, before full-scale VLSI implementation can occur, certain salient factors must be…

  4. Large Scale Ice Water Path and 3-D Ice Water Content

    DOE Data Explorer

    Liu, Guosheng

    2008-01-15

    Cloud ice water concentration is one of the most important, yet poorly observed, cloud properties. Developing physical parameterizations used in general circulation models through single-column modeling is one of the key foci of the ARM program. In addition to the vertical profiles of temperature, water vapor and condensed water at the model grids, large-scale horizontal advective tendencies of these variables are also required as forcing terms in the single-column models. Observed horizontal advection of condensed water has not been available because the radar/lidar/radiometer observations at the ARM site are single-point measurement, therefore, do not provide horizontal distribution of condensed water. The intention of this product is to provide large-scale distribution of cloud ice water by merging available surface and satellite measurements. The satellite cloud ice water algorithm uses ARM ground-based measurements as baseline, produces datasets for 3-D cloud ice water distributions in a 10 deg x 10 deg area near ARM site. The approach of the study is to expand a (surface) point measurement to an (satellite) areal measurement. That is, this study takes the advantage of the high quality cloud measurements at the point of ARM site. We use the cloud characteristics derived from the point measurement to guide/constrain satellite retrieval, then use the satellite algorithm to derive the cloud ice water distributions within an area, i.e., 10 deg x 10 deg centered at ARM site.

  5. Task Dependence, Tissue Specificity, and Spatial Distribution of Widespread Activations in Large Single-Subject Functional MRI Datasets at 7T

    PubMed Central

    Gonzalez-Castillo, Javier; Hoy, Colin W.; Handwerker, Daniel A.; Roopchansingh, Vinai; Inati, Souheil J.; Saad, Ziad S.; Cox, Robert W.; Bandettini, Peter A.

    2015-01-01

    It was recently shown that when large amounts of task-based blood oxygen level–dependent (BOLD) data are combined to increase contrast- and temporal signal-to-noise ratios, the majority of the brain shows significant hemodynamic responses time-locked with the experimental paradigm. Here, we investigate the biological significance of such widespread activations. First, the relationship between activation extent and task demands was investigated by varying cognitive load across participants. Second, the tissue specificity of responses was probed using the better BOLD signal localization capabilities of a 7T scanner. Finally, the spatial distribution of 3 primary response types—namely positively sustained (pSUS), negatively sustained (nSUS), and transient—was evaluated using a newly defined voxel-wise waveshape index that permits separation of responses based on their temporal signature. About 86% of gray matter (GM) became significantly active when all data entered the analysis for the most complex task. Activation extent scaled with task load and largely followed the GM contour. The most common response type was nSUS BOLD, irrespective of the task. Our results suggest that widespread activations associated with extremely large single-subject functional magnetic resonance imaging datasets can provide valuable information about the functional organization of the brain that goes undetected in smaller sample sizes. PMID:25405938

  6. GIEMS-D3: A new long-term, dynamical, high-spatial resolution inundation extent dataset at global scale

    NASA Astrophysics Data System (ADS)

    Aires, Filipe; Miolane, Léo; Prigent, Catherine; Pham Duc, Binh; Papa, Fabrice; Fluet-Chouinard, Etienne; Lehner, Bernhard

    2017-04-01

    The Global Inundation Extent from Multi-Satellites (GIEMS) provides multi-year monthly variations of the global surface water extent at 25kmx25km resolution. It is derived from multiple satellite observations. Its spatial resolution is usually compatible with climate model outputs and with global land surface model grids but is clearly not adequate for local applications that require the characterization of small individual water bodies. There is today a strong demand for high-resolution inundation extent datasets, for a large variety of applications such as water management, regional hydrological modeling, or for the analysis of mosquitos-related diseases. A new procedure is introduced to downscale the GIEMS low spatial resolution inundations to a 3 arc second (90 m) dataset. The methodology is based on topography and hydrography information from the HydroSHEDS database. A new floodability index is adopted and an innovative smoothing procedure is developed to ensure the smooth transition, in the high-resolution maps, between the low-resolution boxes from GIEMS. Topography information is relevant for natural hydrology environments controlled by elevation, but is more limited in human-modified basins. However, the proposed downscaling approach is compatible with forthcoming fusion with other more pertinent satellite information in these difficult regions. The resulting GIEMS-D3 database is the only high spatial resolution inundation database available globally at the monthly time scale over the 1993-2007 period. GIEMS-D3 is assessed by analyzing its spatial and temporal variability, and evaluated by comparisons to other independent satellite observations from visible (Google Earth and Landsat), infrared (MODIS) and active microwave (SAR).

  7. Survey on large scale system control methods

    NASA Technical Reports Server (NTRS)

    Mercadal, Mathieu

    1987-01-01

    The problem inherent to large scale systems such as power network, communication network and economic or ecological systems were studied. The increase in size and flexibility of future spacecraft has put those dynamical systems into the category of large scale systems, and tools specific to the class of large systems are being sought to design control systems that can guarantee more stability and better performance. Among several survey papers, reference was found to a thorough investigation on decentralized control methods. Especially helpful was the classification made of the different existing approaches to deal with large scale systems. A very similar classification is used, even though the papers surveyed are somehow different from the ones reviewed in other papers. Special attention is brought to the applicability of the existing methods to controlling large mechanical systems like large space structures. Some recent developments are added to this survey.

  8. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering.

    PubMed

    Sun, Peng; Speicher, Nora K; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-05-01

    The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering

    PubMed Central

    Sun, Peng; Speicher, Nora K.; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-01-01

    Abstract The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as ‘simultaneous clustering’ or ‘co-clustering’, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: ‘Bi-Force’. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279–292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. PMID:24682815

  10. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  11. Large-Scale 3D Printing: The Way Forward

    NASA Astrophysics Data System (ADS)

    Jassmi, Hamad Al; Najjar, Fady Al; Ismail Mourad, Abdel-Hamid

    2018-03-01

    Research on small-scale 3D printing has rapidly evolved, where numerous industrial products have been tested and successfully applied. Nonetheless, research on large-scale 3D printing, directed to large-scale applications such as construction and automotive manufacturing, yet demands a great a great deal of efforts. Large-scale 3D printing is considered an interdisciplinary topic and requires establishing a blended knowledge base from numerous research fields including structural engineering, materials science, mechatronics, software engineering, artificial intelligence and architectural engineering. This review article summarizes key topics of relevance to new research trends on large-scale 3D printing, particularly pertaining (1) technological solutions of additive construction (i.e. the 3D printers themselves), (2) materials science challenges, and (3) new design opportunities.

  12. Novel method to construct large-scale design space in lubrication process utilizing Bayesian estimation based on a small-scale design-of-experiment and small sets of large-scale manufacturing data.

    PubMed

    Maeda, Jin; Suzuki, Tatsuya; Takayama, Kozo

    2012-12-01

    A large-scale design space was constructed using a Bayesian estimation method with a small-scale design of experiments (DoE) and small sets of large-scale manufacturing data without enforcing a large-scale DoE. The small-scale DoE was conducted using various Froude numbers (X(1)) and blending times (X(2)) in the lubricant blending process for theophylline tablets. The response surfaces, design space, and their reliability of the compression rate of the powder mixture (Y(1)), tablet hardness (Y(2)), and dissolution rate (Y(3)) on a small scale were calculated using multivariate spline interpolation, a bootstrap resampling technique, and self-organizing map clustering. The constant Froude number was applied as a scale-up rule. Three experiments under an optimal condition and two experiments under other conditions were performed on a large scale. The response surfaces on the small scale were corrected to those on a large scale by Bayesian estimation using the large-scale results. Large-scale experiments under three additional sets of conditions showed that the corrected design space was more reliable than that on the small scale, even if there was some discrepancy in the pharmaceutical quality between the manufacturing scales. This approach is useful for setting up a design space in pharmaceutical development when a DoE cannot be performed at a commercial large manufacturing scale.

  13. Scale-dependent habitat use by a large free-ranging predator, the Mediterranean fin whale

    NASA Astrophysics Data System (ADS)

    Cotté, Cédric; Guinet, Christophe; Taupier-Letage, Isabelle; Mate, Bruce; Petiau, Estelle

    2009-05-01

    Since the heterogeneity of oceanographic conditions drives abundance, distribution, and availability of prey, it is essential to understand how foraging predators interact with their dynamic environment at various spatial and temporal scales. We examined the spatio-temporal relationships between oceanographic features and abundance of fin whales ( Balaenoptera physalus), the largest free-ranging predator in the Western Mediterranean Sea (WM), through two independent approaches. First, spatial modeling was used to estimate whale density, using waiting distance (the distance between detections) for fin whales along ferry routes across the WM, in relation to remotely sensed oceanographic parameters. At a large scale (basin and year), fin whales exhibited fidelity to the northern WM with a summer-aggregated and winter-dispersed pattern. At mesoscale (20-100 km), whales were found in colder, saltier (from an on-board system) and dynamic areas defined by steep altimetric and temperature gradients. Second, using an independent fin whale satellite tracking dataset, we showed that tracked whales were effectively preferentially located in favorable habitats, i.e. in areas of high predicted densities as identified by our previous model using oceanographic data contemporaneous to the tracking period. We suggest that the large-scale fidelity corresponds to temporally and spatially predictable habitat of whale favorite prey, the northern krill ( Meganyctiphanes norvegica), while mesoscale relationships are likely to identify areas of high prey concentration and availability.

  14. Clinical Correlations of Brain Lesion Location in Multiple Sclerosis: Voxel-Based Analysis of a Large Clinical Trial Dataset.

    PubMed

    Altermatt, Anna; Gaetano, Laura; Magon, Stefano; Häring, Dieter A; Tomic, Davorka; Wuerfel, Jens; Radue, Ernst-Wilhelm; Kappos, Ludwig; Sprenger, Till

    2018-05-29

    There is a limited correlation between white matter (WM) lesion load as determined by magnetic resonance imaging and disability in multiple sclerosis (MS). The reasons for this so-called clinico-radiological paradox are diverse and may, at least partly, relate to the fact that not just the overall lesion burden, but also the exact anatomical location of lesions predict the severity and type of disability. We aimed at studying the relationship between lesion distribution and disability using a voxel-based lesion probability mapping approach in a very large dataset of MS patients. T2-weighted lesion masks of 2348 relapsing-remitting MS patients were spatially normalized to standard stereotaxic space by non-linear registration. Relations between supratentorial WM lesion locations and disability measures were assessed using a non-parametric ANCOVA (Expanded Disability Status Scale [EDSS]; Multiple Sclerosis Functional Composite, and subscores; Modified Fatigue Impact Scale) or multinomial ordinal logistic regression (EDSS functional subscores). Data from 1907 (81%) patients were included in the analysis because of successful registration. The lesion mapping showed similar areas to be associated with the different disability scales: periventricular regions in temporal, frontal, and limbic lobes were predictive, mainly affecting the posterior thalamic radiation, the anterior, posterior, and superior parts of the corona radiata. In summary, significant associations between lesion location and clinical scores were found in periventricular areas. Such lesion clusters appear to be associated with impairment of different physical and cognitive abilities, probably because they affect commissural and long projection fibers, which are relevant WM pathways supporting many different brain functions.

  15. Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

    NASA Astrophysics Data System (ADS)

    Li, Y.; Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; McGibbney, L. J.

    2016-12-01

    Big oceanographic data have been produced, archived and made available online, but finding the right data for scientific research and application development is still a significant challenge. A long-standing problem in data discovery is how to find the interrelationships between keywords and data, as well as the intrarelationships of the two individually. Most previous research attempted to solve this problem by building domain-specific ontology either manually or through automatic machine learning techniques. The former is costly, labor intensive and hard to keep up-to-date, while the latter is prone to noise and may be difficult for human to understand. Large-scale user behavior data modelling represents a largely untapped, unique, and valuable source for discovering semantic relationships among domain-specific vocabulary. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, user behaviors, and existing ontology. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the proposed search engine helps both scientists and general users search with better ranking results, recommendation, and ontology navigation.

  16. Large-scale structures of solar wind and dynamics of parameters in them

    NASA Astrophysics Data System (ADS)

    Yermolaev, Yuri; Lodkina, Irina; Yermolaev, Michael

    2017-04-01

    On the basis of OMNI dataset and our catalog of large-scale solar wind (SW) phenomena (see web-site ftp://ftp.iki.rssi.ru/pub/omni/ and paper by Yermolaev et al., 2009) we study temporal profile of interplanetary and magnetospheric parameters in following SW phenomena: interplanetary manifestation of coronal mass ejection (ICME) including magnetic cloud (MC) and Ejecta, Sheath—compression region before ICME and corotating interaction region (CIR)—compression region before high-speed stream (HSS) of solar wind. To take into account a possible influence of other SW types, following sequences of phenomena, which include all typical sequences of non-stationary SW events, are analyzed: (1) SW/ CIR/ SW, (2) SW/ IS/ CIR/ SW, (3) SW/ Ejecta/ SW, (4) SW/ Sheath/Ejecta/ SW, (5) SW/ IS/ Sheath/ Ejecta/ SW, (6) SW/ MC/ SW, (7) SW/Sheath/ MC/ SW, (8) SW/ IS/ Sheath/ MC/ SW (where SW is undisturbed solar wind, and IS is interplanetary shock) (Yermolaev et al., 2015) using the method of double superposed epoch analysis for large numbers of events (Yermolaev et al., 2010). Similarities and distinctions of different SW phenomena depending on neighboring SW types and their geoeffectiveness are discussed. The work was supported by the Russian Science Foundation, projects 16-12-10062. References: Yermolaev, Yu. I., N. S. Nikolaeva, I. G. Lodkina, and M. Yu. Yermolaev (2009), Catalog of Large-Scale Solar Wind Phenomena during 1976-2000, Cosmic Research, , Vol. 47, No. 2, pp. 81-94. Yermolaev, Y. I., N. S. Nikolaeva, I. G. Lodkina, and M. Y. Yermolaev (2010), Specific interplanetary conditions for CIR-induced, Sheath-induced, and ICME-induced geomagnetic storms obtained by double superposed epoch analysis, Ann. Geophys., 28, pp. 2177-2186. Yermolaev, Yu. I., I. G. Lodkina, N. S. Nikolaeva, and M. Yu. Yermolaev (2015), Dynamics of large-scale solar wind streams obtained by the double superposed epoch analysis, J. Geophys. Res. Space Physics, 120, doi:10.1002/2015JA021274.

  17. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

    PubMed Central

    2013-01-01

    Background Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Results Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Conclusions Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of

  18. A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

    NASA Astrophysics Data System (ADS)

    Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

    2017-10-01

    In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.

  19. Functional CAR models for large spatially correlated functional datasets.

    PubMed

    Zhang, Lin; Baladandayuthapani, Veerabhadran; Zhu, Hongxiao; Baggerly, Keith A; Majewski, Tadeusz; Czerniak, Bogdan A; Morris, Jeffrey S

    2016-01-01

    We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations.

  20. ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects.

    PubMed

    Zhang, Yaoyang; Xu, Tao; Shan, Bing; Hart, Jonathan; Aslanian, Aaron; Han, Xuemei; Zong, Nobel; Li, Haomin; Choi, Howard; Wang, Dong; Acharya, Lipi; Du, Lisa; Vogt, Peter K; Ping, Peipei; Yates, John R

    2015-11-03

    Shotgun proteomics generates valuable information from large-scale and target protein characterizations, including protein expression, protein quantification, protein post-translational modifications (PTMs), protein localization, and protein-protein interactions. Typically, peptides derived from proteolytic digestion, rather than intact proteins, are analyzed by mass spectrometers because peptides are more readily separated, ionized and fragmented. The amino acid sequences of peptides can be interpreted by matching the observed tandem mass spectra to theoretical spectra derived from a protein sequence database. Identified peptides serve as surrogates for their proteins and are often used to establish what proteins were present in the original mixture and to quantify protein abundance. Two major issues exist for assigning peptides to their originating protein. The first issue is maintaining a desired false discovery rate (FDR) when comparing or combining multiple large datasets generated by shotgun analysis and the second issue is properly assigning peptides to proteins when homologous proteins are present in the database. Herein we demonstrate a new computational tool, ProteinInferencer, which can be used for protein inference with both small- or large-scale data sets to produce a well-controlled protein FDR. In addition, ProteinInferencer introduces confidence scoring for individual proteins, which makes protein identifications evaluable. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015. Published by Elsevier B.V.

  1. Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

    NASA Astrophysics Data System (ADS)

    Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

    2017-04-01

    CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the

  2. A relativistic signature in large-scale structure

    NASA Astrophysics Data System (ADS)

    Bartolo, Nicola; Bertacca, Daniele; Bruni, Marco; Koyama, Kazuya; Maartens, Roy; Matarrese, Sabino; Sasaki, Misao; Verde, Licia; Wands, David

    2016-09-01

    In General Relativity, the constraint equation relating metric and density perturbations is inherently nonlinear, leading to an effective non-Gaussianity in the dark matter density field on large scales-even if the primordial metric perturbation is Gaussian. Intrinsic non-Gaussianity in the large-scale dark matter overdensity in GR is real and physical. However, the variance smoothed on a local physical scale is not correlated with the large-scale curvature perturbation, so that there is no relativistic signature in the galaxy bias when using the simplest model of bias. It is an open question whether the observable mass proxies such as luminosity or weak lensing correspond directly to the physical mass in the simple halo bias model. If not, there may be observables that encode this relativistic signature.

  3. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  4. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  5. CLAST: CUDA implemented large-scale alignment search tool.

    PubMed

    Yano, Masahiro; Mori, Hiroshi; Akiyama, Yutaka; Yamada, Takuji; Kurokawa, Ken

    2014-12-11

    Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets. We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows-Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node. CLAST achieved very high speed (similar to the Burrows-Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing

  6. Sound production due to large-scale coherent structures

    NASA Technical Reports Server (NTRS)

    Gatski, T. B.

    1979-01-01

    The acoustic pressure fluctuations due to large-scale finite amplitude disturbances in a free turbulent shear flow are calculated. The flow is decomposed into three component scales; the mean motion, the large-scale wave-like disturbance, and the small-scale random turbulence. The effect of the large-scale structure on the flow is isolated by applying both a spatial and phase average on the governing differential equations and by initially taking the small-scale turbulence to be in energetic equilibrium with the mean flow. The subsequent temporal evolution of the flow is computed from global energetic rate equations for the different component scales. Lighthill's theory is then applied to the region with the flowfield as the source and an observer located outside the flowfield in a region of uniform velocity. Since the time history of all flow variables is known, a minimum of simplifying assumptions for the Lighthill stress tensor is required, including no far-field approximations. A phase average is used to isolate the pressure fluctuations due to the large-scale structure, and also to isolate the dynamic process responsible. Variation of mean square pressure with distance from the source is computed to determine the acoustic far-field location and decay rate, and, in addition, spectra at various acoustic field locations are computed and analyzed. Also included are the effects of varying the growth and decay of the large-scale disturbance on the sound produced.

  7. Hydrometeorological variability on a large french catchment and its relation to large-scale circulation across temporal scales

    NASA Astrophysics Data System (ADS)

    Massei, Nicolas; Dieppois, Bastien; Fritier, Nicolas; Laignel, Benoit; Debret, Maxime; Lavers, David; Hannah, David

    2015-04-01

    In the present context of global changes, considerable efforts have been deployed by the hydrological scientific community to improve our understanding of the impacts of climate fluctuations on water resources. Both observational and modeling studies have been extensively employed to characterize hydrological changes and trends, assess the impact of climate variability or provide future scenarios of water resources. In the aim of a better understanding of hydrological changes, it is of crucial importance to determine how and to what extent trends and long-term oscillations detectable in hydrological variables are linked to global climate oscillations. In this work, we develop an approach associating large-scale/local-scale correlation, enmpirical statistical downscaling and wavelet multiresolution decomposition of monthly precipitation and streamflow over the Seine river watershed, and the North Atlantic sea level pressure (SLP) in order to gain additional insights on the atmospheric patterns associated with the regional hydrology. We hypothesized that: i) atmospheric patterns may change according to the different temporal wavelengths defining the variability of the signals; and ii) definition of those hydrological/circulation relationships for each temporal wavelength may improve the determination of large-scale predictors of local variations. The results showed that the large-scale/local-scale links were not necessarily constant according to time-scale (i.e. for the different frequencies characterizing the signals), resulting in changing spatial patterns across scales. This was then taken into account by developing an empirical statistical downscaling (ESD) modeling approach which integrated discrete wavelet multiresolution analysis for reconstructing local hydrometeorological processes (predictand : precipitation and streamflow on the Seine river catchment) based on a large-scale predictor (SLP over the Euro-Atlantic sector) on a monthly time-step. This approach

  8. The Challenge of Large-Scale Literacy Improvement

    ERIC Educational Resources Information Center

    Levin, Ben

    2010-01-01

    This paper discusses the challenge of making large-scale improvements in literacy in schools across an entire education system. Despite growing interest and rhetoric, there are very few examples of sustained, large-scale change efforts around school-age literacy. The paper reviews 2 instances of such efforts, in England and Ontario. After…

  9. Large-scale influences in near-wall turbulence.

    PubMed

    Hutchins, Nicholas; Marusic, Ivan

    2007-03-15

    Hot-wire data acquired in a high Reynolds number facility are used to illustrate the need for adequate scale separation when considering the coherent structure in wall-bounded turbulence. It is found that a large-scale motion in the log region becomes increasingly comparable in energy to the near-wall cycle as the Reynolds number increases. Through decomposition of fluctuating velocity signals, it is shown that this large-scale motion has a distinct modulating influence on the small-scale energy (akin to amplitude modulation). Reassessment of DNS data, in light of these results, shows similar trends, with the rate and intensity of production due to the near-wall cycle subject to a modulating influence from the largest-scale motions.

  10. Generation of large-scale magnetic fields by small-scale dynamo in shear flows

    NASA Astrophysics Data System (ADS)

    Squire, Jonathan; Bhattacharjee, Amitava

    2015-11-01

    A new mechanism for turbulent mean-field dynamo is proposed, in which the magnetic fluctuations resulting from a small-scale dynamo drive the generation of large-scale magnetic fields. This is in stark contrast to the common idea that small-scale magnetic fields should be harmful to large-scale dynamo action. These dynamos occur in the presence of large-scale velocity shear and do not require net helicity, resulting from off-diagonal components of the turbulent resistivity tensor as the magnetic analogue of the ``shear-current'' effect. The dynamo is studied using a variety of computational and analytic techniques, both when the magnetic fluctuations arise self-consistently through the small-scale dynamo and in lower Reynolds number regimes. Given the inevitable existence of non-helical small-scale magnetic fields in turbulent plasmas, as well as the generic nature of velocity shear, the suggested mechanism may help to explain generation of large-scale magnetic fields across a wide range of astrophysical objects. This work was supported by a Procter Fellowship at Princeton University, and the US Department of Energy Grant DE-AC02-09-CH11466.

  11. A Framework for Spatial Interaction Analysis Based on Large-Scale Mobile Phone Data

    PubMed Central

    Li, Weifeng; Cheng, Xiaoyun; Guo, Gaohua

    2014-01-01

    The overall understanding of spatial interaction and the exact knowledge of its dynamic evolution are required in the urban planning and transportation planning. This study aimed to analyze the spatial interaction based on the large-scale mobile phone data. The newly arisen mass dataset required a new methodology which was compatible with its peculiar characteristics. A three-stage framework was proposed in this paper, including data preprocessing, critical activity identification, and spatial interaction measurement. The proposed framework introduced the frequent pattern mining and measured the spatial interaction by the obtained association. A case study of three communities in Shanghai was carried out as verification of proposed method and demonstration of its practical application. The spatial interaction patterns and the representative features proved the rationality of the proposed framework. PMID:25435865

  12. Nearest neighbor density ratio estimation for large-scale applications in astronomy

    NASA Astrophysics Data System (ADS)

    Kremer, J.; Gieseke, F.; Steenstrup Pedersen, K.; Igel, C.

    2015-09-01

    In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

  13. PKI security in large-scale healthcare networks.

    PubMed

    Mantas, Georgios; Lymberopoulos, Dimitrios; Komninos, Nikos

    2012-06-01

    During the past few years a lot of PKI (Public Key Infrastructures) infrastructures have been proposed for healthcare networks in order to ensure secure communication services and exchange of data among healthcare professionals. However, there is a plethora of challenges in these healthcare PKI infrastructures. Especially, there are a lot of challenges for PKI infrastructures deployed over large-scale healthcare networks. In this paper, we propose a PKI infrastructure to ensure security in a large-scale Internet-based healthcare network connecting a wide spectrum of healthcare units geographically distributed within a wide region. Furthermore, the proposed PKI infrastructure facilitates the trust issues that arise in a large-scale healthcare network including multi-domain PKI infrastructures.

  14. Using Browser Notebooks to Analyse Big Atmospheric Data-sets in the Cloud

    NASA Astrophysics Data System (ADS)

    Robinson, N.; Tomlinson, J.; Arribas, A.; Prudden, R.

    2016-12-01

    We are presenting an account of our experience building an ecosystem for the analysis of big atmospheric data-sets. By using modern technologies we have developed a prototype platform which is scaleable and capable of analysing very large atmospheric datasets. We tested different big-data ecosystems such as Hadoop MapReduce, Spark and Dask, in order to find the one which was best suited for analysis of multidimensional binary data such as NetCDF. We make extensive use of infrastructure-as-code and containerisation to provide a platform which is reusable, and which can scale to accommodate changes in demand. We make this platform readily accessible using browser based notebooks. As a result, analysts with minimal technology experience can, in tens of lines of Python, make interactive data-visualisation web pages, which can analyse very large amounts of data using cutting edge big-data technology

  15. Soil chemistry in lithologically diverse datasets: the quartz dilution effect

    USGS Publications Warehouse

    Bern, Carleton R.

    2009-01-01

    National- and continental-scale soil geochemical datasets are likely to move our understanding of broad soil geochemistry patterns forward significantly. Patterns of chemistry and mineralogy delineated from these datasets are strongly influenced by the composition of the soil parent material, which itself is largely a function of lithology and particle size sorting. Such controls present a challenge by obscuring subtler patterns arising from subsequent pedogenic processes. Here the effect of quartz concentration is examined in moist-climate soils from a pilot dataset of the North American Soil Geochemical Landscapes Project. Due to variable and high quartz contents (6.2–81.7 wt.%), and its residual and inert nature in soil, quartz is demonstrated to influence broad patterns in soil chemistry. A dilution effect is observed whereby concentrations of various elements are significantly and strongly negatively correlated with quartz. Quartz content drives artificial positive correlations between concentrations of some elements and obscures negative correlations between others. Unadjusted soil data show the highly mobile base cations Ca, Mg, and Na to be often strongly positively correlated with intermediately mobile Al or Fe, and generally uncorrelated with the relatively immobile high-field-strength elements (HFS) Ti and Nb. Both patterns are contrary to broad expectations for soils being weathered and leached. After transforming bulk soil chemistry to a quartz-free basis, the base cations are generally uncorrelated with Al and Fe, and negative correlations generally emerge with the HFS elements. Quartz-free element data may be a useful tool for elucidating patterns of weathering or parent-material chemistry in large soil datasets.

  16. Large-scale velocities and primordial non-Gaussianity

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schmidt, Fabian

    2010-09-15

    We study the peculiar velocities of density peaks in the presence of primordial non-Gaussianity. Rare, high-density peaks in the initial density field can be identified with tracers such as galaxies and clusters in the evolved matter distribution. The distribution of relative velocities of peaks is derived in the large-scale limit using two different approaches based on a local biasing scheme. Both approaches agree, and show that halos still stream with the dark matter locally as well as statistically, i.e. they do not acquire a velocity bias. Nonetheless, even a moderate degree of (not necessarily local) non-Gaussianity induces a significant skewnessmore » ({approx}0.1-0.2) in the relative velocity distribution, making it a potentially interesting probe of non-Gaussianity on intermediate to large scales. We also study two-point correlations in redshift space. The well-known Kaiser formula is still a good approximation on large scales, if the Gaussian halo bias is replaced with its (scale-dependent) non-Gaussian generalization. However, there are additional terms not encompassed by this simple formula which become relevant on smaller scales (k > or approx. 0.01h/Mpc). Depending on the allowed level of non-Gaussianity, these could be of relevance for future large spectroscopic surveys.« less

  17. Privacy-preserving record linkage on large real world datasets.

    PubMed

    Randall, Sean M; Ferrante, Anna M; Boyd, James H; Bauer, Jacqueline K; Semmens, James B

    2014-08-01

    Record linkage typically involves the use of dedicated linkage units who are supplied with personally identifying information to determine individuals from within and across datasets. The personally identifying information supplied to linkage units is separated from clinical information prior to release by data custodians. While this substantially reduces the risk of disclosure of sensitive information, some residual risks still exist and remain a concern for some custodians. In this paper we trial a method of record linkage which reduces privacy risk still further on large real world administrative data. The method uses encrypted personal identifying information (bloom filters) in a probability-based linkage framework. The privacy preserving linkage method was tested on ten years of New South Wales (NSW) and Western Australian (WA) hospital admissions data, comprising in total over 26 million records. No difference in linkage quality was found when the results were compared to traditional probabilistic methods using full unencrypted personal identifiers. This presents as a possible means of reducing privacy risks related to record linkage in population level research studies. It is hoped that through adaptations of this method or similar privacy preserving methods, risks related to information disclosure can be reduced so that the benefits of linked research taking place can be fully realised. Copyright © 2013 Elsevier Inc. All rights reserved.

  18. Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publiclymore » available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.« less

  19. Large-scale regions of antimatter

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grobov, A. V., E-mail: alexey.grobov@gmail.com; Rubin, S. G., E-mail: sgrubin@mephi.ru

    2015-07-15

    Amodified mechanism of the formation of large-scale antimatter regions is proposed. Antimatter appears owing to fluctuations of a complex scalar field that carries a baryon charge in the inflation era.

  20. Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

    PubMed

    Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

    2015-10-12

    Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species

  1. Towards large-scale mapping of urban three-dimensional structure using Landsat imagery and global elevation datasets

    NASA Astrophysics Data System (ADS)

    Wang, P.; Huang, C.

    2017-12-01

    The three-dimensional (3D) structure of buildings and infrastructures is fundamental to understanding and modelling of the impacts and challenges of urbanization in terms of energy use, carbon emissions, and earthquake vulnerabilities. However, spatially detailed maps of urban 3D structure have been scarce, particularly in fast-changing developing countries. We present here a novel methodology to map the volume of buildings and infrastructures at 30 meter resolution using a synergy of Landsat imagery and openly available global digital surface models (DSMs), including the Shuttle Radar Topography Mission (SRTM), ASTER Global Digital Elevation Map (GDEM), ALOS World 3D - 30m (AW3D30), and the recently released global DSM from the TanDEM-X mission. Our method builds on the concept of object-based height profile to extract height metrics from the DSMs and use a machine learning algorithm to predict height and volume from the height metrics. We have tested this algorithm in the entire England and assessed our result using Lidar measurements in 25 England cities. Our initial assessments achieved a RMSE of 1.4 m (R2 = 0.72) for building height and a RMSE of 1208.7 m3 (R2 = 0.69) for building volume, demonstrating the potential of large-scale applications and fully automated mapping of urban structure.

  2. Climate variability rather than overstocking causes recent large scale cover changes of Tibetan pastures.

    PubMed

    Lehnert, L W; Wesche, K; Trachte, K; Reudenbach, C; Bendix, J

    2016-04-13

    The Tibetan Plateau (TP) is a globally important "water tower" that provides water for nearly 40% of the world's population. This supply function is claimed to be threatened by pasture degradation on the TP and the associated loss of water regulation functions. However, neither potential large scale degradation changes nor their drivers are known. Here, we analyse trends in a high-resolution dataset of grassland cover to determine the interactions among vegetation dynamics, climate change and human impacts on the TP. The results reveal that vegetation changes have regionally different triggers: While the vegetation cover has increased since the year 2000 in the north-eastern part of the TP due to an increase in precipitation, it has declined in the central and western parts of the TP due to rising air temperature and declining precipitation. Increasing livestock numbers as a result of land use changes exacerbated the negative trends but were not their exclusive driver. Thus, we conclude that climate variability instead of overgrazing has been the primary cause for large scale vegetation cover changes on the TP since the new millennium. Since areas of positive and negative changes are almost equal in extent, pasture degradation is not generally proceeding.

  3. Climate variability rather than overstocking causes recent large scale cover changes of Tibetan pastures

    PubMed Central

    Lehnert, L. W.; Wesche, K.; Trachte, K.; Reudenbach, C.; Bendix, J.

    2016-01-01

    The Tibetan Plateau (TP) is a globally important “water tower” that provides water for nearly 40% of the world’s population. This supply function is claimed to be threatened by pasture degradation on the TP and the associated loss of water regulation functions. However, neither potential large scale degradation changes nor their drivers are known. Here, we analyse trends in a high-resolution dataset of grassland cover to determine the interactions among vegetation dynamics, climate change and human impacts on the TP. The results reveal that vegetation changes have regionally different triggers: While the vegetation cover has increased since the year 2000 in the north-eastern part of the TP due to an increase in precipitation, it has declined in the central and western parts of the TP due to rising air temperature and declining precipitation. Increasing livestock numbers as a result of land use changes exacerbated the negative trends but were not their exclusive driver. Thus, we conclude that climate variability instead of overgrazing has been the primary cause for large scale vegetation cover changes on the TP since the new millennium. Since areas of positive and negative changes are almost equal in extent, pasture degradation is not generally proceeding. PMID:27073126

  4. Climate variability rather than overstocking causes recent large scale cover changes of Tibetan pastures

    NASA Astrophysics Data System (ADS)

    Lehnert, L. W.; Wesche, K.; Trachte, K.; Reudenbach, C.; Bendix, J.

    2016-04-01

    The Tibetan Plateau (TP) is a globally important “water tower” that provides water for nearly 40% of the world’s population. This supply function is claimed to be threatened by pasture degradation on the TP and the associated loss of water regulation functions. However, neither potential large scale degradation changes nor their drivers are known. Here, we analyse trends in a high-resolution dataset of grassland cover to determine the interactions among vegetation dynamics, climate change and human impacts on the TP. The results reveal that vegetation changes have regionally different triggers: While the vegetation cover has increased since the year 2000 in the north-eastern part of the TP due to an increase in precipitation, it has declined in the central and western parts of the TP due to rising air temperature and declining precipitation. Increasing livestock numbers as a result of land use changes exacerbated the negative trends but were not their exclusive driver. Thus, we conclude that climate variability instead of overgrazing has been the primary cause for large scale vegetation cover changes on the TP since the new millennium. Since areas of positive and negative changes are almost equal in extent, pasture degradation is not generally proceeding.

  5. Towards large scale multi-target tracking

    NASA Astrophysics Data System (ADS)

    Vo, Ba-Ngu; Vo, Ba-Tuong; Reuter, Stephan; Lam, Quang; Dietmayer, Klaus

    2014-06-01

    Multi-target tracking is intrinsically an NP-hard problem and the complexity of multi-target tracking solutions usually do not scale gracefully with problem size. Multi-target tracking for on-line applications involving a large number of targets is extremely challenging. This article demonstrates the capability of the random finite set approach to provide large scale multi-target tracking algorithms. In particular it is shown that an approximate filter known as the labeled multi-Bernoulli filter can simultaneously track one thousand five hundred targets in clutter on a standard laptop computer.

  6. The Expanded Large Scale Gap Test

    DTIC Science & Technology

    1987-03-01

    NSWC TR 86-32 DTIC THE EXPANDED LARGE SCALE GAP TEST BY T. P. LIDDIARD D. PRICE RESEARCH AND TECHNOLOGY DEPARTMENT ’ ~MARCH 1987 Ap~proved for public...arises, to reduce the spread in the LSGT 50% gap value.) The worst charges, such as those with the highest or lowest densities, the largest re-pressed...Arlington, VA 22217 PE 62314N INS3A 1 RJ14E31 7R4TBK 11 TITLE (Include Security CIlmsilficatiorn The Expanded Large Scale Gap Test . 12. PEIRSONAL AUTHOR() T

  7. An informal paper on large-scale dynamic systems

    NASA Technical Reports Server (NTRS)

    Ho, Y. C.

    1975-01-01

    Large scale systems are defined as systems requiring more than one decision maker to control the system. Decentralized control and decomposition are discussed for large scale dynamic systems. Information and many-person decision problems are analyzed.

  8. On large-scale dynamo action at high magnetic Reynolds number

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cattaneo, F.; Tobias, S. M., E-mail: smt@maths.leeds.ac.uk

    2014-07-01

    We consider the generation of magnetic activity—dynamo waves—in the astrophysical limit of very large magnetic Reynolds number. We consider kinematic dynamo action for a system consisting of helical flow and large-scale shear. We demonstrate that large-scale dynamo waves persist at high Rm if the helical flow is characterized by a narrow band of spatial scales and the shear is large enough. However, for a wide band of scales the dynamo becomes small scale with a further increase of Rm, with dynamo waves re-emerging only if the shear is then increased. We show that at high Rm, the key effect ofmore » the shear is to suppress small-scale dynamo action, allowing large-scale dynamo action to be observed. We conjecture that this supports a general 'suppression principle'—large-scale dynamo action can only be observed if there is a mechanism that suppresses the small-scale fluctuations.« less

  9. Large-scale dynamos in rapidly rotating plane layer convection

    NASA Astrophysics Data System (ADS)

    Bushby, P. J.; Käpylä, P. J.; Masada, Y.; Brandenburg, A.; Favier, B.; Guervilly, C.; Käpylä, M. J.

    2018-05-01

    Context. Convectively driven flows play a crucial role in the dynamo processes that are responsible for producing magnetic activity in stars and planets. It is still not fully understood why many astrophysical magnetic fields have a significant large-scale component. Aims: Our aim is to investigate the dynamo properties of compressible convection in a rapidly rotating Cartesian domain, focusing upon a parameter regime in which the underlying hydrodynamic flow is known to be unstable to a large-scale vortex instability. Methods: The governing equations of three-dimensional non-linear magnetohydrodynamics (MHD) are solved numerically. Different numerical schemes are compared and we propose a possible benchmark case for other similar codes. Results: In keeping with previous related studies, we find that convection in this parameter regime can drive a large-scale dynamo. The components of the mean horizontal magnetic field oscillate, leading to a continuous overall rotation of the mean field. Whilst the large-scale vortex instability dominates the early evolution of the system, the large-scale vortex is suppressed by the magnetic field and makes a negligible contribution to the mean electromotive force that is responsible for driving the large-scale dynamo. The cycle period of the dynamo is comparable to the ohmic decay time, with longer cycles for dynamos in convective systems that are closer to onset. In these particular simulations, large-scale dynamo action is found only when vertical magnetic field boundary conditions are adopted at the upper and lower boundaries. Strongly modulated large-scale dynamos are found at higher Rayleigh numbers, with periods of reduced activity (grand minima-like events) occurring during transient phases in which the large-scale vortex temporarily re-establishes itself, before being suppressed again by the magnetic field.

  10. ConTour: Data-Driven Exploration of Multi-Relational Datasets for Drug Discovery.

    PubMed

    Partl, Christian; Lex, Alexander; Streit, Marc; Strobelt, Hendrik; Wassermann, Anne-Mai; Pfister, Hanspeter; Schmalstieg, Dieter

    2014-12-01

    Large scale data analysis is nowadays a crucial part of drug discovery. Biologists and chemists need to quickly explore and evaluate potentially effective yet safe compounds based on many datasets that are in relationship with each other. However, there is a lack of tools that support them in these processes. To remedy this, we developed ConTour, an interactive visual analytics technique that enables the exploration of these complex, multi-relational datasets. At its core ConTour lists all items of each dataset in a column. Relationships between the columns are revealed through interaction: selecting one or multiple items in one column highlights and re-sorts the items in other columns. Filters based on relationships enable drilling down into the large data space. To identify interesting items in the first place, ConTour employs advanced sorting strategies, including strategies based on connectivity strength and uniqueness, as well as sorting based on item attributes. ConTour also introduces interactive nesting of columns, a powerful method to show the related items of a child column for each item in the parent column. Within the columns, ConTour shows rich attribute data about the items as well as information about the connection strengths to other datasets. Finally, ConTour provides a number of detail views, which can show items from multiple datasets and their associated data at the same time. We demonstrate the utility of our system in case studies conducted with a team of chemical biologists, who investigate the effects of chemical compounds on cells and need to understand the underlying mechanisms.

  11. Large-scale anisotropy of the cosmic microwave background radiation

    NASA Technical Reports Server (NTRS)

    Silk, J.; Wilson, M. L.

    1981-01-01

    Inhomogeneities in the large-scale distribution of matter inevitably lead to the generation of large-scale anisotropy in the cosmic background radiation. The dipole, quadrupole, and higher order fluctuations expected in an Einstein-de Sitter cosmological model have been computed. The dipole and quadrupole anisotropies are comparable to the measured values, and impose important constraints on the allowable spectrum of large-scale matter density fluctuations. A significant dipole anisotropy is generated by the matter distribution on scales greater than approximately 100 Mpc. The large-scale anisotropy is insensitive to the ionization history of the universe since decoupling, and cannot easily be reconciled with a galaxy formation theory that is based on primordial adiabatic density fluctuations.

  12. Landscape Characterization of Arctic Ecosystems Using Data Mining Algorithms and Large Geospatial Datasets

    NASA Astrophysics Data System (ADS)

    Langford, Z. L.; Kumar, J.; Hoffman, F. M.

    2015-12-01

    Observations indicate that over the past several decades, landscape processes in the Arctic have been changing or intensifying. A dynamic Arctic landscape has the potential to alter ecosystems across a broad range of scales. Accurate characterization is useful to understand the properties and organization of the landscape, optimal sampling network design, measurement and process upscaling and to establish a landscape-based framework for multi-scale modeling of ecosystem processes. This study seeks to delineate the landscape at Seward Peninsula of Alaska into ecoregions using large volumes (terabytes) of high spatial resolution satellite remote-sensing data. Defining high-resolution ecoregion boundaries is difficult because many ecosystem processes in Arctic ecosystems occur at small local to regional scales, which are often resolved in by coarse resolution satellites (e.g., MODIS). We seek to use data-fusion techniques and data analytics algorithms applied to Phased Array type L-band Synthetic Aperture Radar (PALSAR), Interferometric Synthetic Aperture Radar (IFSAR), Satellite for Observation of Earth (SPOT), WorldView-2, WorldView-3, and QuickBird-2 to develop high-resolution (˜5m) ecoregion maps for multiple time periods. Traditional analysis methods and algorithms are insufficient for analyzing and synthesizing such large geospatial data sets, and those algorithms rarely scale out onto large distributed- memory parallel computer systems. We seek to develop computationally efficient algorithms and techniques using high-performance computing for characterization of Arctic landscapes. We will apply a variety of data analytics algorithms, such as cluster analysis, complex object-based image analysis (COBIA), and neural networks. We also propose to use representativeness analysis within the Seward Peninsula domain to determine optimal sampling locations for fine-scale measurements. This methodology should provide an initial framework for analyzing dynamic landscape

  13. A reanalysis dataset of the South China Sea.

    PubMed

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992-2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability.

  14. Economically viable large-scale hydrogen liquefaction

    NASA Astrophysics Data System (ADS)

    Cardella, U.; Decker, L.; Klein, H.

    2017-02-01

    The liquid hydrogen demand, particularly driven by clean energy applications, will rise in the near future. As industrial large scale liquefiers will play a major role within the hydrogen supply chain, production capacity will have to increase by a multiple of today’s typical sizes. The main goal is to reduce the total cost of ownership for these plants by increasing energy efficiency with innovative and simple process designs, optimized in capital expenditure. New concepts must ensure a manageable plant complexity and flexible operability. In the phase of process development and selection, a dimensioning of key equipment for large scale liquefiers, such as turbines and compressors as well as heat exchangers, must be performed iteratively to ensure technological feasibility and maturity. Further critical aspects related to hydrogen liquefaction, e.g. fluid properties, ortho-para hydrogen conversion, and coldbox configuration, must be analysed in detail. This paper provides an overview on the approach, challenges and preliminary results in the development of efficient as well as economically viable concepts for large-scale hydrogen liquefaction.

  15. Large-Scale Coronal Heating from the Solar Magnetic Network

    NASA Technical Reports Server (NTRS)

    Falconer, David A.; Moore, Ronald L.; Porter, Jason G.; Hathaway, David H.

    1999-01-01

    In Fe 12 images from SOHO/EIT, the quiet solar corona shows structure on scales ranging from sub-supergranular (i.e., bright points and coronal network) to multi- supergranular. In Falconer et al 1998 (Ap.J., 501, 386) we suppressed the large-scale background and found that the network-scale features are predominantly rooted in the magnetic network lanes at the boundaries of the supergranules. The emission of the coronal network and bright points contribute only about 5% of the entire quiet solar coronal Fe MI emission. Here we investigate the large-scale corona, the supergranular and larger-scale structure that we had previously treated as a background, and that emits 95% of the total Fe XII emission. We compare the dim and bright halves of the large- scale corona and find that the bright half is 1.5 times brighter than the dim half, has an order of magnitude greater area of bright point coverage, has three times brighter coronal network, and has about 1.5 times more magnetic flux than the dim half These results suggest that the brightness of the large-scale corona is more closely related to the large- scale total magnetic flux than to bright point activity. We conclude that in the quiet sun: (1) Magnetic flux is modulated (concentrated/diluted) on size scales larger than supergranules. (2) The large-scale enhanced magnetic flux gives an enhanced, more active, magnetic network and an increased incidence of network bright point formation. (3) The heating of the large-scale corona is dominated by more widespread, but weaker, network activity than that which heats the bright points. This work was funded by the Solar Physics Branch of NASA's office of Space Science through the SR&T Program and the SEC Guest Investigator Program.

  16. Large- and Very-Large-Scale Motions in Katabatic Flows Over Steep Slopes

    NASA Astrophysics Data System (ADS)

    Giometto, M. G.; Fang, J.; Salesky, S.; Parlange, M. B.

    2016-12-01

    Evidence of large- and very-large-scale motions populating the boundary layer in katabatic flows over steep slopes is presented via direct numerical simulations (DNSs). DNSs are performed at a modified Reynolds number (Rem = 967), considering four sloping angles (α = 60°, 70°, 80° and 90°). Large coherent structures prove to be strongly dependent on the inclination of the underlying surface. Spectra and co-spectra consistently show signatures of large-scale motions (LSMs), with streamwise extension on the order of the boundary layer thickness. A second low-wavenumber mode characterizes pre-multiplied spectra and co-spectra when the slope angle is below 70°, indicative of very-large-scale motions (VLSMs). In addition, conditional sampling and averaging shows how LSMs and VLSMs are induced by counter-rotating roll modes, in agreement with findings from canonical wall-bounded flows. VLSMs contribute to the stream-wise velocity variance and shear stress in the above-jet regions up to 30% and 45% respectively, whereas both LSMs and VLSMs are inactive in the near-wall regions.

  17. Characterization and visualization of the accuracy of FIA's CONUS-wide tree species datasets

    Treesearch

    Rachel Riemann; Barry T. Wilson

    2014-01-01

    Modeled geospatial datasets have been created for 325 tree species across the contiguous United States (CONUS). Effective application of all geospatial datasets depends on their accuracy. Dataset error can be systematic (bias) or unsystematic (scatter), and their magnitude can vary by region and scale. Each of these characteristics affects the locations, scales, uses,...

  18. ATLANTIC-CAMTRAPS: a dataset of medium and large terrestrial mammal communities in the Atlantic Forest of South America.

    PubMed

    Lima, Fernando; Beca, Gabrielle; Muylaert, Renata L; Jenkins, Clinton N; Perilli, Miriam L L; Paschoal, Ana Maria O; Massara, Rodrigo L; Paglia, Adriano P; Chiarello, Adriano G; Graipel, Maurício E; Cherem, Jorge J; Regolin, André L; Oliveira Santos, Luiz Gustavo R; Brocardo, Carlos R; Paviolo, Agustín; Di Bitetti, Mario S; Scoss, Leandro M; Rocha, Fabiana L; Fusco-Costa, Roberto; Rosa, Clarissa A; Da Silva, Marina X; Hufnagell, Ludmila; Santos, Paloma M; Duarte, Gabriela T; Guimarães, Luiza N; Bailey, Larissa L; Rodrigues, Flávio Henrique G; Cunha, Heitor M; Fantacini, Felipe M; Batista, Graziele O; Bogoni, Juliano A; Tortato, Marco A; Luiz, Micheli R; Peroni, Nivaldo; De Castilho, Pedro V; Maccarini, Thiago B; Filho, Vilmar Picinatto; Angelo, Carlos De; Cruz, Paula; Quiroga, Verónica; Iezzi, María E; Varela, Diego; Cavalcanti, Sandra M C; Martensen, Alexandre C; Maggiorini, Erica V; Keesen, Fabíola F; Nunes, André V; Lessa, Gisele M; Cordeiro-Estrela, Pedro; Beltrão, Mayara G; De Albuquerque, Anna Carolina F; Ingberman, Bianca; Cassano, Camila R; Junior, Laury Cullen; Ribeiro, Milton C; Galetti, Mauro

    2017-11-01

    Our understanding of mammal ecology has always been hindered by the difficulties of observing species in closed tropical forests. Camera trapping has become a major advance for monitoring terrestrial mammals in biodiversity rich ecosystems. Here we compiled one of the largest datasets of inventories of terrestrial mammal communities for the Neotropical region based on camera trapping studies. The dataset comprises 170 surveys of medium to large terrestrial mammals using camera traps conducted in 144 areas by 74 studies, covering six vegetation types of tropical and subtropical Atlantic Forest of South America (Brazil and Argentina), and present data on species composition and richness. The complete dataset comprises 53,438 independent records of 83 species of mammals, includes 10 species of marsupials, 15 rodents, 20 carnivores, eight ungulates and six armadillos. Species richness averaged 13 species (±6.07 SD) per site. Only six species occurred in more than 50% of the sites: the domestic dog Canis familiaris, crab-eating fox Cerdocyon thous, tayra Eira barbara, south American coati Nasua nasua, crab-eating raccoon Procyon cancrivorus and the nine-banded armadillo Dasypus novemcinctus. The information contained in this dataset can be used to understand macroecological patterns of biodiversity, community, and population structure, but also to evaluate the ecological consequences of fragmentation, defaunation, and trophic interactions. © 2017 by the Ecological Society of America.

  19. Large Scale Processes and Extreme Floods in Brazil

    NASA Astrophysics Data System (ADS)

    Ribeiro Lima, C. H.; AghaKouchak, A.; Lall, U.

    2016-12-01

    Persistent large scale anomalies in the atmospheric circulation and ocean state have been associated with heavy rainfall and extreme floods in water basins of different sizes across the world. Such studies have emerged in the last years as a new tool to improve the traditional, stationary based approach in flood frequency analysis and flood prediction. Here we seek to advance previous studies by evaluating the dominance of large scale processes (e.g. atmospheric rivers/moisture transport) over local processes (e.g. local convection) in producing floods. We consider flood-prone regions in Brazil as case studies and the role of large scale climate processes in generating extreme floods in such regions is explored by means of observed streamflow, reanalysis data and machine learning methods. The dynamics of the large scale atmospheric circulation in the days prior to the flood events are evaluated based on the vertically integrated moisture flux and its divergence field, which are interpreted in a low-dimensional space as obtained by machine learning techniques, particularly supervised kernel principal component analysis. In such reduced dimensional space, clusters are obtained in order to better understand the role of regional moisture recycling or teleconnected moisture in producing floods of a given magnitude. The convective available potential energy (CAPE) is also used as a measure of local convection activities. We investigate for individual sites the exceedance probability in which large scale atmospheric fluxes dominate the flood process. Finally, we analyze regional patterns of floods and how the scaling law of floods with drainage area responds to changes in the climate forcing mechanisms (e.g. local vs large scale).

  20. Generation of large-scale density fluctuations by buoyancy

    NASA Technical Reports Server (NTRS)

    Chasnov, J. R.; Rogallo, R. S.

    1990-01-01

    The generation of fluid motion from a state of rest by buoyancy forces acting on a homogeneous isotropic small-scale density field is considered. Nonlinear interactions between the generated fluid motion and the initial isotropic small-scale density field are found to create an anisotropic large-scale density field with spectrum proportional to kappa(exp 4). This large-scale density field is observed to result in an increasing Reynolds number of the fluid turbulence in its final period of decay.

  1. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    PubMed Central

    Hero, Alfred O.; Rajaratnam, Bala

    2015-01-01

    When can reliable inference be drawn in fue “Big Data” context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for “Big Data”. Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks. PMID:27087700

  2. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.

    PubMed

    Hero, Alfred O; Rajaratnam, Bala

    2016-01-01

    When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.

  3. Large-scale dark diversity estimates: new perspectives with combined methods.

    PubMed

    Ronk, Argo; de Bello, Francesco; Fibich, Pavel; Pärtel, Meelis

    2016-09-01

    Large-scale biodiversity studies can be more informative if observed diversity in a study site is accompanied by dark diversity, the set of absent although ecologically suitable species. Dark diversity methodology is still being developed and a comparison of different approaches is needed. We used plant data at two different scales (European and seven large regions) and compared dark diversity estimates from two mathematical methods: species co-occurrence (SCO) and species distribution modeling (SDM). We used plant distribution data from the Atlas Florae Europaeae (50 × 50 km grid cells) and seven different European regions (10 × 10 km grid cells). Dark diversity was estimated by SCO and SDM for both datasets. We examined the relationship between the dark diversity sizes (type II regression) and the overlap in species composition (overlap coefficient). We tested the overlap probability according to the hypergeometric distribution. We combined the estimates of the two methods to determine consensus dark diversity and composite dark diversity. We tested whether dark diversity and completeness of site diversity (log ratio of observed and dark diversity) are related to various natural and anthropogenic factors differently than simple observed diversity. Both methods provided similar dark diversity sizes and distribution patterns; dark diversity is greater in southern Europe. The regression line, however, deviated from a 1:1 relationship. The species composition overlap of two methods was about 75%, which is much greater than expected by chance. Both consensus and composite dark diversity estimates showed similar distribution patterns. Both dark diversity and completeness measures exhibit relationships to natural and anthropogenic factors different than those exhibited by observed richness. In summary, dark diversity revealed new biodiversity patterns which were not evident when only observed diversity was examined. A new perspective in dark diversity studies

  4. North Atlantic cyclones; trends, impacts and links to large-scale variability

    NASA Astrophysics Data System (ADS)

    Trigo, R. M.; Trigo, I. F.; Ramos, A. M.; Paredes, D.; Garcia-Herrera, R.; Liberato, M. L. R.; Valente, M. A.

    2009-04-01

    Based on the cyclone detection and tracking algorithm previously developed (Trigo, 2006) we have assessed the inter-annual variability and cyclone frequency trends between 1960 and 2000 for the Euro-Atlantic sector using the highest spatial resolution available (1.125° x 1.125°) from the ERA-40 Surface Level Pressure. Additionally, trends for the u and v wind speed components are also computed at the monthly and seasonal scales, using the same dataset. All cyclone and wind speed trend maps were computed with the corresponding statistical significance field. Results reveal a significant frequency decrease (increase) in the western Mediterranean (Greenland and Scandinavia), particularly in December, February and March. Seasonal and monthly analysis of wind speed trends shows similar spatial patterns. We show that these changes in the frequency of low pressure centers and the associated wind patterns are partially responsible for trends of the significant height of waves. Throughout the extended winter months (ONDJFM), regions with positive (negative) wind magnitude trends, of up to 5 cm/s per year, often correspond to regions of positive (negative) significant wave height trends. The cyclone and wind speed trends computed for the JFM months are well matched by the corresponding trends in significant wave height, with February being the month with the highest trends (negative south of 50°N up to -3 cm/year, and positive up to 5cm/year just north of Scotland). Using precipitation data from ECMWF reanalyses and a CRU high resolution dataset we show the impact of these trends in cyclone frequencies upon the corresponding precipitation trends in the influenced areas. It is also shown that these changes are partially linked to major shifts on the indices of large-scale patterns modes, namely the North Atlantic Oscillation (NAO), the Eastern Atlantic (EA) and the Scandinavian Patterns (SCAN). Trigo, I. F. 2006: Climatology and Interannual Variability of Storm-Tracks in

  5. Large-Scale Aerosol Modeling and Analysis

    DTIC Science & Technology

    2009-09-30

    Modeling of Burning Emissions ( FLAMBE ) project, and other related parameters. Our plans to embed NAAPS inside NOGAPS may need to be put on hold...AOD, FLAMBE and FAROP at FNMOC are supported by 6.4 funding from PMW-120 for “Large-scale Atmospheric Models”, “Small-scale Atmospheric Models

  6. Information Tailoring Enhancements for Large-Scale Social Data

    DTIC Science & Technology

    2016-06-15

    Intelligent Automation Incorporated Information Tailoring Enhancements for Large-Scale... Automation Incorporated Progress Report No. 3 Information Tailoring Enhancements for Large-Scale Social Data Submitted in accordance with...1 Work Performed within This Reporting Period .................................................... 2 1.1 Enhanced Named Entity Recognition (NER

  7. Does using different modern climate datasets impact pollen-based paleoclimate reconstructions in North America during the past 2,000 years

    NASA Astrophysics Data System (ADS)

    Ladd, Matthew; Viau, Andre

    2013-04-01

    Paleoclimate reconstructions rely on the accuracy of modern climate datasets for calibration of fossil records under the assumption of climate normality through time, which means that the modern climate operates in a similar manner as over the past 2,000 years. In this study, we show how using different modern climate datasets have an impact on a pollen-based reconstruction of mean temperature of the warmest month (MTWA) during the past 2,000 years for North America. The modern climate datasets used to explore this research question include the: Whitmore et al., (2005) modern climate dataset; North American Regional Reanalysis (NARR); National Center For Environmental Prediction (NCEP); European Center for Medium Range Weather Forecasting (ECMWF) ERA-40 reanalysis; WorldClim, Global Historical Climate Network (GHCN) and New et al., which is derived from the CRU dataset. Results show that some caution is advised in using the reanalysis data on large-scale reconstructions. Station data appears to dampen out the variability of the reconstruction produced using station based datasets. The reanalysis or model-based datasets are not recommended for paleoclimate large-scale North American reconstructions as they appear to lack some of the dynamics observed in station datasets (CRU) which resulted in warm-biased reconstructions as compared to the station-based reconstructions. The Whitmore et al. (2005) modern climate dataset appears to be a compromise between CRU-based datasets and model-based datasets except for the ERA-40. In addition, an ultra-high resolution gridded climate dataset such as WorldClim may only be useful if the pollen calibration sites in North America have at least the same spatial precision. We reconstruct the MTWA to within +/-0.01°C by using an average of all curves derived from the different modern climate datasets, demonstrating the robustness of the procedure used. It may be that the use of an average of different modern datasets may reduce the

  8. The Path from Large Earth Science Datasets to Information

    NASA Astrophysics Data System (ADS)

    Vicente, G. A.

    2013-12-01

    The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.

  9. A bibliographical surveys of large-scale systems

    NASA Technical Reports Server (NTRS)

    Corliss, W. R.

    1970-01-01

    A limited, partly annotated bibliography was prepared on the subject of large-scale system control. Approximately 400 references are divided into thirteen application areas, such as large societal systems and large communication systems. A first-author index is provided.

  10. Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping.

    PubMed

    Pinho, Ana Luísa; Amadon, Alexis; Ruest, Torsten; Fabre, Murielle; Dohmatob, Elvis; Denghien, Isabelle; Ginisty, Chantal; Becuwe-Desmidt, Séverine; Roger, Séverine; Laurier, Laurence; Joly-Testault, Véronique; Médiouni-Cloarec, Gaëlle; Doublé, Christine; Martins, Bernadette; Pinel, Philippe; Eger, Evelyn; Varoquaux, Gaël; Pallier, Christophe; Dehaene, Stanislas; Hertz-Pannier, Lucie; Thirion, Bertrand

    2018-06-12

    Functional Magnetic Resonance Imaging (fMRI) has furthered brain mapping on perceptual, motor, as well as higher-level cognitive functions. However, to date, no data collection has systematically addressed the functional mapping of cognitive mechanisms at a fine spatial scale. The Individual Brain Charting (IBC) project stands for a high-resolution multi-task fMRI dataset that intends to provide the objective basis toward a comprehensive functional atlas of the human brain. The data refer to a cohort of 12 participants performing many different tasks. The large amount of task-fMRI data on the same subjects yields a precise mapping of the underlying functions, free from both inter-subject and inter-site variability. The present article gives a detailed description of the first release of the IBC dataset. It comprises a dozen of tasks, addressing both low- and high- level cognitive functions. This openly available dataset is thus intended to become a reference for cognitive brain mapping.

  11. Scalable Iterative Classification for Sanitizing Large-Scale Datasets

    PubMed Central

    Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun; Malin, Bradley

    2017-01-01

    Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations. PMID:28943741

  12. Spatial Structure of Above-Ground Biomass Limits Accuracy of Carbon Mapping in Rainforest but Large Scale Forest Inventories Can Help to Overcome

    PubMed Central

    Guitet, Stéphane; Hérault, Bruno; Molto, Quentin; Brunaux, Olivier; Couteron, Pierre

    2015-01-01

    Precise mapping of above-ground biomass (AGB) is a major challenge for the success of REDD+ processes in tropical rainforest. The usual mapping methods are based on two hypotheses: a large and long-ranged spatial autocorrelation and a strong environment influence at the regional scale. However, there are no studies of the spatial structure of AGB at the landscapes scale to support these assumptions. We studied spatial variation in AGB at various scales using two large forest inventories conducted in French Guiana. The dataset comprised 2507 plots (0.4 to 0.5 ha) of undisturbed rainforest distributed over the whole region. After checking the uncertainties of estimates obtained from these data, we used half of the dataset to develop explicit predictive models including spatial and environmental effects and tested the accuracy of the resulting maps according to their resolution using the rest of the data. Forest inventories provided accurate AGB estimates at the plot scale, for a mean of 325 Mg.ha-1. They revealed high local variability combined with a weak autocorrelation up to distances of no more than10 km. Environmental variables accounted for a minor part of spatial variation. Accuracy of the best model including spatial effects was 90 Mg.ha-1 at plot scale but coarse graining up to 2-km resolution allowed mapping AGB with accuracy lower than 50 Mg.ha-1. Whatever the resolution, no agreement was found with available pan-tropical reference maps at all resolutions. We concluded that the combined weak autocorrelation and weak environmental effect limit AGB maps accuracy in rainforest, and that a trade-off has to be found between spatial resolution and effective accuracy until adequate “wall-to-wall” remote sensing signals provide reliable AGB predictions. Waiting for this, using large forest inventories with low sampling rate (<0.5%) may be an efficient way to increase the global coverage of AGB maps with acceptable accuracy at kilometric resolution. PMID

  13. Spatial Structure of Above-Ground Biomass Limits Accuracy of Carbon Mapping in Rainforest but Large Scale Forest Inventories Can Help to Overcome.

    PubMed

    Guitet, Stéphane; Hérault, Bruno; Molto, Quentin; Brunaux, Olivier; Couteron, Pierre

    2015-01-01

    Precise mapping of above-ground biomass (AGB) is a major challenge for the success of REDD+ processes in tropical rainforest. The usual mapping methods are based on two hypotheses: a large and long-ranged spatial autocorrelation and a strong environment influence at the regional scale. However, there are no studies of the spatial structure of AGB at the landscapes scale to support these assumptions. We studied spatial variation in AGB at various scales using two large forest inventories conducted in French Guiana. The dataset comprised 2507 plots (0.4 to 0.5 ha) of undisturbed rainforest distributed over the whole region. After checking the uncertainties of estimates obtained from these data, we used half of the dataset to develop explicit predictive models including spatial and environmental effects and tested the accuracy of the resulting maps according to their resolution using the rest of the data. Forest inventories provided accurate AGB estimates at the plot scale, for a mean of 325 Mg.ha-1. They revealed high local variability combined with a weak autocorrelation up to distances of no more than10 km. Environmental variables accounted for a minor part of spatial variation. Accuracy of the best model including spatial effects was 90 Mg.ha-1 at plot scale but coarse graining up to 2-km resolution allowed mapping AGB with accuracy lower than 50 Mg.ha-1. Whatever the resolution, no agreement was found with available pan-tropical reference maps at all resolutions. We concluded that the combined weak autocorrelation and weak environmental effect limit AGB maps accuracy in rainforest, and that a trade-off has to be found between spatial resolution and effective accuracy until adequate "wall-to-wall" remote sensing signals provide reliable AGB predictions. Waiting for this, using large forest inventories with low sampling rate (<0.5%) may be an efficient way to increase the global coverage of AGB maps with acceptable accuracy at kilometric resolution.

  14. Evolution of scaling emergence in large-scale spatial epidemic spreading.

    PubMed

    Wang, Lin; Li, Xiang; Zhang, Yi-Qing; Zhang, Yan; Zhang, Kan

    2011-01-01

    Zipf's law and Heaps' law are two representatives of the scaling concepts, which play a significant role in the study of complexity science. The coexistence of the Zipf's law and the Heaps' law motivates different understandings on the dependence between these two scalings, which has still hardly been clarified. In this article, we observe an evolution process of the scalings: the Zipf's law and the Heaps' law are naturally shaped to coexist at the initial time, while the crossover comes with the emergence of their inconsistency at the larger time before reaching a stable state, where the Heaps' law still exists with the disappearance of strict Zipf's law. Such findings are illustrated with a scenario of large-scale spatial epidemic spreading, and the empirical results of pandemic disease support a universal analysis of the relation between the two laws regardless of the biological details of disease. Employing the United States domestic air transportation and demographic data to construct a metapopulation model for simulating the pandemic spread at the U.S. country level, we uncover that the broad heterogeneity of the infrastructure plays a key role in the evolution of scaling emergence. The analyses of large-scale spatial epidemic spreading help understand the temporal evolution of scalings, indicating the coexistence of the Zipf's law and the Heaps' law depends on the collective dynamics of epidemic processes, and the heterogeneity of epidemic spread indicates the significance of performing targeted containment strategies at the early time of a pandemic disease.

  15. Moon-based Earth Observation for Large Scale Geoscience Phenomena

    NASA Astrophysics Data System (ADS)

    Guo, Huadong; Liu, Guang; Ding, Yixing

    2016-07-01

    The capability of Earth observation for large-global-scale natural phenomena needs to be improved and new observing platform are expected. We have studied the concept of Moon as an Earth observation in these years. Comparing with manmade satellite platform, Moon-based Earth observation can obtain multi-spherical, full-band, active and passive information,which is of following advantages: large observation range, variable view angle, long-term continuous observation, extra-long life cycle, with the characteristics of longevity ,consistency, integrity, stability and uniqueness. Moon-based Earth observation is suitable for monitoring the large scale geoscience phenomena including large scale atmosphere change, large scale ocean change,large scale land surface dynamic change,solid earth dynamic change,etc. For the purpose of establishing a Moon-based Earth observation platform, we already have a plan to study the five aspects as follows: mechanism and models of moon-based observing earth sciences macroscopic phenomena; sensors' parameters optimization and methods of moon-based Earth observation; site selection and environment of moon-based Earth observation; Moon-based Earth observation platform; and Moon-based Earth observation fundamental scientific framework.

  16. Large-Scale Structure and Hyperuniformity of Amorphous Ices

    NASA Astrophysics Data System (ADS)

    Martelli, Fausto; Torquato, Salvatore; Giovambattista, Nicolas; Car, Roberto

    2017-09-01

    We investigate the large-scale structure of amorphous ices and transitions between their different forms by quantifying their large-scale density fluctuations. Specifically, we simulate the isothermal compression of low-density amorphous ice (LDA) and hexagonal ice to produce high-density amorphous ice (HDA). Both HDA and LDA are nearly hyperuniform; i.e., they are characterized by an anomalous suppression of large-scale density fluctuations. By contrast, in correspondence with the nonequilibrium phase transitions to HDA, the presence of structural heterogeneities strongly suppresses the hyperuniformity and the system becomes hyposurficial (devoid of "surface-area fluctuations"). Our investigation challenges the largely accepted "frozen-liquid" picture, which views glasses as structurally arrested liquids. Beyond implications for water, our findings enrich our understanding of pressure-induced structural transformations in glasses.

  17. MEMHDX: an interactive tool to expedite the statistical validation and visualization of large HDX-MS datasets.

    PubMed

    Hourdel, Véronique; Volant, Stevenn; O'Brien, Darragh P; Chenal, Alexandre; Chamot-Rooke, Julia; Dillies, Marie-Agnès; Brier, Sébastien

    2016-11-15

    With the continued improvement of requisite mass spectrometers and UHPLC systems, Hydrogen/Deuterium eXchange Mass Spectrometry (HDX-MS) workflows are rapidly evolving towards the investigation of more challenging biological systems, including large protein complexes and membrane proteins. The analysis of such extensive systems results in very large HDX-MS datasets for which specific analysis tools are required to speed up data validation and interpretation. We introduce a web application and a new R-package named 'MEMHDX' to help users analyze, validate and visualize large HDX-MS datasets. MEMHDX is composed of two elements. A statistical tool aids in the validation of the results by applying a mixed-effects model for each peptide, in each experimental condition, and at each time point, taking into account the time dependency of the HDX reaction and number of independent replicates. Two adjusted P-values are generated per peptide, one for the 'Change in dynamics' and one for the 'Magnitude of ΔD', and are used to classify the data by means of a 'Logit' representation. A user-friendly interface developed with Shiny by RStudio facilitates the use of the package. This interactive tool allows the user to easily and rapidly validate, visualize and compare the relative deuterium incorporation on the amino acid sequence and 3D structure, providing both spatial and temporal information. MEMHDX is freely available as a web tool at the project home page http://memhdx.c3bi.pasteur.fr CONTACT: marie-agnes.dillies@pasteur.fr or sebastien.brier@pasteur.frSupplementary information: Supplementary data is available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  18. Optimizing tertiary storage organization and access for spatio-temporal datasets

    NASA Technical Reports Server (NTRS)

    Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith

    1994-01-01

    We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.

  19. Large Scale Cross Drive Correlation Of Digital Media

    DTIC Science & Technology

    2016-03-01

    NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS LARGE SCALE CROSS-DRIVE CORRELATION OF DIGITAL MEDIA by Joseph Van Bruaene March 2016 Thesis Co...CROSS-DRIVE CORRELATION OF DIGITAL MEDIA 5. FUNDING NUMBERS 6. AUTHOR(S) Joseph Van Bruaene 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval...the ability to make large scale cross-drive correlations among a large corpus of digital media becomes increasingly important. We propose a

  20. Facing the Challenges of Accessing, Managing, and Integrating Large Observational Datasets in Ecology: Enabling and Enriching the Use of NEON's Observational Data

    NASA Astrophysics Data System (ADS)

    Thibault, K. M.

    2013-12-01

    As the construction of NEON and its transition to operations progresses, more and more data will become available to the scientific community, both from NEON directly and from the concomitant growth of existing data repositories. Many of these datasets include ecological observations of a diversity of taxa in both aquatic and terrestrial environments. Although observational data have been collected and used throughout the history of organismal biology, the field has not yet fully developed a culture of data management, documentation, standardization, sharing and discoverability to facilitate the integration and synthesis of datasets. Moreover, the tools required to accomplish these goals, namely database design, implementation, and management, and automation and parallelization of analytical tasks through computational techniques, have not historically been included in biology curricula, at either the undergraduate or graduate levels. To ensure the success of data-generating projects like NEON in advancing organismal ecology and to increase transparency and reproducibility of scientific analyses, an acceleration of the cultural shift to open science practices, the development and adoption of data standards, such as the DarwinCore standard for taxonomic data, and increased training in computational approaches for biologists need to be realized. Here I highlight several initiatives that are intended to increase access to and discoverability of publicly available datasets and equip biologists and other scientists with the skills that are need to manage, integrate, and analyze data from multiple large-scale projects. The EcoData Retriever (ecodataretriever.org) is a tool that downloads publicly available datasets, re-formats the data into an efficient relational database structure, and then automatically imports the data tables onto a user's local drive into the database tool of the user's choice. The automation of these tasks results in nearly instantaneous execution

  1. Simultaneous analysis of large INTEGRAL/SPI1 datasets: Optimizing the computation of the solution and its variance using sparse matrix algorithms

    NASA Astrophysics Data System (ADS)

    Bouchet, L.; Amestoy, P.; Buttari, A.; Rouet, F.-H.; Chauvin, M.

    2013-02-01

    Nowadays, analyzing and reducing the ever larger astronomical datasets is becoming a crucial challenge, especially for long cumulated observation times. The INTEGRAL/SPI X/γ-ray spectrometer is an instrument for which it is essential to process many exposures at the same time in order to increase the low signal-to-noise ratio of the weakest sources. In this context, the conventional methods for data reduction are inefficient and sometimes not feasible at all. Processing several years of data simultaneously requires computing not only the solution of a large system of equations, but also the associated uncertainties. We aim at reducing the computation time and the memory usage. Since the SPI transfer function is sparse, we have used some popular methods for the solution of large sparse linear systems; we briefly review these methods. We use the Multifrontal Massively Parallel Solver (MUMPS) to compute the solution of the system of equations. We also need to compute the variance of the solution, which amounts to computing selected entries of the inverse of the sparse matrix corresponding to our linear system. This can be achieved through one of the latest features of the MUMPS software that has been partly motivated by this work. In this paper we provide a brief presentation of this feature and evaluate its effectiveness on astrophysical problems requiring the processing of large datasets simultaneously, such as the study of the entire emission of the Galaxy. We used these algorithms to solve the large sparse systems arising from SPI data processing and to obtain both their solutions and the associated variances. In conclusion, thanks to these newly developed tools, processing large datasets arising from SPI is now feasible with both a reasonable execution time and a low memory usage.

  2. RACORO continental boundary layer cloud investigations. Part I: Case study development and ensemble large-scale forcings

    DOE PAGES

    Vogelmann, Andrew M.; Fridlind, Ann M.; Toto, Tami; ...

    2015-06-19

    Observation-based modeling case studies of continental boundary layer clouds have been developed to study cloudy boundary layers, aerosol influences upon them, and their representation in cloud- and global-scale models. Three 60-hour case study periods span the temporal evolution of cumulus, stratiform, and drizzling boundary layer cloud systems, representing mixed and transitional states rather than idealized or canonical cases. Based on in-situ measurements from the RACORO field campaign and remote-sensing observations, the cases are designed with a modular configuration to simplify use in large-eddy simulations (LES) and single-column models. Aircraft measurements of aerosol number size distribution are fit to lognormal functionsmore » for concise representation in models. Values of the aerosol hygroscopicity parameter, κ, are derived from observations to be ~0.10, which are lower than the 0.3 typical over continents and suggestive of a large aerosol organic fraction. Ensemble large-scale forcing datasets are derived from the ARM variational analysis, ECMWF forecasts, and a multi-scale data assimilation system. The forcings are assessed through comparison of measured bulk atmospheric and cloud properties to those computed in 'trial' large-eddy simulations, where more efficient run times are enabled through modest reductions in grid resolution and domain size compared to the full-sized LES grid. Simulations capture many of the general features observed, but the state-of-the-art forcings were limited at representing details of cloud onset, and tight gradients and high-resolution transients of importance. Methods for improving the initial conditions and forcings are discussed. The cases developed are available to the general modeling community for studying continental boundary clouds.« less

  3. The large-scale distribution of galaxies

    NASA Technical Reports Server (NTRS)

    Geller, Margaret J.

    1989-01-01

    The spatial distribution of galaxies in the universe is characterized on the basis of the six completed strips of the Harvard-Smithsonian Center for Astrophysics redshift-survey extension. The design of the survey is briefly reviewed, and the results are presented graphically. Vast low-density voids similar to the void in Bootes are found, almost completely surrounded by thin sheets of galaxies. Also discussed are the implications of the results for the survey sampling problem, the two-point correlation function of the galaxy distribution, the possibility of detecting large-scale coherent flows, theoretical models of large-scale structure, and the identification of groups and clusters of galaxies.

  4. Scaling up Psycholinguistics

    ERIC Educational Resources Information Center

    Smith, Nathaniel J.

    2011-01-01

    This dissertation contains several projects, each addressing different questions with different techniques. In chapter 1, I argue that they are unified thematically by their goal of "scaling up psycholinguistics"; they are all aimed at analyzing large data-sets using tools that reveal patterns to propose and test mechanism-neutral hypotheses about…

  5. Appraising city-scale pollution monitoring capabilities of multi-satellite datasets using portable pollutant monitors

    NASA Astrophysics Data System (ADS)

    Aliyu, Yahaya A.; Botai, Joel O.

    2018-04-01

    The retrieval characteristics for a city-scale satellite experiment was explored over a Nigerian city. The study evaluated carbon monoxide and aerosol contents in the city atmosphere. We utilized the MSA Altair 5× gas detector and CW-HAT200 particulate counter to investigate the city-scale monitoring capabilities of satellite pollution observing instruments; atmospheric infrared sounder (AIRS), measurement of pollution in the troposphere (MOPITT), moderate resolution imaging spectroradiometer (MODIS), multi-angle imaging spectroradiometer (MISR) and ozone monitoring instrument (OMI). To achieve this, we employed the Kriging interpolation technique to collocate the satellite pollutant estimations over 19 ground sample sites for the period of 2015-2016. The portable pollutant devices were validated using the WHO air filter sampling model. To determine the city-scale performance of the satellite datasets, performance indicators: correlation coefficient, model efficiency, reliability index and root mean square error, were adopted as measures. The comparative analysis revealed that MOPITT carbon monoxide (CO) and MODIS aerosol optical depth (AOD) estimates are the appropriate satellite measurements for ground equivalents in Zaria, Nigeria. Our findings were within the acceptable limits of similar studies that utilized reference stations. In conclusion, this study offers direction to Nigeria's air quality policy organizers about available alternative air pollution measurements for mitigating air quality effects within its limited resource environment.

  6. BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

    PubMed Central

    2010-01-01

    Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show

  7. Large-scale environments of narrow-line Seyfert 1 galaxies

    NASA Astrophysics Data System (ADS)

    Järvelä, E.; Lähteenmäki, A.; Lietzen, H.; Poudel, A.; Heinämäki, P.; Einasto, M.

    2017-09-01

    Studying large-scale environments of narrow-line Seyfert 1 (NLS1) galaxies gives a new perspective on their properties, particularly their radio loudness. The large-scale environment is believed to have an impact on the evolution and intrinsic properties of galaxies, however, NLS1 sources have not been studied in this context before. We have a large and diverse sample of 1341 NLS1 galaxies and three separate environment data sets constructed using Sloan Digital Sky Survey. We use various statistical methods to investigate how the properties of NLS1 galaxies are connected to the large-scale environment, and compare the large-scale environments of NLS1 galaxies with other active galactic nuclei (AGN) classes, for example, other jetted AGN and broad-line Seyfert 1 (BLS1) galaxies, to study how they are related. NLS1 galaxies reside in less dense environments than any of the comparison samples, thus confirming their young age. The average large-scale environment density and environmental distribution of NLS1 sources is clearly different compared to BLS1 galaxies, thus it is improbable that they could be the parent population of NLS1 galaxies and unified by orientation. Within the NLS1 class there is a trend of increasing radio loudness with increasing large-scale environment density, indicating that the large-scale environment affects their intrinsic properties. Our results suggest that the NLS1 class of sources is not homogeneous, and furthermore, that a considerable fraction of them are misclassified. We further support a published proposal to replace the traditional classification to radio-loud, and radio-quiet or radio-silent sources with a division into jetted and non-jetted sources.

  8. Double inflation - A possible resolution of the large-scale structure problem

    NASA Technical Reports Server (NTRS)

    Turner, Michael S.; Villumsen, Jens V.; Vittorio, Nicola; Silk, Joseph; Juszkiewicz, Roman

    1987-01-01

    A model is presented for the large-scale structure of the universe in which two successive inflationary phases resulted in large small-scale and small large-scale density fluctuations. This bimodal density fluctuation spectrum in an Omega = 1 universe dominated by hot dark matter leads to large-scale structure of the galaxy distribution that is consistent with recent observational results. In particular, large, nearly empty voids and significant large-scale peculiar velocity fields are produced over scales of about 100 Mpc, while the small-scale structure over less than about 10 Mpc resembles that in a low-density universe, as observed. Detailed analytical calculations and numerical simulations are given of the spatial and velocity correlations.

  9. Measuring the Large-scale Solar Magnetic Field

    NASA Astrophysics Data System (ADS)

    Hoeksema, J. T.; Scherrer, P. H.; Peterson, E.; Svalgaard, L.

    2017-12-01

    The Sun's large-scale magnetic field is important for determining global structure of the corona and for quantifying the evolution of the polar field, which is sometimes used for predicting the strength of the next solar cycle. Having confidence in the determination of the large-scale magnetic field of the Sun is difficult because the field is often near the detection limit, various observing methods all measure something a little different, and various systematic effects can be very important. We compare resolved and unresolved observations of the large-scale magnetic field from the Wilcox Solar Observatory, Heliseismic and Magnetic Imager (HMI), Michelson Doppler Imager (MDI), and Solis. Cross comparison does not enable us to establish an absolute calibration, but it does allow us to discover and compensate for instrument problems, such as the sensitivity decrease seen in the WSO measurements in late 2016 and early 2017.

  10. A reanalysis dataset of the South China Sea

    PubMed Central

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803

  11. Development of a Watershed Boundary Dataset for Mississippi

    USGS Publications Warehouse

    Van Wilson, K.; Clair, Michael G.; Turnipseed, D. Phil; Rebich, Richard A.

    2009-01-01

    The U.S. Geological Survey, in cooperation with the Mississippi Department of Environmental Quality, U.S. Department of Agriculture-Natural Resources Conservation Service, Mississippi Department of Transportation, U.S. Department of Agriculture-Forest Service, and the Mississippi Automated Resource Information System, developed a 1:24,000-scale Watershed Boundary Dataset for Mississippi including watershed and subwatershed boundaries, codes, names, and drainage areas. The Watershed Boundary Dataset for Mississippi provides a standard geographical framework for water-resources and selected land-resources planning. The original 8-digit subbasins (hydrologic unit codes) were further subdivided into 10-digit watersheds and 12-digit subwatersheds - the exceptions are the Lower Mississippi River Alluvial Plain (known locally as the Delta) and the Mississippi River inside levees, which were only subdivided into 10-digit watersheds. Also, large water bodies in the Mississippi Sound along the coast were not delineated as small as a typical 12-digit subwatershed. All of the data - including watershed and subwatershed boundaries, hydrologic unit codes and names, and drainage-area data - are stored in a Geographic Information System database.

  12. Spectral fingerprints of large-scale neuronal interactions.

    PubMed

    Siegel, Markus; Donner, Tobias H; Engel, Andreas K

    2012-01-11

    Cognition results from interactions among functionally specialized but widely distributed brain regions; however, neuroscience has so far largely focused on characterizing the function of individual brain regions and neurons therein. Here we discuss recent studies that have instead investigated the interactions between brain regions during cognitive processes by assessing correlations between neuronal oscillations in different regions of the primate cerebral cortex. These studies have opened a new window onto the large-scale circuit mechanisms underlying sensorimotor decision-making and top-down attention. We propose that frequency-specific neuronal correlations in large-scale cortical networks may be 'fingerprints' of canonical neuronal computations underlying cognitive processes.

  13. A unified large/small-scale dynamo in helical turbulence

    NASA Astrophysics Data System (ADS)

    Bhat, Pallavi; Subramanian, Kandaswamy; Brandenburg, Axel

    2016-09-01

    We use high resolution direct numerical simulations (DNS) to show that helical turbulence can generate significant large-scale fields even in the presence of strong small-scale dynamo action. During the kinematic stage, the unified large/small-scale dynamo grows fields with a shape-invariant eigenfunction, with most power peaked at small scales or large k, as in Subramanian & Brandenburg. Nevertheless, the large-scale field can be clearly detected as an excess power at small k in the negatively polarized component of the energy spectrum for a forcing with positively polarized waves. Its strength overline{B}, relative to the total rms field Brms, decreases with increasing magnetic Reynolds number, ReM. However, as the Lorentz force becomes important, the field generated by the unified dynamo orders itself by saturating on successively larger scales. The magnetic integral scale for the positively polarized waves, characterizing the small-scale field, increases significantly from the kinematic stage to saturation. This implies that the small-scale field becomes as coherent as possible for a given forcing scale, which averts the ReM-dependent quenching of overline{B}/B_rms. These results are obtained for 10243 DNS with magnetic Prandtl numbers of PrM = 0.1 and 10. For PrM = 0.1, overline{B}/B_rms grows from about 0.04 to about 0.4 at saturation, aided in the final stages by helicity dissipation. For PrM = 10, overline{B}/B_rms grows from much less than 0.01 to values of the order the 0.2. Our results confirm that there is a unified large/small-scale dynamo in helical turbulence.

  14. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    NASA Astrophysics Data System (ADS)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  15. Modulation of Small-scale Turbulence Structure by Large-scale Motions in the Absence of Direct Energy Transfer.

    NASA Astrophysics Data System (ADS)

    Brasseur, James G.; Juneja, Anurag

    1996-11-01

    Previous DNS studies indicate that small-scale structure can be directly altered through ``distant'' dynamical interactions by energetic forcing of the large scales. To remove the possibility of stimulating energy transfer between the large- and small-scale motions in these long-range interactions, we here perturb the large scale structure without altering its energy content by suddenly altering only the phases of large-scale Fourier modes. Scale-dependent changes in turbulence structure appear as a non zero difference field between two simulations from identical initial conditions of isotropic decaying turbulence, one perturbed and one unperturbed. We find that the large-scale phase perturbations leave the evolution of the energy spectrum virtually unchanged relative to the unperturbed turbulence. The difference field, on the other hand, is strongly affected by the perturbation. Most importantly, the time scale τ characterizing the change in in turbulence structure at spatial scale r shortly after initiating a change in large-scale structure decreases with decreasing turbulence scale r. Thus, structural information is transferred directly from the large- to the smallest-scale motions in the absence of direct energy transfer---a long-range effect which cannot be explained by a linear mechanism such as rapid distortion theory. * Supported by ARO grant DAAL03-92-G-0117

  16. Evolution of Scaling Emergence in Large-Scale Spatial Epidemic Spreading

    PubMed Central

    Wang, Lin; Li, Xiang; Zhang, Yi-Qing; Zhang, Yan; Zhang, Kan

    2011-01-01

    Background Zipf's law and Heaps' law are two representatives of the scaling concepts, which play a significant role in the study of complexity science. The coexistence of the Zipf's law and the Heaps' law motivates different understandings on the dependence between these two scalings, which has still hardly been clarified. Methodology/Principal Findings In this article, we observe an evolution process of the scalings: the Zipf's law and the Heaps' law are naturally shaped to coexist at the initial time, while the crossover comes with the emergence of their inconsistency at the larger time before reaching a stable state, where the Heaps' law still exists with the disappearance of strict Zipf's law. Such findings are illustrated with a scenario of large-scale spatial epidemic spreading, and the empirical results of pandemic disease support a universal analysis of the relation between the two laws regardless of the biological details of disease. Employing the United States domestic air transportation and demographic data to construct a metapopulation model for simulating the pandemic spread at the U.S. country level, we uncover that the broad heterogeneity of the infrastructure plays a key role in the evolution of scaling emergence. Conclusions/Significance The analyses of large-scale spatial epidemic spreading help understand the temporal evolution of scalings, indicating the coexistence of the Zipf's law and the Heaps' law depends on the collective dynamics of epidemic processes, and the heterogeneity of epidemic spread indicates the significance of performing targeted containment strategies at the early time of a pandemic disease. PMID:21747932

  17. A Functional Model for Management of Large Scale Assessments.

    ERIC Educational Resources Information Center

    Banta, Trudy W.; And Others

    This functional model for managing large-scale program evaluations was developed and validated in connection with the assessment of Tennessee's Nutrition Education and Training Program. Management of such a large-scale assessment requires the development of a structure for the organization; distribution and recovery of large quantities of…

  18. Current Scientific Issues in Large Scale Atmospheric Dynamics

    NASA Technical Reports Server (NTRS)

    Miller, T. L. (Compiler)

    1986-01-01

    Topics in large scale atmospheric dynamics are discussed. Aspects of atmospheric blocking, the influence of transient baroclinic eddies on planetary-scale waves, cyclogenesis, the effects of orography on planetary scale flow, small scale frontal structure, and simulations of gravity waves in frontal zones are discussed.

  19. Food appropriation through large scale land acquisitions

    NASA Astrophysics Data System (ADS)

    Rulli, Maria Cristina; D'Odorico, Paolo

    2014-05-01

    The increasing demand for agricultural products and the uncertainty of international food markets has recently drawn the attention of governments and agribusiness firms toward investments in productive agricultural land, mostly in the developing world. The targeted countries are typically located in regions that have remained only marginally utilized because of lack of modern technology. It is expected that in the long run large scale land acquisitions (LSLAs) for commercial farming will bring the technology required to close the existing crops yield gaps. While the extent of the acquired land and the associated appropriation of freshwater resources have been investigated in detail, the amount of food this land can produce and the number of people it could feed still need to be quantified. Here we use a unique dataset of land deals to provide a global quantitative assessment of the rates of crop and food appropriation potentially associated with LSLAs. We show how up to 300-550 million people could be fed by crops grown in the acquired land, should these investments in agriculture improve crop production and close the yield gap. In contrast, about 190-370 million people could be supported by this land without closing of the yield gap. These numbers raise some concern because the food produced in the acquired land is typically exported to other regions, while the target countries exhibit high levels of malnourishment. Conversely, if used for domestic consumption, the crops harvested in the acquired land could ensure food security to the local populations.

  20. Satellite-derived pan-Arctic melt onset dataset, 2000-2009

    NASA Astrophysics Data System (ADS)

    Wang, L.; Derksen, C.; Howell, S.; Wolken, G. J.; Sharp, M. J.; Markus, T.

    2009-12-01

    The SeaWinds Scatterometer on QuikSCAT (QS) has been in orbit for over a decade since its launch in June 1999. Due to its high sensitivity to the appearance of liquid water in snow and day/night all weather capability, QS data have been successfully used to detect melt onset and melt duration for various elements of the cryosphere. These melt datasets are especially useful in the polar regions where the application of imagery from optical sensors is hindered by polar nights and frequent cloud cover. In this study, we generate a pan-Arctic, pan-cryosphere melt onset dataset by combining estimates from previously published algorithms optimized for individual cryospheric elements and applied to QS and Special Sensor Microwave Imager (SSM/I) data for the northern high latitude land surface, ice caps, large lakes, and sea ice. Comparisons of melt onset along the boundaries between different components of the cryosphere show that in general the integrated dataset provides consistent and spatially coherent melt onset estimates across the pan-Arctic. We present the climatology and the anomaly patterns in melt onset during 2000-2009, and identify synoptic-scale linkages between atmospheric conditions and the observed patterns. We also investigate the possible trends in melt onset in the pan-Arctic during the 10-year period.

  1. Condition Monitoring of Large-Scale Facilities

    NASA Technical Reports Server (NTRS)

    Hall, David L.

    1999-01-01

    This document provides a summary of the research conducted for the NASA Ames Research Center under grant NAG2-1182 (Condition-Based Monitoring of Large-Scale Facilities). The information includes copies of view graphs presented at NASA Ames in the final Workshop (held during December of 1998), as well as a copy of a technical report provided to the COTR (Dr. Anne Patterson-Hine) subsequent to the workshop. The material describes the experimental design, collection of data, and analysis results associated with monitoring the health of large-scale facilities. In addition to this material, a copy of the Pennsylvania State University Applied Research Laboratory data fusion visual programming tool kit was also provided to NASA Ames researchers.

  2. Seismic safety in conducting large-scale blasts

    NASA Astrophysics Data System (ADS)

    Mashukov, I. V.; Chaplygin, V. V.; Domanov, V. P.; Semin, A. A.; Klimkin, M. A.

    2017-09-01

    In mining enterprises to prepare hard rocks for excavation a drilling and blasting method is used. With the approach of mining operations to settlements the negative effect of large-scale blasts increases. To assess the level of seismic impact of large-scale blasts the scientific staff of Siberian State Industrial University carried out expertise for coal mines and iron ore enterprises. Determination of the magnitude of surface seismic vibrations caused by mass explosions was performed using seismic receivers, an analog-digital converter with recording on a laptop. The registration results of surface seismic vibrations during production of more than 280 large-scale blasts at 17 mining enterprises in 22 settlements are presented. The maximum velocity values of the Earth’s surface vibrations are determined. The safety evaluation of seismic effect was carried out according to the permissible value of vibration velocity. For cases with exceedance of permissible values recommendations were developed to reduce the level of seismic impact.

  3. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets.

    PubMed

    Yu, Yao; Hu, Hao; Bohlender, Ryan J; Hu, Fulan; Chen, Jiun-Sheng; Holt, Carson; Fowler, Jerry; Guthery, Stephen L; Scheet, Paul; Hildebrandt, Michelle A T; Yandell, Mark; Huff, Chad D

    2018-04-06

    High-throughput sequencing data are increasingly being made available to the research community for secondary analyses, providing new opportunities for large-scale association studies. However, heterogeneity in target capture and sequencing technologies often introduce strong technological stratification biases that overwhelm subtle signals of association in studies of complex traits. Here, we introduce the Cross-Platform Association Toolkit, XPAT, which provides a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets. XPAT includes tools to support cross-platform aware variant calling, quality control filtering, gene-based association testing and rare variant effect size estimation. To evaluate the performance of XPAT, we conducted case-control association studies for three diseases, including 783 breast cancer cases, 272 ovarian cancer cases, 205 Crohn disease cases and 3507 shared controls (including 1722 females) using sequencing data from multiple sources. XPAT greatly reduced Type I error inflation in the case-control analyses, while replicating many previously identified disease-gene associations. We also show that association tests conducted with XPAT using cross-platform data have comparable performance to tests using matched platform data. XPAT enables new association studies that combine existing sequencing datasets to identify genetic loci associated with common diseases and other complex traits.

  4. Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets

    NASA Astrophysics Data System (ADS)

    Kim, S. K.; Prabhat, M.; Williams, D. N.

    2017-12-01

    Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.

  5. A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics.

    PubMed

    Deng, Wan-Yu; Bai, Zuo; Huang, Guang-Bin; Zheng, Qing-Hua

    2016-05-01

    Big dimensional data is a growing trend that is emerging in many real world contexts, extending from web mining, gene expression analysis, protein-protein interaction to high-frequency financial data. Nowadays, there is a growing consensus that the increasing dimensionality poses impeding effects on the performances of classifiers, which is termed as the "peaking phenomenon" in the field of machine intelligence. To address the issue, dimensionality reduction is commonly employed as a preprocessing step on the Big dimensional data before building the classifiers. In this paper, we propose an Extreme Learning Machine (ELM) approach for large-scale data analytic. In contrast to existing approaches, we embed hidden nodes that are designed using singular value decomposition (SVD) into the classical ELM. These SVD nodes in the hidden layer are shown to capture the underlying characteristics of the Big dimensional data well, exhibiting excellent generalization performances. The drawback of using SVD on the entire dataset, however, is the high computational complexity involved. To address this, a fast divide and conquer approximation scheme is introduced to maintain computational tractability on high volume data. The resultant algorithm proposed is labeled here as Fast Singular Value Decomposition-Hidden-nodes based Extreme Learning Machine or FSVD-H-ELM in short. In FSVD-H-ELM, instead of identifying the SVD hidden nodes directly from the entire dataset, SVD hidden nodes are derived from multiple random subsets of data sampled from the original dataset. Comprehensive experiments and comparisons are conducted to assess the FSVD-H-ELM against other state-of-the-art algorithms. The results obtained demonstrated the superior generalization performance and efficiency of the FSVD-H-ELM. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. Potential for geophysical experiments in large scale tests.

    USGS Publications Warehouse

    Dieterich, J.H.

    1981-01-01

    Potential research applications for large-specimen geophysical experiments include measurements of scale dependence of physical parameters and examination of interactions with heterogeneities, especially flaws such as cracks. In addition, increased specimen size provides opportunities for improved recording resolution and greater control of experimental variables. Large-scale experiments using a special purpose low stress (100MPa).-Author

  7. Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less

  8. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor.

    PubMed

    Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y

    2012-11-18

    Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.

  9. Homogenization of Large-Scale Movement Models in Ecology

    USGS Publications Warehouse

    Garlick, M.J.; Powell, J.A.; Hooten, M.B.; McFarlane, L.R.

    2011-01-01

    A difficulty in using diffusion models to predict large scale animal population dispersal is that individuals move differently based on local information (as opposed to gradients) in differing habitat types. This can be accommodated by using ecological diffusion. However, real environments are often spatially complex, limiting application of a direct approach. Homogenization for partial differential equations has long been applied to Fickian diffusion (in which average individual movement is organized along gradients of habitat and population density). We derive a homogenization procedure for ecological diffusion and apply it to a simple model for chronic wasting disease in mule deer. Homogenization allows us to determine the impact of small scale (10-100 m) habitat variability on large scale (10-100 km) movement. The procedure generates asymptotic equations for solutions on the large scale with parameters defined by small-scale variation. The simplicity of this homogenization procedure is striking when compared to the multi-dimensional homogenization procedure for Fickian diffusion,and the method will be equally straightforward for more complex models. ?? 2010 Society for Mathematical Biology.

  10. Felyx : A Free Open Software Solution for the Analysis of Large Earth Observation Datasets

    NASA Astrophysics Data System (ADS)

    Piolle, Jean-Francois; Shutler, Jamie; Poulter, David; Guidetti, Veronica; Donlon, Craig

    2014-05-01

    GHRSST project, by assembling large collections of earth observation data from various sources and agencies, has also raised the need for providing the user community with tools to inter-compare them, assess and monitor their quality. The ESA /Medspiration project, which implemented the first operating node of GHRSST system for Europe, also paved the way successfully towards such generic analytics tools by developing the High Resolution Diagnostic Dataset System (HR-DDS) and Satellite to In situ Multi-sensor Match-up Databases. Building on this heritage, ESA is now funding the development by IFREMER, PML and Pelamis of felyx, a web tool merging the two capabilities into a single software solution. It will consist in a free open software solution, written in python and javascript, whose aim is to provide Earth Observation data producers and users with an open-source, flexible and reusable tool to allow the quality and performance of data streams (satellite, in situ and model) to be easily monitored and studied. The primary concept of Felyx is to work as an extraction tool, subsetting source data over predefined target areas (which can be static or moving) : these data subsets, and associated metrics, can then be accessed by users or client applications either as raw files, automatic alerts and reports generated periodically, or through a flexible web interface enabling statistical analysis and visualization. Felyx presents itself as an open-source suite of tools, written in python and javascript, enabling : * subsetting large local or remote collections of Earth Observation data over predefined sites (geographical boxes) or moving targets (ship, buoy, hurricane), storing locally the extracted data (refered as miniProds). These miniProds constitute a much smaller representative subset of the original collection on which one can perform any kind of processing or assessment without having to cope with heavy volumes of data. * computing statistical metrics over these

  11. The large-scale organization of metabolic networks

    NASA Astrophysics Data System (ADS)

    Jeong, H.; Tombor, B.; Albert, R.; Oltvai, Z. N.; Barabási, A.-L.

    2000-10-01

    In a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions. However, despite the key role of these networks in sustaining cellular functions, their large-scale structure is essentially unknown. Here we present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems. This may indicate that metabolic organization is not only identical for all living organisms, but also complies with the design principles of robust and error-tolerant scale-free networks, and may represent a common blueprint for the large-scale organization of interactions among all cellular constituents.

  12. Large-scale weakly supervised object localization via latent category learning.

    PubMed

    Chong Wang; Kaiqi Huang; Weiqiang Ren; Junge Zhang; Maybank, Steve

    2015-04-01

    Localizing objects in cluttered backgrounds is challenging under large-scale weakly supervised conditions. Due to the cluttered image condition, objects usually have large ambiguity with backgrounds. Besides, there is also a lack of effective algorithm for large-scale weakly supervised localization in cluttered backgrounds. However, backgrounds contain useful latent information, e.g., the sky in the aeroplane class. If this latent information can be learned, object-background ambiguity can be largely reduced and background can be suppressed effectively. In this paper, we propose the latent category learning (LCL) in large-scale cluttered conditions. LCL is an unsupervised learning method which requires only image-level class labels. First, we use the latent semantic analysis with semantic object representation to learn the latent categories, which represent objects, object parts or backgrounds. Second, to determine which category contains the target object, we propose a category selection strategy by evaluating each category's discrimination. Finally, we propose the online LCL for use in large-scale conditions. Evaluation on the challenging PASCAL Visual Object Class (VOC) 2007 and the large-scale imagenet large-scale visual recognition challenge 2013 detection data sets shows that the method can improve the annotation precision by 10% over previous methods. More importantly, we achieve the detection precision which outperforms previous results by a large margin and can be competitive to the supervised deformable part model 5.0 baseline on both data sets.

  13. Subsampling for dataset optimisation

    NASA Astrophysics Data System (ADS)

    Ließ, Mareike

    2017-04-01

    Soil-landscapes have formed by the interaction of soil-forming factors and pedogenic processes. In modelling these landscapes in their pedodiversity and the underlying processes, a representative unbiased dataset is required. This concerns model input as well as output data. However, very often big datasets are available which are highly heterogeneous and were gathered for various purposes, but not to model a particular process or data space. As a first step, the overall data space and/or landscape section to be modelled needs to be identified including considerations regarding scale and resolution. Then the available dataset needs to be optimised via subsampling to well represent this n-dimensional data space. A couple of well-known sampling designs may be adapted to suit this purpose. The overall approach follows three main strategies: (1) the data space may be condensed and de-correlated by a factor analysis to facilitate the subsampling process. (2) Different methods of pattern recognition serve to structure the n-dimensional data space to be modelled into units which then form the basis for the optimisation of an existing dataset through a sensible selection of samples. Along the way, data units for which there is currently insufficient soil data available may be identified. And (3) random samples from the n-dimensional data space may be replaced by similar samples from the available dataset. While being a presupposition to develop data-driven statistical models, this approach may also help to develop universal process models and identify limitations in existing models.

  14. GLEAM version 3: Global Land Evaporation Datasets and Model

    NASA Astrophysics Data System (ADS)

    Martens, B.; Miralles, D. G.; Lievens, H.; van der Schalie, R.; de Jeu, R.; Fernandez-Prieto, D.; Verhoest, N.

    2015-12-01

    Terrestrial evaporation links energy, water and carbon cycles over land and is therefore a key variable of the climate system. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to limitations in in situ measurements. As a result, several methods have risen to estimate global patterns of land evaporation from satellite observations. However, these algorithms generally differ in their approach to model evaporation, resulting in large differences in their estimates. One of these methods is GLEAM, the Global Land Evaporation: the Amsterdam Methodology. GLEAM estimates terrestrial evaporation based on daily satellite observations of meteorological variables, vegetation characteristics and soil moisture. Since the publication of the first version of the algorithm (2011), the model has been widely applied to analyse trends in the water cycle and land-atmospheric feedbacks during extreme hydrometeorological events. A third version of the GLEAM global datasets is foreseen by the end of 2015. Given the relevance of having a continuous and reliable record of global-scale evaporation estimates for climate and hydrological research, the establishment of an online data portal to host these data to the public is also foreseen. In this new release of the GLEAM datasets, different components of the model have been updated, with the most significant change being the revision of the data assimilation algorithm. In this presentation, we will highlight the most important changes of the methodology and present three new GLEAM datasets and their validation against in situ observations and an alternative dataset of terrestrial evaporation (ERA-Land). Results of the validation exercise indicate that the magnitude and the spatiotemporal variability of the modelled evaporation agree reasonably well with the estimates of ERA-Land and the in situ

  15. Large-scale structure of randomly jammed spheres

    NASA Astrophysics Data System (ADS)

    Ikeda, Atsushi; Berthier, Ludovic; Parisi, Giorgio

    2017-05-01

    We numerically analyze the density field of three-dimensional randomly jammed packings of monodisperse soft frictionless spherical particles, paying special attention to fluctuations occurring at large length scales. We study in detail the two-point static structure factor at low wave vectors in Fourier space. We also analyze the nature of the density field in real space by studying the large-distance behavior of the two-point pair correlation function, of density fluctuations in subsystems of increasing sizes, and of the direct correlation function. We show that such real space analysis can be greatly improved by introducing a coarse-grained density field to disentangle genuine large-scale correlations from purely local effects. Our results confirm that both Fourier and real space signatures of vanishing density fluctuations at large scale are absent, indicating that randomly jammed packings are not hyperuniform. In addition, we establish that the pair correlation function displays a surprisingly complex structure at large distances, which is however not compatible with the long-range negative correlation of hyperuniform systems but fully compatible with an analytic form for the structure factor. This implies that the direct correlation function is short ranged, as we also demonstrate directly. Our results reveal that density fluctuations in jammed packings do not follow the behavior expected for random hyperuniform materials, but display instead a more complex behavior.

  16. National Hydrography Dataset Plus (NHDPlus)

    EPA Pesticide Factsheets

    The NHDPlus Version 1.0 is an integrated suite of application-ready geospatial data sets that incorporate many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,000-scale NHD), improved networking, naming, and value-added attributes (VAA's). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainageenforcement technique first broadly applied in New England, and thus dubbed The New-England Method. This technique involves burning-in the 1:100,000-scale NHD and when available building walls using the national WatershedBoundary Dataset (WBD). The resulting modified digital elevation model(HydroDEM) is used to produce hydrologic derivatives that agree with the NHDand WBD. An interdisciplinary team from the U. S. Geological Survey (USGS), U.S. Environmental Protection Agency (USEPA), and contractors, over the lasttwo years has found this method to produce the best quality NHD catchments using an automated process.The VAAs include greatly enhanced capabilities for upstream and downstream navigation, analysis and modeling. Examples include: retrieve all flowlines (predominantly confluence-to-confluence stream segments) and catchments upstream of a given flowline using queries rather than by slower flowline-by flowline navigation; retrieve flowlines by stream order; subset a stream level path sorted in hydrologic order for st

  17. Analysis of Radar and Optical Space Borne Data for Large Scale Topographical Mapping

    NASA Astrophysics Data System (ADS)

    Tampubolon, W.; Reinhardt, W.

    2015-03-01

    Normally, in order to provide high resolution 3 Dimension (3D) geospatial data, large scale topographical mapping needs input from conventional airborne campaigns which are in Indonesia bureaucratically complicated especially during legal administration procedures i.e. security clearance from military/defense ministry. This often causes additional time delays besides technical constraints such as weather and limited aircraft availability for airborne campaigns. Of course the geospatial data quality is an important issue for many applications. The increasing demand of geospatial data nowadays consequently requires high resolution datasets as well as a sufficient level of accuracy. Therefore an integration of different technologies is required in many cases to gain the expected result especially in the context of disaster preparedness and emergency response. Another important issue in this context is the fast delivery of relevant data which is expressed by the term "Rapid Mapping". In this paper we present first results of an on-going research to integrate different data sources like space borne radar and optical platforms. Initially the orthorectification of Very High Resolution Satellite (VHRS) imagery i.e. SPOT-6 has been done as a continuous process to the DEM generation using TerraSAR-X/TanDEM-X data. The role of Ground Control Points (GCPs) from GNSS surveys is mandatory in order to fulfil geometrical accuracy. In addition, this research aims on providing suitable processing algorithm of space borne data for large scale topographical mapping as described in section 3.2. Recently, radar space borne data has been used for the medium scale topographical mapping e.g. for 1:50.000 map scale in Indonesian territories. The goal of this on-going research is to increase the accuracy of remote sensing data by different activities, e.g. the integration of different data sources (optical and radar) or the usage of the GCPs in both, the optical and the radar satellite data

  18. Environmental Impacts of Large Scale Biochar Application Through Spatial Modeling

    NASA Astrophysics Data System (ADS)

    Huber, I.; Archontoulis, S.

    2017-12-01

    In an effort to study the environmental (emissions, soil quality) and production (yield) impacts of biochar application at regional scales we coupled the APSIM-Biochar model with the pSIMS parallel platform. So far the majority of biochar research has been concentrated on lab to field studies to advance scientific knowledge. Regional scale assessments are highly needed to assist decision making. The overall objective of this simulation study was to identify areas in the USA that have the most gain environmentally from biochar's application, as well as areas which our model predicts a notable yield increase due to the addition of biochar. We present the modifications in both APSIM biochar and pSIMS components that were necessary to facilitate these large scale model runs across several regions in the United States at a resolution of 5 arcminutes. This study uses the AgMERRA global climate data set (1980-2010) and the Global Soil Dataset for Earth Systems modeling as a basis for creating its simulations, as well as local management operations for maize and soybean cropping systems and different biochar application rates. The regional scale simulation analysis is in progress. Preliminary results showed that the model predicts that high quality soils (particularly those common to Iowa cropping systems) do not receive much, if any, production benefit from biochar. However, soils with low soil organic matter ( 0.5%) do get a noteworthy yield increase of around 5-10% in the best cases. We also found N2O emissions to be spatial and temporal specific; increase in some areas and decrease in some other areas due to biochar application. In contrast, we found increases in soil organic carbon and plant available water in all soils (top 30 cm) due to biochar application. The magnitude of these increases (% change from the control) were larger in soil with low organic matter (below 1.5%) and smaller in soils with high organic matter (above 3%) and also dependent on biochar

  19. An Novel Architecture of Large-scale Communication in IOT

    NASA Astrophysics Data System (ADS)

    Ma, Wubin; Deng, Su; Huang, Hongbin

    2018-03-01

    In recent years, many scholars have done a great deal of research on the development of Internet of Things and networked physical systems. However, few people have made the detailed visualization of the large-scale communications architecture in the IOT. In fact, the non-uniform technology between IPv6 and access points has led to a lack of broad principles of large-scale communications architectures. Therefore, this paper presents the Uni-IPv6 Access and Information Exchange Method (UAIEM), a new architecture and algorithm that addresses large-scale communications in the IOT.

  20. MOBBED: a computational data infrastructure for handling large collections of event-rich time series datasets in MATLAB

    PubMed Central

    Cockfield, Jeremy; Su, Kyungmin; Robbins, Kay A.

    2013-01-01

    Experiments to monitor human brain activity during active behavior record a variety of modalities (e.g., EEG, eye tracking, motion capture, respiration monitoring) and capture a complex environmental context leading to large, event-rich time series datasets. The considerable variability of responses within and among subjects in more realistic behavioral scenarios requires experiments to assess many more subjects over longer periods of time. This explosion of data requires better computational infrastructure to more systematically explore and process these collections. MOBBED is a lightweight, easy-to-use, extensible toolkit that allows users to incorporate a computational database into their normal MATLAB workflow. Although capable of storing quite general types of annotated data, MOBBED is particularly oriented to multichannel time series such as EEG that have event streams overlaid with sensor data. MOBBED directly supports access to individual events, data frames, and time-stamped feature vectors, allowing users to ask questions such as what types of events or features co-occur under various experimental conditions. A database provides several advantages not available to users who process one dataset at a time from the local file system. In addition to archiving primary data in a central place to save space and avoid inconsistencies, such a database allows users to manage, search, and retrieve events across multiple datasets without reading the entire dataset. The database also provides infrastructure for handling more complex event patterns that include environmental and contextual conditions. The database can also be used as a cache for expensive intermediate results that are reused in such activities as cross-validation of machine learning algorithms. MOBBED is implemented over PostgreSQL, a widely used open source database, and is freely available under the GNU general public license at http://visual.cs.utsa.edu/mobbed. Source and issue reports for MOBBED

  1. MOBBED: a computational data infrastructure for handling large collections of event-rich time series datasets in MATLAB.

    PubMed

    Cockfield, Jeremy; Su, Kyungmin; Robbins, Kay A

    2013-01-01

    Experiments to monitor human brain activity during active behavior record a variety of modalities (e.g., EEG, eye tracking, motion capture, respiration monitoring) and capture a complex environmental context leading to large, event-rich time series datasets. The considerable variability of responses within and among subjects in more realistic behavioral scenarios requires experiments to assess many more subjects over longer periods of time. This explosion of data requires better computational infrastructure to more systematically explore and process these collections. MOBBED is a lightweight, easy-to-use, extensible toolkit that allows users to incorporate a computational database into their normal MATLAB workflow. Although capable of storing quite general types of annotated data, MOBBED is particularly oriented to multichannel time series such as EEG that have event streams overlaid with sensor data. MOBBED directly supports access to individual events, data frames, and time-stamped feature vectors, allowing users to ask questions such as what types of events or features co-occur under various experimental conditions. A database provides several advantages not available to users who process one dataset at a time from the local file system. In addition to archiving primary data in a central place to save space and avoid inconsistencies, such a database allows users to manage, search, and retrieve events across multiple datasets without reading the entire dataset. The database also provides infrastructure for handling more complex event patterns that include environmental and contextual conditions. The database can also be used as a cache for expensive intermediate results that are reused in such activities as cross-validation of machine learning algorithms. MOBBED is implemented over PostgreSQL, a widely used open source database, and is freely available under the GNU general public license at http://visual.cs.utsa.edu/mobbed. Source and issue reports for MOBBED

  2. Gravitational lenses and large scale structure

    NASA Technical Reports Server (NTRS)

    Turner, Edwin L.

    1987-01-01

    Four possible statistical tests of the large scale distribution of cosmic material are described. Each is based on gravitational lensing effects. The current observational status of these tests is also summarized.

  3. Large-Scale 1:1 Computing Initiatives: An Open Access Database

    ERIC Educational Resources Information Center

    Richardson, Jayson W.; McLeod, Scott; Flora, Kevin; Sauers, Nick J.; Kannan, Sathiamoorthy; Sincar, Mehmet

    2013-01-01

    This article details the spread and scope of large-scale 1:1 computing initiatives around the world. What follows is a review of the existing literature around 1:1 programs followed by a description of the large-scale 1:1 database. Main findings include: 1) the XO and the Classmate PC dominate large-scale 1:1 initiatives; 2) if professional…

  4. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation

    PubMed Central

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware

  5. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.

    PubMed

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware

  6. Collaborative Visualization for Large-Scale Accelerator Electromagnetic Modeling (Final Report)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    William J. Schroeder

    2011-11-13

    This report contains the comprehensive summary of the work performed on the SBIR Phase II, Collaborative Visualization for Large-Scale Accelerator Electromagnetic Modeling at Kitware Inc. in collaboration with Stanford Linear Accelerator Center (SLAC). The goal of the work was to develop collaborative visualization tools for large-scale data as illustrated in the figure below. The solutions we proposed address the typical problems faced by geographicallyand organizationally-separated research and engineering teams, who produce large data (either through simulation or experimental measurement) and wish to work together to analyze and understand their data. Because the data is large, we expect that it cannotmore » be easily transported to each team member's work site, and that the visualization server must reside near the data. Further, we also expect that each work site has heterogeneous resources: some with large computing clients, tiled (or large) displays and high bandwidth; others sites as simple as a team member on a laptop computer. Our solution is based on the open-source, widely used ParaView large-data visualization application. We extended this tool to support multiple collaborative clients who may locally visualize data, and then periodically rejoin and synchronize with the group to discuss their findings. Options for managing session control, adding annotation, and defining the visualization pipeline, among others, were incorporated. We also developed and deployed a Web visualization framework based on ParaView that enables the Web browser to act as a participating client in a collaborative session. The ParaView Web Visualization framework leverages various Web technologies including WebGL, JavaScript, Java and Flash to enable interactive 3D visualization over the web using ParaView as the visualization server. We steered the development of this technology by teaming with the SLAC National Accelerator Laboratory. SLAC has a computationally

  7. Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid Supercomputers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.

    A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less

  8. Spatiotemporal property and predictability of large-scale human mobility

    NASA Astrophysics Data System (ADS)

    Zhang, Hai-Tao; Zhu, Tao; Fu, Dongfei; Xu, Bowen; Han, Xiao-Pu; Chen, Duxin

    2018-04-01

    Spatiotemporal characteristics of human mobility emerging from complexity on individual scale have been extensively studied due to the application potential on human behavior prediction and recommendation, and control of epidemic spreading. We collect and investigate a comprehensive data set of human activities on large geographical scales, including both websites browse and mobile towers visit. Numerical results show that the degree of activity decays as a power law, indicating that human behaviors are reminiscent of scale-free random walks known as Lévy flight. More significantly, this study suggests that human activities on large geographical scales have specific non-Markovian characteristics, such as a two-segment power-law distribution of dwelling time and a high possibility for prediction. Furthermore, a scale-free featured mobility model with two essential ingredients, i.e., preferential return and exploration, and a Gaussian distribution assumption on the exploration tendency parameter is proposed, which outperforms existing human mobility models under scenarios of large geographical scales.

  9. A Novel Strategy for Very-Large-Scale Cash-Crop Mapping in the Context of Weather-Related Risk Assessment, Combining Global Satellite Multispectral Datasets, Environmental Constraints, and In Situ Acquisition of Geospatial Data.

    PubMed

    Dell'Acqua, Fabio; Iannelli, Gianni Cristian; Torres, Marco A; Martina, Mario L V

    2018-02-14

    Cash crops are agricultural crops intended to be sold for profit as opposed to subsistence crops, meant to support the producer, or to support livestock. Since cash crops are intended for future sale, they translate into large financial value when considered on a wide geographical scale, so their production directly involves financial risk. At a national level, extreme weather events including destructive rain or hail, as well as drought, can have a significant impact on the overall economic balance. It is thus important to map such crops in order to set up insurance and mitigation strategies. Using locally generated data-such as municipality-level records of crop seeding-for mapping purposes implies facing a series of issues like data availability, quality, homogeneity, etc. We thus opted for a different approach relying on global datasets. Global datasets ensure homogeneity and availability of data, although sometimes at the expense of precision and accuracy. A typical global approach makes use of spaceborne remote sensing, for which different land cover classification strategies are available in literature at different levels of cost and accuracy. We selected the optimal strategy in the perspective of a global processing chain. Thanks to a specifically developed strategy for fusing unsupervised classification results with environmental constraints and other geospatial inputs including ground-based data, we managed to obtain good classification results despite the constraints placed. The overall production process was composed using "good-enough" algorithms at each step, ensuring that the precision, accuracy, and data-hunger of each algorithm was commensurate to the precision, accuracy, and amount of data available. This paper describes the tailored strategy developed on the occasion as a cooperation among different groups with diverse backgrounds, a strategy which is believed to be profitably reusable in other, similar contexts. The paper presents the problem

  10. A Novel Strategy for Very-Large-Scale Cash-Crop Mapping in the Context of Weather-Related Risk Assessment, Combining Global Satellite Multispectral Datasets, Environmental Constraints, and In Situ Acquisition of Geospatial Data

    PubMed Central

    Iannelli, Gianni Cristian; Torres, Marco A.

    2018-01-01

    Cash crops are agricultural crops intended to be sold for profit as opposed to subsistence crops, meant to support the producer, or to support livestock. Since cash crops are intended for future sale, they translate into large financial value when considered on a wide geographical scale, so their production directly involves financial risk. At a national level, extreme weather events including destructive rain or hail, as well as drought, can have a significant impact on the overall economic balance. It is thus important to map such crops in order to set up insurance and mitigation strategies. Using locally generated data—such as municipality-level records of crop seeding—for mapping purposes implies facing a series of issues like data availability, quality, homogeneity, etc. We thus opted for a different approach relying on global datasets. Global datasets ensure homogeneity and availability of data, although sometimes at the expense of precision and accuracy. A typical global approach makes use of spaceborne remote sensing, for which different land cover classification strategies are available in literature at different levels of cost and accuracy. We selected the optimal strategy in the perspective of a global processing chain. Thanks to a specifically developed strategy for fusing unsupervised classification results with environmental constraints and other geospatial inputs including ground-based data, we managed to obtain good classification results despite the constraints placed. The overall production process was composed using “good-enough" algorithms at each step, ensuring that the precision, accuracy, and data-hunger of each algorithm was commensurate to the precision, accuracy, and amount of data available. This paper describes the tailored strategy developed on the occasion as a cooperation among different groups with diverse backgrounds, a strategy which is believed to be profitably reusable in other, similar contexts. The paper presents the

  11. Wedge measures parallax separations...on large-scale 70-mm

    Treesearch

    Steven L. Wert; Richard J. Myhre

    1967-01-01

    A new parallax wedge (range: 1.5 to 2 inches) has been designed for use with large-scaled 70-mm. aerial photographs. The narrow separation of the wedge allows the user to measure small parallax separations that are characteristic of large-scale photographs.

  12. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    PubMed Central

    Azad, Ariful; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin

    2018-01-01

    Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license. PMID:29315405

  13. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    DOE PAGES

    Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; ...

    2018-01-05

    Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less

  14. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.

    Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less

  15. Intensification and Structure Change of Super Typhoon Flo as Related to the Large-Scale Environment.

    DTIC Science & Technology

    1998-06-01

    large dataset is a challenge. Schiavone and Papathomas (1990) summarize methods currently available for visualizing scientific 116 datasets. These...Prediction and Dynamic Meteorology, Second Edition. John Wiley and Sons, 477 pp. Hardy, R. L., 1971: Multiquadric equations of topography and other...Inter. Corp., Monterey CA, 40 pp. Sawyer, J. S., 1947: Notes on the theory of tropical cyclones. Quart. J. Roy. Meteor. Soc, 73, 101-126. Schiavone

  16. Large datasets, logistics, sharing and workflow in screening.

    PubMed

    Cook, Tessa S

    2018-03-29

    Cancer screening initiatives exist around the world for different malignancies, most frequently breast, colorectal, and cervical cancer. A number of cancer registries exist to collect relevant data, but while these data may include imaging findings, they rarely, if ever, include actual images. Additionally, the data submitted to the registry are usually correlated with eventual cancer diagnoses and patient outcomes, rather than used with the individual's future screenings. Developing screening programs that allow for images to be submitted to a central location in addition to patient meta data and used for comparison to future screening exams would be very valuable in increasing access to care and ensuring that individuals are effectively screened at appropriate intervals. It would also change the way imaging results and additional patient data are correlated to eventual outcomes. However, it introduces logistical challenges surrounding secure storage and transmission of data to subsequent screening sites. In addition, in the absence of standardized protocols for screening, comparing current and prior imaging, especially from different equipment, can be challenging. Implementing a large-scale screening program with an image-enriched screening registry-effectively, an image-enriched electronic screening record-also requires that incentives exist for screening sites, physicians, and patients to participate; to maximize coverage, participation may have to be supported by government agencies. Workflows will also have to be adjusted to support registry participation for all screening patients in an effort to create a large, robust data set that can be used for future screening efforts as well as research initiatives.center.

  17. A computationally efficient Bayesian sequential simulation approach for the assimilation of vast and diverse hydrogeophysical datasets

    NASA Astrophysics Data System (ADS)

    Nussbaumer, Raphaël; Gloaguen, Erwan; Mariéthoz, Grégoire; Holliger, Klaus

    2016-04-01

    Bayesian sequential simulation (BSS) is a powerful geostatistical technique, which notably has shown significant potential for the assimilation of datasets that are diverse with regard to the spatial resolution and their relationship. However, these types of applications of BSS require a large number of realizations to adequately explore the solution space and to assess the corresponding uncertainties. Moreover, such simulations generally need to be performed on very fine grids in order to adequately exploit the technique's potential for characterizing heterogeneous environments. Correspondingly, the computational cost of BSS algorithms in their classical form is very high, which so far has limited an effective application of this method to large models and/or vast datasets. In this context, it is also important to note that the inherent assumption regarding the independence of the considered datasets is generally regarded as being too strong in the context of sequential simulation. To alleviate these problems, we have revisited the classical implementation of BSS and incorporated two key features to increase the computational efficiency. The first feature is a combined quadrant spiral - superblock search, which targets run-time savings on large grids and adds flexibility with regard to the selection of neighboring points using equal directional sampling and treating hard data and previously simulated points separately. The second feature is a constant path of simulation, which enhances the efficiency for multiple realizations. We have also modified the aggregation operator to be more flexible with regard to the assumption of independence of the considered datasets. This is achieved through log-linear pooling, which essentially allows for attributing weights to the various data components. Finally, a multi-grid simulating path was created to enforce large-scale variance and to allow for adapting parameters, such as, for example, the log-linear weights or the type

  18. CANFAR + Skytree: Mining Massive Datasets as an Essential Part of the Future of Astronomy

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.

    2013-01-01

    The future study of large astronomical datasets, consisting of hundreds of millions to billions of objects, will be dominated by large computing resources, and by analysis tools of the necessary scalability and sophistication to extract useful information. Significant effort will be required to fulfil their potential as a provider of the next generation of science results. To-date, computing systems have allowed either sophisticated analysis of small datasets, e.g., most astronomy software, or simple analysis of large datasets, e.g., database queries. At the Canadian Astronomy Data Centre, we have combined our cloud computing system, the Canadian Advanced Network for Astronomical Research (CANFAR), with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. This allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. CANFAR works by utilizing virtual machines, which appear to the user as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement carries far more flexibility than other cloud systems, because it enables the user to immediately install and run the same code that they already utilize for science on their desktop. We demonstrate the utility of the CANFAR + Skytree system by showing science results obtained, including assigning photometric redshifts with full probability density functions (PDFs) to a catalog of approximately 133 million galaxies from the MegaPipe reductions of the Canada-France-Hawaii Telescope Legacy Wide and Deep surveys. Each PDF is produced nonparametrically from 100 instances of the photometric parameters for each galaxy, generated by perturbing within the errors on the measurements. Hence, we produce, store, and assign redshifts to, a catalog of over 13 billion

  19. A high-throughput system for high-quality tomographic reconstruction of large datasets at Diamond Light Source

    PubMed Central

    Atwood, Robert C.; Bodey, Andrew J.; Price, Stephen W. T.; Basham, Mark; Drakopoulos, Michael

    2015-01-01

    Tomographic datasets collected at synchrotrons are becoming very large and complex, and, therefore, need to be managed efficiently. Raw images may have high pixel counts, and each pixel can be multidimensional and associated with additional data such as those derived from spectroscopy. In time-resolved studies, hundreds of tomographic datasets can be collected in sequence, yielding terabytes of data. Users of tomographic beamlines are drawn from various scientific disciplines, and many are keen to use tomographic reconstruction software that does not require a deep understanding of reconstruction principles. We have developed Savu, a reconstruction pipeline that enables users to rapidly reconstruct data to consistently create high-quality results. Savu is designed to work in an ‘orthogonal’ fashion, meaning that data can be converted between projection and sinogram space throughout the processing workflow as required. The Savu pipeline is modular and allows processing strategies to be optimized for users' purposes. In addition to the reconstruction algorithms themselves, it can include modules for identification of experimental problems, artefact correction, general image processing and data quality assessment. Savu is open source, open licensed and ‘facility-independent’: it can run on standard cluster infrastructure at any institution. PMID:25939626

  20. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

    PubMed

    Tsatsaronis, George; Balikas, Georgios; Malakasiotis, Prodromos; Partalas, Ioannis; Zschunke, Matthias; Alvers, Michael R; Weissenborn, Dirk; Krithara, Anastasia; Petridis, Sergios; Polychronopoulos, Dimitris; Almirantis, Yannis; Pavlopoulos, John; Baskiotis, Nicolas; Gallinari, Patrick; Artiéres, Thierry; Ngomo, Axel-Cyrille Ngonga; Heino, Norman; Gaussier, Eric; Barrio-Alvers, Liliana; Schroeder, Michael; Androutsopoulos, Ion; Paliouras, Georgios

    2015-04-30

    This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. The 2013 BIOASQ competition comprised two tasks, Task 1a and Task 1b. In Task 1a participants were asked to automatically annotate new PUBMED documents with MESH headings. Twelve teams participated in Task 1a, with a total of 46 system runs submitted, and one of the teams performing consistently better than the MTI indexer used by NLM to suggest MESH headings to curators. Task 1b used benchmark datasets containing 29 development and 282 test English questions, along with gold standard (reference) answers, prepared by a team of biomedical experts from around Europe and participants had to automatically produce answers. Three teams participated in Task 1b, with 11 system runs. The BIOASQ infrastructure, including benchmark datasets, evaluation mechanisms, and the results of the participants and baseline methods, is publicly available. A publicly available evaluation infrastructure for biomedical semantic indexing and QA has been developed, which includes benchmark datasets, and can be used to evaluate systems that: assign MESH headings to published articles or to English questions; retrieve relevant RDF triples from ontologies, relevant articles and snippets from PUBMED Central; produce "exact" and paragraph-sized "ideal" answers (summaries). The results of the systems that participated in the 2013 BIOASQ competition are promising. In Task 1a one of the systems performed consistently better from the NLM's MTI indexer. In Task 1b the systems received high scores in the manual evaluation of the "ideal

  1. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  2. Management of large-scale technology

    NASA Technical Reports Server (NTRS)

    Levine, A.

    1985-01-01

    Two major themes are addressed in this assessment of the management of large-scale NASA programs: (1) how a high technology agency was a decade marked by a rapid expansion of funds and manpower in the first half and almost as rapid contraction in the second; and (2) how NASA combined central planning and control with decentralized project execution.

  3. LSD: Large Survey Database framework

    NASA Astrophysics Data System (ADS)

    Juric, Mario

    2012-09-01

    The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.

  4. Food security through large scale investments in agriculture

    NASA Astrophysics Data System (ADS)

    Rulli, M.; D'Odorico, P.

    2013-12-01

    Most of the human appropriation of freshwater resources is for food production. There is some concern that in the near future the finite freshwater resources available on Earth might not be sufficient to meet the increasing human demand for agricultural products. In the late 1700s Malthus argued that in the long run the humanity would not have enough resources to feed itself. Malthus' analysis, however, did not account for the emergence of technological innovations that could increase the rate of food production. The modern and contemporary history has seen at least three major technological advances that have increased humans' access to food, namely, the industrial revolution, the green revolution, and the intensification of global trade. Here we argue that a fourth revolution has just started to happen. It involves foreign direct investments in agriculture, which intensify the crop yields of potentially highly productive agricultural lands by introducing the use of more modern technologies. The increasing demand for agricultural products and the uncertainty of international food markets has recently drawn the attention of governments and agribusiness firms toward investments in productive agricultural land, mostly in the developing world. The targeted countries are typically located in regions that have remained only marginally utilized because of lack of modern technology. It is expected that in the long run large scale land acquisitions for commercial farming will bring the technology required to close the existing yield gaps. While the extent of the acquired land and the associated appropriation of freshwater resources have been investigated in detail, the amount of food this land can produce and the number of people it could feed still need to be quantified. Here we use a unique dataset of verified land deals to provide a global quantitative assessment of the rates of crop and food appropriation potentially associated with large scale land acquisitions. We

  5. In search of a footprint: an investigation about the potentiality of large datasets and territorial analysis in disaster and resilience research.

    NASA Astrophysics Data System (ADS)

    Pregnolato, Marco; Petitta, Marcello; Schneiderbauer, Stefan; Pedoth, Lydia; Iasio, Christian; Kaveckis, Giedrius

    2014-05-01

    The present investigation aims to contribute to a better understanding if and how coarse scale data can prove useful in a study on resilience of communities towards natural hazards. Main goal of the work is the exploitation of large datasets in search for indicators and information valuable for resilience research; in particular, for marks in the statistical distribution of events as well as in the physical signs on a territory, to be possibly defined as disaster footprints. The approach developed required to start from theoretical considerations about some key concepts, such as footprint and resilience and the possible influence of different types of adverse events on a territory. In particular, the research focuses on statistical signals that can be identified within datasets, concerning the effects of hazardous events against the background of resilience, defined as the "ability of a system and its component parts to anticipate, absorb, accommodate, or recover" from a disaster. The hypothesis for this work was that a disaster footprint could be shown using land features and changes maps. The question linked to this hypothesis was: is there a possibility to recognize on the land a multi-dimensional footprint? Is it possible to do this using land cover/land use data? In order to answer these questions this work proposes a synthetic index, named for convenience Hazard-Territory Index, created to categorize classes of Land Use/Land Cover from the CORINE Land Cover maps, by the mean of different approaches, according to the type of hazard. Through the use and elaboration of CORINE Land Cover data this work investigates whether the land and its use (in a way the relationship between a territory and the community living on it) and its changes over time can reveal some information and results relevant for the analysis of resilience. The investigation, set up in order to analyse these "signs on a map", led to implicate the notion of footprint as a multi

  6. US National Large-scale City Orthoimage Standard Initiative

    USGS Publications Warehouse

    Zhou, G.; Song, C.; Benjamin, S.; Schickler, W.

    2003-01-01

    The early procedures and algorithms for National digital orthophoto generation in National Digital Orthophoto Program (NDOP) were based on earlier USGS mapping operations, such as field control, aerotriangulation (derived in the early 1920's), the quarter-quadrangle-centered (3.75 minutes of longitude and latitude in geographic extent), 1:40,000 aerial photographs, and 2.5 D digital elevation models. However, large-scale city orthophotos using early procedures have disclosed many shortcomings, e.g., ghost image, occlusion, shadow. Thus, to provide the technical base (algorithms, procedure) and experience needed for city large-scale digital orthophoto creation is essential for the near future national large-scale digital orthophoto deployment and the revision of the Standards for National Large-scale City Digital Orthophoto in National Digital Orthophoto Program (NDOP). This paper will report our initial research results as follows: (1) High-precision 3D city DSM generation through LIDAR data processing, (2) Spatial objects/features extraction through surface material information and high-accuracy 3D DSM data, (3) 3D city model development, (4) Algorithm development for generation of DTM-based orthophoto, and DBM-based orthophoto, (5) True orthophoto generation by merging DBM-based orthophoto and DTM-based orthophoto, and (6) Automatic mosaic by optimizing and combining imagery from many perspectives.

  7. Large-scale structure in superfluid Chaplygin gas cosmology

    NASA Astrophysics Data System (ADS)

    Yang, Rongjia

    2014-03-01

    We investigate the growth of the large-scale structure in the superfluid Chaplygin gas (SCG) model. Both linear and nonlinear growth, such as σ8 and the skewness S3, are discussed. We find the growth factor of SCG reduces to the Einstein-de Sitter case at early times while it differs from the cosmological constant model (ΛCDM) case in the large a limit. We also find there will be more stricture growth on large scales in the SCG scenario than in ΛCDM and the variations of σ8 and S3 between SCG and ΛCDM cannot be discriminated.

  8. A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures.

    PubMed

    DeCost, Brian L; Holm, Elizabeth A

    2016-12-01

    This data article presents a data set comprised of 2048 synthetic scanning electron microscope (SEM) images of powder materials and descriptions of the corresponding 3D structures that they represent. These images were created using open source rendering software, and the generating scripts are included with the data set. Eight particle size distributions are represented with 256 independent images from each. The particle size distributions are relatively similar to each other, so that the dataset offers a useful benchmark to assess the fidelity of image analysis techniques. The characteristics of the PSDs and the resulting images are described and analyzed in more detail in the research article "Characterizing powder materials using keypoint-based computer vision methods" (B.L. DeCost, E.A. Holm, 2016) [1]. These data are freely available in a Mendeley Data archive "A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures" (B.L. DeCost, E.A. Holm, 2016) located at http://dx.doi.org/10.17632/tj4syyj9mr.1[2] for any academic, educational, or research purposes.

  9. Geospatial Optimization of Siting Large-Scale Solar Projects

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Macknick, Jordan; Quinby, Ted; Caulfield, Emmet

    2014-03-01

    Recent policy and economic conditions have encouraged a renewed interest in developing large-scale solar projects in the U.S. Southwest. However, siting large-scale solar projects is complex. In addition to the quality of the solar resource, solar developers must take into consideration many environmental, social, and economic factors when evaluating a potential site. This report describes a proof-of-concept, Web-based Geographical Information Systems (GIS) tool that evaluates multiple user-defined criteria in an optimization algorithm to inform discussions and decisions regarding the locations of utility-scale solar projects. Existing siting recommendations for large-scale solar projects from governmental and non-governmental organizations are not consistent withmore » each other, are often not transparent in methods, and do not take into consideration the differing priorities of stakeholders. The siting assistance GIS tool we have developed improves upon the existing siting guidelines by being user-driven, transparent, interactive, capable of incorporating multiple criteria, and flexible. This work provides the foundation for a dynamic siting assistance tool that can greatly facilitate siting decisions among multiple stakeholders.« less

  10. Critical Issues in Large-Scale Assessment: A Resource Guide.

    ERIC Educational Resources Information Center

    Redfield, Doris

    The purpose of this document is to provide practical guidance and support for the design, development, and implementation of large-scale assessment systems that are grounded in research and best practice. Information is included about existing large-scale testing efforts, including national testing programs, state testing programs, and…

  11. National Hydrography Dataset (NHD)

    USGS Publications Warehouse

    ,

    2001-01-01

    The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000 scale and exists at that scale for the whole country. High resolution NHD adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Like the 1:100,000-scale NHD, high resolution NHD contains reach codes for networked features and isolated lakes, flow direction, names, stream level, and centerline representations for areal water bodies. Reaches are also defined to represent waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria set out by the Federal Geographic Data Committee.

  12. State of the Art in Large-Scale Soil Moisture Monitoring

    NASA Technical Reports Server (NTRS)

    Ochsner, Tyson E.; Cosh, Michael Harold; Cuenca, Richard H.; Dorigo, Wouter; Draper, Clara S.; Hagimoto, Yutaka; Kerr, Yan H.; Larson, Kristine M.; Njoku, Eni Gerald; Small, Eric E.; hide

    2013-01-01

    Soil moisture is an essential climate variable influencing land atmosphere interactions, an essential hydrologic variable impacting rainfall runoff processes, an essential ecological variable regulating net ecosystem exchange, and an essential agricultural variable constraining food security. Large-scale soil moisture monitoring has advanced in recent years creating opportunities to transform scientific understanding of soil moisture and related processes. These advances are being driven by researchers from a broad range of disciplines, but this complicates collaboration and communication. For some applications, the science required to utilize large-scale soil moisture data is poorly developed. In this review, we describe the state of the art in large-scale soil moisture monitoring and identify some critical needs for research to optimize the use of increasingly available soil moisture data. We review representative examples of 1) emerging in situ and proximal sensing techniques, 2) dedicated soil moisture remote sensing missions, 3) soil moisture monitoring networks, and 4) applications of large-scale soil moisture measurements. Significant near-term progress seems possible in the use of large-scale soil moisture data for drought monitoring. Assimilation of soil moisture data for meteorological or hydrologic forecasting also shows promise, but significant challenges related to model structures and model errors remain. Little progress has been made yet in the use of large-scale soil moisture observations within the context of ecological or agricultural modeling. Opportunities abound to advance the science and practice of large-scale soil moisture monitoring for the sake of improved Earth system monitoring, modeling, and forecasting.

  13. A reference human genome dataset of the BGISEQ-500 sequencer.

    PubMed

    Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

    2017-05-01

    BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.

  14. Rapid Global Fitting of Large Fluorescence Lifetime Imaging Microscopy Datasets

    PubMed Central

    Warren, Sean C.; Margineanu, Anca; Alibhai, Dominic; Kelly, Douglas J.; Talbot, Clifford; Alexandrov, Yuriy; Munro, Ian; Katan, Matilda

    2013-01-01

    Fluorescence lifetime imaging (FLIM) is widely applied to obtain quantitative information from fluorescence signals, particularly using Förster Resonant Energy Transfer (FRET) measurements to map, for example, protein-protein interactions. Extracting FRET efficiencies or population fractions typically entails fitting data to complex fluorescence decay models but such experiments are frequently photon constrained, particularly for live cell or in vivo imaging, and this leads to unacceptable errors when analysing data on a pixel-wise basis. Lifetimes and population fractions may, however, be more robustly extracted using global analysis to simultaneously fit the fluorescence decay data of all pixels in an image or dataset to a multi-exponential model under the assumption that the lifetime components are invariant across the image (dataset). This approach is often considered to be prohibitively slow and/or computationally expensive but we present here a computationally efficient global analysis algorithm for the analysis of time-correlated single photon counting (TCSPC) or time-gated FLIM data based on variable projection. It makes efficient use of both computer processor and memory resources, requiring less than a minute to analyse time series and multiwell plate datasets with hundreds of FLIM images on standard personal computers. This lifetime analysis takes account of repetitive excitation, including fluorescence photons excited by earlier pulses contributing to the fit, and is able to accommodate time-varying backgrounds and instrument response functions. We demonstrate that this global approach allows us to readily fit time-resolved fluorescence data to complex models including a four-exponential model of a FRET system, for which the FRET efficiencies of the two species of a bi-exponential donor are linked, and polarisation-resolved lifetime data, where a fluorescence intensity and bi-exponential anisotropy decay model is applied to the analysis of live cell

  15. Nonlinear Generation of shear flows and large scale magnetic fields by small scale

    NASA Astrophysics Data System (ADS)

    Aburjania, G.

    2009-04-01

    EGU2009-233 Nonlinear Generation of shear flows and large scale magnetic fields by small scale turbulence in the ionosphere by G. Aburjania Contact: George Aburjania, g.aburjania@gmail.com,aburj@mymail.ge

  16. Large-Scale Coherent Vortex Formation in Two-Dimensional Turbulence

    NASA Astrophysics Data System (ADS)

    Orlov, A. V.; Brazhnikov, M. Yu.; Levchenko, A. A.

    2018-04-01

    The evolution of a vortex flow excited by an electromagnetic technique in a thin layer of a conducting liquid was studied experimentally. Small-scale vortices, excited at the pumping scale, merge with time due to the nonlinear interaction and produce large-scale structures—the inverse energy cascade is formed. The dependence of the energy spectrum in the developed inverse cascade is well described by the Kraichnan law k -5/3. At large scales, the inverse cascade is limited by cell sizes, and a large-scale coherent vortex flow is formed, which occupies almost the entire area of the experimental cell. The radial profile of the azimuthal velocity of the coherent vortex immediately after the pumping was switched off has been established for the first time. Inside the vortex core, the azimuthal velocity grows linearly along a radius and reaches a constant value outside the core, which agrees well with the theoretical prediction.

  17. Intensive agriculture erodes β-diversity at large scales.

    PubMed

    Karp, Daniel S; Rominger, Andrew J; Zook, Jim; Ranganathan, Jai; Ehrlich, Paul R; Daily, Gretchen C

    2012-09-01

    Biodiversity is declining from unprecedented land conversions that replace diverse, low-intensity agriculture with vast expanses under homogeneous, intensive production. Despite documented losses of species richness, consequences for β-diversity, changes in community composition between sites, are largely unknown, especially in the tropics. Using a 10-year data set on Costa Rican birds, we find that low-intensity agriculture sustained β-diversity across large scales on a par with forest. In high-intensity agriculture, low local (α) diversity inflated β-diversity as a statistical artefact. Therefore, at small spatial scales, intensive agriculture appeared to retain β-diversity. Unlike in forest or low-intensity systems, however, high-intensity agriculture also homogenised vegetation structure over large distances, thereby decoupling the fundamental ecological pattern of bird communities changing with geographical distance. This ~40% decline in species turnover indicates a significant decline in β-diversity at large spatial scales. These findings point the way towards multi-functional agricultural systems that maintain agricultural productivity while simultaneously conserving biodiversity. © 2012 Blackwell Publishing Ltd/CNRS.

  18. Large-scale spatio-temporal monitoring highlights hotspots of demersal fish diversity in the Mediterranean Sea

    NASA Astrophysics Data System (ADS)

    Granger, Victoria; Fromentin, Jean-Marc; Bez, Nicolas; Relini, Giulio; Meynard, Christine N.; Gaertner, Jean-Claude; Maiorano, Porzia; Garcia Ruiz, Cristina; Follesa, Cristina; Gristina, Michele; Peristeraki, Panagiota; Brind'Amour, Anik; Carbonara, Pierluigi; Charilaou, Charis; Esteban, Antonio; Jadaud, Angélique; Joksimovic, Aleksandar; Kallianiotis, Argyris; Kolitari, Jerina; Manfredi, Chiara; Massuti, Enric; Mifsud, Roberta; Quetglas, Antoni; Refes, Wahid; Sbrana, Mario; Vrgoc, Nedo; Spedicato, Maria Teresa; Mérigot, Bastien

    2015-01-01

    Increasing human pressures and global environmental change may severely affect the diversity of species assemblages and associated ecosystem services. Despite the recent interest in phylogenetic and functional diversity, our knowledge on large spatio-temporal patterns of demersal fish diversity sampled by trawling remains still incomplete, notably in the Mediterranean Sea, one of the most threatened marine regions of the world. We investigated large spatio-temporal diversity patterns by analysing a dataset of 19,886 hauls from 10 to 800 m depth performed annually during the last two decades by standardised scientific bottom trawl field surveys across the Mediterranean Sea, within the MEDITS program. A multi-component (eight diversity indices) and multi-scale (local assemblages, biogeographic regions to basins) approach indicates that only the two most traditional components (species richness and evenness) were sufficient to reflect patterns in taxonomic, phylogenetic or functional richness and divergence. We also put into question the use of widely computed indices that allow comparing directly taxonomic, phylogenetic and functional diversity within a unique mathematical framework. In addition, demersal fish assemblages sampled by trawl do not follow a continuous decreasing longitudinal/latitudinal diversity gradients (spatial effects explained up to 70.6% of deviance in regression tree and generalised linear models), for any of the indices and spatial scales analysed. Indeed, at both local and regional scales species richness was relatively high in the Iberian region, Malta, the Eastern Ionian and Aegean seas, meanwhile the Adriatic Sea and Cyprus showed a relatively low level. In contrast, evenness as well as taxonomic, phylogenetic and functional divergences did not show regional hotspots. All studied diversity components remained stable over the last two decades. Overall, our results highlight the need to use complementary diversity indices through different

  19. Stability of large-scale systems.

    NASA Technical Reports Server (NTRS)

    Siljak, D. D.

    1972-01-01

    The purpose of this paper is to present the results obtained in stability study of large-scale systems based upon the comparison principle and vector Liapunov functions. The exposition is essentially self-contained, with emphasis on recent innovations which utilize explicit information about the system structure. This provides a natural foundation for the stability theory of dynamic systems under structural perturbations.

  20. One tree to link them all: a phylogenetic dataset for the European tetrapoda.

    PubMed

    Roquet, Cristina; Lavergne, Sébastien; Thuiller, Wilfried

    2014-08-08

    Since the ever-increasing availability of phylogenetic informative data, the last decade has seen an upsurge of ecological studies incorporating information on evolutionary relationships among species. However, detailed species-level phylogenies are still lacking for many large groups and regions, which are necessary for comprehensive large-scale eco-phylogenetic analyses. Here, we provide a dataset of 100 dated phylogenetic trees for all European tetrapods based on a mixture of supermatrix and supertree approaches. Phylogenetic inference was performed separately for each of the main Tetrapoda groups of Europe except mammals (i.e. amphibians, birds, squamates and turtles) by means of maximum likelihood (ML) analyses of supermatrix applying a tree constraint at the family (amphibians and squamates) or order (birds and turtles) levels based on consensus knowledge. For each group, we inferred 100 ML trees to be able to provide a phylogenetic dataset that accounts for phylogenetic uncertainty, and assessed node support with bootstrap analyses. Each tree was dated using penalized-likelihood and fossil calibration. The trees obtained were well-supported by existing knowledge and previous phylogenetic studies. For mammals, we modified the most complete supertree dataset available on the literature to include a recent update of the Carnivora clade. As a final step, we merged the phylogenetic trees of all groups to obtain a set of 100 phylogenetic trees for all European Tetrapoda species for which data was available (91%). We provide this phylogenetic dataset (100 chronograms) for the purpose of comparative analyses, macro-ecological or community ecology studies aiming to incorporate phylogenetic information while accounting for phylogenetic uncertainty.

  1. Real-time simulation of large-scale floods

    NASA Astrophysics Data System (ADS)

    Liu, Q.; Qin, Y.; Li, G. D.; Liu, Z.; Cheng, D. J.; Zhao, Y. H.

    2016-08-01

    According to the complex real-time water situation, the real-time simulation of large-scale floods is very important for flood prevention practice. Model robustness and running efficiency are two critical factors in successful real-time flood simulation. This paper proposed a robust, two-dimensional, shallow water model based on the unstructured Godunov- type finite volume method. A robust wet/dry front method is used to enhance the numerical stability. An adaptive method is proposed to improve the running efficiency. The proposed model is used for large-scale flood simulation on real topography. Results compared to those of MIKE21 show the strong performance of the proposed model.

  2. [Spatial domain display for interference image dataset].

    PubMed

    Wang, Cai-Ling; Li, Yu-Shan; Liu, Xue-Bin; Hu, Bing-Liang; Jing, Juan-Juan; Wen, Jia

    2011-11-01

    The requirements of imaging interferometer visualization is imminent for the user of image interpretation and information extraction. However, the conventional researches on visualization only focus on the spectral image dataset in spectral domain. Hence, the quick show of interference spectral image dataset display is one of the nodes in interference image processing. The conventional visualization of interference dataset chooses classical spectral image dataset display method after Fourier transformation. In the present paper, the problem of quick view of interferometer imager in image domain is addressed and the algorithm is proposed which simplifies the matter. The Fourier transformation is an obstacle since its computation time is very large and the complexion would be even deteriorated with the size of dataset increasing. The algorithm proposed, named interference weighted envelopes, makes the dataset divorced from transformation. The authors choose three interference weighted envelopes respectively based on the Fourier transformation, features of interference data and human visual system. After comparing the proposed with the conventional methods, the results show the huge difference in display time.

  3. Large scale anomalies in the microwave background: causation and correlation.

    PubMed

    Aslanyan, Grigor; Easther, Richard

    2013-12-27

    Most treatments of large scale anomalies in the microwave sky are a posteriori, with unquantified look-elsewhere effects. We contrast these with physical models of specific inhomogeneities in the early Universe which can generate these apparent anomalies. Physical models predict correlations between candidate anomalies and the corresponding signals in polarization and large scale structure, reducing the impact of cosmic variance. We compute the apparent spatial curvature associated with large-scale inhomogeneities and show that it is typically small, allowing for a self-consistent analysis. As an illustrative example we show that a single large plane wave inhomogeneity can contribute to low-l mode alignment and odd-even asymmetry in the power spectra and the best-fit model accounts for a significant part of the claimed odd-even asymmetry. We argue that this approach can be generalized to provide a more quantitative assessment of potential large scale anomalies in the Universe.

  4. Superwind Outflows in Seyfert Galaxies? : Large-Scale Radio Maps of an Edge-On Sample

    NASA Astrophysics Data System (ADS)

    Colbert, E.; Gallimore, J.; Baum, S.; O'Dea, C.

    1995-03-01

    Large-scale galactic winds (superwinds) are commonly found flowing out of the nuclear region of ultraluminous infrared and powerful starburst galaxies. Stellar winds and supernovae from the nuclear starburst provide the energy to drive these superwinds. The outflowing gas escapes along the rotation axis, sweeping up and shock-heating clouds in the halo, which produces optical line emission, radio synchrotron emission, and X-rays. These features can most easily be studied in edge-on systems, so that the wind emission is not confused by that from the disk. We have begun a systematic search for superwind outflows in Seyfert galaxies. In an earlier optical emission-line survey, we found extended minor axis emission and/or double-peaked emission line profiles in >~30% of the sample objects. We present here large-scale (6cm VLA C-config) radio maps of 11 edge-on Seyfert galaxies, selected (without bias) from a distance-limited sample of 23 edge-on Seyferts. These data have been used to estimate the frequency of occurrence of superwinds. Preliminary results indicate that four (36%) of the 11 objects observed and six (26%) of the 23 objects in the distance-limited sample have extended radio emission oriented perpendicular to the galaxy disk. This emission may be produced by a galactic wind blowing out of the disk. Two (NGC 2992 and NGC 5506) of the nine objects for which we have both radio and optical data show good evidence for a galactic wind in both datasets. We suggest that galactic winds occur in >~30% of all Seyferts. A goal of this work is to find a diagnostic that can be used to distinguish between large-scale outflows that are driven by starbursts and those that are driven by an AGN. The presence of starburst-driven superwinds in Seyferts, if established, would have important implications for the connection between starburst galaxies and AGN.

  5. Effects of biasing on the galaxy power spectrum at large scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Beltran Jimenez, Jose; Departamento de Fisica Teorica, Universidad Complutense de Madrid, 28040, Madrid; Durrer, Ruth

    2011-05-15

    In this paper we study the effect of biasing on the power spectrum at large scales. We show that even though nonlinear biasing does introduce a white noise contribution on large scales, the P(k){proportional_to}k{sup n} behavior of the matter power spectrum on large scales may still be visible and above the white noise for about one decade. We show, that the Kaiser biasing scheme which leads to linear bias of the correlation function on large scales, also generates a linear bias of the power spectrum on rather small scales. This is a consequence of the divergence on small scales ofmore » the pure Harrison-Zeldovich spectrum. However, biasing becomes k dependent if we damp the underlying power spectrum on small scales. We also discuss the effect of biasing on the baryon acoustic oscillations.« less

  6. FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web.

    PubMed

    Probst, Daniel; Reymond, Jean-Louis

    2018-04-15

    During the past decade, big data have become a major tool in scientific endeavors. Although statistical methods and algorithms are well-suited for analyzing and summarizing enormous amounts of data, the results do not allow for a visual inspection of the entire data. Current scientific software, including R packages and Python libraries such as ggplot2, matplotlib and plot.ly, do not support interactive visualizations of datasets exceeding 100 000 data points on the web. Other solutions enable the web-based visualization of big data only through data reduction or statistical representations. However, recent hardware developments, especially advancements in graphical processing units, allow for the rendering of millions of data points on a wide range of consumer hardware such as laptops, tablets and mobile phones. Similar to the challenges and opportunities brought to virtually every scientific field by big data, both the visualization of and interaction with copious amounts of data are both demanding and hold great promise. Here we present FUn, a framework consisting of a client (Faerun) and server (Underdark) module, facilitating the creation of web-based, interactive 3D visualizations of large datasets, enabling record level visual inspection. We also introduce a reference implementation providing access to SureChEMBL, a database containing patent information on more than 17 million chemical compounds. The source code and the most recent builds of Faerun and Underdark, Lore.js and the data preprocessing toolchain used in the reference implementation, are available on the project website (http://doc.gdb.tools/fun/). daniel.probst@dcb.unibe.ch or jean-louis.reymond@dcb.unibe.ch.

  7. Emory University: High-Throughput Protein-Protein Interaction Dataset for Lung Cancer-Associated Genes | Office of Cancer Genomics

    Cancer.gov

    To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.

  8. Statistical Measures of Large-Scale Structure

    NASA Astrophysics Data System (ADS)

    Vogeley, Michael; Geller, Margaret; Huchra, John; Park, Changbom; Gott, J. Richard

    1993-12-01

    \\inv Mpc} To quantify clustering in the large-scale distribution of galaxies and to test theories for the formation of structure in the universe, we apply statistical measures to the CfA Redshift Survey. This survey is complete to m_{B(0)}=15.5 over two contiguous regions which cover one-quarter of the sky and include ~ 11,000 galaxies. The salient features of these data are voids with diameter 30-50\\hmpc and coherent dense structures with a scale ~ 100\\hmpc. Comparison with N-body simulations rules out the ``standard" CDM model (Omega =1, b=1.5, sigma_8 =1) at the 99% confidence level because this model has insufficient power on scales lambda >30\\hmpc. An unbiased open universe CDM model (Omega h =0.2) and a biased CDM model with non-zero cosmological constant (Omega h =0.24, lambda_0 =0.6) match the observed power spectrum. The amplitude of the power spectrum depends on the luminosity of galaxies in the sample; bright (L>L(*) ) galaxies are more strongly clustered than faint galaxies. The paucity of bright galaxies in low-density regions may explain this dependence. To measure the topology of large-scale structure, we compute the genus of isodensity surfaces of the smoothed density field. On scales in the ``non-linear" regime, <= 10\\hmpc, the high- and low-density regions are multiply-connected over a broad range of density threshold, as in a filamentary net. On smoothing scales >10\\hmpc, the topology is consistent with statistics of a Gaussian random field. Simulations of CDM models fail to produce the observed coherence of structure on non-linear scales (>95% confidence level). The underdensity probability (the frequency of regions with density contrast delta rho //lineρ=-0.8) depends strongly on the luminosity of galaxies; underdense regions are significantly more common (>2sigma ) in bright (L>L(*) ) galaxy samples than in samples which include fainter galaxies.

  9. Cosmic strings and the large-scale structure

    NASA Technical Reports Server (NTRS)

    Stebbins, Albert

    1988-01-01

    A possible problem for cosmic string models of galaxy formation is presented. If very large voids are common and if loop fragmentation is not much more efficient than presently believed, then it may be impossible for string scenarios to produce the observed large-scale structure with Omega sub 0 = 1 and without strong environmental biasing.

  10. Coupled binary embedding for large-scale image retrieval.

    PubMed

    Zheng, Liang; Wang, Shengjin; Tian, Qi

    2014-08-01

    Visual matching is a crucial step in image retrieval based on the bag-of-words (BoW) model. In the baseline method, two keypoints are considered as a matching pair if their SIFT descriptors are quantized to the same visual word. However, the SIFT visual word has two limitations. First, it loses most of its discriminative power during quantization. Second, SIFT only describes the local texture feature. Both drawbacks impair the discriminative power of the BoW model and lead to false positive matches. To tackle this problem, this paper proposes to embed multiple binary features at indexing level. To model correlation between features, a multi-IDF scheme is introduced, through which different binary features are coupled into the inverted file. We show that matching verification methods based on binary features, such as Hamming embedding, can be effectively incorporated in our framework. As an extension, we explore the fusion of binary color feature into image retrieval. The joint integration of the SIFT visual word and binary features greatly enhances the precision of visual matching, reducing the impact of false positive matches. Our method is evaluated through extensive experiments on four benchmark datasets (Ukbench, Holidays, DupImage, and MIR Flickr 1M). We show that our method significantly improves the baseline approach. In addition, large-scale experiments indicate that the proposed method requires acceptable memory usage and query time compared with other approaches. Further, when global color feature is integrated, our method yields competitive performance with the state-of-the-arts.

  11. Coronal hole evolution by sudden large scale changes

    NASA Technical Reports Server (NTRS)

    Nolte, J. T.; Gerassimenko, M.; Krieger, A. S.; Solodyna, C. V.

    1978-01-01

    Sudden shifts in coronal-hole boundaries observed by the S-054 X-ray telescope on Skylab between May and November, 1973, within 1 day of CMP of the holes, at latitudes not exceeding 40 deg, are compared with the long-term evolution of coronal-hole area. It is found that large-scale shifts in boundary locations can account for most if not all of the evolution of coronal holes. The temporal and spatial scales of these large-scale changes imply that they are the results of a physical process occurring in the corona. It is concluded that coronal holes evolve by magnetic-field lines' opening when the holes are growing, and by fields' closing as the holes shrink.

  12. New probes of Cosmic Microwave Background large-scale anomalies

    NASA Astrophysics Data System (ADS)

    Aiola, Simone

    Fifty years of Cosmic Microwave Background (CMB) data played a crucial role in constraining the parameters of the LambdaCDM model, where Dark Energy, Dark Matter, and Inflation are the three most important pillars not yet understood. Inflation prescribes an isotropic universe on large scales, and it generates spatially-correlated density fluctuations over the whole Hubble volume. CMB temperature fluctuations on scales bigger than a degree in the sky, affected by modes on super-horizon scale at the time of recombination, are a clean snapshot of the universe after inflation. In addition, the accelerated expansion of the universe, driven by Dark Energy, leaves a hardly detectable imprint in the large-scale temperature sky at late times. Such fundamental predictions have been tested with current CMB data and found to be in tension with what we expect from our simple LambdaCDM model. Is this tension just a random fluke or a fundamental issue with the present model? In this thesis, we present a new framework to probe the lack of large-scale correlations in the temperature sky using CMB polarization data. Our analysis shows that if a suppression in the CMB polarization correlations is detected, it will provide compelling evidence for new physics on super-horizon scale. To further analyze the statistical properties of the CMB temperature sky, we constrain the degree of statistical anisotropy of the CMB in the context of the observed large-scale dipole power asymmetry. We find evidence for a scale-dependent dipolar modulation at 2.5sigma. To isolate late-time signals from the primordial ones, we test the anomalously high Integrated Sachs-Wolfe effect signal generated by superstructures in the universe. We find that the detected signal is in tension with the expectations from LambdaCDM at the 2.5sigma level, which is somewhat smaller than what has been previously argued. To conclude, we describe the current status of CMB observations on small scales, highlighting the

  13. Exploring the large-scale structure of Taylor–Couette turbulence through Large-Eddy Simulations

    NASA Astrophysics Data System (ADS)

    Ostilla-Mónico, Rodolfo; Zhu, Xiaojue; Verzicco, Roberto

    2018-04-01

    Large eddy simulations (LES) of Taylor-Couette (TC) flow, the flow between two co-axial and independently rotating cylinders are performed in an attempt to explore the large-scale axially-pinned structures seen in experiments and simulations. Both static and dynamic LES models are used. The Reynolds number is kept fixed at Re = 3.4 · 104, and the radius ratio η = ri /ro is set to η = 0.909, limiting the effects of curvature and resulting in frictional Reynolds numbers of around Re τ ≈ 500. Four rotation ratios from Rot = ‑0.0909 to Rot = 0.3 are simulated. First, the LES of TC is benchmarked for different rotation ratios. Both the Smagorinsky model with a constant of cs = 0.1 and the dynamic model are found to produce reasonable results for no mean rotation and cyclonic rotation, but deviations increase for increasing rotation. This is attributed to the increasing anisotropic character of the fluctuations. Second, “over-damped” LES, i.e. LES with a large Smagorinsky constant is performed and is shown to reproduce some features of the large-scale structures, even when the near-wall region is not adequately modeled. This shows the potential for using over-damped LES for fast explorations of the parameter space where large-scale structures are found.

  14. Large-scale microwave anisotropy from gravitating seeds

    NASA Technical Reports Server (NTRS)

    Veeraraghavan, Shoba; Stebbins, Albert

    1992-01-01

    Topological defects could have seeded primordial inhomogeneities in cosmological matter. We examine the horizon-scale matter and geometry perturbations generated by such seeds in an expanding homogeneous and isotropic universe. Evolving particle horizons generally lead to perturbations around motionless seeds, even when there are compensating initial underdensities in the matter. We describe the pattern of the resulting large angular scale microwave anisotropy.

  15. First global WCRP shortwave surface radiation budget dataset

    NASA Technical Reports Server (NTRS)

    Whitlock, C. H.; Charlock, T. P.; Staylor, W. F.; Pinker, R. T.; Laszlo, I.; Ohmura, A.; Gilgen, H.; Konzelman, T.; DiPasquale, R. C.; Moats, C. D.

    1995-01-01

    Shortwave radiative fluxes that reach the Earth's surface are key factors that influence atmospheric and oceanic circulations as well as surface climate. Yet, information on these fluxes is meager. Surface site data are generally available from only a limited number of observing stations over land. Much less is known about the large-scale variability of the shortwave radiative fluxes over the oceans, which cover most of the globe. Recognizing the need to produce global-scale fields of such fluxes for use in climate research, the World Climate Research Program has initiated activities that led to the establishment of the Surface Radiation Budget Climatology Project with the ultimate goal to determine various components of the surface radiation budget from satellite data. In this paper, the first global products that resulted from this activity are described. Monthly and daily data on a 280-km grid scale are available. Samples of climate parameters obtainable from the dataset are presented. Emphasis is given to validation and limitations of the results. For most of the globe, satellite estimates have bias values between +/- 20 W/sq m and rms values are around 25 W/sq m. There are specific regions with much larger uncertainties however.

  16. Preventing Large-Scale Controlled Substance Diversion From Within the Pharmacy

    PubMed Central

    Martin, Emory S.; Dzierba, Steven H.; Jones, David M.

    2013-01-01

    Large-scale diversion of controlled substances (CS) from within a hospital or heath system pharmacy is a rare but growing problem. It is the responsibility of pharmacy leadership to scrutinize control processes to expose weaknesses. This article reviews examples of large-scale diversion incidents and diversion techniques and provides practical strategies to stimulate enhanced CS security within the pharmacy staff. Large-scale diversion from within a pharmacy department can be averted by a pharmacist-in-charge who is informed and proactive in taking effective countermeasures. PMID:24421497

  17. Large-scale flow experiments for managing river systems

    USGS Publications Warehouse

    Konrad, Christopher P.; Olden, Julian D.; Lytle, David A.; Melis, Theodore S.; Schmidt, John C.; Bray, Erin N.; Freeman, Mary C.; Gido, Keith B.; Hemphill, Nina P.; Kennard, Mark J.; McMullen, Laura E.; Mims, Meryl C.; Pyron, Mark; Robinson, Christopher T.; Williams, John G.

    2011-01-01

    Experimental manipulations of streamflow have been used globally in recent decades to mitigate the impacts of dam operations on river systems. Rivers are challenging subjects for experimentation, because they are open systems that cannot be isolated from their social context. We identify principles to address the challenges of conducting effective large-scale flow experiments. Flow experiments have both scientific and social value when they help to resolve specific questions about the ecological action of flow with a clear nexus to water policies and decisions. Water managers must integrate new information into operating policies for large-scale experiments to be effective. Modeling and monitoring can be integrated with experiments to analyze long-term ecological responses. Experimental design should include spatially extensive observations and well-defined, repeated treatments. Large-scale flow manipulations are only a part of dam operations that affect river systems. Scientists can ensure that experimental manipulations continue to be a valuable approach for the scientifically based management of river systems.

  18. Scaling identity connects human mobility and social interactions.

    PubMed

    Deville, Pierre; Song, Chaoming; Eagle, Nathan; Blondel, Vincent D; Barabási, Albert-László; Wang, Dashun

    2016-06-28

    Massive datasets that capture human movements and social interactions have catalyzed rapid advances in our quantitative understanding of human behavior during the past years. One important aspect affecting both areas is the critical role space plays. Indeed, growing evidence suggests both our movements and communication patterns are associated with spatial costs that follow reproducible scaling laws, each characterized by its specific critical exponents. Although human mobility and social networks develop concomitantly as two prolific yet largely separated fields, we lack any known relationships between the critical exponents explored by them, despite the fact that they often study the same datasets. Here, by exploiting three different mobile phone datasets that capture simultaneously these two aspects, we discovered a new scaling relationship, mediated by a universal flux distribution, which links the critical exponents characterizing the spatial dependencies in human mobility and social networks. Therefore, the widely studied scaling laws uncovered in these two areas are not independent but connected through a deeper underlying reality.

  19. Scaling identity connects human mobility and social interactions

    PubMed Central

    Deville, Pierre; Song, Chaoming; Eagle, Nathan; Blondel, Vincent D.; Barabási, Albert-László; Wang, Dashun

    2016-01-01

    Massive datasets that capture human movements and social interactions have catalyzed rapid advances in our quantitative understanding of human behavior during the past years. One important aspect affecting both areas is the critical role space plays. Indeed, growing evidence suggests both our movements and communication patterns are associated with spatial costs that follow reproducible scaling laws, each characterized by its specific critical exponents. Although human mobility and social networks develop concomitantly as two prolific yet largely separated fields, we lack any known relationships between the critical exponents explored by them, despite the fact that they often study the same datasets. Here, by exploiting three different mobile phone datasets that capture simultaneously these two aspects, we discovered a new scaling relationship, mediated by a universal flux distribution, which links the critical exponents characterizing the spatial dependencies in human mobility and social networks. Therefore, the widely studied scaling laws uncovered in these two areas are not independent but connected through a deeper underlying reality. PMID:27274050

  20. Experimental Simulations of Large-Scale Collisions

    NASA Technical Reports Server (NTRS)

    Housen, Kevin R.

    2002-01-01

    This report summarizes research on the effects of target porosity on the mechanics of impact cratering. Impact experiments conducted on a centrifuge provide direct simulations of large-scale cratering on porous asteroids. The experiments show that large craters in porous materials form mostly by compaction, with essentially no deposition of material into the ejecta blanket that is a signature of cratering in less-porous materials. The ratio of ejecta mass to crater mass is shown to decrease with increasing crater size or target porosity. These results are consistent with the observation that large closely-packed craters on asteroid Mathilde appear to have formed without degradation to earlier craters.

  1. Large Scale Underground Detectors in Europe

    NASA Astrophysics Data System (ADS)

    Katsanevas, S. K.

    2006-07-01

    The physics potential and the complementarity of the large scale underground European detectors: Water Cherenkov (MEMPHYS), Liquid Argon TPC (GLACIER) and Liquid Scintillator (LENA) is presented with emphasis on the major physics opportunities, namely proton decay, supernova detection and neutrino parameter determination using accelerator beams.

  2. Large-scale retrieval for medical image analytics: A comprehensive review.

    PubMed

    Li, Zhongyu; Zhang, Xiaofan; Müller, Henning; Zhang, Shaoting

    2018-01-01

    Over the past decades, medical image analytics was greatly facilitated by the explosion of digital imaging techniques, where huge amounts of medical images were produced with ever-increasing quality and diversity. However, conventional methods for analyzing medical images have achieved limited success, as they are not capable to tackle the huge amount of image data. In this paper, we review state-of-the-art approaches for large-scale medical image analysis, which are mainly based on recent advances in computer vision, machine learning and information retrieval. Specifically, we first present the general pipeline of large-scale retrieval, summarize the challenges/opportunities of medical image analytics on a large-scale. Then, we provide a comprehensive review of algorithms and techniques relevant to major processes in the pipeline, including feature representation, feature indexing, searching, etc. On the basis of existing work, we introduce the evaluation protocols and multiple applications of large-scale medical image retrieval, with a variety of exploratory and diagnostic scenarios. Finally, we discuss future directions of large-scale retrieval, which can further improve the performance of medical image analysis. Copyright © 2017 Elsevier B.V. All rights reserved.

  3. Analysis of a large dataset of mycorrhiza inoculation field trials on potato shows highly significant increases in yield.

    PubMed

    Hijri, Mohamed

    2016-04-01

    An increasing human population requires more food production in nutrient-efficient systems in order to simultaneously meet global food needs while reducing the environmental footprint of agriculture. Arbuscular mycorrhizal fungi (AMF) have the potential to enhance crop yield, but their efficiency has yet to be demonstrated in large-scale crop production systems. This study reports an analysis of a dataset consisting of 231 field trials in which the same AMF inoculant (Rhizophagus irregularis DAOM 197198) was applied to potato over a 4-year period in North America and Europe under authentic field conditions. The inoculation was performed using a liquid suspension of AMF spores that was sprayed onto potato seed pieces, yielding a calculated 71 spores per seed piece. Statistical analysis showed a highly significant increase in marketable potato yield (ANOVA, P < 0.0001) for inoculated fields (42.2 tons/ha) compared with non-inoculated controls (38.3 tons/ha), irrespective of trial year. The average yield increase was 3.9 tons/ha, representing 9.5 % of total crop yield. Inoculation was profitable with a 0.67-tons/ha increase in yield, a threshold reached in almost 79 % of all trials. This finding clearly demonstrates the benefits of mycorrhizal-based inoculation on crop yield, using potato as a case study. Further improvements of these beneficial inoculants will help compensate for crop production deficits, both now and in the future.

  4. Analyzing remotely sensed datasets for improved characterization of field-scale interventions for food security

    NASA Astrophysics Data System (ADS)

    Limaye, A. S.; Ellenburg, W. L., II; Coffee, K.; Ashmall, W.; Stanton, K.; Burks, J.; Irwin, D.

    2017-12-01

    Agriculture interventions such as irrigation, improved fertilization, and advanced cultivars have the potential to increase food security and ensure climate resilience. However, in order broaden the support of activities like these, environmental managers must be able to assess their impact. Often field data are difficult to obtain and decisions are made with limited information. Satellite products can provide relevant information at field and village wide scales that can assist in this process. SERVIR is taking an aim of helping connect the space-based products to help the efficacy of village scale interventions through a couple of web-based tools, called ClimateSERV and AgriSERV. ClimateSERV has been active since 2014, and has increased in the data holdings and access points. Currently, ClimateSERV enables users to create geographic regions of their choosing and to compute key statistics for those regions. Rainfall (GPM IMERG, CHIRPS), vegetation indices (eMODIS Normalized Difference Vegetation Index - NDVI; Evaporative Stress Index), and North American Multi-model Ensemble-based seasonal climate forecasts of rainfall and temperature. ClimateSERV can also query the Google Earth Engine holdings for datasets, currently, ClimateSERV provides access to the daytime MODIS Land Surface Temperature (LST). Our first such derived product is a monthly rainfall analysis feature which combines CHIRPS historic rainfall with seasonal forecast models AgriSERV is a derived web-based tool based on the ClimateSERV data holdings. It is designed to provide easy to interpret analysis, based NDVI and rainfall. This tool allows users to draw two areas of interest, one control with no intervention and another that has experienced intervention. An on-demand comparative analysis is performed and the user is presented with side-by-side charts and summary data that highlight the differences of the two areas in terms of vegetation health, derived growing season lengths and rainfall. The

  5. Large-scale magnetic fields at high Reynolds numbers in magnetohydrodynamic simulations.

    PubMed

    Hotta, H; Rempel, M; Yokoyama, T

    2016-03-25

    The 11-year solar magnetic cycle shows a high degree of coherence in spite of the turbulent nature of the solar convection zone. It has been found in recent high-resolution magnetohydrodynamics simulations that the maintenance of a large-scale coherent magnetic field is difficult with small viscosity and magnetic diffusivity (≲10 (12) square centimenters per second). We reproduced previous findings that indicate a reduction of the energy in the large-scale magnetic field for lower diffusivities and demonstrate the recovery of the global-scale magnetic field using unprecedentedly high resolution. We found an efficient small-scale dynamo that suppresses small-scale flows, which mimics the properties of large diffusivity. As a result, the global-scale magnetic field is maintained even in the regime of small diffusivities-that is, large Reynolds numbers. Copyright © 2016, American Association for the Advancement of Science.

  6. Prototype Vector Machine for Large Scale Semi-Supervised Learning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Kai; Kwok, James T.; Parvin, Bahram

    2009-04-29

    Practicaldataminingrarelyfalls exactlyinto the supervisedlearning scenario. Rather, the growing amount of unlabeled data poses a big challenge to large-scale semi-supervised learning (SSL). We note that the computationalintensivenessofgraph-based SSLarises largely from the manifold or graph regularization, which in turn lead to large models that are dificult to handle. To alleviate this, we proposed the prototype vector machine (PVM), a highlyscalable,graph-based algorithm for large-scale SSL. Our key innovation is the use of"prototypes vectors" for effcient approximation on both the graph-based regularizer and model representation. The choice of prototypes are grounded upon two important criteria: they not only perform effective low-rank approximation of themore » kernel matrix, but also span a model suffering the minimum information loss compared with the complete model. We demonstrate encouraging performance and appealing scaling properties of the PVM on a number of machine learning benchmark data sets.« less

  7. Large-Scale Test of Dynamic Correlation Processors: Implications for Correlation-Based Seismic Pipelines

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dodge, D. A.; Harris, D. B.

    Correlation detectors are of considerable interest to the seismic monitoring communities because they offer reduced detection thresholds and combine detection, location and identification functions into a single operation. They appear to be ideal for applications requiring screening of frequent repeating events. However, questions remain about how broadly empirical correlation methods are applicable. We describe the effectiveness of banks of correlation detectors in a system that combines traditional power detectors with correlation detectors in terms of efficiency, which we define to be the fraction of events detected by the correlators. This paper elaborates and extends the concept of a dynamic correlationmore » detection framework – a system which autonomously creates correlation detectors from event waveforms detected by power detectors; and reports observed performance on a network of arrays in terms of efficiency. We performed a large scale test of dynamic correlation processors on an 11 terabyte global dataset using 25 arrays in the single frequency band 1-3 Hz. The system found over 3.2 million unique signals and produced 459,747 screened detections. A very satisfying result is that, on average, efficiency grows with time and, after nearly 16 years of operation, exceeds 47% for events observed over all distance ranges and approaches 70% for near regional and 90% for local events. This observation suggests that future pipeline architectures should make extensive use of correlation detectors, principally for decluttering observations of local and near-regional events. Our results also suggest that future operations based on correlation detection will require commodity large-scale computing infr