Non-negative Tensor Factorization for Robust Exploratory Big-Data Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alexandrov, Boian; Vesselinov, Velimir Valentinov; Djidjev, Hristo Nikolov
Currently, large multidimensional datasets are being accumulated in almost every field. Data are: (1) collected by distributed sensor networks in real-time all over the globe, (2) produced by large-scale experimental measurements or engineering activities, (3) generated by high-performance simulations, and (4) gathered by electronic communications and socialnetwork activities, etc. Simultaneous analysis of these ultra-large heterogeneous multidimensional datasets is often critical for scientific discoveries, decision-making, emergency response, and national and global security. The importance of such analyses mandates the development of the next-generation of robust machine learning (ML) methods and tools for bigdata exploratory analysis.
Accessing Multi-Dimensional Images and Data Cubes in the Virtual Observatory
NASA Astrophysics Data System (ADS)
Tody, Douglas; Plante, R. L.; Berriman, G. B.; Cresitello-Dittmar, M.; Good, J.; Graham, M.; Greene, G.; Hanisch, R. J.; Jenness, T.; Lazio, J.; Norris, P.; Pevunova, O.; Rots, A. H.
2014-01-01
Telescopes across the spectrum are routinely producing multi-dimensional images and datasets, such as Doppler velocity cubes, polarization datasets, and time-resolved “movies.” Examples of current telescopes producing such multi-dimensional images include the JVLA, ALMA, and the IFU instruments on large optical and near-infrared wavelength telescopes. In the near future, both the LSST and JWST will also produce such multi-dimensional images routinely. High-energy instruments such as Chandra produce event datasets that are also a form of multi-dimensional data, in effect being a very sparse multi-dimensional image. Ensuring that the data sets produced by these telescopes can be both discovered and accessed by the community is essential and is part of the mission of the Virtual Observatory (VO). The Virtual Astronomical Observatory (VAO, http://www.usvao.org/), in conjunction with its international partners in the International Virtual Observatory Alliance (IVOA), has developed a protocol and an initial demonstration service designed for the publication, discovery, and access of arbitrarily large multi-dimensional images. The protocol describing multi-dimensional images is the Simple Image Access Protocol, version 2, which provides the minimal set of metadata required to characterize a multi-dimensional image for its discovery and access. A companion Image Data Model formally defines the semantics and structure of multi-dimensional images independently of how they are serialized, while providing capabilities such as support for sparse data that are essential to deal effectively with large cubes. A prototype data access service has been deployed and tested, using a suite of multi-dimensional images from a variety of telescopes. The prototype has demonstrated the capability to discover and remotely access multi-dimensional data via standard VO protocols. The prototype informs the specification of a protocol that will be submitted to the IVOA for approval, with an operational data cube service to be delivered in mid-2014. An associated user-installable VO data service framework will provide the capabilities required to publish VO-compatible multi-dimensional images or data cubes.
Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge
May, Jody C.; McLean, John A.
2017-01-01
Hybrid analytical instrumentation constructed around mass spectrometry (MS) are becoming preferred techniques for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements which are used to infer systems-level information. In this review, we describe multidimensional MS configurations as technologies which are big data drivers and discuss some new and emerging strategies for mining information from large-scale datasets. A discussion is included on the information content which can be obtained from individual dimensions, as well as the unique information which can be derived by comparing different levels of data. Finally, we discuss some emerging data visualization strategies which seek to make highly dimensional datasets both accessible and comprehensible. PMID:27306312
NASA Astrophysics Data System (ADS)
Shrestha, S. R.; Collow, T. W.; Rose, B.
2016-12-01
Scientific datasets are generated from various sources and platforms but they are typically produced either by earth observation systems or by modelling systems. These are widely used for monitoring, simulating, or analyzing measurements that are associated with physical, chemical, and biological phenomena over the ocean, atmosphere, or land. A significant subset of scientific datasets stores values directly as rasters or in a form that can be rasterized. This is where a value exists at every cell in a regular grid spanning the spatial extent of the dataset. Government agencies like NOAA, NASA, EPA, USGS produces large volumes of near real-time, forecast, and historical data that drives climatological and meteorological studies, and underpins operations ranging from weather prediction to sea ice loss. Modern science is computationally intensive because of the availability of an enormous amount of scientific data, the adoption of data-driven analysis, and the need to share these dataset and research results with the public. ArcGIS as a platform is sophisticated and capable of handling such complex domain. We'll discuss constructs and capabilities applicable to multidimensional gridded data that can be conceptualized as a multivariate space-time cube. Building on the concept of a two-dimensional raster, a typical multidimensional raster dataset could contain several "slices" within the same spatial extent. We will share a case from the NOAA Climate Forecast Systems Reanalysis (CFSR) multidimensional data as an example of how large collections of rasters can be efficiently organized and managed through a data model within a geodatabase called "Mosaic dataset" and dynamically transformed and analyzed using raster functions. A raster function is a lightweight, raster-valued transformation defined over a mixed set of raster and scalar input. That means, just like any tool, you can provide a raster function with input parameters. It enables dynamic processing of only the data that's being displayed on the screen or requested by an application. We will present the dynamic processing and analysis of CFSR data using the chains of raster function and share it as dynamic multidimensional image service. This workflow and capabilities can be easily applied to any scientific data formats that are supported in mosaic dataset.
Statistical Projections for Multi-resolution, Multi-dimensional Visual Data Exploration and Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoa T. Nguyen; Stone, Daithi; E. Wes Bethel
2016-01-01
An ongoing challenge in visual exploration and analysis of large, multi-dimensional datasets is how to present useful, concise information to a user for some specific visualization tasks. Typical approaches to this problem have proposed either reduced-resolution versions of data, or projections of data, or both. These approaches still have some limitations such as consuming high computation or suffering from errors. In this work, we explore the use of a statistical metric as the basis for both projections and reduced-resolution versions of data, with a particular focus on preserving one key trait in data, namely variation. We use two different casemore » studies to explore this idea, one that uses a synthetic dataset, and another that uses a large ensemble collection produced by an atmospheric modeling code to study long-term changes in global precipitation. The primary findings of our work are that in terms of preserving the variation signal inherent in data, that using a statistical measure more faithfully preserves this key characteristic across both multi-dimensional projections and multi-resolution representations than a methodology based upon averaging.« less
Scientific Visualization Tools for Enhancement of Undergraduate Research
NASA Astrophysics Data System (ADS)
Rodriguez, W. J.; Chaudhury, S. R.
2001-05-01
Undergraduate research projects that utilize remote sensing satellite instrument data to investigate atmospheric phenomena pose many challenges. A significant challenge is processing large amounts of multi-dimensional data. Remote sensing data initially requires mining; filtering of undesirable spectral, instrumental, or environmental features; and subsequently sorting and reformatting to files for easy and quick access. The data must then be transformed according to the needs of the investigation(s) and displayed for interpretation. These multidimensional datasets require views that can range from two-dimensional plots to multivariable-multidimensional scientific visualizations with animations. Science undergraduate students generally find these data processing tasks daunting. Generally, researchers are required to fully understand the intricacies of the dataset and write computer programs or rely on commercially available software, which may not be trivial to use. In the time that undergraduate researchers have available for their research projects, learning the data formats, programming languages, and/or visualization packages is impractical. When dealing with large multi-dimensional data sets appropriate Scientific Visualization tools are imperative in allowing students to have a meaningful and pleasant research experience, while producing valuable scientific research results. The BEST Lab at Norfolk State University has been creating tools for multivariable-multidimensional analysis of Earth Science data. EzSAGE and SAGE4D have been developed to sort, analyze and visualize SAGE II (Stratospheric Aerosol and Gas Experiment) data with ease. Three- and four-dimensional visualizations in interactive environments can be produced. EzSAGE provides atmospheric slices in three-dimensions where the researcher can change the scales in the three-dimensions, color tables and degree of smoothing interactively to focus on particular phenomena. SAGE4D provides a navigable four-dimensional interactive environment. These tools allow students to make higher order decisions based on large multidimensional sets of data while diminishing the level of frustration that results from dealing with the details of processing large data sets.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
NASA Astrophysics Data System (ADS)
Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer
2018-04-01
Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.
Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset
NASA Astrophysics Data System (ADS)
Liu, Haicheng; Xiao, Xiao
2016-11-01
Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.
Igloo-Plot: a tool for visualization of multidimensional datasets.
Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S
2014-01-01
Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.
Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van Benthem, Mark H.
2016-05-04
This software is employed for 3D visualization of X-ray diffraction (XRD) data with functionality for slicing, reorienting, isolating and plotting of 2D color contour maps and 3D renderings of large datasets. The program makes use of the multidimensionality of textured XRD data where diffracted intensity is not constant over a given set of angular positions (as dictated by the three defined dimensional angles of phi, chi, and two-theta). Datasets are rendered in 3D with intensity as a scaler which is represented as a rainbow color scale. A GUI interface and scrolling tools along with interactive function via the mouse allowmore » for fast manipulation of these large datasets so as to perform detailed analysis of diffraction results with full dimensionality of the diffraction space.« less
Optimizing tertiary storage organization and access for spatio-temporal datasets
NASA Technical Reports Server (NTRS)
Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith
1994-01-01
We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.
DICON: interactive visual analysis of multidimensional clusters.
Cao, Nan; Gotz, David; Sun, Jimeng; Qu, Huamin
2011-12-01
Clustering as a fundamental data analysis technique has been widely used in many analytic applications. However, it is often difficult for users to understand and evaluate multidimensional clustering results, especially the quality of clusters and their semantics. For large and complex data, high-level statistical information about the clusters is often needed for users to evaluate cluster quality while a detailed display of multidimensional attributes of the data is necessary to understand the meaning of clusters. In this paper, we introduce DICON, an icon-based cluster visualization that embeds statistical information into a multi-attribute display to facilitate cluster interpretation, evaluation, and comparison. We design a treemap-like icon to represent a multidimensional cluster, and the quality of the cluster can be conveniently evaluated with the embedded statistical information. We further develop a novel layout algorithm which can generate similar icons for similar clusters, making comparisons of clusters easier. User interaction and clutter reduction are integrated into the system to help users more effectively analyze and refine clustering results for large datasets. We demonstrate the power of DICON through a user study and a case study in the healthcare domain. Our evaluation shows the benefits of the technique, especially in support of complex multidimensional cluster analysis. © 2011 IEEE
Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions
Li, Haoran; Xiong, Li; Jiang, Xiaoqian
2014-01-01
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain data due to increased perturbation error and computation complexity. In this paper, we propose DPCopula, a differentially private data synthesization technique using Copula functions for multi-dimensional data. The core of our method is to compute a differentially private copula function from which we can sample synthetic data. Copula functions are used to describe the dependence between multivariate random vectors and allow us to build the multivariate joint distribution using one-dimensional marginal distributions. We present two methods for estimating the parameters of the copula functions with differential privacy: maximum likelihood estimation and Kendall’s τ estimation. We present formal proofs for the privacy guarantee as well as the convergence property of our methods. Extensive experiments using both real datasets and synthetic datasets demonstrate that DPCopula generates highly accurate synthetic multi-dimensional data with significantly better utility than state-of-the-art techniques. PMID:25405241
Large-Scale Astrophysical Visualization on Smartphones
NASA Astrophysics Data System (ADS)
Becciani, U.; Massimino, P.; Costa, A.; Gheller, C.; Grillo, A.; Krokos, M.; Petta, C.
2011-07-01
Nowadays digital sky surveys and long-duration, high-resolution numerical simulations using high performance computing and grid systems produce multidimensional astrophysical datasets in the order of several Petabytes. Sharing visualizations of such datasets within communities and collaborating research groups is of paramount importance for disseminating results and advancing astrophysical research. Moreover educational and public outreach programs can benefit greatly from novel ways of presenting these datasets by promoting understanding of complex astrophysical processes, e.g., formation of stars and galaxies. We have previously developed VisIVO Server, a grid-enabled platform for high-performance large-scale astrophysical visualization. This article reviews the latest developments on VisIVO Web, a custom designed web portal wrapped around VisIVO Server, then introduces VisIVO Smartphone, a gateway connecting VisIVO Web and data repositories for mobile astrophysical visualization. We discuss current work and summarize future developments.
Using Browser Notebooks to Analyse Big Atmospheric Data-sets in the Cloud
NASA Astrophysics Data System (ADS)
Robinson, N.; Tomlinson, J.; Arribas, A.; Prudden, R.
2016-12-01
We are presenting an account of our experience building an ecosystem for the analysis of big atmospheric data-sets. By using modern technologies we have developed a prototype platform which is scaleable and capable of analysing very large atmospheric datasets. We tested different big-data ecosystems such as Hadoop MapReduce, Spark and Dask, in order to find the one which was best suited for analysis of multidimensional binary data such as NetCDF. We make extensive use of infrastructure-as-code and containerisation to provide a platform which is reusable, and which can scale to accommodate changes in demand. We make this platform readily accessible using browser based notebooks. As a result, analysts with minimal technology experience can, in tens of lines of Python, make interactive data-visualisation web pages, which can analyse very large amounts of data using cutting edge big-data technology
OMERO and Bio-Formats 5: flexible access to large bioimaging datasets at scale
NASA Astrophysics Data System (ADS)
Moore, Josh; Linkert, Melissa; Blackburn, Colin; Carroll, Mark; Ferguson, Richard K.; Flynn, Helen; Gillen, Kenneth; Leigh, Roger; Li, Simon; Lindner, Dominik; Moore, William J.; Patterson, Andrew J.; Pindelski, Blazej; Ramalingam, Balaji; Rozbicki, Emil; Tarkowska, Aleksandra; Walczysko, Petr; Allan, Chris; Burel, Jean-Marie; Swedlow, Jason
2015-03-01
The Open Microscopy Environment (OME) has built and released Bio-Formats, a Java-based proprietary file format conversion tool and OMERO, an enterprise data management platform under open source licenses. In this report, we describe new versions of Bio-Formats and OMERO that are specifically designed to support large, multi-gigabyte or terabyte scale datasets that are routinely collected across most domains of biological and biomedical research. Bio- Formats reads image data directly from native proprietary formats, bypassing the need for conversion into a standard format. It implements the concept of a file set, a container that defines the contents of multi-dimensional data comprised of many files. OMERO uses Bio-Formats to read files natively, and provides a flexible access mechanism that supports several different storage and access strategies. These new capabilities of OMERO and Bio-Formats make them especially useful for use in imaging applications like digital pathology, high content screening and light sheet microscopy that create routinely large datasets that must be managed and analyzed.
SHARE: system design and case studies for statistical health information release
Gardner, James; Xiong, Li; Xiao, Yonghui; Gao, Jingjing; Post, Andrew R; Jiang, Xiaoqian; Ohno-Machado, Lucila
2013-01-01
Objectives We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. Materials and Methods SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. Results Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback–Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. Conclusions SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses. PMID:23059729
Near Real-time Scientific Data Analysis and Visualization with the ArcGIS Platform
NASA Astrophysics Data System (ADS)
Shrestha, S. R.; Viswambharan, V.; Doshi, A.
2017-12-01
Scientific multidimensional data are generated from a variety of sources and platforms. These datasets are mostly produced by earth observation and/or modeling systems. Agencies like NASA, NOAA, USGS, and ESA produce large volumes of near real-time observation, forecast, and historical data that drives fundamental research and its applications in larger aspects of humanity from basic decision making to disaster response. A common big data challenge for organizations working with multidimensional scientific data and imagery collections is the time and resources required to manage and process such large volumes and varieties of data. The challenge of adopting data driven real-time visualization and analysis, as well as the need to share these large datasets, workflows, and information products to wider and more diverse communities, brings an opportunity to use the ArcGIS platform to handle such demand. In recent years, a significant effort has put in expanding the capabilities of ArcGIS to support multidimensional scientific data across the platform. New capabilities in ArcGIS to support scientific data management, processing, and analysis as well as creating information products from large volumes of data using the image server technology are becoming widely used in earth science and across other domains. We will discuss and share the challenges associated with big data by the geospatial science community and how we have addressed these challenges in the ArcGIS platform. We will share few use cases, such as NOAA High Resolution Refresh Radar (HRRR) data, that demonstrate how we access large collections of near real-time data (that are stored on-premise or on the cloud), disseminate them dynamically, process and analyze them on-the-fly, and serve them to a variety of geospatial applications. We will also share how on-the-fly processing using raster functions capabilities, can be extended to create persisted data and information products using raster analytics capabilities that exploit distributed computing in an enterprise environment.
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
Nuclear Forensic Inferences Using Iterative Multidimensional Statistics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robel, M; Kristo, M J; Heller, M A
2009-06-09
Nuclear forensics involves the analysis of interdicted nuclear material for specific material characteristics (referred to as 'signatures') that imply specific geographical locations, production processes, culprit intentions, etc. Predictive signatures rely on expert knowledge of physics, chemistry, and engineering to develop inferences from these material characteristics. Comparative signatures, on the other hand, rely on comparison of the material characteristics of the interdicted sample (the 'questioned sample' in FBI parlance) with those of a set of known samples. In the ideal case, the set of known samples would be a comprehensive nuclear forensics database, a database which does not currently exist. Inmore » fact, our ability to analyze interdicted samples and produce an extensive list of precise materials characteristics far exceeds our ability to interpret the results. Therefore, as we seek to develop the extensive databases necessary for nuclear forensics, we must also develop the methods necessary to produce the necessary inferences from comparison of our analytical results with these large, multidimensional sets of data. In the work reported here, we used a large, multidimensional dataset of results from quality control analyses of uranium ore concentrate (UOC, sometimes called 'yellowcake'). We have found that traditional multidimensional techniques, such as principal components analysis (PCA), are especially useful for understanding such datasets and drawing relevant conclusions. In particular, we have developed an iterative partial least squares-discriminant analysis (PLS-DA) procedure that has proven especially adept at identifying the production location of unknown UOC samples. By removing classes which fell far outside the initial decision boundary, and then rebuilding the PLS-DA model, we have consistently produced better and more definitive attributions than with a single pass classification approach. Performance of the iterative PLS-DA method compared favorably to that of classification and regression tree (CART) and k nearest neighbor (KNN) algorithms, with the best combination of accuracy and robustness, as tested by classifying samples measured independently in our laboratories against the vendor QC based reference set.« less
Next-Generation Machine Learning for Biological Networks.
Camacho, Diogo M; Collins, Katherine M; Powers, Rani K; Costello, James C; Collins, James J
2018-06-14
Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology. Copyright © 2018 Elsevier Inc. All rights reserved.
Collaboration tools and techniques for large model datasets
Signell, R.P.; Carniel, S.; Chiggiato, J.; Janekovic, I.; Pullen, J.; Sherwood, C.R.
2008-01-01
In MREA and many other marine applications, it is common to have multiple models running with different grids, run by different institutions. Techniques and tools are described for low-bandwidth delivery of data from large multidimensional datasets, such as those from meteorological and oceanographic models, directly into generic analysis and visualization tools. Output is stored using the NetCDF CF Metadata Conventions, and then delivered to collaborators over the web via OPeNDAP. OPeNDAP datasets served by different institutions are then organized via THREDDS catalogs. Tools and procedures are then used which enable scientists to explore data on the original model grids using tools they are familiar with. It is also low-bandwidth, enabling users to extract just the data they require, an important feature for access from ship or remote areas. The entire implementation is simple enough to be handled by modelers working with their webmasters - no advanced programming support is necessary. ?? 2007 Elsevier B.V. All rights reserved.
Spear, Timothy T; Nishimura, Michael I; Simms, Patricia E
2017-08-01
Advancement in flow cytometry reagents and instrumentation has allowed for simultaneous analysis of large numbers of lineage/functional immune cell markers. Highly complex datasets generated by polychromatic flow cytometry require proper analytical software to answer investigators' questions. A problem among many investigators and flow cytometry Shared Resource Laboratories (SRLs), including our own, is a lack of access to a flow cytometry-knowledgeable bioinformatics team, making it difficult to learn and choose appropriate analysis tool(s). Here, we comparatively assess various multidimensional flow cytometry software packages for their ability to answer a specific biologic question and provide graphical representation output suitable for publication, as well as their ease of use and cost. We assessed polyfunctional potential of TCR-transduced T cells, serving as a model evaluation, using multidimensional flow cytometry to analyze 6 intracellular cytokines and degranulation on a per-cell basis. Analysis of 7 parameters resulted in 128 possible combinations of positivity/negativity, far too complex for basic flow cytometry software to analyze fully. Various software packages were used, analysis methods used in each described, and representative output displayed. Of the tools investigated, automated classification of cellular expression by nonlinear stochastic embedding (ACCENSE) and coupled analysis in Pestle/simplified presentation of incredibly complex evaluations (SPICE) provided the most user-friendly manipulations and readable output, evaluating effects of altered antigen-specific stimulation on T cell polyfunctionality. This detailed approach may serve as a model for other investigators/SRLs in selecting the most appropriate software to analyze complex flow cytometry datasets. Further development and awareness of available tools will help guide proper data analysis to answer difficult biologic questions arising from incredibly complex datasets. © Society for Leukocyte Biology.
Utilizing Multidimensional Measures of Race in Education Research: The Case of Teacher Perceptions
Irizarry, Yasmiyn
2015-01-01
Education scholarship on race using quantitative data analysis consists largely of studies on the black-white dichotomy, and more recently, on the experiences of student within conventional racial/ethnic categories (white, Hispanic/Latina/o, Asian, black). Despite substantial shifts in the racial and ethnic composition of American children, studies continue to overlook the diverse racialized experiences for students of Asian and Latina/o descent, the racialization of immigration status, and the educational experiences of Native American students. This study provides one possible strategy for developing multidimensional measures of race using large-scale datasets and demonstrates the utility of multidimensional measures for examining educational inequality, using teacher perceptions of student behavior as a case in point. With data from the first grade wave of the Early Childhood Longitudinal Study, Kindergarten Cohort of 1998–1999, I examine differences in teacher ratings of Externalizing Problem Behaviors and Approaches to Learning across fourteen racialized subgroups at the intersections of race, ethnicity, and immigrant status. Results show substantial subgroup variation in teacher perceptions of problem and learning behaviors, while also highlighting key points of divergence and convergence within conventional racial/ethnic categories. PMID:26413559
Utilizing Multidimensional Measures of Race in Education Research: The Case of Teacher Perceptions.
Irizarry, Yasmiyn
2015-10-01
Education scholarship on race using quantitative data analysis consists largely of studies on the black-white dichotomy, and more recently, on the experiences of student within conventional racial/ethnic categories (white, Hispanic/Latina/o, Asian, black). Despite substantial shifts in the racial and ethnic composition of American children, studies continue to overlook the diverse racialized experiences for students of Asian and Latina/o descent, the racialization of immigration status, and the educational experiences of Native American students. This study provides one possible strategy for developing multidimensional measures of race using large-scale datasets and demonstrates the utility of multidimensional measures for examining educational inequality, using teacher perceptions of student behavior as a case in point. With data from the first grade wave of the Early Childhood Longitudinal Study, Kindergarten Cohort of 1998-1999, I examine differences in teacher ratings of Externalizing Problem Behaviors and Approaches to Learning across fourteen racialized subgroups at the intersections of race, ethnicity, and immigrant status. Results show substantial subgroup variation in teacher perceptions of problem and learning behaviors, while also highlighting key points of divergence and convergence within conventional racial/ethnic categories.
A Visual Analytics Approach for Station-Based Air Quality Data
Du, Yi; Ma, Cuixia; Wu, Chao; Xu, Xiaowei; Guo, Yike; Zhou, Yuanchun; Li, Jianhui
2016-01-01
With the deployment of multi-modality and large-scale sensor networks for monitoring air quality, we are now able to collect large and multi-dimensional spatio-temporal datasets. For these sensed data, we present a comprehensive visual analysis approach for air quality analysis. This approach integrates several visual methods, such as map-based views, calendar views, and trends views, to assist the analysis. Among those visual methods, map-based visual methods are used to display the locations of interest, and the calendar and the trends views are used to discover the linear and periodical patterns. The system also provides various interaction tools to combine the map-based visualization, trends view, calendar view and multi-dimensional view. In addition, we propose a self-adaptive calendar-based controller that can flexibly adapt the changes of data size and granularity in trends view. Such a visual analytics system would facilitate big-data analysis in real applications, especially for decision making support. PMID:28029117
A Visual Analytics Approach for Station-Based Air Quality Data.
Du, Yi; Ma, Cuixia; Wu, Chao; Xu, Xiaowei; Guo, Yike; Zhou, Yuanchun; Li, Jianhui
2016-12-24
With the deployment of multi-modality and large-scale sensor networks for monitoring air quality, we are now able to collect large and multi-dimensional spatio-temporal datasets. For these sensed data, we present a comprehensive visual analysis approach for air quality analysis. This approach integrates several visual methods, such as map-based views, calendar views, and trends views, to assist the analysis. Among those visual methods, map-based visual methods are used to display the locations of interest, and the calendar and the trends views are used to discover the linear and periodical patterns. The system also provides various interaction tools to combine the map-based visualization, trends view, calendar view and multi-dimensional view. In addition, we propose a self-adaptive calendar-based controller that can flexibly adapt the changes of data size and granularity in trends view. Such a visual analytics system would facilitate big-data analysis in real applications, especially for decision making support.
High performance computing environment for multidimensional image analysis
Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo
2007-01-01
Background The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. Results We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478× speedup. Conclusion Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets. PMID:17634099
High performance computing environment for multidimensional image analysis.
Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo
2007-07-10
The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478x speedup. Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets.
NASA Astrophysics Data System (ADS)
Appel, Marius; Lahn, Florian; Pebesma, Edzer; Buytaert, Wouter; Moulds, Simon
2016-04-01
Today's amount of freely available data requires scientists to spend large parts of their work on data management. This is especially true in environmental sciences when working with large remote sensing datasets, such as obtained from earth-observation satellites like the Sentinel fleet. Many frameworks like SpatialHadoop or Apache Spark address the scalability but target programmers rather than data analysts, and are not dedicated to imagery or array data. In this work, we use the open-source data management and analytics system SciDB to bring large earth-observation datasets closer to analysts. Its underlying data representation as multidimensional arrays fits naturally to earth-observation datasets, distributes storage and computational load over multiple instances by multidimensional chunking, and also enables efficient time-series based analyses, which is usually difficult using file- or tile-based approaches. Existing interfaces to R and Python furthermore allow for scalable analytics with relatively little learning effort. However, interfacing SciDB and file-based earth-observation datasets that come as tiled temporal snapshots requires a lot of manual bookkeeping during ingestion, and SciDB natively only supports loading data from CSV-like and custom binary formatted files, which currently limits its practical use in earth-observation analytics. To make it easier to work with large multi-temporal datasets in SciDB, we developed software tools that enrich SciDB with earth observation metadata and allow working with commonly used file formats: (i) the SciDB extension library scidb4geo simplifies working with spatiotemporal arrays by adding relevant metadata to the database and (ii) the Geospatial Data Abstraction Library (GDAL) driver implementation scidb4gdal allows to ingest and export remote sensing imagery from and to a large number of file formats. Using added metadata on temporal resolution and coverage, the GDAL driver supports time-based ingestion of imagery to existing multi-temporal SciDB arrays. While our SciDB plugin works directly in the database, the GDAL driver has been specifically developed using a minimum amount of external dependencies (i.e. CURL). Source code for both tools is available from github [1]. We present these tools in a case-study that demonstrates the ingestion of multi-temporal tiled earth-observation data to SciDB, followed by a time-series analysis using R and SciDBR. Through the exclusive use of open-source software, our approach supports reproducibility in scalable large-scale earth-observation analytics. In the future, these tools can be used in an automated way to let scientists only work on ready-to-use SciDB arrays to significantly reduce the data management workload for domain scientists. [1] https://github.com/mappl/scidb4geo} and \\url{https://github.com/mappl/scidb4gdal
A Graphics Design Framework to Visualize Multi-Dimensional Economic Datasets
ERIC Educational Resources Information Center
Chandramouli, Magesh; Narayanan, Badri; Bertoline, Gary R.
2013-01-01
This study implements a prototype graphics visualization framework to visualize multidimensional data. This graphics design framework serves as a "visual analytical database" for visualization and simulation of economic models. One of the primary goals of any kind of visualization is to extract useful information from colossal volumes of…
Visualising nursing data using correspondence analysis.
Kokol, Peter; Blažun Vošner, Helena; Železnik, Danica
2016-09-01
Digitally stored, large healthcare datasets enable nurses to use 'big data' techniques and tools in nursing research. Big data is complex and multi-dimensional, so visualisation may be a preferable approach to analyse and understand it. To demonstrate the use of visualisation of big data in a technique called correspondence analysis. In the authors' study, relations among data in a nursing dataset were shown visually in graphs using correspondence analysis. The case presented demonstrates that correspondence analysis is easy to use, shows relations between data visually in a form that is simple to interpret, and can reveal hidden associations between data. Correspondence analysis supports the discovery of new knowledge. Implications for practice Knowledge obtained using correspondence analysis can be transferred immediately into practice or used to foster further research.
Extracting Undimensional Chains from Multidimensional Datasets: A Graph Theory Approach.
ERIC Educational Resources Information Center
Yamomoto, Yoneo; Wise, Steven L.
An order-analysis procedure, which uses graph theory to extract efficiently nonredundant, unidimensional chains of items from multidimensional data sets and chain consistency as a criterion for chain membership is outlined in this paper. The procedure is intended as an alternative to the Reynolds (1976) procedure which is described as being…
Chalkley, Robert J; Baker, Peter R; Hansen, Kirk C; Medzihradszky, Katalin F; Allen, Nadia P; Rexach, Michael; Burlingame, Alma L
2005-08-01
An in-depth analysis of a multidimensional chromatography-mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument was carried out. A total of 3269 CID spectra were acquired. Through manual verification of database search results and de novo interpretation of spectra 2368 spectra could be confidently determined as predicted tryptic peptides. A detailed analysis of the non-matching spectra was also carried out, highlighting what the non-matching spectra in a database search typically are composed of. The results of this comprehensive dataset study demonstrate that QqTOF instruments produce information-rich data of which a high percentage of the data is readily interpretable.
Xray: N-dimensional, labeled arrays for analyzing physical datasets in Python
NASA Astrophysics Data System (ADS)
Hoyer, S.
2015-12-01
Efficient analysis of geophysical datasets requires tools that both preserve and utilize metadata, and that transparently scale to process large datas. Xray is such a tool, in the form of an open source Python library for analyzing the labeled, multi-dimensional array (tensor) datasets that are ubiquitous in the Earth sciences. Xray's approach pairs Python data structures based on the data model of the netCDF file format with the proven design and user interface of pandas, the popular Python data analysis library for labeled tabular data. On top of the NumPy array, xray adds labeled dimensions (e.g., "time") and coordinate values (e.g., "2015-04-10"), which it uses to enable a host of operations powered by these labels: selection, aggregation, alignment, broadcasting, split-apply-combine, interoperability with pandas and serialization to netCDF/HDF5. Many of these operations are enabled by xray's tight integration with pandas. Finally, to allow for easy parallelism and to enable its labeled data operations to scale to datasets that does not fit into memory, xray integrates with the parallel processing library dask.
Atwood, Robert C.; Bodey, Andrew J.; Price, Stephen W. T.; Basham, Mark; Drakopoulos, Michael
2015-01-01
Tomographic datasets collected at synchrotrons are becoming very large and complex, and, therefore, need to be managed efficiently. Raw images may have high pixel counts, and each pixel can be multidimensional and associated with additional data such as those derived from spectroscopy. In time-resolved studies, hundreds of tomographic datasets can be collected in sequence, yielding terabytes of data. Users of tomographic beamlines are drawn from various scientific disciplines, and many are keen to use tomographic reconstruction software that does not require a deep understanding of reconstruction principles. We have developed Savu, a reconstruction pipeline that enables users to rapidly reconstruct data to consistently create high-quality results. Savu is designed to work in an ‘orthogonal’ fashion, meaning that data can be converted between projection and sinogram space throughout the processing workflow as required. The Savu pipeline is modular and allows processing strategies to be optimized for users' purposes. In addition to the reconstruction algorithms themselves, it can include modules for identification of experimental problems, artefact correction, general image processing and data quality assessment. Savu is open source, open licensed and ‘facility-independent’: it can run on standard cluster infrastructure at any institution. PMID:25939626
A Generic multi-dimensional feature extraction method using multiobjective genetic programming.
Zhang, Yang; Rockett, Peter I
2009-01-01
In this paper, we present a generic feature extraction method for pattern classification using multiobjective genetic programming. This not only evolves the (near-)optimal set of mappings from a pattern space to a multi-dimensional decision space, but also simultaneously optimizes the dimensionality of that decision space. The presented framework evolves vector-to-vector feature extractors that maximize class separability. We demonstrate the efficacy of our approach by making statistically-founded comparisons with a wide variety of established classifier paradigms over a range of datasets and find that for most of the pairwise comparisons, our evolutionary method delivers statistically smaller misclassification errors. At very worst, our method displays no statistical difference in a few pairwise comparisons with established classifier/dataset combinations; crucially, none of the misclassification results produced by our method is worse than any comparator classifier. Although principally focused on feature extraction, feature selection is also performed as an implicit side effect; we show that both feature extraction and selection are important to the success of our technique. The presented method has the practical consequence of obviating the need to exhaustively evaluate a large family of conventional classifiers when faced with a new pattern recognition problem in order to attain a good classification accuracy.
Development and Validation of the Minnesota Borderline Personality Disorder Scale (MBPD)
Bornovalova, Marina A.; Hicks, Brian M.; Patrick, Christopher J.; Iacono, William G.; McGue, Matt
2011-01-01
While large epidemiological datasets can inform research on the etiology and development of borderline personality disorder (BPD), they rarely include BPD measures. In some cases, however, proxy measures can be constructed using instruments already in these datasets. In this study we developed and validated a self-report measure of BPD from the Multidimensional Personality Questionnaire (MPQ). Items for the new instrument—the Minnesota BPD scale (MBPD)—were identified and refined using three large samples: undergraduates, community adolescent twins, and urban substance users. We determined the construct validity of the MBPD by examining its association with (1) diagnosed BPD, (2) questionnaire reported BPD symptoms, and (3) clinical variables associated with BPD: suicidality, trauma, disinhibition, internalizing distress, and substance use. We also tested the MBPD in two prison inmate samples. Across samples, the MBPD correlated with BPD indices and external criteria, and showed incremental validity above measures of negative affect, thus supporting its construct validity as a measure of BPD. PMID:21467094
Sonification Prototype for Space Physics
NASA Astrophysics Data System (ADS)
Candey, R. M.; Schertenleib, A. M.; Diaz Merced, W. L.
2005-12-01
As an alternative and adjunct to visual displays, auditory exploration of data via sonification (data controlled sound) and audification (audible playback of data samples) is promising for complex or rapidly/temporally changing visualizations, for data exploration of large datasets (particularly multi-dimensional datasets), and for exploring datasets in frequency rather than spatial dimensions (see also International Conferences on Auditory Display
Wu, Zhaohua; Feng, Jiaxin; Qiao, Fangli; Tan, Zhe-Min
2016-04-13
In this big data era, it is more urgent than ever to solve two major issues: (i) fast data transmission methods that can facilitate access to data from non-local sources and (ii) fast and efficient data analysis methods that can reveal the key information from the available data for particular purposes. Although approaches in different fields to address these two questions may differ significantly, the common part must involve data compression techniques and a fast algorithm. This paper introduces the recently developed adaptive and spatio-temporally local analysis method, namely the fast multidimensional ensemble empirical mode decomposition (MEEMD), for the analysis of a large spatio-temporal dataset. The original MEEMD uses ensemble empirical mode decomposition to decompose time series at each spatial grid and then pieces together the temporal-spatial evolution of climate variability and change on naturally separated timescales, which is computationally expensive. By taking advantage of the high efficiency of the expression using principal component analysis/empirical orthogonal function analysis for spatio-temporally coherent data, we design a lossy compression method for climate data to facilitate its non-local transmission. We also explain the basic principles behind the fast MEEMD through decomposing principal components instead of original grid-wise time series to speed up computation of MEEMD. Using a typical climate dataset as an example, we demonstrate that our newly designed methods can (i) compress data with a compression rate of one to two orders; and (ii) speed-up the MEEMD algorithm by one to two orders. © 2016 The Authors.
Nanocubes for real-time exploration of spatiotemporal datasets.
Lins, Lauro; Klosowski, James T; Scheidegger, Carlos
2013-12-01
Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop's main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.
An Evaluation of Database Solutions to Spatial Object Association
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, V S; Kurc, T; Saltz, J
2008-06-24
Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less
Quantification and Visualization of Variation in Anatomical Trees
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amenta, Nina; Datar, Manasi; Dirksen, Asger
This paper presents two approaches to quantifying and visualizing variation in datasets of trees. The first approach localizes subtrees in which significant population differences are found through hypothesis testing and sparse classifiers on subtree features. The second approach visualizes the global metric structure of datasets through low-distortion embedding into hyperbolic planes in the style of multidimensional scaling. A case study is made on a dataset of airway trees in relation to Chronic Obstructive Pulmonary Disease.
2D Radiative Processes Near Cloud Edges
NASA Technical Reports Server (NTRS)
Varnai, T.
2012-01-01
Because of the importance and complexity of dynamical, microphysical, and radiative processes taking place near cloud edges, the transition zone between clouds and cloud free air has been the subject of intense research both in the ASR program and in the wider community. One challenge in this research is that the one-dimensional (1D) radiative models widely used in both remote sensing and dynamical simulations become less accurate near cloud edges: The large horizontal gradients in particle concentrations imply that accurate radiative calculations need to consider multi-dimensional radiative interactions among areas that have widely different optical properties. This study examines the way the importance of multidimensional shortwave radiative interactions changes as we approach cloud edges. For this, the study relies on radiative simulations performed for a multiyear dataset of clouds observed over the NSA, SGP, and TWP sites. This dataset is based on Microbase cloud profiles as well as wind measurements and ARM cloud classification products. The study analyzes the way the difference between 1D and 2D simulation results increases near cloud edges. It considers both monochromatic radiances and broadband radiative heating, and it also examines the influence of factors such as cloud type and height, and solar elevation. The results provide insights into the workings of radiative processes and may help better interpret radiance measurements and better estimate the radiative impacts of this critical region.
Multidimensional Poverty and Health Status as a Predictor of Chronic Income Poverty.
Callander, Emily J; Schofield, Deborah J
2015-12-01
Longitudinal analysis of Wave 5 to 10 of the nationally representative Household, Income and Labour Dynamics in Australia dataset was undertaken to assess whether multidimensional poverty status can predict chronic income poverty. Of those who were multidimensionally poor (low income plus poor health or poor health and insufficient education attainment) in 2007, and those who were in income poverty only (no other forms of disadvantage) in 2007, a greater proportion of those in multidimensional poverty continued to be in income poverty for the subsequent 5 years through to 2012. People who were multidimensionally poor in 2007 had 2.17 times the odds of being in income poverty each year through to 2012 than those who were in income poverty only in 2005 (95% CI: 1.23-3.83). Multidimensional poverty measures are a useful tool for policymakers to identify target populations for policies aiming to improve equity and reduce chronic disadvantage. Copyright © 2014 John Wiley & Sons, Ltd.
CELL5M: A geospatial database of agricultural indicators for Africa South of the Sahara.
Koo, Jawoo; Cox, Cindy M; Bacou, Melanie; Azzarri, Carlo; Guo, Zhe; Wood-Sichra, Ulrike; Gong, Queenie; You, Liangzhi
2016-01-01
Recent progress in large-scale georeferenced data collection is widening opportunities for combining multi-disciplinary datasets from biophysical to socioeconomic domains, advancing our analytical and modeling capacity. Granular spatial datasets provide critical information necessary for decision makers to identify target areas, assess baseline conditions, prioritize investment options, set goals and targets and monitor impacts. However, key challenges in reconciling data across themes, scales and borders restrict our capacity to produce global and regional maps and time series. This paper provides overview, structure and coverage of CELL5M-an open-access database of geospatial indicators at 5 arc-minute grid resolution-and introduces a range of analytical applications and case-uses. CELL5M covers a wide set of agriculture-relevant domains for all countries in Africa South of the Sahara and supports our understanding of multi-dimensional spatial variability inherent in farming landscapes throughout the region.
Users' Manual and Installation Guide for the EverVIEW Slice and Dice Tool (Version 1.0 Beta)
Roszell, Dustin; Conzelmann, Craig; Chimmula, Sumani; Chandrasekaran, Anuradha; Hunnicut, Christina
2009-01-01
Network Common Data Form (NetCDF) is a self-describing, machine-independent file format for storing array-oriented scientific data. Over the past few years, there has been a growing movement within the community of natural resource managers in The Everglades, Fla., to use NetCDF as the standard data container for datasets based on multidimensional arrays. As a consequence, a need arose for additional tools to view and manipulate NetCDF datasets, specifically to create subsets of large NetCDF files. To address this need, we created the EverVIEW Slice and Dice Tool to allow users to create subsets of grid-based NetCDF files. The major functions of this tool are (1) to subset NetCDF files both spatially and temporally; (2) to view the NetCDF data in table form; and (3) to export filtered data to a comma-separated value file format.
DataPflex: a MATLAB-based tool for the manipulation and visualization of multidimensional datasets.
Hendriks, Bart S; Espelin, Christopher W
2010-02-01
DataPflex is a MATLAB-based application that facilitates the manipulation and visualization of multidimensional datasets. The strength of DataPflex lies in the intuitive graphical user interface for the efficient incorporation, manipulation and visualization of high-dimensional data that can be generated by multiplexed protein measurement platforms including, but not limited to Luminex or Meso-Scale Discovery. Such data can generally be represented in the form of multidimensional datasets [for example (time x stimulation x inhibitor x inhibitor concentration x cell type x measurement)]. For cases where measurements are made in a combinational fashion across multiple dimensions, there is a need for a tool to efficiently manipulate and reorganize such data for visualization. DataPflex accepts data consisting of up to five arbitrary dimensions in addition to a measurement dimension. Data are imported from a simple .xls format and can be exported to MATLAB or .xls. Data dimensions can be reordered, subdivided, merged, normalized and visualized in the form of collections of line graphs, bar graphs, surface plots, heatmaps, IC50's and other custom plots. Open source implementation in MATLAB enables easy extension for custom plotting routines and integration with more sophisticated analysis tools. DataPflex is distributed under the GPL license (http://www.gnu.org/licenses/) together with documentation, source code and sample data files at: http://code.google.com/p/datapflex. Supplementary data available at Bioinformatics online.
Big Data Clustering via Community Detection and Hyperbolic Network Embedding in IoT Applications.
Karyotis, Vasileios; Tsitseklis, Konstantinos; Sotiropoulos, Konstantinos; Papavassiliou, Symeon
2018-04-15
In this paper, we present a novel data clustering framework for big sensory data produced by IoT applications. Based on a network representation of the relations among multi-dimensional data, data clustering is mapped to node clustering over the produced data graphs. To address the potential very large scale of such datasets/graphs that test the limits of state-of-the-art approaches, we map the problem of data clustering to a community detection one over the corresponding data graphs. Specifically, we propose a novel computational approach for enhancing the traditional Girvan-Newman (GN) community detection algorithm via hyperbolic network embedding. The data dependency graph is embedded in the hyperbolic space via Rigel embedding, allowing more efficient computation of edge-betweenness centrality needed in the GN algorithm. This allows for more efficient clustering of the nodes of the data graph in terms of modularity, without sacrificing considerable accuracy. In order to study the operation of our approach with respect to enhancing GN community detection, we employ various representative types of artificial complex networks, such as scale-free, small-world and random geometric topologies, and frequently-employed benchmark datasets for demonstrating its efficacy in terms of data clustering via community detection. Furthermore, we provide a proof-of-concept evaluation by applying the proposed framework over multi-dimensional datasets obtained from an operational smart-city/building IoT infrastructure provided by the Federated Interoperable Semantic IoT/cloud Testbeds and Applications (FIESTA-IoT) testbed federation. It is shown that the proposed framework can be indeed used for community detection/data clustering and exploited in various other IoT applications, such as performing more energy-efficient smart-city/building sensing.
Big Data Clustering via Community Detection and Hyperbolic Network Embedding in IoT Applications
Sotiropoulos, Konstantinos
2018-01-01
In this paper, we present a novel data clustering framework for big sensory data produced by IoT applications. Based on a network representation of the relations among multi-dimensional data, data clustering is mapped to node clustering over the produced data graphs. To address the potential very large scale of such datasets/graphs that test the limits of state-of-the-art approaches, we map the problem of data clustering to a community detection one over the corresponding data graphs. Specifically, we propose a novel computational approach for enhancing the traditional Girvan–Newman (GN) community detection algorithm via hyperbolic network embedding. The data dependency graph is embedded in the hyperbolic space via Rigel embedding, allowing more efficient computation of edge-betweenness centrality needed in the GN algorithm. This allows for more efficient clustering of the nodes of the data graph in terms of modularity, without sacrificing considerable accuracy. In order to study the operation of our approach with respect to enhancing GN community detection, we employ various representative types of artificial complex networks, such as scale-free, small-world and random geometric topologies, and frequently-employed benchmark datasets for demonstrating its efficacy in terms of data clustering via community detection. Furthermore, we provide a proof-of-concept evaluation by applying the proposed framework over multi-dimensional datasets obtained from an operational smart-city/building IoT infrastructure provided by the Federated Interoperable Semantic IoT/cloud Testbeds and Applications (FIESTA-IoT) testbed federation. It is shown that the proposed framework can be indeed used for community detection/data clustering and exploited in various other IoT applications, such as performing more energy-efficient smart-city/building sensing. PMID:29662043
Visualizing Big Data Outliers through Distributed Aggregation.
Wilkinson, Leland
2017-08-29
Visualizing outliers in massive datasets requires statistical pre-processing in order to reduce the scale of the problem to a size amenable to rendering systems like D3, Plotly or analytic systems like R or SAS. This paper presents a new algorithm, called hdoutliers, for detecting multidimensional outliers. It is unique for a) dealing with a mixture of categorical and continuous variables, b) dealing with big-p (many columns of data), c) dealing with big-n (many rows of data), d) dealing with outliers that mask other outliers, and e) dealing consistently with unidimensional and multidimensional datasets. Unlike ad hoc methods found in many machine learning papers, hdoutliers is based on a distributional model that allows outliers to be tagged with a probability. This critical feature reduces the likelihood of false discoveries.
EdgeMaps: visualizing explicit and implicit relations
NASA Astrophysics Data System (ADS)
Dörk, Marian; Carpendale, Sheelagh; Williamson, Carey
2011-01-01
In this work, we introduce EdgeMaps as a new method for integrating the visualization of explicit and implicit data relations. Explicit relations are specific connections between entities already present in a given dataset, while implicit relations are derived from multidimensional data based on shared properties and similarity measures. Many datasets include both types of relations, which are often difficult to represent together in information visualizations. Node-link diagrams typically focus on explicit data connections, while not incorporating implicit similarities between entities. Multi-dimensional scaling considers similarities between items, however, explicit links between nodes are not displayed. In contrast, EdgeMaps visualize both implicit and explicit relations by combining and complementing spatialization and graph drawing techniques. As a case study for this approach we chose a dataset of philosophers, their interests, influences, and birthdates. By introducing the limitation of activating only one node at a time, interesting visual patterns emerge that resemble the aesthetics of fireworks and waves. We argue that the interactive exploration of these patterns may allow the viewer to grasp the structure of a graph better than complex node-link visualizations.
Blazing Signature Filter: a library for fast pairwise similarity comparisons
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Joon-Yong; Fujimoto, Grant M.; Wilson, Ryan
Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majoritymore » of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balke, Nina; Kalinin, Sergei V.; Jesse, Stephen
Kelvin probe force microscopy (KPFM) has provided deep insights into the role local electronic, ionic and electrochemical processes play on the global functionality of materials and devices, even down to the atomic scale. Conventional KPFM utilizes heterodyne detection and bias feedback to measure the contact potential difference (CPD) between tip and sample. This measurement paradigm, however, permits only partial recovery of the information encoded in bias- and time-dependent electrostatic interactions between the tip and sample and effectively down-samples the cantilever response to a single measurement of CPD per pixel. This level of detail is insufficient for electroactive materials, devices, ormore » solid-liquid interfaces, where non-linear dielectrics are present or spurious electrostatic events are possible. Here, we simulate and experimentally validate a novel approach for spatially resolved KPFM capable of a full information transfer of the dynamic electric processes occurring between tip and sample. General acquisition mode, or G-Mode, adopts a big data approach utilising high speed detection, compression, and storage of the raw cantilever deflection signal in its entirety at high sampling rates (> 4 MHz), providing a permanent record of the tip trajectory. We develop a range of methodologies for analysing the resultant large multidimensional datasets involving classical, physics-based and information-based approaches. Physics-based analysis of G-Mode KPFM data recovers the parabolic bias dependence of the electrostatic force for each cycle of the excitation voltage, leading to a multidimensional dataset containing spatial and temporal dependence of the CPD and capacitance channels. We use multivariate statistical methods to reduce data volume and separate the complex multidimensional data sets into statistically significant components that can then be mapped onto separate physical mechanisms. Overall, G-Mode KPFM offers a new paradigm to study dynamic electric phenomena in electroactive interfaces as well as offer a promising approach to extend KPFM to solid-liquid interfaces.« less
Balke, Nina; Kalinin, Sergei V.; Jesse, Stephen; ...
2016-08-12
Kelvin probe force microscopy (KPFM) has provided deep insights into the role local electronic, ionic and electrochemical processes play on the global functionality of materials and devices, even down to the atomic scale. Conventional KPFM utilizes heterodyne detection and bias feedback to measure the contact potential difference (CPD) between tip and sample. This measurement paradigm, however, permits only partial recovery of the information encoded in bias- and time-dependent electrostatic interactions between the tip and sample and effectively down-samples the cantilever response to a single measurement of CPD per pixel. This level of detail is insufficient for electroactive materials, devices, ormore » solid-liquid interfaces, where non-linear dielectrics are present or spurious electrostatic events are possible. Here, we simulate and experimentally validate a novel approach for spatially resolved KPFM capable of a full information transfer of the dynamic electric processes occurring between tip and sample. General acquisition mode, or G-Mode, adopts a big data approach utilising high speed detection, compression, and storage of the raw cantilever deflection signal in its entirety at high sampling rates (> 4 MHz), providing a permanent record of the tip trajectory. We develop a range of methodologies for analysing the resultant large multidimensional datasets involving classical, physics-based and information-based approaches. Physics-based analysis of G-Mode KPFM data recovers the parabolic bias dependence of the electrostatic force for each cycle of the excitation voltage, leading to a multidimensional dataset containing spatial and temporal dependence of the CPD and capacitance channels. We use multivariate statistical methods to reduce data volume and separate the complex multidimensional data sets into statistically significant components that can then be mapped onto separate physical mechanisms. Overall, G-Mode KPFM offers a new paradigm to study dynamic electric phenomena in electroactive interfaces as well as offer a promising approach to extend KPFM to solid-liquid interfaces.« less
Parsimony and goodness-of-fit in multi-dimensional NMR inversion
NASA Astrophysics Data System (ADS)
Babak, Petro; Kryuchkov, Sergey; Kantzas, Apostolos
2017-01-01
Multi-dimensional nuclear magnetic resonance (NMR) experiments are often used for study of molecular structure and dynamics of matter in core analysis and reservoir evaluation. Industrial applications of multi-dimensional NMR involve a high-dimensional measurement dataset with complicated correlation structure and require rapid and stable inversion algorithms from the time domain to the relaxation rate and/or diffusion domains. In practice, applying existing inverse algorithms with a large number of parameter values leads to an infinite number of solutions with a reasonable fit to the NMR data. The interpretation of such variability of multiple solutions and selection of the most appropriate solution could be a very complex problem. In most cases the characteristics of materials have sparse signatures, and investigators would like to distinguish the most significant relaxation and diffusion values of the materials. To produce an easy to interpret and unique NMR distribution with the finite number of the principal parameter values, we introduce a new method for NMR inversion. The method is constructed based on the trade-off between the conventional goodness-of-fit approach to multivariate data and the principle of parsimony guaranteeing inversion with the least number of parameter values. We suggest performing the inversion of NMR data using the forward stepwise regression selection algorithm. To account for the trade-off between goodness-of-fit and parsimony, the objective function is selected based on Akaike Information Criterion (AIC). The performance of the developed multi-dimensional NMR inversion method and its comparison with conventional methods are illustrated using real data for samples with bitumen, water and clay.
Brislin, Sarah J.; Drislane, Laura E.; Smith, Shannon Toney; Edens, John F.; Patrick, Christopher J.
2015-01-01
Psychopathy is conceptualized by the triarchic model as encompassing three distinct phenotypic constructs: boldness, meanness, and disinhibition. In the current study, the Multidimensional Personality Questionnaire (MPQ), a normal-range personality measure, was evaluated for representation of these three constructs. Consensus ratings were used to identify MPQ items most related to each triarchic (Tri) construct. Scale measures were developed from items indicative of each construct, and scores for these scales were evaluated for convergent and discriminant validity in community (N = 176) and incarcerated samples (N = 240). A cross the two samples, MPQ-Tri scale scores demonstrated good internal consistencies and relationships with criterion measures of various types consistent with predictions based on the triarchic model. Findings are discussed in terms of their implications for further investigation of the triarchic model constructs in preexisting datasets that include the MPQ, in particular longitudinal and genetically informative datasets. PMID:25642934
ViA: a perceptual visualization assistant
NASA Astrophysics Data System (ADS)
Healey, Chris G.; St. Amant, Robert; Elhaddad, Mahmoud S.
2000-05-01
This paper describes an automated visualized assistant called ViA. ViA is designed to help users construct perceptually optical visualizations to represent, explore, and analyze large, complex, multidimensional datasets. We have approached this problem by studying what is known about the control of human visual attention. By harnessing the low-level human visual system, we can support our dual goals of rapid and accurate visualization. Perceptual guidelines that we have built using psychophysical experiments form the basis for ViA. ViA uses modified mixed-initiative planning algorithms from artificial intelligence to search of perceptually optical data attribute to visual feature mappings. Our perceptual guidelines are integrated into evaluation engines that provide evaluation weights for a given data-feature mapping, and hints on how that mapping might be improved. ViA begins by asking users a set of simple questions about their dataset and the analysis tasks they want to perform. Answers to these questions are used in combination with the evaluation engines to identify and intelligently pursue promising data-feature mappings. The result is an automatically-generated set of mappings that are perceptually salient, but that also respect the context of the dataset and users' preferences about how they want to visualize their data.
Shingrani, Rahul; Krenz, Gary; Molthen, Robert
2010-01-01
With advances in medical imaging scanners, it has become commonplace to generate large multidimensional datasets. These datasets require tools for a rapid, thorough analysis. To address this need, we have developed an automated algorithm for morphometric analysis incorporating A Visualization Workshop computational and image processing libraries for three-dimensional segmentation, vascular tree generation and structural hierarchical ordering with a two-stage numeric optimization procedure for estimating vessel diameters. We combine this new technique with our mathematical models of pulmonary vascular morphology to quantify structural and functional attributes of lung arterial trees. Our physiological studies require repeated measurements of vascular structure to determine differences in vessel biomechanical properties between animal models of pulmonary disease. Automation provides many advantages including significantly improved speed and minimized operator interaction and biasing. The results are validated by comparison with previously published rat pulmonary arterial micro-CT data analysis techniques, in which vessels were manually mapped and measured using intense operator intervention. Published by Elsevier Ireland Ltd.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Crowell, Kevin L.; Slysz, Gordon W.; Baker, Erin Shammel
2013-09-05
We introduce a command line software application LC-IMS-MS Feature Finder that searches for molecular ion signatures in multidimensional liquid chromatography-ion mobility spectrometry-mass spectrometry (LC-IMS-MS) data by clustering deisotoped peaks with similar monoisotopic mass, charge state, LC elution time, and ion mobility drift time values. The software application includes an algorithm for detecting and quantifying co-eluting chemical species, including species that exist in multiple conformations that may have been separated in the IMS dimension.
Optimizing Earth Data Search Ranking using Deep Learning and Real-time User Behaviour
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; McGibbney, L. J.; Greguska, F. R., III
2017-12-01
Finding Earth science data has been a challenging problem given both the quantity of data available and the heterogeneity of the data across a wide variety of domains. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic dimension (e.g., term frequency-inverse document frequency (TF-IDF) score, popularity, release date, etc.). This approach largely fails to take account of users' multidimensional preferences for geospatial data, and hence may likely result in a less than optimal user experience in discovering the most applicable dataset out of a vast range of available datasets. With users interacting with search engines, sufficient information is already hidden in the log files. Compared with explicit feedback data, information that can be derived/extracted from log files is virtually free and substantially more timely. In this dissertation, I propose an online deep learning framework that can quickly update the learning function based on real-time user clickstream data. The contributions of this framework include 1) a log processor that can ingest, process and create training data from web logs in a real-time manner; 2) a query understanding module to better interpret users' search intent using web log processing results and metadata; 3) a feature extractor that identifies ranking features representing users' multidimensional interests of geospatial data; and 4) a deep learning based ranking algorithm that can be trained incrementally using user behavior data. The search ranking results will be evaluated using precision at K and normalized discounted cumulative gain (NDCG).
Noormohammadpour, Pardis; Tavana, Bahareh; Mansournia, Mohammad Ali; Zeinalizadeh, Mehdi; Mirzashahi, Babak; Rostami, Mohsen; Kordi, Ramin
2018-05-01
Translation and cultural adaptation of the National Institutes of Health (NIH) Task Force's minimal dataset. The purpose of this study was to evaluate validity and reliability of the Farsi version of NIH Task Force's recommended multidimensional minimal dataset for research on chronic low back pain (CLBP). Considering the high treatment cost of CLBP and its increasing prevalence, NIH Pain Consortium developed research standards (including recommendations for definitions, a minimum dataset, and outcomes' report) for studies regarding CLBP. Application of these recommendations could standardize research and improve comparability of different studies in CLBP. This study has three phases: translation of dataset into Farsi and its cultural adaptation, assessment of pre-final version of dataset's comprehensibility via a pilot study, and investigation of the reliability and validity of final version of translated dataset. Subjects were 250 patients with CLBP. Test-retest reliability, content validity, and convergent validity (correlations among different dimensions of dataset and Farsi versions of Oswestry Disability Index, Roland Morris Disability Questionnaire, Fear-Avoidance Belief Questionnaire, and Beck Depression Inventory-II) were assessed. The Farsi version demonstrated good/excellent convergent validity (the correlation coefficient between impact dimension and ODI was r = 0.75 [P < 0.001], between impact dimension and Roland-Morris Disability Questionnaire was r = 0.80 [P < 0.001], and between psychological dimension and BDI was r = 0.62 [P < 0.001]). The test-retest reliability was also strong (intraclass correlation coefficient value ranged between 0.70 and 0.95) and the internal consistency was good/excellent (Chronbach's alpha coefficients' value for two main dimensions including impact dimension and psychological dimension were 0.91 and 0.82 [P < 0.001], respectively). In addition, its face validity and content validity were acceptable. The Farsi version of minimal dataset for research on CLBP is a reliable and valid instrument for data gathering in patients with CLBP. This minimum dataset can be a step toward standardization of research regarding CLBP. 3.
NASA Technical Reports Server (NTRS)
Chelton, Dudley B.; Schlax, Michael G.
1994-01-01
A formalism is presented for determining the wavenumber-frequency transfer function associated with an irregularly sampled multidimensional dataset. This transfer function reveals the filtering characteristics and aliasing patterns inherent in the sample design. In combination with information about the spectral characteristics of the signal, the transfer function can be used to quantify the spatial and temporal resolution capability of the dataset. Application of the method to idealized Geosat altimeter data (i.e., neglecting measurement errors and data dropouts) concludes that the Geosat orbit configuration is capable of resolving scales of about 3 deg in latitude and longitude by about 30 days.
Vaccarino, Anthony L; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M; Stuss, Donald T; Theriault, Elizabeth; Evans, Kenneth R
2018-01-01
Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute's "Brain-CODE" is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care.
Vaccarino, Anthony L.; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R.; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G.; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F. Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M.; Stuss, Donald T.; Theriault, Elizabeth; Evans, Kenneth R.
2018-01-01
Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute’s “Brain-CODE” is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care. PMID:29875648
NASA Astrophysics Data System (ADS)
Haslauer, Claus; Bohling, Geoff
2013-04-01
Hydraulic conductivity (K) is a fundamental parameter that influences groundwater flow and solute transport. Measurements of K are limited and uncertain. Moreover, the spatial structure of K, which impacts the groundwater velocity field and hence directly influences the advective spreading of a solute migrating in the subsurface, is commonly described by approaches using second order moments. Spatial copulas have in the recent past been applied successfully to model the spatial dependence structure of heterogeneous subsurface datasets. At the MADE site, hydraulic conductivity (K) has been measured in exceptional detail. Two independently collected data-sets were used for this study: (1) ~2000 flowmeter based K measurements, and (2) ~20,000 direct-push based K measurements. These datasets exhibit a very heterogeneous (Var[ln(K)]>2) spatially distributed K field. A copula analysis reveals that the spatial dependence structure of the flowmeter and direct-push datasets are essentially the same. A spatial copula analysis factors out the influence of the marginal distribution of the property under investigation. This independence from the marginal distributions allows the copula analysis to reveal the underlying similarity between the spatial dependence structures of the flowmeter and direct-push datasets despite two complicating factors: 1) an overall offset between the datasets, with direct-push K values being, on average, roughly a factor of five lower than flowmeter K values, due at least in part to opposite biases between the two measurement techniques, and 2) the presence of some anomalously high K values in the direct-push dataset due to a lower limit on accurately measureable pressure responses in high-K zones. In addition, the vertical resolution of the direct-push dataset is ten times finer than that of the flowmeter dataset. Upscaling the direct-push data to compensate for this difference resulted in little change to the spatial structure. The objective of the presented work is to use multidimensional spatial copulas to describe and model the spatial dependence of the spatial structure of K at the heterogeneous MADE site, and evaluate the effects of this multidimensional description on solute transport.
The Ophidia framework: toward cloud-based data analytics for climate change
NASA Astrophysics Data System (ADS)
Fiore, Sandro; D'Anca, Alessandro; Elia, Donatello; Mancini, Marco; Mariello, Andrea; Mirto, Maria; Palazzo, Cosimo; Aloisio, Giovanni
2015-04-01
The Ophidia project is a research effort on big data analytics facing scientific data analysis challenges in the climate change domain. It provides parallel (server-side) data analysis, an internal storage model and a hierarchical data organization to manage large amount of multidimensional scientific data. The Ophidia analytics platform provides several MPI-based parallel operators to manipulate large datasets (data cubes) and array-based primitives to perform data analysis on large arrays of scientific data. The most relevant data analytics use cases implemented in national and international projects target fire danger prevention (OFIDIA), interactions between climate change and biodiversity (EUBrazilCC), climate indicators and remote data analysis (CLIP-C), sea situational awareness (TESSA), large scale data analytics on CMIP5 data in NetCDF format, Climate and Forecast (CF) convention compliant (ExArch). Two use cases regarding the EU FP7 EUBrazil Cloud Connect and the INTERREG OFIDIA projects will be presented during the talk. In the former case (EUBrazilCC) the Ophidia framework is being extended to integrate scalable VM-based solutions for the management of large volumes of scientific data (both climate and satellite data) in a cloud-based environment to study how climate change affects biodiversity. In the latter one (OFIDIA) the data analytics framework is being exploited to provide operational support regarding processing chains devoted to fire danger prevention. To tackle the project challenges, data analytics workflows consisting of about 130 operators perform, among the others, parallel data analysis, metadata management, virtual file system tasks, maps generation, rolling of datasets, import/export of datasets in NetCDF format. Finally, the entire Ophidia software stack has been deployed at CMCC on 24-nodes (16-cores/node) of the Athena HPC cluster. Moreover, a cloud-based release tested with OpenNebula is also available and running in the private cloud infrastructure of the CMCC Supercomputing Centre.
Multidimensional Compressed Sensing MRI Using Tensor Decomposition-Based Sparsifying Transform
Yu, Yeyang; Jin, Jin; Liu, Feng; Crozier, Stuart
2014-01-01
Compressed Sensing (CS) has been applied in dynamic Magnetic Resonance Imaging (MRI) to accelerate the data acquisition without noticeably degrading the spatial-temporal resolution. A suitable sparsity basis is one of the key components to successful CS applications. Conventionally, a multidimensional dataset in dynamic MRI is treated as a series of two-dimensional matrices, and then various matrix/vector transforms are used to explore the image sparsity. Traditional methods typically sparsify the spatial and temporal information independently. In this work, we propose a novel concept of tensor sparsity for the application of CS in dynamic MRI, and present the Higher-order Singular Value Decomposition (HOSVD) as a practical example. Applications presented in the three- and four-dimensional MRI data demonstrate that HOSVD simultaneously exploited the correlations within spatial and temporal dimensions. Validations based on cardiac datasets indicate that the proposed method achieved comparable reconstruction accuracy with the low-rank matrix recovery methods and, outperformed the conventional sparse recovery methods. PMID:24901331
Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset.
Seashore-Ludlow, Brinton; Rees, Matthew G; Cheah, Jaime H; Cokol, Murat; Price, Edmund V; Coletti, Matthew E; Jones, Victor; Bodycombe, Nicole E; Soule, Christian K; Gould, Joshua; Alexander, Benjamin; Li, Ava; Montgomery, Philip; Wawer, Mathias J; Kuru, Nurdan; Kotz, Joanne D; Hon, C Suk-Yee; Munoz, Benito; Liefeld, Ted; Dančík, Vlado; Bittker, Joshua A; Palmer, Michelle; Bradner, James E; Shamji, Alykhan F; Clemons, Paul A; Schreiber, Stuart L
2015-11-01
Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset. This analysis reveals insights into small-molecule mechanisms of action, and genomic features that associate with CCL response to small-molecule treatment. We are able to recapitulate known relationships between FDA-approved therapies and cancer dependencies and to uncover new relationships, including for KRAS-mutant cancers and neuroblastoma. To enable the cancer community to explore these data, and to generate novel hypotheses, we created an updated version of the Cancer Therapeutic Response Portal (CTRP v2). We present the largest CCL sensitivity dataset yet available, and an analysis method integrating information from multiple CCLs and multiple small molecules to identify CCL response predictors robustly. We updated the CTRP to enable the cancer research community to leverage these data and analyses. ©2015 American Association for Cancer Research.
Bioimage informatics for experimental biology
Swedlow, Jason R.; Goldberg, Ilya G.; Eliceiri, Kevin W.
2012-01-01
Over the last twenty years there have been great advances in light microscopy with the result that multi-dimensional imaging has driven a revolution in modern biology. The development of new approaches of data acquisition are reportedly frequently, and yet the significant data management and analysis challenges presented by these new complex datasets remains largely unsolved. Like the well-developed field of genome bioinformatics, central repositories are and will be key resources, but there is a critical need for informatics tools in individual laboratories to help manage, share, visualize, and analyze image data. In this article we present the recent efforts by the bioimage informatics community to tackle these challenges and discuss our own vision for future development of bioimage informatics solution. PMID:19416072
DOE Office of Scientific and Technical Information (OSTI.GOV)
Suiter, Christopher L.; Paramasivam, Sivakumar; Hou, Guangjin
Recently, we have demonstrated that considerable inherent sensitivity gains are attained in MAS NMR spectra acquired by nonuniform sampling (NUS) and introduced maximum entropy interpolation (MINT) processing that assures the linearity of transformation between the time and frequency domains. In this report, we examine the utility of the NUS/MINT approach in multidimensional datasets possessing high dynamic range, such as homonuclear 13C–13C correlation spectra. We demonstrate on model compounds and on 1–73-(U-13C,15N)/74–108-(U-15N) E. coli thioredoxin reassembly, that with appropriately constructed 50 % NUS schedules inherent sensitivity gains of 1.7–2.1-fold are readily reached in such datasets. We show that both linearity andmore » line width are retained under these experimental conditions throughout the entire dynamic range of the signals. Furthermore, we demonstrate that the reproducibility of the peak intensities is excellent in the NUS/MINT approach when experiments are repeated multiple times and identical experimental and processing conditions are employed. Finally, we discuss the principles for design and implementation of random exponentially biased NUS sampling schedules for homonuclear 13C–13C MAS correlation experiments that yield high quality artifact-free datasets.« less
Suiter, Christopher L.; Paramasivam, Sivakumar; Hou, Guangjin; Sun, Shangjin; Rice, David; Hoch, Jeffrey C.; Rovnyak, David
2014-01-01
Recently, we have demonstrated that considerable inherent sensitivity gains are attained in MAS NMR spectra acquired by nonuniform sampling (NUS) and introduced maximum entropy interpolation (MINT) processing that assures the linearity of transformation between the time and frequency domains. In this report, we examine the utility of the NUS/MINT approach in multidimensional datasets possessing high dynamic range, such as homonuclear 13C–13C correlation spectra. We demonstrate on model compounds and on 1–73-(U-13C, 15N)/74–108-(U-15N) E. coli thioredoxin reassembly, that with appropriately constructed 50 % NUS schedules inherent sensitivity gains of 1.7–2.1-fold are readily reached in such datasets. We show that both linearity and line width are retained under these experimental conditions throughout the entire dynamic range of the signals. Furthermore, we demonstrate that the reproducibility of the peak intensities is excellent in the NUS/MINT approach when experiments are repeated multiple times and identical experimental and processing conditions are employed. Finally, we discuss the principles for design and implementation of random exponentially biased NUS sampling schedules for homonuclear 13C–13C MAS correlation experiments that yield high-quality artifact-free datasets. PMID:24752819
Arneson, Douglas; Bhattacharya, Anindya; Shu, Le; Mäkinen, Ville-Petteri; Yang, Xia
2016-09-09
Human diseases are commonly the result of multidimensional changes at molecular, cellular, and systemic levels. Recent advances in genomic technologies have enabled an outpour of omics datasets that capture these changes. However, separate analyses of these various data only provide fragmented understanding and do not capture the holistic view of disease mechanisms. To meet the urgent needs for tools that effectively integrate multiple types of omics data to derive biological insights, we have developed Mergeomics, a computational pipeline that integrates multidimensional disease association data with functional genomics and molecular networks to retrieve biological pathways, gene networks, and central regulators critical for disease development. To make the Mergeomics pipeline available to a wider research community, we have implemented an online, user-friendly web server ( http://mergeomics. idre.ucla.edu/ ). The web server features a modular implementation of the Mergeomics pipeline with detailed tutorials. Additionally, it provides curated genomic resources including tissue-specific expression quantitative trait loci, ENCODE functional annotations, biological pathways, and molecular networks, and offers interactive visualization of analytical results. Multiple computational tools including Marker Dependency Filtering (MDF), Marker Set Enrichment Analysis (MSEA), Meta-MSEA, and Weighted Key Driver Analysis (wKDA) can be used separately or in flexible combinations. User-defined summary-level genomic association datasets (e.g., genetic, transcriptomic, epigenomic) related to a particular disease or phenotype can be uploaded and computed real-time to yield biologically interpretable results, which can be viewed online and downloaded for later use. Our Mergeomics web server offers researchers flexible and user-friendly tools to facilitate integration of multidimensional data into holistic views of disease mechanisms in the form of tissue-specific key regulators, biological pathways, and gene networks.
Spatial contexts for temporal variability in alpine vegetation under ongoing climate change
Fagre, Daniel B.; ,; George P. Malanson,
2013-01-01
A framework to monitor mountain summit vegetation (The Global Observation Research Initiative in Alpine Environments, GLORIA) was initiated in 1997. GLORIA results should be taken within a regional context of the spatial variability of alpine tundra. Changes observed at GLORIA sites in Glacier National Park, Montana, USA are quantified within the context of the range of variability observed in alpine tundra across much of western North America. Dissimilarity is calculated and used in nonmetric multidimensional scaling for repeated measures of vascular species cover at 14 GLORIA sites with 525 nearby sites and with 436 sites in western North America. The lengths of the trajectories of the GLORIA sites in ordination space are compared to the dimensions of the space created by the larger datasets. The absolute amount of change on the GLORIA summits over 5 years is high, but the degree of change is small relative to the geographical context. The GLORIA sites are on the margin of the ordination volumes with the large datasets. The GLORIA summit vegetation appears to be specialized, arguing for the intrinsic value of early observed change in limited niche space.
ERIC Educational Resources Information Center
Li, Ying; Jiao, Hong; Lissitz, Robert W.
2012-01-01
This study investigated the application of multidimensional item response theory (IRT) models to validate test structure and dimensionality. Multiple content areas or domains within a single subject often exist in large-scale achievement tests. Such areas or domains may cause multidimensionality or local item dependence, which both violate the…
A multidimensional representation model of geographic features
Usery, E. Lynn; Timson, George; Coletti, Mark
2016-01-28
A multidimensional model of geographic features has been developed and implemented with data from The National Map of the U.S. Geological Survey. The model, programmed in C++ and implemented as a feature library, was tested with data from the National Hydrography Dataset demonstrating the capability to handle changes in feature attributes, such as increases in chlorine concentration in a stream, and feature geometry, such as the changing shoreline of barrier islands over time. Data can be entered directly, from a comma separated file, or features with attributes and relationships can be automatically populated in the model from data in the Spatial Data Transfer Standard format.
NASA Astrophysics Data System (ADS)
de Santis, A.; de Franceschi, G.; Perrone, L.
1997-06-01
The Istituto Nazionale di Geofisica under the P.N.R.A. (National Program of Research in Antarctica) has the responsibility of acquiring geophysical observations at the Italian Antarctic Base of Terra Nova Bay. Among others, geomagnetic and riometric data can provide some new insights into local and global activity of the magnetosphere-ionosphere coupling. This article investigates some properties of these kinds of data by means of spectral and fractal analyses. In addition, a multidimensional index is derived from this single-point dataset to represent not only the local but also the global state of the magnetospheric activity.
Chen Peng; Ao Li
2017-01-01
The emergence of multi-dimensional data offers opportunities for more comprehensive analysis of the molecular characteristics of human diseases and therefore improving diagnosis, treatment, and prevention. In this study, we proposed a heterogeneous network based method by integrating multi-dimensional data (HNMD) to identify GBM-related genes. The novelty of the method lies in that the multi-dimensional data of GBM from TCGA dataset that provide comprehensive information of genes, are combined with protein-protein interactions to construct a weighted heterogeneous network, which reflects both the general and disease-specific relationships between genes. In addition, a propagation algorithm with resistance is introduced to precisely score and rank GBM-related genes. The results of comprehensive performance evaluation show that the proposed method significantly outperforms the network based methods with single-dimensional data and other existing approaches. Subsequent analysis of the top ranked genes suggests they may be functionally implicated in GBM, which further corroborates the superiority of the proposed method. The source code and the results of HNMD can be downloaded from the following URL: http://bioinformatics.ustc.edu.cn/hnmd/ .
NASA Astrophysics Data System (ADS)
Dungan, J. L.; Wang, W.; Hashimoto, H.; Michaelis, A.; Milesi, C.; Ichii, K.; Nemani, R. R.
2009-12-01
In support of NACP, we are conducting an ensemble modeling exercise using the Terrestrial Observation and Prediction System (TOPS) to evaluate uncertainties among ecosystem models, satellite datasets, and in-situ measurements. The models used in the experiment include public-domain versions of Biome-BGC, LPJ, TOPS-BGC, and CASA, driven by a consistent set of climate fields for North America at 8km resolution and daily/monthly time steps over the period of 1982-2006. The reference datasets include MODIS Gross Primary Production (GPP) and Net Primary Production (NPP) products, Fluxnet measurements, and other observational data. The simulation results and the reference datasets are consistently processed and systematically compared in the climate (temperature-precipitation) space; in particular, an alternative to the Taylor diagram is developed to facilitate model-data intercomparisons in multi-dimensional space. The key findings of this study indicate that: the simulated GPP/NPP fluxes are in general agreement with observations over forests, but are biased low (underestimated) over non-forest types; large uncertainties of biomass and soil carbon stocks are found among the models (and reference datasets), often induced by seemingly “small” differences in model parameters and implementation details; the simulated Net Ecosystem Production (NEP) mainly responds to non-respiratory disturbances (e.g. fire) in the models and therefore is difficult to compare with flux data; and the seasonality and interannual variability of NEP varies significantly among models and reference datasets. These findings highlight the problem inherent in relying on only one modeling approach to map surface carbon fluxes and emphasize the pressing necessity of expanded and enhanced monitoring systems to narrow critical structural and parametrical uncertainties among ecosystem models.
Multidimensional quantum entanglement with large-scale integrated optics.
Wang, Jianwei; Paesani, Stefano; Ding, Yunhong; Santagati, Raffaele; Skrzypczyk, Paul; Salavrakos, Alexia; Tura, Jordi; Augusiak, Remigiusz; Mančinska, Laura; Bacco, Davide; Bonneau, Damien; Silverstone, Joshua W; Gong, Qihuang; Acín, Antonio; Rottwitt, Karsten; Oxenløwe, Leif K; O'Brien, Jeremy L; Laing, Anthony; Thompson, Mark G
2018-04-20
The ability to control multidimensional quantum systems is central to the development of advanced quantum technologies. We demonstrate a multidimensional integrated quantum photonic platform able to generate, control, and analyze high-dimensional entanglement. A programmable bipartite entangled system is realized with dimensions up to 15 × 15 on a large-scale silicon photonics quantum circuit. The device integrates more than 550 photonic components on a single chip, including 16 identical photon-pair sources. We verify the high precision, generality, and controllability of our multidimensional technology, and further exploit these abilities to demonstrate previously unexplored quantum applications, such as quantum randomness expansion and self-testing on multidimensional states. Our work provides an experimental platform for the development of multidimensional quantum technologies. Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.
Using Graph Indices for the Analysis and Comparison of Chemical Datasets.
Fourches, Denis; Tropsha, Alexander
2013-10-01
In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Clearing your Desk! Software and Data Services for Collaborative Web Based GIS Analysis
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Gichamo, T.; Yildirim, A. A.; Liu, Y.
2015-12-01
Can your desktop computer crunch the large GIS datasets that are becoming increasingly common across the geosciences? Do you have access to or the know-how to take advantage of advanced high performance computing (HPC) capability? Web based cyberinfrastructure takes work off your desk or laptop computer and onto infrastructure or "cloud" based data and processing servers. This talk will describe the HydroShare collaborative environment and web based services being developed to support the sharing and processing of hydrologic data and models. HydroShare supports the upload, storage, and sharing of a broad class of hydrologic data including time series, geographic features and raster datasets, multidimensional space-time data, and other structured collections of data. Web service tools and a Python client library provide researchers with access to HPC resources without requiring them to become HPC experts. This reduces the time and effort spent in finding and organizing the data required to prepare the inputs for hydrologic models and facilitates the management of online data and execution of models on HPC systems. This presentation will illustrate the use of web based data and computation services from both the browser and desktop client software. These web-based services implement the Terrain Analysis Using Digital Elevation Model (TauDEM) tools for watershed delineation, generation of hydrology-based terrain information, and preparation of hydrologic model inputs. They allow users to develop scripts on their desktop computer that call analytical functions that are executed completely in the cloud, on HPC resources using input datasets stored in the cloud, without installing specialized software, learning how to use HPC, or transferring large datasets back to the user's desktop. These cases serve as examples for how this approach can be extended to other models to enhance the use of web and data services in the geosciences.
magHD: a new approach to multi-dimensional data storage, analysis, display and exploitation
NASA Astrophysics Data System (ADS)
Angleraud, Christophe
2014-06-01
The ever increasing amount of data and processing capabilities - following the well- known Moore's law - is challenging the way scientists and engineers are currently exploiting large datasets. The scientific visualization tools, although quite powerful, are often too generic and provide abstract views of phenomena, thus preventing cross disciplines fertilization. On the other end, Geographic information Systems allow nice and visually appealing maps to be built but they often get very confused as more layers are added. Moreover, the introduction of time as a fourth analysis dimension to allow analysis of time dependent phenomena such as meteorological or climate models, is encouraging real-time data exploration techniques that allow spatial-temporal points of interests to be detected by integration of moving images by the human brain. Magellium is involved in high performance image processing chains for satellite image processing as well as scientific signal analysis and geographic information management since its creation (2003). We believe that recent work on big data, GPU and peer-to-peer collaborative processing can open a new breakthrough in data analysis and display that will serve many new applications in collaborative scientific computing, environment mapping and understanding. The magHD (for Magellium Hyper-Dimension) project aims at developing software solutions that will bring highly interactive tools for complex datasets analysis and exploration commodity hardware, targeting small to medium scale clusters with expansion capabilities to large cloud based clusters.
Fluid Lensing based Machine Learning for Augmenting Earth Science Coral Datasets
NASA Astrophysics Data System (ADS)
Li, A.; Instrella, R.; Chirayath, V.
2016-12-01
Recently, there has been increased interest in monitoring the effects of climate change upon the world's marine ecosystems, particularly coral reefs. These delicate ecosystems are especially threatened due to their sensitivity to ocean warming and acidification, leading to unprecedented levels of coral bleaching and die-off in recent years. However, current global aquatic remote sensing datasets are unable to quantify changes in marine ecosystems at spatial and temporal scales relevant to their growth. In this project, we employ various supervised and unsupervised machine learning algorithms to augment existing datasets from NASA's Earth Observing System (EOS), using high resolution airborne imagery. This method utilizes NASA's ongoing airborne campaigns as well as its spaceborne assets to collect remote sensing data over these afflicted regions, and employs Fluid Lensing algorithms to resolve optical distortions caused by the fluid surface, producing cm-scale resolution imagery of these diverse ecosystems from airborne platforms. Support Vector Machines (SVMs) and K-mean clustering methods were applied to satellite imagery at 0.5m resolution, producing segmented maps classifying coral based on percent cover and morphology. Compared to a previous study using multidimensional maximum a posteriori (MAP) estimation to separate these features in high resolution airborne datasets, SVMs are able to achieve above 75% accuracy when augmented with existing MAP estimates, while unsupervised methods such as K-means achieve roughly 68% accuracy, verified by manually segmented reference data provided by a marine biologist. This effort thus has broad applications for coastal remote sensing, by helping marine biologists quantify behavioral trends spanning large areas and over longer timescales, and to assess the health of coral reefs worldwide.
Riffle, Michael; Merrihew, Gennifer E; Jaschob, Daniel; Sharma, Vagisha; Davis, Trisha N; Noble, William S; MacCoss, Michael J
2015-11-01
Regulation of protein abundance is a critical aspect of cellular function, organism development, and aging. Alternative splicing may give rise to multiple possible proteoforms of gene products where the abundance of each proteoform is independently regulated. Understanding how the abundances of these distinct gene products change is essential to understanding the underlying mechanisms of many biological processes. Bottom-up proteomics mass spectrometry techniques may be used to estimate protein abundance indirectly by sequencing and quantifying peptides that are later mapped to proteins based on sequence. However, quantifying the abundance of distinct gene products is routinely confounded by peptides that map to multiple possible proteoforms. In this work, we describe a technique that may be used to help mitigate the effects of confounding ambiguous peptides and multiple proteoforms when quantifying proteins. We have applied this technique to visualize the distribution of distinct gene products for the whole proteome across 11 developmental stages of the model organism Caenorhabditis elegans. The result is a large multidimensional dataset for which web-based tools were developed for visualizing how translated gene products change during development and identifying possible proteoforms. The underlying instrument raw files and tandem mass spectra may also be downloaded. The data resource is freely available on the web at http://www.yeastrc.org/wormpes/ . Graphical Abstract ᅟ.
NASA Astrophysics Data System (ADS)
Riffle, Michael; Merrihew, Gennifer E.; Jaschob, Daniel; Sharma, Vagisha; Davis, Trisha N.; Noble, William S.; MacCoss, Michael J.
2015-11-01
Regulation of protein abundance is a critical aspect of cellular function, organism development, and aging. Alternative splicing may give rise to multiple possible proteoforms of gene products where the abundance of each proteoform is independently regulated. Understanding how the abundances of these distinct gene products change is essential to understanding the underlying mechanisms of many biological processes. Bottom-up proteomics mass spectrometry techniques may be used to estimate protein abundance indirectly by sequencing and quantifying peptides that are later mapped to proteins based on sequence. However, quantifying the abundance of distinct gene products is routinely confounded by peptides that map to multiple possible proteoforms. In this work, we describe a technique that may be used to help mitigate the effects of confounding ambiguous peptides and multiple proteoforms when quantifying proteins. We have applied this technique to visualize the distribution of distinct gene products for the whole proteome across 11 developmental stages of the model organism Caenorhabditis elegans. The result is a large multidimensional dataset for which web-based tools were developed for visualizing how translated gene products change during development and identifying possible proteoforms. The underlying instrument raw files and tandem mass spectra may also be downloaded. The data resource is freely available on the web at http://www.yeastrc.org/wormpes/.
Calculating p-values and their significances with the Energy Test for large datasets
NASA Astrophysics Data System (ADS)
Barter, W.; Burr, C.; Parkes, C.
2018-04-01
The energy test method is a multi-dimensional test of whether two samples are consistent with arising from the same underlying population, through the calculation of a single test statistic (called the T-value). The method has recently been used in particle physics to search for samples that differ due to CP violation. The generalised extreme value function has previously been used to describe the distribution of T-values under the null hypothesis that the two samples are drawn from the same underlying population. We show that, in a simple test case, the distribution is not sufficiently well described by the generalised extreme value function. We present a new method, where the distribution of T-values under the null hypothesis when comparing two large samples can be found by scaling the distribution found when comparing small samples drawn from the same population. This method can then be used to quickly calculate the p-values associated with the results of the test.
A peek into the future of radiology using big data applications.
Kharat, Amit T; Singhal, Shubham
2017-01-01
Big data is extremely large amount of data which is available in the radiology department. Big data is identified by four Vs - Volume, Velocity, Variety, and Veracity. By applying different algorithmic tools and converting raw data to transformed data in such large datasets, there is a possibility of understanding and using radiology data for gaining new knowledge and insights. Big data analytics consists of 6Cs - Connection, Cloud, Cyber, Content, Community, and Customization. The global technological prowess and per-capita capacity to save digital information has roughly doubled every 40 months since the 1980's. By using big data, the planning and implementation of radiological procedures in radiology departments can be given a great boost. Potential applications of big data in the future are scheduling of scans, creating patient-specific personalized scanning protocols, radiologist decision support, emergency reporting, virtual quality assurance for the radiologist, etc. Targeted use of big data applications can be done for images by supporting the analytic process. Screening software tools designed on big data can be used to highlight a region of interest, such as subtle changes in parenchymal density, solitary pulmonary nodule, or focal hepatic lesions, by plotting its multidimensional anatomy. Following this, we can run more complex applications such as three-dimensional multi planar reconstructions (MPR), volumetric rendering (VR), and curved planar reconstruction, which consume higher system resources on targeted data subsets rather than querying the complete cross-sectional imaging dataset. This pre-emptive selection of dataset can substantially reduce the system requirements such as system memory, server load and provide prompt results. However, a word of caution, "big data should not become "dump data" due to inadequate and poor analysis and non-structured improperly stored data. In the near future, big data can ring in the era of personalized and individualized healthcare.
The Ophidia Stack: Toward Large Scale, Big Data Analytics Experiments for Climate Change
NASA Astrophysics Data System (ADS)
Fiore, S.; Williams, D. N.; D'Anca, A.; Nassisi, P.; Aloisio, G.
2015-12-01
The Ophidia project is a research effort on big data analytics facing scientific data analysis challenges in multiple domains (e.g. climate change). It provides a "datacube-oriented" framework responsible for atomically processing and manipulating scientific datasets, by providing a common way to run distributive tasks on large set of data fragments (chunks). Ophidia provides declarative, server-side, and parallel data analysis, jointly with an internal storage model able to efficiently deal with multidimensional data and a hierarchical data organization to manage large data volumes. The project relies on a strong background on high performance database management and On-Line Analytical Processing (OLAP) systems to manage large scientific datasets. The Ophidia analytics platform provides several data operators to manipulate datacubes (about 50), and array-based primitives (more than 100) to perform data analysis on large scientific data arrays. To address interoperability, Ophidia provides multiple server interfaces (e.g. OGC-WPS). From a client standpoint, a Python interface enables the exploitation of the framework into Python-based eco-systems/applications (e.g. IPython) and the straightforward adoption of a strong set of related libraries (e.g. SciPy, NumPy). The talk will highlight a key feature of the Ophidia framework stack: the "Analytics Workflow Management System" (AWfMS). The Ophidia AWfMS coordinates, orchestrates, optimises and monitors the execution of multiple scientific data analytics and visualization tasks, thus supporting "complex analytics experiments". Some real use cases related to the CMIP5 experiment will be discussed. In particular, with regard to the "Climate models intercomparison data analysis" case study proposed in the EU H2020 INDIGO-DataCloud project, workflows related to (i) anomalies, (ii) trend, and (iii) climate change signal analysis will be presented. Such workflows will be distributed across multiple sites - according to the datasets distribution - and will include intercomparison, ensemble, and outlier analysis. The two-level workflow solution envisioned in INDIGO (coarse grain for distributed tasks orchestration, and fine grain, at the level of a single data analytics cluster instance) will be presented and discussed.
NASA Astrophysics Data System (ADS)
Wickersham, Andrew Joseph
There are two critical research needs for the study of hydrocarbon combustion in high speed flows: 1) combustion diagnostics with adequate temporal and spatial resolution, and 2) mathematical techniques that can extract key information from large datasets. The goal of this work is to address these needs, respectively, by the use of high speed and multi-perspective chemiluminescence and advanced mathematical algorithms. To obtain the measurements, this work explored the application of high speed chemiluminescence diagnostics and the use of fiber-based endoscopes (FBEs) for non-intrusive and multi-perspective chemiluminescence imaging up to 20 kHz. Non-intrusive and full-field imaging measurements provide a wealth of information for model validation and design optimization of propulsion systems. However, it is challenging to obtain such measurements due to various implementation difficulties such as optical access, thermal management, and equipment cost. This work therefore explores the application of FBEs for non-intrusive imaging to supersonic propulsion systems. The FBEs used in this work are demonstrated to overcome many of the aforementioned difficulties and provided datasets from multiple angular positions up to 20 kHz in a supersonic combustor. The combustor operated on ethylene fuel at Mach 2 with an inlet stagnation temperature and pressure of approximately 640 degrees Fahrenheit and 70 psia, respectively. The imaging measurements were obtained from eight perspectives simultaneously, providing full-field datasets under such flow conditions for the first time, allowing the possibility of inferring multi-dimensional measurements. Due to the high speed and multi-perspective nature, such new diagnostic capability generates a large volume of data and calls for analysis algorithms that can process the data and extract key physics effectively. To extract the key combustion dynamics from the measurements, three mathematical methods were investigated in this work: Fourier analysis, proper orthogonal decomposition (POD), and wavelet analysis (WA). These algorithms were first demonstrated and tested on imaging measurements obtained from one perspective in a sub-sonic combustor (up to Mach 0.2). The results show that these algorithms are effective in extracting the key physics from large datasets, including the characteristic frequencies of flow-flame interactions especially during transient processes such as lean blow off and ignition. After these relatively simple tests and demonstrations, these algorithms were applied to process the measurements obtained from multi-perspective in the supersonic combustor. compared to past analyses (which have been limited to data obtained from one perspective only), the availability of data at multiple perspective provide further insights into the flame and flow structures in high speed flows. In summary, this work shows that high speed chemiluminescence is a simple yet powerful combustion diagnostic. Especially when combined with FBEs and the analyses algorithms described in this work, such diagnostics provide full-field imaging at high repetition rate in challenging flows. Based on such measurements, a wealth of information can be obtained from proper analysis algorithms, including characteristic frequency, dominating flame modes, and even multi-dimensional flame and flow structures.
VisIVO: A Library and Integrated Tools for Large Astrophysical Dataset Exploration
NASA Astrophysics Data System (ADS)
Becciani, U.; Costa, A.; Ersotelos, N.; Krokos, M.; Massimino, P.; Petta, C.; Vitello, F.
2012-09-01
VisIVO provides an integrated suite of tools and services that can be used in many scientific fields. VisIVO development starts in the Virtual Observatory framework. VisIVO allows users to visualize meaningfully highly-complex, large-scale datasets and create movies of these visualizations based on distributed infrastructures. VisIVO supports high-performance, multi-dimensional visualization of large-scale astrophysical datasets. Users can rapidly obtain meaningful visualizations while preserving full and intuitive control of the relevant parameters. VisIVO consists of VisIVO Desktop - a stand-alone application for interactive visualization on standard PCs, VisIVO Server - a platform for high performance visualization, VisIVO Web - a custom designed web portal, VisIVOSmartphone - an application to exploit the VisIVO Server functionality and the latest VisIVO features: VisIVO Library allows a job running on a computational system (grid, HPC, etc.) to produce movies directly with the code internal data arrays without the need to produce intermediate files. This is particularly important when running on large computational facilities, where the user wants to have a look at the results during the data production phase. For example, in grid computing facilities, images can be produced directly in the grid catalogue while the user code is running in a system that cannot be directly accessed by the user (a worker node). The deployment of VisIVO on the DG and gLite is carried out with the support of EDGI and EGI-Inspire projects. Depending on the structure and size of datasets under consideration, the data exploration process could take several hours of CPU for creating customized views and the production of movies could potentially last several days. For this reason an MPI parallel version of VisIVO could play a fundamental role in increasing performance, e.g. it could be automatically deployed on nodes that are MPI aware. A central concept in our development is thus to produce unified code that can run either on serial nodes or in parallel by using HPC oriented grid nodes. Another important aspect, to obtain as high performance as possible, is the integration of VisIVO processes with grid nodes where GPUs are available. We have selected CUDA for implementing a range of computationally heavy modules. VisIVO is supported by EGI-Inspire, EDGI and SCI-BUS projects.
Classification of user interfaces for graph-based online analytical processing
NASA Astrophysics Data System (ADS)
Michaelis, James R.
2016-05-01
In the domain of business intelligence, user-oriented software for conducting multidimensional analysis via Online- Analytical Processing (OLAP) is now commonplace. In this setting, datasets commonly have well-defined sets of dimensions and measures around which analysis tasks can be conducted. However, many forms of data used in intelligence operations - deriving from social networks, online communications, and text corpora - will consist of graphs with varying forms of potential dimensional structure. Hence, enabling OLAP over such data collections requires explicit definition and extraction of supporting dimensions and measures. Further, as Graph OLAP remains an emerging technique, limited research has been done on its user interface requirements. Namely, on effective pairing of interface designs to different types of graph-derived dimensions and measures. This paper presents a novel technique for pairing of user interface designs to Graph OLAP datasets, rooted in Analytic Hierarchy Process (AHP) driven comparisons. Attributes of the classification strategy are encoded through an AHP ontology, developed in our alternate work and extended to support pairwise comparison of interfaces. Specifically, according to their ability, as perceived by Subject Matter Experts, to support dimensions and measures corresponding to Graph OLAP dataset attributes. To frame this discussion, a survey is provided both on existing variations of Graph OLAP, as well as existing interface designs previously applied in multidimensional analysis settings. Following this, a review of our AHP ontology is provided, along with a listing of corresponding dataset and interface attributes applicable toward SME recommendation structuring. A walkthrough of AHP-based recommendation encoding via the ontology-based approach is then provided. The paper concludes with a short summary of proposed future directions seen as essential for this research area.
Collaborative visual analytics of radio surveys in the Big Data era
NASA Astrophysics Data System (ADS)
Vohl, Dany; Fluke, Christopher J.; Hassan, Amr H.; Barnes, David G.; Kilborn, Virginia A.
2017-06-01
Radio survey datasets comprise an increasing number of individual observations stored as sets of multidimensional data. In large survey projects, astronomers commonly face limitations regarding: 1) interactive visual analytics of sufficiently large subsets of data; 2) synchronous and asynchronous collaboration; and 3) documentation of the discovery workflow. To support collaborative data inquiry, we present encube, a large-scale comparative visual analytics framework. encube can utilise advanced visualization environments such as the CAVE2 (a hybrid 2D and 3D virtual reality environment powered with a 100 Tflop/s GPU-based supercomputer and 84 million pixels) for collaborative analysis of large subsets of data from radio surveys. It can also run on standard desktops, providing a capable visual analytics experience across the display ecology. encube is composed of four primary units enabling compute-intensive processing, advanced visualisation, dynamic interaction, parallel data query, along with data management. Its modularity will make it simple to incorporate astronomical analysis packages and Virtual Observatory capabilities developed within our community. We discuss how encube builds a bridge between high-end display systems (such as CAVE2) and the classical desktop, preserving all traces of the work completed on either platform - allowing the research process to continue wherever you are.
Geovisualization to support the exploration of large health and demographic survey data
Koua, Etien L; Kraak, Menno-Jan
2004-01-01
Background Survey data are increasingly abundant from many international projects and national statistics. They are generally comprehensive and cover local, regional as well as national levels census in many domains including health, demography, human development, and economy. These surveys result in several hundred indicators. Geographical analysis of such large amount of data is often a difficult task and searching for patterns is particularly a difficult challenge. Geovisualization research is increasingly dealing with the exploration of patterns and relationships in such large datasets for understanding underlying geographical processes. One of the attempts has been to use Artificial Neural Networks as a technology especially useful in situations where the numbers are vast and the relationships are often unclear or even hidden. Results We investigate ways to integrate computational analysis based on a Self-Organizing Map neural network, with visual representations of derived structures and patterns in a framework for exploratory visualization to support visual data mining and knowledge discovery. The framework suggests ways to explore the general structure of the dataset in its multidimensional space in order to provide clues for further exploration of correlations and relationships. Conclusion In this paper, the proposed framework is used to explore a demographic and health survey data. Several graphical representations (information spaces) are used to depict the general structure and clustering of the data and get insight about the relationships among the different variables. Detail exploration of correlations and relationships among the attributes is provided. Results of the analysis are also presented in maps and other graphics. PMID:15180898
DataFed: A Federated Data System for Visualization and Analysis of Spatio-Temporal Air Quality Data
NASA Astrophysics Data System (ADS)
Husar, R. B.; Hoijarvi, K.
2017-12-01
DataFed is a distributed web-services-based computing environment for accessing, processing, and visualizing atmospheric data in support of air quality science and management. The flexible, adaptive environment facilitates the access and flow of atmospheric data from provider to users by enabling the creation of user-driven data processing/visualization applications. DataFed `wrapper' components, non-intrusively wrap heterogeneous, distributed datasets for access by standards-based GIS web services. The mediator components (also web services) map the heterogeneous data into a spatio-temporal data model. Chained web services provide homogeneous data views (e.g., geospatial, time views) using a global multi-dimensional data model. In addition to data access and rendering, the data processing component services can be programmed for filtering, aggregation, and fusion of multidimensional data. A complete application software is written in a custom made data flow language. Currently, the federated data pool consists of over 50 datasets originating from globally distributed data providers delivering surface-based air quality measurements, satellite observations, emissions data as well as regional and global-scale air quality models. The web browser-based user interface allows point and click navigation and browsing the XYZT multi-dimensional data space. The key applications of DataFed are for exploring spatial pattern of pollutants, seasonal, weekly, diurnal cycles and frequency distributions for exploratory air quality research. Since 2008, DataFed has been used to support EPA in the implementation of the Exceptional Event Rule. The data system is also used at universities in the US, Europe and Asia.
Jia, Peilin; Chen, Xiangning; Xie, Wei; Kendler, Kenneth S; Zhao, Zhongming
2018-06-20
Numerous high-throughput omics studies have been conducted in schizophrenia, providing an accumulated catalog of susceptible variants and genes. The results from these studies, however, are highly heterogeneous. The variants and genes nominated by different omics studies often have limited overlap with each other. There is thus a pressing need for integrative analysis to unify the different types of data and provide a convergent view of schizophrenia candidate genes (SZgenes). In this study, we collected a comprehensive, multidimensional dataset, including 7819 brain-expressed genes. The data hosted genome-wide association evidence in genetics (eg, genotyping data, copy number variations, de novo mutations), epigenetics, transcriptomics, and literature mining. We developed a method named mega-analysis of odds ratio (MegaOR) to prioritize SZgenes. Application of MegaOR in the multidimensional data resulted in consensus sets of SZgenes (up to 530), each enriched with dense, multidimensional evidence. We proved that these SZgenes had highly tissue-specific expression in brain and nerve and had intensive interactions that were significantly stronger than chance expectation. Furthermore, we found these SZgenes were involved in human brain development by showing strong spatiotemporal expression patterns; these characteristics were replicated in independent brain expression datasets. Finally, we found the SZgenes were enriched in critical functional gene sets involved in neuronal activities, ligand gated ion signaling, and fragile X mental retardation protein targets. In summary, MegaOR analysis reported consensus sets of SZgenes with enriched association evidence to schizophrenia, providing insights into the pathophysiology underlying schizophrenia.
Statistical segmentation of multidimensional brain datasets
NASA Astrophysics Data System (ADS)
Desco, Manuel; Gispert, Juan D.; Reig, Santiago; Santos, Andres; Pascau, Javier; Malpica, Norberto; Garcia-Barreno, Pedro
2001-07-01
This paper presents an automatic segmentation procedure for MRI neuroimages that overcomes part of the problems involved in multidimensional clustering techniques like partial volume effects (PVE), processing speed and difficulty of incorporating a priori knowledge. The method is a three-stage procedure: 1) Exclusion of background and skull voxels using threshold-based region growing techniques with fully automated seed selection. 2) Expectation Maximization algorithms are used to estimate the probability density function (PDF) of the remaining pixels, which are assumed to be mixtures of gaussians. These pixels can then be classified into cerebrospinal fluid (CSF), white matter and grey matter. Using this procedure, our method takes advantage of using the full covariance matrix (instead of the diagonal) for the joint PDF estimation. On the other hand, logistic discrimination techniques are more robust against violation of multi-gaussian assumptions. 3) A priori knowledge is added using Markov Random Field techniques. The algorithm has been tested with a dataset of 30 brain MRI studies (co-registered T1 and T2 MRI). Our method was compared with clustering techniques and with template-based statistical segmentation, using manual segmentation as a gold-standard. Our results were more robust and closer to the gold-standard.
DocCube: Multi-Dimensional Visualization and Exploration of Large Document Sets.
ERIC Educational Resources Information Center
Mothe, Josiane; Chrisment, Claude; Dousset, Bernard; Alaux, Joel
2003-01-01
Describes a user interface that provides global visualizations of large document sets to help users formulate the query that corresponds to their information needs. Highlights include concept hierarchies that users can browse to specify and refine information needs; knowledge discovery in databases and texts; and multidimensional modeling.…
HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets.
Omta, Wienand A; van Heesbeen, Roy G; Pagliero, Romina J; van der Velden, Lieke M; Lelieveld, Daphne; Nellen, Mehdi; Kramer, Maik; Yeong, Marley; Saeidi, Amir M; Medema, Rene H; Spruit, Marco; Brinkkemper, Sjaak; Klumperman, Judith; Egan, David A
2016-10-01
High-content screening (HCS) can generate large multidimensional datasets and when aligned with the appropriate data mining tools, it can yield valuable insights into the mechanism of action of bioactive molecules. However, easy-to-use data mining tools are not widely available, with the result that these datasets are frequently underutilized. Here, we present HC StratoMineR, a web-based tool for high-content data analysis. It is a decision-supportive platform that guides even non-expert users through a high-content data analysis workflow. HC StratoMineR is built by using My Structured Query Language for storage and querying, PHP: Hypertext Preprocessor as the main programming language, and jQuery for additional user interface functionality. R is used for statistical calculations, logic and data visualizations. Furthermore, C++ and graphical processor unit power is diffusely embedded in R by using the rcpp and rpud libraries for operations that are computationally highly intensive. We show that we can use HC StratoMineR for the analysis of multivariate data from a high-content siRNA knock-down screen and a small-molecule screen. It can be used to rapidly filter out undesirable data; to select relevant data; and to perform quality control, data reduction, data exploration, morphological hit picking, and data clustering. Our results demonstrate that HC StratoMineR can be used to functionally categorize HCS hits and, thus, provide valuable information for hit prioritization.
Generating and Visualizing Climate Indices using Google Earth Engine
NASA Astrophysics Data System (ADS)
Erickson, T. A.; Guentchev, G.; Rood, R. B.
2017-12-01
Climate change is expected to have largest impacts on regional and local scales. Relevant and credible climate information is needed to support the planning and adaptation efforts in our communities. The volume of climate projections of temperature and precipitation is steadily increasing, as datasets are being generated on finer spatial and temporal grids with an increasing number of ensembles to characterize uncertainty. Despite advancements in tools for querying and retrieving subsets of these large, multi-dimensional datasets, ease of access remains a barrier for many existing and potential users who want to derive useful information from these data, particularly for those outside of the climate modelling research community. Climate indices, that can be derived from daily temperature and precipitation data, such as annual number of frost days or growing season length, can provide useful information to practitioners and stakeholders. For this work the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset was loaded into Google Earth Engine, a cloud-based geospatial processing platform. Algorithms that use the Earth Engine API to generate several climate indices were written. The indices were chosen from the set developed by the joint CCl/CLIVAR/JCOMM Expert Team on Climate Change Detection and Indices (ETCCDI). Simple user interfaces were created that allow users to query, produce maps and graphs of the indices, as well as download results for additional analyses. These browser-based interfaces could allow users in low-bandwidth environments to access climate information. This research shows that calculating climate indices from global downscaled climate projection datasets and sharing them widely using cloud computing technologies is feasible. Further development will focus on exposing the climate indices to existing applications via the Earth Engine API, and building custom user interfaces for presenting climate indices to a diverse set of user groups.
ERIC Educational Resources Information Center
de Jong, Martijn G.; Steenkamp, Jan-Benedict E. M.
2010-01-01
We present a class of finite mixture multilevel multidimensional ordinal IRT models for large scale cross-cultural research. Our model is proposed for confirmatory research settings. Our prior for item parameters is a mixture distribution to accommodate situations where different groups of countries have different measurement operations, while…
OpenClimateGIS - A Web Service Providing Climate Model Data in Commonly Used Geospatial Formats
NASA Astrophysics Data System (ADS)
Erickson, T. A.; Koziol, B. W.; Rood, R. B.
2011-12-01
The goal of the OpenClimateGIS project is to make climate model datasets readily available in commonly used, modern geospatial formats used by GIS software, browser-based mapping tools, and virtual globes.The climate modeling community typically stores climate data in multidimensional gridded formats capable of efficiently storing large volumes of data (such as netCDF, grib) while the geospatial community typically uses flexible vector and raster formats that are capable of storing small volumes of data (relative to the multidimensional gridded formats). OpenClimateGIS seeks to address this difference in data formats by clipping climate data to user-specified vector geometries (i.e. areas of interest) and translating the gridded data on-the-fly into multiple vector formats. The OpenClimateGIS system does not store climate data archives locally, but rather works in conjunction with external climate archives that expose climate data via the OPeNDAP protocol. OpenClimateGIS provides a RESTful API web service for accessing climate data resources via HTTP, allowing a wide range of applications to access the climate data.The OpenClimateGIS system has been developed using open source development practices and the source code is publicly available. The project integrates libraries from several other open source projects (including Django, PostGIS, numpy, Shapely, and netcdf4-python).OpenClimateGIS development is supported by a grant from NOAA's Climate Program Office.
ERIC Educational Resources Information Center
Makiya, George K.
2012-01-01
This dissertation reports on a multi-dimensional longitudinal investigation of the factors that influence Enterprise Architecture (EA) diffusion and assimilation within the U.S. federal government. The study uses publicly available datasets of 123 U.S. federal departments and agencies, as well as interview data among CIOs and EA managers within…
BioStar models of clinical and genomic data for biomedical data warehouse design
Wang, Liangjiang; Ramanathan, Murali
2008-01-01
Biomedical research is now generating large amounts of data, ranging from clinical test results to microarray gene expression profiles. The scale and complexity of these datasets give rise to substantial challenges in data management and analysis. It is highly desirable that data warehousing and online analytical processing technologies can be applied to biomedical data integration and mining. The major difficulty probably lies in the task of capturing and modelling diverse biological objects and their complex relationships. This paper describes multidimensional data modelling for biomedical data warehouse design. Since the conventional models such as star schema appear to be insufficient for modelling clinical and genomic data, we develop a new model called BioStar schema. The new model can capture the rich semantics of biomedical data and provide greater extensibility for the fast evolution of biological research methodologies. PMID:18048122
NASA Astrophysics Data System (ADS)
Aksenov, A. G.; Chechetkin, V. M.
2018-04-01
Most of the energy released in the gravitational collapse of the cores of massive stars is carried away by neutrinos. Neutrinos play a pivotal role in explaining core-collape supernovae. Currently, mathematical models of the gravitational collapse are based on multi-dimensional gas dynamics and thermonuclear reactions, while neutrino transport is considered in a simplified way. Multidimensional gas dynamics is used with neutrino transport in the flux-limited diffusion approximation to study the role of multi-dimensional effects. The possibility of large-scale convection is discussed, which is interesting both for explaining SN II and for setting up observations to register possible high-energy (≳10MeV) neutrinos from the supernova. A new multi-dimensional, multi-temperature gas dynamics method with neutrino transport is presented.
Xarray: multi-dimensional data analysis in Python
NASA Astrophysics Data System (ADS)
Hoyer, Stephan; Hamman, Joe; Maussion, Fabien
2017-04-01
xarray (http://xarray.pydata.org) is an open source project and Python package that provides a toolkit and data structures for N-dimensional labeled arrays, which are the bread and butter of modern geoscientific data analysis. Key features of the package include label-based indexing and arithmetic, interoperability with the core scientific Python packages (e.g., pandas, NumPy, Matplotlib, Cartopy), out-of-core computation on datasets that don't fit into memory, a wide range of input/output options, and advanced multi-dimensional data manipulation tools such as group-by and resampling. In this contribution we will present the key features of the library and demonstrate its great potential for a wide range of applications, from (big-)data processing on super computers to data exploration in front of a classroom.
NASA Astrophysics Data System (ADS)
Bauer, J. R.; Rose, K.; Romeo, L.; Barkhurst, A.; Nelson, J.; Duran-Sesin, R.; Vielma, J.
2016-12-01
Efforts to prepare for and reduce the risk of hazards, from both natural and anthropogenic sources, which threaten our oceans and coasts requires an understanding of the dynamics and interactions between the physical, ecological, and socio-economic systems. Understanding these coupled dynamics are essential as offshore oil & gas exploration and production continues to push into harsher, more extreme environments where risks and uncertainty increase. However, working with these large, complex data from various sources and scales to assess risks and potential impacts associated with offshore energy exploration and production poses several challenges to research. In order to address these challenges, an integrated assessment model (IAM) was developed at the Department of Energy's (DOE) National Energy Technology Laboratory (NETL) that combines spatial data infrastructure and an online research platform to manage, process, analyze, and share these large, multidimensional datasets, research products, and the tools and models used to evaluate risk and reduce uncertainty for the entire offshore system, from the subsurface, through the water column, to coastal ecosystems and communities. Here, we will discuss the spatial data infrastructure and online research platform, NETL's Energy Data eXchange (EDX), that underpin the offshore IAM, providing information on how the framework combines multidimensional spatial data and spatio-temporal tools to evaluate risks to the complex matrix of potential environmental, social, and economic impacts stemming from modeled offshore hazard scenarios, such as oil spills or hurricanes. In addition, we will discuss the online analytics, tools, and visualization methods integrated into this framework that support availability and access to data, as well as allow for the rapid analysis and effective communication of analytical results to aid a range of decision-making needs.
Extracting Unidimensional Chains from Multidimensional Datasets: A Graph Theory Approach.
1980-02-01
Educational Technology, 1972, 13, 56-60. Cliff, N. Complete orders from incomplete data: Interactive ordering and tailored testing , Psychological Bulletin...Research Center Department or Psychology 1 Dr. Frederick M. Lord University of Leyden University of Illinois Educational Testing Service Boerhaavelaan...Forguson I Dr. Earl Hunt Department of Psychology The hnericin College Testing Program Dept. of Psychology University o f Houston P.O. Box 168 University of
Choi, BongKyoo; Kawakami, Norito; Chang, SeiJin; Koh, SangBaek; Bjorner, Jakob; Punnett, Laura; Karasek, Robert
2008-01-01
The five-item psychological demands scale of the Job Content Questionnaire (JCQ) has been assumed to be one-dimensional in practice. To examine whether the scale has sufficient internal consistency and external validity to be treated as a single scale, using the cross-national JCQ datasets from the United States, Korea, and Japan. Exploratory factor analyses with 22 JCQ items, confirmatory factor analyses with the five psychological demands items, and correlations analyses with mental health indexes. Generally, exploratory factor analyses displayed the predicted demand/control/support structure with three and four factors extracted. However, at more detailed levels of exploratory and confirmatory factor analyses, the demands scale showed clear evidence of multi-factor structure. The correlations of items and subscales of the demands scale with mental health indexes were similar to those of the full scale in the Korean and Japanese datasets, but not in the U.S. data. In 4 out of 16 sub-samples of the U.S. data, several significant correlations of the components of the demands scale with job dissatisfaction and life dissatisfaction were obscured by the full scale. The multidimensionality of the psychological demands scale should be considered in psychometric analysis and interpretation, occupational epidemiologic studies, and future scale extension.
Dong, Ni; Huang, Helai; Zheng, Liang
2015-09-01
In zone-level crash prediction, accounting for spatial dependence has become an extensively studied topic. This study proposes Support Vector Machine (SVM) model to address complex, large and multi-dimensional spatial data in crash prediction. Correlation-based Feature Selector (CFS) was applied to evaluate candidate factors possibly related to zonal crash frequency in handling high-dimension spatial data. To demonstrate the proposed approaches and to compare them with the Bayesian spatial model with conditional autoregressive prior (i.e., CAR), a dataset in Hillsborough county of Florida was employed. The results showed that SVM models accounting for spatial proximity outperform the non-spatial model in terms of model fitting and predictive performance, which indicates the reasonableness of considering cross-zonal spatial correlations. The best model predictive capability, relatively, is associated with the model considering proximity of the centroid distance by choosing the RBF kernel and setting the 10% of the whole dataset as the testing data, which further exhibits SVM models' capacity for addressing comparatively complex spatial data in regional crash prediction modeling. Moreover, SVM models exhibit the better goodness-of-fit compared with CAR models when utilizing the whole dataset as the samples. A sensitivity analysis of the centroid-distance-based spatial SVM models was conducted to capture the impacts of explanatory variables on the mean predicted probabilities for crash occurrence. While the results conform to the coefficient estimation in the CAR models, which supports the employment of the SVM model as an alternative in regional safety modeling. Copyright © 2015 Elsevier Ltd. All rights reserved.
A peek into the future of radiology using big data applications
Kharat, Amit T.; Singhal, Shubham
2017-01-01
Big data is extremely large amount of data which is available in the radiology department. Big data is identified by four Vs – Volume, Velocity, Variety, and Veracity. By applying different algorithmic tools and converting raw data to transformed data in such large datasets, there is a possibility of understanding and using radiology data for gaining new knowledge and insights. Big data analytics consists of 6Cs – Connection, Cloud, Cyber, Content, Community, and Customization. The global technological prowess and per-capita capacity to save digital information has roughly doubled every 40 months since the 1980's. By using big data, the planning and implementation of radiological procedures in radiology departments can be given a great boost. Potential applications of big data in the future are scheduling of scans, creating patient-specific personalized scanning protocols, radiologist decision support, emergency reporting, virtual quality assurance for the radiologist, etc. Targeted use of big data applications can be done for images by supporting the analytic process. Screening software tools designed on big data can be used to highlight a region of interest, such as subtle changes in parenchymal density, solitary pulmonary nodule, or focal hepatic lesions, by plotting its multidimensional anatomy. Following this, we can run more complex applications such as three-dimensional multi planar reconstructions (MPR), volumetric rendering (VR), and curved planar reconstruction, which consume higher system resources on targeted data subsets rather than querying the complete cross-sectional imaging dataset. This pre-emptive selection of dataset can substantially reduce the system requirements such as system memory, server load and provide prompt results. However, a word of caution, “big data should not become “dump data” due to inadequate and poor analysis and non-structured improperly stored data. In the near future, big data can ring in the era of personalized and individualized healthcare. PMID:28744087
Uher, Vojtěch; Gajdoš, Petr; Radecký, Michal; Snášel, Václav
2016-01-01
The Differential Evolution (DE) is a widely used bioinspired optimization algorithm developed by Storn and Price. It is popular for its simplicity and robustness. This algorithm was primarily designed for real-valued problems and continuous functions, but several modified versions optimizing both integer and discrete-valued problems have been developed. The discrete-coded DE has been mostly used for combinatorial problems in a set of enumerative variants. However, the DE has a great potential in the spatial data analysis and pattern recognition. This paper formulates the problem as a search of a combination of distinct vertices which meet the specified conditions. It proposes a novel approach called the Multidimensional Discrete Differential Evolution (MDDE) applying the principle of the discrete-coded DE in discrete point clouds (PCs). The paper examines the local searching abilities of the MDDE and its convergence to the global optimum in the PCs. The multidimensional discrete vertices cannot be simply ordered to get a convenient course of the discrete data, which is crucial for good convergence of a population. A novel mutation operator utilizing linear ordering of spatial data based on the space filling curves is introduced. The algorithm is tested on several spatial datasets and optimization problems. The experiments show that the MDDE is an efficient and fast method for discrete optimizations in the multidimensional point clouds.
Utilization of the Discrete Differential Evolution for Optimization in Multidimensional Point Clouds
Radecký, Michal; Snášel, Václav
2016-01-01
The Differential Evolution (DE) is a widely used bioinspired optimization algorithm developed by Storn and Price. It is popular for its simplicity and robustness. This algorithm was primarily designed for real-valued problems and continuous functions, but several modified versions optimizing both integer and discrete-valued problems have been developed. The discrete-coded DE has been mostly used for combinatorial problems in a set of enumerative variants. However, the DE has a great potential in the spatial data analysis and pattern recognition. This paper formulates the problem as a search of a combination of distinct vertices which meet the specified conditions. It proposes a novel approach called the Multidimensional Discrete Differential Evolution (MDDE) applying the principle of the discrete-coded DE in discrete point clouds (PCs). The paper examines the local searching abilities of the MDDE and its convergence to the global optimum in the PCs. The multidimensional discrete vertices cannot be simply ordered to get a convenient course of the discrete data, which is crucial for good convergence of a population. A novel mutation operator utilizing linear ordering of spatial data based on the space filling curves is introduced. The algorithm is tested on several spatial datasets and optimization problems. The experiments show that the MDDE is an efficient and fast method for discrete optimizations in the multidimensional point clouds. PMID:27974884
Enabling Web-Based Analysis of CUAHSI HIS Hydrologic Data Using R and Web Processing Services
NASA Astrophysics Data System (ADS)
Ames, D. P.; Kadlec, J.; Bayles, M.; Seul, M.; Hooper, R. P.; Cummings, B.
2015-12-01
The CUAHSI Hydrologic Information System (CUAHSI HIS) provides open access to a large number of hydrological time series observation and modeled data from many parts of the world. Several software tools have been designed to simplify searching and access to the CUAHSI HIS datasets. These software tools include: Desktop client software (HydroDesktop, HydroExcel), developer libraries (WaterML R Package, OWSLib, ulmo), and the new interactive search website, http://data.cuahsi.org. An issue with using the time series data from CUAHSI HIS for further analysis by hydrologists (for example for verification of hydrological and snowpack models) is the large heterogeneity of the time series data. The time series may be regular or irregular, contain missing data, have different time support, and be recorded in different units. R is a widely used computational environment for statistical analysis of time series and spatio-temporal data that can be used to assess fitness and perform scientific analyses on observation data. R includes the ability to record a data analysis in the form of a reusable script. The R script together with the input time series dataset can be shared with other users, making the analysis more reproducible. The major goal of this study is to examine the use of R as a Web Processing Service for transforming time series data from the CUAHSI HIS and sharing the results on the Internet within HydroShare. HydroShare is an online data repository and social network for sharing large hydrological data sets such as time series, raster datasets, and multi-dimensional data. It can be used as a permanent cloud storage space for saving the time series analysis results. We examine the issues associated with running R scripts online: including code validation, saving of outputs, reporting progress, and provenance management. An explicit goal is that the script which is run locally should produce exactly the same results as the script run on the Internet. Our design can be used as a model for other studies that need to run R scripts on the web.
Privacy preserving data publishing of categorical data through k-anonymity and feature selection.
Aristodimou, Aristos; Antoniades, Athos; Pattichis, Constantinos S
2016-03-01
In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.
On new physics searches with multidimensional differential shapes
NASA Astrophysics Data System (ADS)
Ferreira, Felipe; Fichet, Sylvain; Sanz, Veronica
2018-03-01
In the context of upcoming new physics searches at the LHC, we investigate the impact of multidimensional differential rates in typical LHC analyses. We discuss the properties of shape information, and argue that multidimensional rates bring limited information in the scope of a discovery, but can have a large impact on model discrimination. We also point out subtleties about systematic uncertainties cancellations and the Cauchy-Schwarz bound on interference terms.
Sims, Mario; Wyatt, Sharon B.; Gutierrez, Mary Lou; Taylor, Herman A.; Williams, David R.
2009-01-01
Objective Assessing the discrimination-health disparities hypothesis requires psychometrically sound, multidimensional measures of discrimination. Among the available discrimination measures, few are multidimensional and none have adequate psychometric testing in a large, African American sample. We report the development and psychometric testing of the multidimensional Jackson Heart Study Discrimination (JHSDIS) Instrument. Methods A multidimensional measure assessing the occurrence, frequency, attribution, and coping responses to perceived everyday and lifetime discrimination; lifetime burden of discrimination; and effect of skin color was developed and tested in the 5302-member cohort of the Jackson Heart Study. Internal consistency was calculated by using Cronbach α. coefficient. Confirmatory factor analysis established the dimensions, and intercorrelation coefficients assessed the discriminant validity of the instrument. Setting Tri-county area of the Jackson, MS metropolitan statistical area. Results The JHSDIS was psychometrically sound (overall α=.78, .84 and .77, respectively, for the everyday and lifetime subscales). Confirmatory factor analysis yielded 11 factors, which confirmed the a priori dimensions represented. Conclusions The JHSDIS combined three scales into a single multidimensional instrument with good psychometric properties in a large sample of African Americans. This analysis lays the foundation for using this instrument in research that will examine the association between perceived discrimination and CVD among African Americans. PMID:19341164
A rapid local singularity analysis algorithm with applications
NASA Astrophysics Data System (ADS)
Chen, Zhijun; Cheng, Qiuming; Agterberg, Frits
2015-04-01
The local singularity model developed by Cheng is fast gaining popularity in characterizing mineralization and detecting anomalies of geochemical, geophysical and remote sensing data. However in one of the conventional algorithms involving the moving average values with different scales is time-consuming especially while analyzing a large dataset. Summed area table (SAT), also called as integral image, is a fast algorithm used within the Viola-Jones object detection framework in computer vision area. Historically, the principle of SAT is well-known in the study of multi-dimensional probability distribution functions, namely in computing 2D (or ND) probabilities (area under the probability distribution) from the respective cumulative distribution functions. We introduce SAT and it's variation Rotated Summed Area Table in the isotropic, anisotropic or directional local singularity mapping in this study. Once computed using SAT, any one of the rectangular sum can be computed at any scale or location in constant time. The area for any rectangular region in the image can be computed by using only 4 array accesses in constant time independently of the size of the region; effectively reducing the time complexity from O(n) to O(1). New programs using Python, Julia, matlab and C++ are implemented respectively to satisfy different applications, especially to the big data analysis. Several large geochemical and remote sensing datasets are tested. A wide variety of scale changes (linear spacing or log spacing) for non-iterative or iterative approach are adopted to calculate the singularity index values and compare the results. The results indicate that the local singularity analysis with SAT is more robust and superior to traditional approach in identifying anomalies.
NASA Astrophysics Data System (ADS)
Kobor, J. S.; O'Connor, M. D.; Sherwood, M. N.
2013-12-01
Effective floodplain management and restoration requires a detailed understanding of floodplain processes not readily achieved using standard one-dimensional hydraulic modeling approaches. The application of more advanced numerical models is, however, often limited by the relatively high costs of acquiring the high-resolution topographic data needed for model development using traditional surveying methods. The increasing availability of LiDAR data has the potential to significantly reduce these costs and thus facilitate application of multi-dimensional hydraulic models where budget constraints would have otherwise prohibited their use. The accuracy and suitability of LiDAR data for supporting model development can vary widely depending on the resolution of channel and floodplain features, the data collection density, and the degree of vegetation canopy interference among other factors. More work is needed to develop guidelines for evaluating LiDAR accuracy and determining when and how best the data can be used to support numerical modeling activities. Here we present two recent case studies where LiDAR datasets were used to support floodplain and sediment transport modeling efforts. One LiDAR dataset was collected with a relatively low point density and used to study a small stream channel in coastal Marin County and a second dataset was collected with a higher point density and applied to a larger stream channel in western Sonoma County. Traditional topographic surveying was performed at both sites which provided a quantitative means of evaluating the LiDAR accuracy. We found that with the lower point density dataset, the accuracy of the LiDAR varied significantly between the active stream channel and floodplain whereas the accuracy across the channel/floodplain interface was more uniform with the higher density dataset. Accuracy also varied widely as a function of the density of the riparian vegetation canopy. We found that coupled 1- and 2-dimensional hydraulic models whereby the active channel is simulated in 1-dimension and the floodplain in 2-dimensions provided the best means of utilizing the LiDAR data to evaluate existing conditions and develop alternative flood hazard mitigation and habitat restoration strategies. Such an approach recognizes the limitations of the LiDAR data within active channel areas with dense riparian cover and is cost-effective in that it allows field survey efforts to focus primarily on characterizing active stream channel areas. The multi-dimensional modeling approach also conforms well to the physical realties of the stream system whereby in-channel flows can generally be well-described as a one-dimensional flow problem and floodplain flows are often characterized by multiple and often poorly understood flowpaths. The multi-dimensional modeling approach has the additional advantages of allowing for accurate simulation of the effects of hydraulic structures using well-tested one-dimensional formulae and minimizing the computational burden of the models by not requiring the small spatial resolutions necessary to resolve the geometries of small stream channels in two-dimensions.
Array Processing in the Cloud: the rasdaman Approach
NASA Astrophysics Data System (ADS)
Merticariu, Vlad; Dumitru, Alex
2015-04-01
The multi-dimensional array data model is gaining more and more attention when dealing with Big Data challenges in a variety of domains such as climate simulations, geographic information systems, medical imaging or astronomical observations. Solutions provided by classical Big Data tools such as Key-Value Stores and MapReduce, as well as traditional relational databases, proved to be limited in domains associated with multi-dimensional data. This problem has been addressed by the field of array databases, in which systems provide database services for raster data, without imposing limitations on the number of dimensions that a dataset can have. Examples of datasets commonly handled by array databases include 1-dimensional sensor data, 2-D satellite imagery, 3-D x/y/t image time series as well as x/y/z geophysical voxel data, and 4-D x/y/z/t weather data. And this can grow as large as simulations of the whole universe when it comes to astrophysics. rasdaman is a well established array database, which implements many optimizations for dealing with large data volumes and operation complexity. Among those, the latest one is intra-query parallelization support: a network of machines collaborate for answering a single array database query, by dividing it into independent sub-queries sent to different servers. This enables massive processing speed-ups, which promise solutions to research challenges on multi-Petabyte data cubes. There are several correlated factors which influence the speedup that intra-query parallelisation brings: the number of servers, the capabilities of each server, the quality of the network, the availability of the data to the server that needs it in order to compute the result and many more. In the effort of adapting the engine to cloud processing patterns, two main components have been identified: one that handles communication and gathers information about the arrays sitting on every server, and a processing unit responsible with dividing work among available nodes and executing operations on local data. The federation daemon collects and stores statistics from the other network nodes and provides real time updates about local changes. Information exchanged includes available datasets, CPU load and memory usage per host. The processing component is represented by the rasdaman server. Using information from the federation daemon it breaks queries into subqueries to be executed on peer nodes, ships them, and assembles the intermediate results. Thus, we define a rasdaman network node as a pair of a federation daemon and a rasdaman server. Any node can receive a query and will subsequently act as this query's dispatcher, so all peers are at the same level and there is no single point of failure. Should a node become inaccessible then the peers will recognize this and will not any longer consider this peer for distribution. Conversely, a peer at any time can join the network. To assess the feasibility of our approach, we deployed a rasdaman network in the Amazon Elastic Cloud environment on 1001 nodes, and observed that this feature can greatly increase the performance and scalability of the system, offering a large throughput of processed data.
Panepinto, Julie A; Torres, Sylvia; Bendo, Cristiane B; McCavit, Timothy L; Dinu, Bogdan; Sherman-Bien, Sandra; Bemrich-Stolz, Christy; Varni, James W
2014-01-01
Sickle cell disease (SCD) is an inherited blood disorder characterized by a chronic hemolytic anemia that can contribute to fatigue and global cognitive impairment in patients. The study objective was to report on the feasibility, reliability, and validity of the PedsQL™ Multidimensional Fatigue Scale in SCD for pediatric patient self-report ages 5-18 years and parent proxy-report for ages 2-18 years. This was a cross-sectional multi-site study whereby 240 pediatric patients with SCD and 303 parents completed the 18-item PedsQL™ Multidimensional Fatigue Scale. Participants also completed the PedsQL™ 4.0 Generic Core Scales. The PedsQL™ Multidimensional Fatigue Scale evidenced excellent feasibility, excellent reliability for the Total Scale Scores (patient self-report α = 0.90; parent proxy-report α = 0.95), and acceptable reliability for the three individual scales (patient self-report α = 0.77-0.84; parent proxy-report α = 0.90-0.97). Intercorrelations of the PedsQL™ Multidimensional Fatigue Scale with the PedsQL™ Generic Core Scales were predominantly in the large (≥0.50) range, supporting construct validity. PedsQL™ Multidimensional Fatigue Scale Scores were significantly worse with large effects sizes (≥0.80) for patients with SCD than for a comparison sample of healthy children, supporting known-groups discriminant validity. Confirmatory factor analysis demonstrated an acceptable to excellent model fit in SCD. The PedsQL™ Multidimensional Fatigue Scale demonstrated acceptable to excellent measurement properties in SCD. The results demonstrate the relative severity of fatigue symptoms in pediatric patients with SCD, indicating the potential clinical utility of multidimensional assessment of fatigue in patients with SCD in clinical research and practice. © 2013 Wiley Periodicals, Inc.
PedsQL™ Multidimensional Fatigue Scale in Sickle Cell Disease: Feasibility, Reliability and Validity
Panepinto, Julie A.; Torres, Sylvia; Bendo, Cristiane B.; McCavit, Timothy L.; Dinu, Bogdan; Sherman-Bien, Sandra; Bemrich-Stolz, Christy; Varni, James W.
2013-01-01
Background Sickle cell disease (SCD) is an inherited blood disorder characterized by a chronic hemolytic anemia that can contribute to fatigue and global cognitive impairment in patients. The study objective was to report on the feasibility, reliability, and validity of the PedsQL™ Multidimensional Fatigue Scale in SCD for pediatric patient self-report ages 5–18 years and parent proxy-report for ages 2–18 years. Procedure This was a cross-sectional multi-site study whereby 240 pediatric patients with SCD and 303 parents completed the 18-item PedsQL™ Multidimensional Fatigue Scale. Participants also completed the PedsQL™ 4.0 Generic Core Scales. Results The PedsQL™ Multidimensional Fatigue Scale evidenced excellent feasibility, excellent reliability for the Total Scale Scores (patient self-report α = 0.90; parent proxy-report α = 0.95), and acceptable reliability for the three individual scales (patient self-report α = 0.77–0.84; parent proxy-report α = 0.90–0.97). Intercorrelations of the PedsQL™ Multidimensional Fatigue Scale with the PedsQL™ Generic Core Scales were predominantly in the large (≥ 0.50) range, supporting construct validity. PedsQL™ Multidimensional Fatigue Scale Scores were significantly worse with large effects sizes (≥0.80) for patients with SCD than for a comparison sample of healthy children, supporting known-groups discriminant validity. Confirmatory factor analysis demonstrated an acceptable to excellent model fit in SCD. Conclusions The PedsQL™ Multidimensional Fatigue Scale demonstrated acceptable to excellent measurement properties in SCD. The results demonstrate the relative severity of fatigue symptoms in pediatric patients with SCD, indicating the potential clinical utility of multidimensional assessment of fatigue in patients with SCD in clinical research and practice. PMID:24038960
Coastal Seabed Mapping with Hyperspectral and Lidar data
NASA Astrophysics Data System (ADS)
Taramelli, A.; Valentini, E.; Filipponi, F.; Cappucci, S.
2017-12-01
A synoptic view of the coastal seascape and its dynamics needs a quantitative ability to dissect different components over the complexity of the seafloor where a mixture of geo - biological facies determines geomorphological features and their coverage. The present study uses an analytical approach that takes advantage of a multidimensional model to integrate different data sources from airborne Hyperspectral and LiDAR remote sensing and in situ measurements to detect antropogenic features and ecological `tipping points' in coastal seafloors. The proposed approach has the ability to generate coastal seabed maps using: 1) a multidimensional dataset to account for radiometric and morphological properties of waters and the seafloor; 2) a field spectral library to assimilate the high environmental variability into the multidimensional model; 3) a final classification scheme to represent the spatial gradients in the seafloor. The spatial pattern of the response to anthropogenic forcing may be indistinguishable from patterns of natural variability. It is argued that this novel approach to define tipping points following anthropogenic impacts could be most valuable in the management of natural resources and the economic development of coastal areas worldwide. Examples are reported from different sites of the Mediterranean Sea, both from Marine Protected and un-Protected Areas.
PCA feature extraction for change detection in multidimensional unlabeled data.
Kuncheva, Ludmila I; Faithfull, William J
2014-01-01
When classifiers are deployed in real-world applications, it is assumed that the distribution of the incoming data matches the distribution of the data used to train the classifier. This assumption is often incorrect, which necessitates some form of change detection or adaptive classification. While there has been a lot of work on change detection based on the classification error monitored over the course of the operation of the classifier, finding changes in multidimensional unlabeled data is still a challenge. Here, we propose to apply principal component analysis (PCA) for feature extraction prior to the change detection. Supported by a theoretical example, we argue that the components with the lowest variance should be retained as the extracted features because they are more likely to be affected by a change. We chose a recently proposed semiparametric log-likelihood change detection criterion that is sensitive to changes in both mean and variance of the multidimensional distribution. An experiment with 35 datasets and an illustration with a simple video segmentation demonstrate the advantage of using extracted features compared to raw data. Further analysis shows that feature extraction through PCA is beneficial, specifically for data with multiple balanced classes.
NASA Astrophysics Data System (ADS)
Mukherjee, Sayak; Stewart, David; Stewart, William; Lanier, Lewis L.; Das, Jayajit
2017-08-01
Single-cell responses are shaped by the geometry of signalling kinetic trajectories carved in a multidimensional space spanned by signalling protein abundances. It is, however, challenging to assay a large number (more than 3) of signalling species in live-cell imaging, which makes it difficult to probe single-cell signalling kinetic trajectories in large dimensions. Flow and mass cytometry techniques can measure a large number (4 to more than 40) of signalling species but are unable to track single cells. Thus, cytometry experiments provide detailed time-stamped snapshots of single-cell signalling kinetics. Is it possible to use the time-stamped cytometry data to reconstruct single-cell signalling trajectories? Borrowing concepts of conserved and slow variables from non-equilibrium statistical physics we develop an approach to reconstruct signalling trajectories using snapshot data by creating new variables that remain invariant or vary slowly during the signalling kinetics. We apply this approach to reconstruct trajectories using snapshot data obtained from in silico simulations, live-cell imaging measurements, and, synthetic flow cytometry datasets. The application of invariants and slow variables to reconstruct trajectories provides a radically different way to track objects using snapshot data. The approach is likely to have implications for solving matching problems in a wide range of disciplines.
A benchmark for comparison of cell tracking algorithms
Maška, Martin; Ulman, Vladimír; Svoboda, David; Matula, Pavel; Matula, Petr; Ederra, Cristina; Urbiola, Ainhoa; España, Tomás; Venkatesan, Subramanian; Balak, Deepak M.W.; Karas, Pavel; Bolcková, Tereza; Štreitová, Markéta; Carthel, Craig; Coraluppi, Stefano; Harder, Nathalie; Rohr, Karl; Magnusson, Klas E. G.; Jaldén, Joakim; Blau, Helen M.; Dzyubachyk, Oleh; Křížek, Pavel; Hagen, Guy M.; Pastor-Escuredo, David; Jimenez-Carretero, Daniel; Ledesma-Carbayo, Maria J.; Muñoz-Barrutia, Arrate; Meijering, Erik; Kozubek, Michal; Ortiz-de-Solorzano, Carlos
2014-01-01
Motivation: Automatic tracking of cells in multidimensional time-lapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this article, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge Web site (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24526711
Emotion Recognition from EEG Signals Using Multidimensional Information in EMD Domain.
Zhuang, Ning; Zeng, Ying; Tong, Li; Zhang, Chi; Zhang, Hanming; Yan, Bin
2017-01-01
This paper introduces a method for feature extraction and emotion recognition based on empirical mode decomposition (EMD). By using EMD, EEG signals are decomposed into Intrinsic Mode Functions (IMFs) automatically. Multidimensional information of IMF is utilized as features, the first difference of time series, the first difference of phase, and the normalized energy. The performance of the proposed method is verified on a publicly available emotional database. The results show that the three features are effective for emotion recognition. The role of each IMF is inquired and we find that high frequency component IMF1 has significant effect on different emotional states detection. The informative electrodes based on EMD strategy are analyzed. In addition, the classification accuracy of the proposed method is compared with several classical techniques, including fractal dimension (FD), sample entropy, differential entropy, and discrete wavelet transform (DWT). Experiment results on DEAP datasets demonstrate that our method can improve emotion recognition performance.
Horn’s Curve Estimation Through Multi-Dimensional Interpolation
2013-03-01
complex nature of human behavior has not yet been broached. This is not to say analysts play favorites in reaching conclusions, only that varied...Chapter III, Section 3.7. For now, it is sufficient to say underdetermined data presents technical challenges and all such datasets will be excluded from...database lookup table and then use the method of linear interpolation to instantaneously estimate the unknown points on an as-needed basis ( say from a user
Feng, Li; Axel, Leon; Chandarana, Hersh; Block, Kai Tobias; Sodickson, Daniel K; Otazo, Ricardo
2016-02-01
To develop a novel framework for free-breathing MRI called XD-GRASP, which sorts dynamic data into extra motion-state dimensions using the self-navigation properties of radial imaging and reconstructs the multidimensional dataset using compressed sensing. Radial k-space data are continuously acquired using the golden-angle sampling scheme and sorted into multiple motion-states based on respiratory and/or cardiac motion signals derived directly from the data. The resulting undersampled multidimensional dataset is reconstructed using a compressed sensing approach that exploits sparsity along the new dynamic dimensions. The performance of XD-GRASP is demonstrated for free-breathing three-dimensional (3D) abdominal imaging, two-dimensional (2D) cardiac cine imaging and 3D dynamic contrast-enhanced (DCE) MRI of the liver, comparing against reconstructions without motion sorting in both healthy volunteers and patients. XD-GRASP separates respiratory motion from cardiac motion in cardiac imaging, and respiratory motion from contrast enhancement in liver DCE-MRI, which improves image quality and reduces motion-blurring artifacts. XD-GRASP represents a new use of sparsity for motion compensation and a novel way to handle motions in the context of a continuous acquisition paradigm. Instead of removing or correcting motion, extra motion-state dimensions are reconstructed, which improves image quality and also offers new physiological information of potential clinical value. © 2015 Wiley Periodicals, Inc.
Feng, Li; Axel, Leon; Chandarana, Hersh; Block, Kai Tobias; Sodickson, Daniel K.; Otazo, Ricardo
2015-01-01
Purpose To develop a novel framework for free-breathing MRI called XD-GRASP, which sorts dynamic data into extra motion-state dimensions using the self-navigation properties of radial imaging and reconstructs the multidimensional dataset using compressed sensing. Methods Radial k-space data are continuously acquired using the golden-angle sampling scheme and sorted into multiple motion-states based on respiratory and/or cardiac motion signals derived directly from the data. The resulting under-sampled multidimensional dataset is reconstructed using a compressed sensing approach that exploits sparsity along the new dynamic dimensions. The performance of XD-GRASP is demonstrated for free-breathing three-dimensional (3D) abdominal imaging, two-dimensional (2D) cardiac cine imaging and 3D dynamic contrast-enhanced (DCE) MRI of the liver, comparing against reconstructions without motion sorting in both healthy volunteers and patients. Results XD-GRASP separates respiratory motion from cardiac motion in cardiac imaging, and respiratory motion from contrast enhancement in liver DCE-MRI, which improves image quality and reduces motion-blurring artifacts. Conclusion XD-GRASP represents a new use of sparsity for motion compensation and a novel way to handle motions in the context of a continuous acquisition paradigm. Instead of removing or correcting motion, extra motion-state dimensions are reconstructed, which improves image quality and also offers new physiological information of potential clinical value. PMID:25809847
Martin, Colin R; Redshaw, Maggie
2018-06-01
The 10-item Edinburgh Postnatal Depression Scale (EPDS) is an established screening tool for postnatal depression. Inconsistent findings in factor structure and replication difficulties have limited the scope of development of the measure as a multi-dimensional tool. The current investigation sought to robustly determine the underlying factor structure of the EPDS and the replicability and stability of the most plausible model identified. A between-subjects design was used. EPDS data were collected postpartum from two independent cohorts using identical data capture methods. Datasets were examined with confirmatory factor analysis, model invariance testing and systematic evaluation of relational and internal aspects of the measure. Participants were two samples of postpartum women in England assessed at three months (n = 245) and six months (n = 217). The findings showed a three-factor seven-item model of the EPDS offered an excellent fit to the data, and was observed to be replicable in both datasets and invariant as a function of time point of assessment. Some EPDS sub-scale scores were significantly higher at six months. The EPDS is multi-dimensional and a robust measurement model comprises three factors that are replicable. The potential utility of the sub-scale components identified requires further research to identify a role in contemporary screening practice. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Neural Network Machine Learning and Dimension Reduction for Data Visualization
NASA Technical Reports Server (NTRS)
Liles, Charles A.
2014-01-01
Neural network machine learning in computer science is a continuously developing field of study. Although neural network models have been developed which can accurately predict a numeric value or nominal classification, a general purpose method for constructing neural network architecture has yet to be developed. Computer scientists are often forced to rely on a trial-and-error process of developing and improving accurate neural network models. In many cases, models are constructed from a large number of input parameters. Understanding which input parameters have the greatest impact on the prediction of the model is often difficult to surmise, especially when the number of input variables is very high. This challenge is often labeled the "curse of dimensionality" in scientific fields. However, techniques exist for reducing the dimensionality of problems to just two dimensions. Once a problem's dimensions have been mapped to two dimensions, it can be easily plotted and understood by humans. The ability to visualize a multi-dimensional dataset can provide a means of identifying which input variables have the highest effect on determining a nominal or numeric output. Identifying these variables can provide a better means of training neural network models; models can be more easily and quickly trained using only input variables which appear to affect the outcome variable. The purpose of this project is to explore varying means of training neural networks and to utilize dimensional reduction for visualizing and understanding complex datasets.
Filtering NetCDF Files by Using the EverVIEW Slice and Dice Tool
Conzelmann, Craig; Romañach, Stephanie S.
2010-01-01
Network Common Data Form (NetCDF) is a self-describing, machine-independent file format for storing array-oriented scientific data. It was created to provide a common interface between applications and real-time meteorological and other scientific data. Over the past few years, there has been a growing movement within the community of natural resource managers in The Everglades, Fla., to use NetCDF as the standard data container for datasets based on multidimensional arrays. As a consequence, a need surfaced for additional tools to view and manipulate NetCDF datasets, specifically to filter the files by creating subsets of large NetCDF files. The U.S. Geological Survey (USGS) and the Joint Ecosystem Modeling (JEM) group are working to address these needs with applications like the EverVIEW Slice and Dice Tool, which allows users to filter grid-based NetCDF files, thus targeting those data most important to them. The major functions of this tool are as follows: (1) to create subsets of NetCDF files temporally, spatially, and by data value; (2) to view the NetCDF data in table form; and (3) to export the filtered data to a comma-separated value (CSV) file format. The USGS and JEM will continue to work with scientists and natural resource managers across The Everglades to solve complex restoration problems through technological advances.
NASA Astrophysics Data System (ADS)
Pregnolato, Marco; Petitta, Marcello; Schneiderbauer, Stefan; Pedoth, Lydia; Iasio, Christian; Kaveckis, Giedrius
2014-05-01
The present investigation aims to contribute to a better understanding if and how coarse scale data can prove useful in a study on resilience of communities towards natural hazards. Main goal of the work is the exploitation of large datasets in search for indicators and information valuable for resilience research; in particular, for marks in the statistical distribution of events as well as in the physical signs on a territory, to be possibly defined as disaster footprints. The approach developed required to start from theoretical considerations about some key concepts, such as footprint and resilience and the possible influence of different types of adverse events on a territory. In particular, the research focuses on statistical signals that can be identified within datasets, concerning the effects of hazardous events against the background of resilience, defined as the "ability of a system and its component parts to anticipate, absorb, accommodate, or recover" from a disaster. The hypothesis for this work was that a disaster footprint could be shown using land features and changes maps. The question linked to this hypothesis was: is there a possibility to recognize on the land a multi-dimensional footprint? Is it possible to do this using land cover/land use data? In order to answer these questions this work proposes a synthetic index, named for convenience Hazard-Territory Index, created to categorize classes of Land Use/Land Cover from the CORINE Land Cover maps, by the mean of different approaches, according to the type of hazard. Through the use and elaboration of CORINE Land Cover data this work investigates whether the land and its use (in a way the relationship between a territory and the community living on it) and its changes over time can reveal some information and results relevant for the analysis of resilience. The investigation, set up in order to analyse these "signs on a map", led to implicate the notion of footprint as a multi-dimensional concept, dealing with different temporal scales and dimensions of resilience and it proposes therefore a definition of disaster footprint, as a multi-parametrical and complex impact indicator (or rather an indicator family). The mutual influence between the land, the hazard and the system on the territory presents different aspects that we tried to synthesize into the same index, differently analyzed according to different dimensions of disaster footprint considered; namely: probability of occurrence, susceptibility to harm, long-term impacts and modifications. The index visualizes the information at national and supra-national scale on maps. Although presenting important theoretical limitations (mainly in the spatial and temporal resolution of the data and in the definition of proxies for physical parameters), the application of this methodology at a supra-national scale has proved useful in the attempt to define the domains of investigations for community resilience studies at a local scale.
Visual Analysis of Cloud Computing Performance Using Behavioral Lines.
Muelder, Chris; Zhu, Biao; Chen, Wei; Zhang, Hongxin; Ma, Kwan-Liu
2016-02-29
Cloud computing is an essential technology to Big Data analytics and services. A cloud computing system is often comprised of a large number of parallel computing and storage devices. Monitoring the usage and performance of such a system is important for efficient operations, maintenance, and security. Tracing every application on a large cloud system is untenable due to scale and privacy issues. But profile data can be collected relatively efficiently by regularly sampling the state of the system, including properties such as CPU load, memory usage, network usage, and others, creating a set of multivariate time series for each system. Adequate tools for studying such large-scale, multidimensional data are lacking. In this paper, we present a visual based analysis approach to understanding and analyzing the performance and behavior of cloud computing systems. Our design is based on similarity measures and a layout method to portray the behavior of each compute node over time. When visualizing a large number of behavioral lines together, distinct patterns often appear suggesting particular types of performance bottleneck. The resulting system provides multiple linked views, which allow the user to interactively explore the data by examining the data or a selected subset at different levels of detail. Our case studies, which use datasets collected from two different cloud systems, show that this visual based approach is effective in identifying trends and anomalies of the systems.
Clustering and Network Analysis of Reverse Phase Protein Array Data.
Byron, Adam
2017-01-01
Molecular profiling of proteins and phosphoproteins using a reverse phase protein array (RPPA) platform, with a panel of target-specific antibodies, enables the parallel, quantitative proteomic analysis of many biological samples in a microarray format. Hence, RPPA analysis can generate a high volume of multidimensional data that must be effectively interrogated and interpreted. A range of computational techniques for data mining can be applied to detect and explore data structure and to form functional predictions from large datasets. Here, two approaches for the computational analysis of RPPA data are detailed: the identification of similar patterns of protein expression by hierarchical cluster analysis and the modeling of protein interactions and signaling relationships by network analysis. The protocols use freely available, cross-platform software, are easy to implement, and do not require any programming expertise. Serving as data-driven starting points for further in-depth analysis, validation, and biological experimentation, these and related bioinformatic approaches can accelerate the functional interpretation of RPPA data.
NASA Astrophysics Data System (ADS)
Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis
2016-09-01
Polarimetric radar-based hydrometeor classification is the procedure of identifying different types of hydrometeors by exploiting polarimetric radar observations. The main drawback of the existing supervised classification methods, mostly based on fuzzy logic, is a significant dependency on a presumed electromagnetic behaviour of different hydrometeor types. Namely, the results of the classification largely rely upon the quality of scattering simulations. When it comes to the unsupervised approach, it lacks the constraints related to the hydrometeor microphysics. The idea of the proposed method is to compensate for these drawbacks by combining the two approaches in a way that microphysical hypotheses can, to a degree, adjust the content of the classes obtained statistically from the observations. This is done by means of an iterative approach, performed offline, which, in a statistical framework, examines clustered representative polarimetric observations by comparing them to the presumed polarimetric properties of each hydrometeor class. Aside from comparing, a routine alters the content of clusters by encouraging further statistical clustering in case of non-identification. By merging all identified clusters, the multi-dimensional polarimetric signatures of various hydrometeor types are obtained for each of the studied representative datasets, i.e. for each radar system of interest. These are depicted by sets of centroids which are then employed in operational labelling of different hydrometeors. The method has been applied on three C-band datasets, each acquired by different operational radar from the MeteoSwiss Rad4Alp network, as well as on two X-band datasets acquired by two research mobile radars. The results are discussed through a comparative analysis which includes a corresponding supervised and unsupervised approach, emphasising the operational potential of the proposed method.
Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications
NASA Astrophysics Data System (ADS)
Maskey, M.; Ramachandran, R.; Miller, J.
2017-12-01
Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.
ERIC Educational Resources Information Center
Longley, Susan L.; Watson, David; Noyes, Russell, Jr.
2005-01-01
Although hypochondriasis is associated with the costly use of unnecessary medical resources, this mental health problem remains largely neglected. A lack of clear conceptual models and valid measures has impeded accurate assessment and hindered progress. The Multidimensional Inventory of Hypochondriacal Traits (MIHT) addresses these deficiencies…
Tensor Train Neighborhood Preserving Embedding
NASA Astrophysics Data System (ADS)
Wang, Wenqi; Aggarwal, Vaneet; Aeron, Shuchin
2018-05-01
In this paper, we propose a Tensor Train Neighborhood Preserving Embedding (TTNPE) to embed multi-dimensional tensor data into low dimensional tensor subspace. Novel approaches to solve the optimization problem in TTNPE are proposed. For this embedding, we evaluate novel trade-off gain among classification, computation, and dimensionality reduction (storage) for supervised learning. It is shown that compared to the state-of-the-arts tensor embedding methods, TTNPE achieves superior trade-off in classification, computation, and dimensionality reduction in MNIST handwritten digits and Weizmann face datasets.
Density Large Deviations for Multidimensional Stochastic Hyperbolic Conservation Laws
NASA Astrophysics Data System (ADS)
Barré, J.; Bernardin, C.; Chetrite, R.
2018-02-01
We investigate the density large deviation function for a multidimensional conservation law in the vanishing viscosity limit, when the probability concentrates on weak solutions of a hyperbolic conservation law. When the mobility and diffusivity matrices are proportional, i.e. an Einstein-like relation is satisfied, the problem has been solved in Bellettini and Mariani (Bull Greek Math Soc 57:31-45, 2010). When this proportionality does not hold, we compute explicitly the large deviation function for a step-like density profile, and we show that the associated optimal current has a non trivial structure. We also derive a lower bound for the large deviation function, valid for a more general weak solution, and leave the general large deviation function upper bound as a conjecture.
Systems and precision medicine approaches to diabetes heterogeneity: a Big Data perspective.
Capobianco, Enrico
2017-12-01
Big Data, and in particular Electronic Health Records, provide the medical community with a great opportunity to analyze multiple pathological conditions at an unprecedented depth for many complex diseases, including diabetes. How can we infer on diabetes from large heterogeneous datasets? A possible solution is provided by invoking next-generation computational methods and data analytics tools within systems medicine approaches. By deciphering the multi-faceted complexity of biological systems, the potential of emerging diagnostic tools and therapeutic functions can be ultimately revealed. In diabetes, a multidimensional approach to data analysis is needed to better understand the disease conditions, trajectories and the associated comorbidities. Elucidation of multidimensionality comes from the analysis of factors such as disease phenotypes, marker types, and biological motifs while seeking to make use of multiple levels of information including genetics, omics, clinical data, and environmental and lifestyle factors. Examining the synergy between multiple dimensions represents a challenge. In such regard, the role of Big Data fuels the rise of Precision Medicine by allowing an increasing number of descriptions to be captured from individuals. Thus, data curations and analyses should be designed to deliver highly accurate predicted risk profiles and treatment recommendations. It is important to establish linkages between systems and precision medicine in order to translate their principles into clinical practice. Equivalently, to realize their full potential, the involved multiple dimensions must be able to process information ensuring inter-exchange, reducing ambiguities and redundancies, and ultimately improving health care solutions by introducing clinical decision support systems focused on reclassified phenotypes (or digital biomarkers) and community-driven patient stratifications.
NASA Astrophysics Data System (ADS)
Casamayou-Boucau, Yannick; Ryder, Alan G.
2017-09-01
Anisotropy resolved multidimensional emission spectroscopy (ARMES) provides valuable insights into multi-fluorophore proteins (Groza et al 2015 Anal. Chim. Acta 886 133-42). Fluorescence anisotropy adds to the multidimensional fluorescence dataset information about the physical size of the fluorophores and/or the rigidity of the surrounding micro-environment. The first ARMES studies used standard thin film polarizers (TFP) that had negligible transmission between 250 and 290 nm, preventing accurate measurement of intrinsic protein fluorescence from tyrosine and tryptophan. Replacing TFP with pairs of broadband wire grid polarizers enabled standard fluorescence spectrometers to accurately measure anisotropies between 250 and 300 nm, which was validated with solutions of perylene in the UV and Erythrosin B and Phloxine B in the visible. In all cases, anisotropies were accurate to better than ±1% when compared to literature measurements made with Glan Thompson or TFP polarizers. Better dual wire grid polarizer UV transmittance and the use of excitation-emission matrix measurements for ARMES required complete Rayleigh scatter elimination. This was achieved by chemometric modelling rather than classical interpolation, which enabled the acquisition of pure anisotropy patterns over wider spectral ranges. In combination, these three improvements permit the accurate implementation of ARMES for studying intrinsic protein fluorescence.
The Cyclic Nature of Problem Solving: An Emergent Multidimensional Problem-Solving Framework
ERIC Educational Resources Information Center
Carlson, Marilyn P.; Bloom, Irene
2005-01-01
This paper describes the problem-solving behaviors of 12 mathematicians as they completed four mathematical tasks. The emergent problem-solving framework draws on the large body of research, as grounded by and modified in response to our close observations of these mathematicians. The resulting "Multidimensional Problem-Solving Framework" has four…
Tensor-driven extraction of developmental features from varying paediatric EEG datasets.
Kinney-Lang, Eli; Spyrou, Loukianos; Ebied, Ahmed; Chin, Richard Fm; Escudero, Javier
2018-05-21
Constant changes in developing children's brains can pose a challenge in EEG dependant technologies. Advancing signal processing methods to identify developmental differences in paediatric populations could help improve function and usability of such technologies. Taking advantage of the multi-dimensional structure of EEG data through tensor analysis may offer a framework for extracting relevant developmental features of paediatric datasets. A proof of concept is demonstrated through identifying latent developmental features in resting-state EEG. Approach. Three paediatric datasets (n = 50, 17, 44) were analyzed using a two-step constrained parallel factor (PARAFAC) tensor decomposition. Subject age was used as a proxy measure of development. Classification used support vector machines (SVM) to test if PARAFAC identified features could predict subject age. The results were cross-validated within each dataset. Classification analysis was complemented by visualization of the high-dimensional feature structures using t-distributed Stochastic Neighbour Embedding (t-SNE) maps. Main Results. Development-related features were successfully identified for the developmental conditions of each dataset. SVM classification showed the identified features could accurately predict subject at a significant level above chance for both healthy and impaired populations. t-SNE maps revealed suitable tensor factorization was key in extracting the developmental features. Significance. The described methods are a promising tool for identifying latent developmental features occurring throughout childhood EEG. © 2018 IOP Publishing Ltd.
Lesot, Philippe; Kazimierczuk, Krzysztof; Trébosc, Julien; Amoureux, Jean-Paul; Lafon, Olivier
2015-11-01
Unique information about the atom-level structure and dynamics of solids and mesophases can be obtained by the use of multidimensional nuclear magnetic resonance (NMR) experiments. Nevertheless, the acquisition of these experiments often requires long acquisition times. We review here alternative sampling methods, which have been proposed to circumvent this issue in the case of solids and mesophases. Compared to the spectra of solutions, those of solids and mesophases present some specificities because they usually display lower signal-to-noise ratios, non-Lorentzian line shapes, lower spectral resolutions and wider spectral widths. We highlight herein the advantages and limitations of these alternative sampling methods. A first route to accelerate the acquisition time of multidimensional NMR spectra consists in the use of sparse sampling schemes, such as truncated, radial or random sampling ones. These sparsely sampled datasets are generally processed by reconstruction methods differing from the Discrete Fourier Transform (DFT). A host of non-DFT methods have been applied for solids and mesophases, including the G-matrix Fourier transform, the linear least-square procedures, the covariance transform, the maximum entropy and the compressed sensing. A second class of alternative sampling consists in departing from the Jeener paradigm for multidimensional NMR experiments. These non-Jeener methods include Hadamard spectroscopy as well as spatial or orientational encoding of the evolution frequencies. The increasing number of high field NMR magnets and the development of techniques to enhance NMR sensitivity will contribute to widen the use of these alternative sampling methods for the study of solids and mesophases in the coming years. Copyright © 2015 John Wiley & Sons, Ltd.
Generalizing DTW to the multi-dimensional case requires an adaptive approach
Hu, Bing; Jin, Hongxia; Wang, Jun; Keogh, Eamonn
2017-01-01
In recent years Dynamic Time Warping (DTW) has emerged as the distance measure of choice for virtually all time series data mining applications. For example, virtually all applications that process data from wearable devices use DTW as a core sub-routine. This is the result of significant progress in improving DTW’s efficiency, together with multiple empirical studies showing that DTW-based classifiers at least equal (and generally surpass) the accuracy of all their rivals across dozens of datasets. Thus far, most of the research has considered only the one-dimensional case, with practitioners generalizing to the multi-dimensional case in one of two ways, dependent or independent warping. In general, it appears the community believes either that the two ways are equivalent, or that the choice is irrelevant. In this work, we show that this is not the case. The two most commonly used multi-dimensional DTW methods can produce different classifications, and neither one dominates over the other. This seems to suggest that one should learn the best method for a particular application. However, we will show that this is not necessary; a simple, principled rule can be used on a case-by-case basis to predict which of the two methods we should trust at the time of classification. Our method allows us to ensure that classification results are at least as accurate as the better of the two rival methods, and, in many cases, our method is significantly more accurate. We demonstrate our ideas with the most extensive set of multi-dimensional time series classification experiments ever attempted. PMID:29104448
ArcGIS Framework for Scientific Data Analysis and Serving
NASA Astrophysics Data System (ADS)
Xu, H.; Ju, W.; Zhang, J.
2015-12-01
ArcGIS is a platform for managing, visualizing, analyzing, and serving geospatial data. Scientific data as part of the geospatial data features multiple dimensions (X, Y, time, and depth) and large volume. Multidimensional mosaic dataset (MDMD), a newly enhanced data model in ArcGIS, models the multidimensional gridded data (e.g. raster or image) as a hypercube and enables ArcGIS's capabilities to handle the large volume and near-real time scientific data. Built on top of geodatabase, the MDMD stores the dimension values and the variables (2D arrays) in a geodatabase table which allows accessing a slice or slices of the hypercube through a simple query and supports animating changes along time or vertical dimension using ArcGIS desktop or web clients. Through raster types, MDMD can manage not only netCDF, GRIB, and HDF formats but also many other formats or satellite data. It is scalable and can handle large data volume. The parallel geo-processing engine makes the data ingestion fast and easily. Raster function, definition of a raster processing algorithm, is a very important component in ArcGIS platform for on-demand raster processing and analysis. The scientific data analytics is achieved through the MDMD and raster function templates which perform on-demand scientific computation with variables ingested in the MDMD. For example, aggregating monthly average from daily data; computing total rainfall of a year; calculating heat index for forecasting data, and identifying fishing habitat zones etc. Addtionally, MDMD with the associated raster function templates can be served through ArcGIS server as image services which provide a framework for on-demand server side computation and analysis, and the published services can be accessed by multiple clients such as ArcMap, ArcGIS Online, JavaScript, REST, WCS, and WMS. This presentation will focus on the MDMD model and raster processing templates. In addtion, MODIS land cover, NDFD weather service, and HYCOM ocean model will be used to illustrate how ArcGIS platform and MDMD model can facilitate scientific data visualization and analytics and how the analysis results can be shared to more audience through ArcGIS Online and Portal.
Ten years of maintaining and expanding a microbial genome and metagenome analysis system.
Markowitz, Victor M; Chen, I-Min A; Chu, Ken; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C
2015-11-01
Launched in March 2005, the Integrated Microbial Genomes (IMG) system is a comprehensive data management system that supports multidimensional comparative analysis of genomic data. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets sequenced at the Joint Genome Institute or provided by scientific users, as well as public genome datasets available at the National Center for Biotechnology Information Genbank sequence data archive. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and are integrated into the data warehouse using IMG's data integration toolkits. Microbial genome and metagenome application specific data marts and user interfaces provide access to different subsets of IMG's data and analysis toolkits. This review article revisits IMG's original aims, highlights key milestones reached by the system during the past 10 years, and discusses the main challenges faced by a rapidly expanding system, in particular the complexity of maintaining such a system in an academic setting with limited budgets and computing and data management infrastructure. Copyright © 2015 Elsevier Ltd. All rights reserved.
ERIC Educational Resources Information Center
Meulman, Jacqueline J.; Verboon, Peter
1993-01-01
Points of view analysis, as a way to deal with individual differences in multidimensional scaling, was largely supplanted by the weighted Euclidean model. It is argued that the approach deserves new attention, especially as a technique to analyze group differences. A streamlined and integrated process is proposed. (SLD)
A Method for Generating Reduced-Order Linear Models of Multidimensional Supersonic Inlets
NASA Technical Reports Server (NTRS)
Chicatelli, Amy; Hartley, Tom T.
1998-01-01
Simulation of high speed propulsion systems may be divided into two categories, nonlinear and linear. The nonlinear simulations are usually based on multidimensional computational fluid dynamics (CFD) methodologies and tend to provide high resolution results that show the fine detail of the flow. Consequently, these simulations are large, numerically intensive, and run much slower than real-time. ne linear simulations are usually based on large lumping techniques that are linearized about a steady-state operating condition. These simplistic models often run at or near real-time but do not always capture the detailed dynamics of the plant. Under a grant sponsored by the NASA Lewis Research Center, Cleveland, Ohio, a new method has been developed that can be used to generate improved linear models for control design from multidimensional steady-state CFD results. This CFD-based linear modeling technique provides a small perturbation model that can be used for control applications and real-time simulations. It is important to note the utility of the modeling procedure; all that is needed to obtain a linear model of the propulsion system is the geometry and steady-state operating conditions from a multidimensional CFD simulation or experiment. This research represents a beginning step in establishing a bridge between the controls discipline and the CFD discipline so that the control engineer is able to effectively use multidimensional CFD results in control system design and analysis.
A hybrid heuristic for the multiple choice multidimensional knapsack problem
NASA Astrophysics Data System (ADS)
Mansi, Raïd; Alves, Cláudio; Valério de Carvalho, J. M.; Hanafi, Saïd
2013-08-01
In this article, a new solution approach for the multiple choice multidimensional knapsack problem is described. The problem is a variant of the multidimensional knapsack problem where items are divided into classes, and exactly one item per class has to be chosen. Both problems are NP-hard. However, the multiple choice multidimensional knapsack problem appears to be more difficult to solve in part because of its choice constraints. Many real applications lead to very large scale multiple choice multidimensional knapsack problems that can hardly be addressed using exact algorithms. A new hybrid heuristic is proposed that embeds several new procedures for this problem. The approach is based on the resolution of linear programming relaxations of the problem and reduced problems that are obtained by fixing some variables of the problem. The solutions of these problems are used to update the global lower and upper bounds for the optimal solution value. A new strategy for defining the reduced problems is explored, together with a new family of cuts and a reformulation procedure that is used at each iteration to improve the performance of the heuristic. An extensive set of computational experiments is reported for benchmark instances from the literature and for a large set of hard instances generated randomly. The results show that the approach outperforms other state-of-the-art methods described so far, providing the best known solution for a significant number of benchmark instances.
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.
Zhao, Weizhong; Zou, Wen; Chen, James J
2014-01-01
The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
Machine Detection of Enhanced Electromechanical Energy Conversion in PbZr 0.2Ti 0.8O 3 Thin Films
Agar, Joshua C.; Cao, Ye; Naul, Brett; ...
2018-05-28
Many energy conversion, sensing, and microelectronic applications based on ferroic materials are determined by the domain structure evolution under applied stimuli. New hyperspectral, multidimensional spectroscopic techniques now probe dynamic responses at relevant length and time scales to provide an understanding of how these nanoscale domain structures impact macroscopic properties. Such approaches, however, remain limited in use because of the difficulties that exist in extracting and visualizing scientific insights from these complex datasets. Using multidimensional band-excitation scanning probe spectroscopy and adapting tools from both computer vision and machine learning, an automated workflow is developed to featurize, detect, and classify signatures ofmore » ferroelectric/ferroelastic switching processes in complex ferroelectric domain structures. This approach enables the identification and nanoscale visualization of varied modes of response and a pathway to statistically meaningful quantification of the differences between those modes. Lastly, among other things, the importance of domain geometry is spatially visualized for enhancing nanoscale electromechanical energy conversion.« less
Machine Detection of Enhanced Electromechanical Energy Conversion in PbZr 0.2Ti 0.8O 3 Thin Films
DOE Office of Scientific and Technical Information (OSTI.GOV)
Agar, Joshua C.; Cao, Ye; Naul, Brett
Many energy conversion, sensing, and microelectronic applications based on ferroic materials are determined by the domain structure evolution under applied stimuli. New hyperspectral, multidimensional spectroscopic techniques now probe dynamic responses at relevant length and time scales to provide an understanding of how these nanoscale domain structures impact macroscopic properties. Such approaches, however, remain limited in use because of the difficulties that exist in extracting and visualizing scientific insights from these complex datasets. Using multidimensional band-excitation scanning probe spectroscopy and adapting tools from both computer vision and machine learning, an automated workflow is developed to featurize, detect, and classify signatures ofmore » ferroelectric/ferroelastic switching processes in complex ferroelectric domain structures. This approach enables the identification and nanoscale visualization of varied modes of response and a pathway to statistically meaningful quantification of the differences between those modes. Lastly, among other things, the importance of domain geometry is spatially visualized for enhancing nanoscale electromechanical energy conversion.« less
Big data analytics workflow management for eScience
NASA Astrophysics Data System (ADS)
Fiore, Sandro; D'Anca, Alessandro; Palazzo, Cosimo; Elia, Donatello; Mariello, Andrea; Nassisi, Paola; Aloisio, Giovanni
2015-04-01
In many domains such as climate and astrophysics, scientific data is often n-dimensional and requires tools that support specialized data types and primitives if it is to be properly stored, accessed, analysed and visualized. Currently, scientific data analytics relies on domain-specific software and libraries providing a huge set of operators and functionalities. However, most of these software fail at large scale since they: (i) are desktop based, rely on local computing capabilities and need the data locally; (ii) cannot benefit from available multicore/parallel machines since they are based on sequential codes; (iii) do not provide declarative languages to express scientific data analysis tasks, and (iv) do not provide newer or more scalable storage models to better support the data multidimensionality. Additionally, most of them: (v) are domain-specific, which also means they support a limited set of data formats, and (vi) do not provide a workflow support, to enable the construction, execution and monitoring of more complex "experiments". The Ophidia project aims at facing most of the challenges highlighted above by providing a big data analytics framework for eScience. Ophidia provides several parallel operators to manipulate large datasets. Some relevant examples include: (i) data sub-setting (slicing and dicing), (ii) data aggregation, (iii) array-based primitives (the same operator applies to all the implemented UDF extensions), (iv) data cube duplication, (v) data cube pivoting, (vi) NetCDF-import and export. Metadata operators are available too. Additionally, the Ophidia framework provides array-based primitives to perform data sub-setting, data aggregation (i.e. max, min, avg), array concatenation, algebraic expressions and predicate evaluation on large arrays of scientific data. Bit-oriented plugins have also been implemented to manage binary data cubes. Defining processing chains and workflows with tens, hundreds of data analytics operators is the real challenge in many practical scientific use cases. This talk will specifically address the main needs, requirements and challenges regarding data analytics workflow management applied to large scientific datasets. Three real use cases concerning analytics workflows for sea situational awareness, fire danger prevention, climate change and biodiversity will be discussed in detail.
PROC IRT: A SAS Procedure for Item Response Theory
Matlock Cole, Ki; Paek, Insu
2017-01-01
This article reviews the procedure for item response theory (PROC IRT) procedure in SAS/STAT 14.1 to conduct item response theory (IRT) analyses of dichotomous and polytomous datasets that are unidimensional or multidimensional. The review provides an overview of available features, including models, estimation procedures, interfacing, input, and output files. A small-scale simulation study evaluates the IRT model parameter recovery of the PROC IRT procedure. The use of the IRT procedure in Statistical Analysis Software (SAS) may be useful for researchers who frequently utilize SAS for analyses, research, and teaching.
The IRI/LDEO Climate Data Library: Helping People use Climate Data
NASA Astrophysics Data System (ADS)
Blumenthal, M. B.; Grover-Kopec, E.; Bell, M.; del Corral, J.
2005-12-01
The IRI Climate Data Library (http://iridl.ldeo.columbia.edu/) is a library of datasets. By library we mean a collection of things, collected from both near and far, designed to make them more accessible for the library's users. Our datasets come from many different sources, many different "data cultures", many different formats. By dataset we mean a collection of data organized as multidimensional dependent variables, independent variables, and sub-datasets, along with the metadata (particularly use-metadata) that makes it possible to interpret the data in a meaningful manner. Ingrid, which provides the infrastructure for the Data Library, is an environment that lets one work with datasets: read, write, request, serve, view, select, calculate, transform, ... . It hides an extraordinary amount of technical detail from the user, letting the user think in terms of manipulations to datasets rather that manipulations of files of numbers. Among other things, this hidden technical detail could be accessing data on servers in other places, doing only the small needed portion of an enormous calculation, or translating to and from a variety of formats and between "data cultures". These operations are presented as a collection of virtual directories and documents on a web server, so that an ordinary web client can instantiate a calculation simply by requesting the resulting document or image. Building on this infrastructure, we (and others) have created collections of dynamically-updated images to faciliate monitoring aspects of the climate system, as well as linking these images to the underlying data. We have also created specialized interfaces to address the particular needs of user groups that IRI needs to support.
Anagnostou, Paolo; Dominici, Valentina; Battaggia, Cinzia; Pagani, Luca; Vilar, Miguel; Wells, R. Spencer; Pettener, Davide; Sarno, Stefania; Boattini, Alessio; Francalacci, Paolo; Colonna, Vincenza; Vona, Giuseppe; Calò, Carla; Destro Bisol, Giovanni; Tofanelli, Sergio
2017-01-01
Human populations are often dichotomized into “isolated” and “open” categories using cultural and/or geographical barriers to gene flow as differential criteria. Although widespread, the use of these alternative categories could obscure further heterogeneity due to inter-population differences in effective size, growth rate, and timing or amount of gene flow. We compared intra and inter-population variation measures combining novel and literature data relative to 87,818 autosomal SNPs in 14 open populations and 10 geographic and/or linguistic European isolates. Patterns of intra-population diversity were found to vary considerably more among isolates, probably due to differential levels of drift and inbreeding. The relatively large effective size estimated for some population isolates challenges the generalized view that they originate from small founding groups. Principal component scores based on measures of intra-population variation of isolated and open populations were found to be distributed along a continuum, with an area of intersection between the two groups. Patterns of inter-population diversity were even closer, as we were able to detect some differences between population groups only for a few multidimensional scaling dimensions. Therefore, different lines of evidence suggest that dichotomizing human populations into open and isolated groups fails to capture the actual relations among their genomic features. PMID:28145502
Clustervision: Visual Supervision of Unsupervised Clustering.
Kwon, Bum Chul; Eysenbach, Ben; Verma, Janu; Ng, Kenney; De Filippi, Christopher; Stewart, Walter F; Perer, Adam
2018-01-01
Clustering, the process of grouping together similar items into distinct partitions, is a common type of unsupervised machine learning that can be useful for summarizing and aggregating complex multi-dimensional data. However, data can be clustered in many ways, and there exist a large body of algorithms designed to reveal different patterns. While having access to a wide variety of algorithms is helpful, in practice, it is quite difficult for data scientists to choose and parameterize algorithms to get the clustering results relevant for their dataset and analytical tasks. To alleviate this problem, we built Clustervision, a visual analytics tool that helps ensure data scientists find the right clustering among the large amount of techniques and parameters available. Our system clusters data using a variety of clustering techniques and parameters and then ranks clustering results utilizing five quality metrics. In addition, users can guide the system to produce more relevant results by providing task-relevant constraints on the data. Our visual user interface allows users to find high quality clustering results, explore the clusters using several coordinated visualization techniques, and select the cluster result that best suits their task. We demonstrate this novel approach using a case study with a team of researchers in the medical domain and showcase that our system empowers users to choose an effective representation of their complex data.
Manifold Learning by Preserving Distance Orders.
Ataer-Cansizoglu, Esra; Akcakaya, Murat; Orhan, Umut; Erdogmus, Deniz
2014-03-01
Nonlinear dimensionality reduction is essential for the analysis and the interpretation of high dimensional data sets. In this manuscript, we propose a distance order preserving manifold learning algorithm that extends the basic mean-squared error cost function used mainly in multidimensional scaling (MDS)-based methods. We develop a constrained optimization problem by assuming explicit constraints on the order of distances in the low-dimensional space. In this optimization problem, as a generalization of MDS, instead of forcing a linear relationship between the distances in the high-dimensional original and low-dimensional projection space, we learn a non-decreasing relation approximated by radial basis functions. We compare the proposed method with existing manifold learning algorithms using synthetic datasets based on the commonly used residual variance and proposed percentage of violated distance orders metrics. We also perform experiments on a retinal image dataset used in Retinopathy of Prematurity (ROP) diagnosis.
J. McKean; D. Tonina; C. Bohn; C. W. Wright
2014-01-01
New remote sensing technologies and improved computer performance now allow numerical flow modeling over large stream domains. However, there has been limited testing of whether channel topography can be remotely mapped with accuracy necessary for such modeling. We assessed the ability of the Experimental Advanced Airborne Research Lidar, to support a multi-dimensional...
New modes of electron microscopy for materials science enabled by fast direct electron detectors
NASA Astrophysics Data System (ADS)
Minor, Andrew
There is an ongoing revolution in the development of electron detector technology that has enabled modes of electron microscopy imaging that had only before been theorized. The age of electron microscopy as a tool for imaging is quickly giving way to a new frontier of multidimensional datasets to be mined. These improvements in electron detection have enabled cryo-electron microscopy to resolve the three-dimensional structures of non-crystalized proteins, revolutionizing structural biology. In the physical sciences direct electron detectors has enabled four-dimensional reciprocal space maps of materials at atomic resolution, providing all the structural information about nanoscale materials in one experiment. This talk will highlight the impact of direct electron detectors for materials science, including a new method of scanning nanobeam diffraction. With faster detectors we can take a series of 2D diffraction patterns at each position in a 2D STEM raster scan resulting in a four-dimensional data set. For thin film analysis, direct electron detectors hold the potential to enable strain, polarization, composition and electrical field mapping over relatively large fields of view, all from a single experiment.
NASA Astrophysics Data System (ADS)
Leka, K. D.; Barnes, Graham; Wagner, Eric
2018-04-01
A classification infrastructure built upon Discriminant Analysis (DA) has been developed at NorthWest Research Associates for examining the statistical differences between samples of two known populations. Originating to examine the physical differences between flare-quiet and flare-imminent solar active regions, we describe herein some details of the infrastructure including: parametrization of large datasets, schemes for handling "null" and "bad" data in multi-parameter analysis, application of non-parametric multi-dimensional DA, an extension through Bayes' theorem to probabilistic classification, and methods invoked for evaluating classifier success. The classifier infrastructure is applicable to a wide range of scientific questions in solar physics. We demonstrate its application to the question of distinguishing flare-imminent from flare-quiet solar active regions, updating results from the original publications that were based on different data and much smaller sample sizes. Finally, as a demonstration of "Research to Operations" efforts in the space-weather forecasting context, we present the Discriminant Analysis Flare Forecasting System (DAFFS), a near-real-time operationally-running solar flare forecasting tool that was developed from the research-directed infrastructure.
Ptitsyn, Andrey; Hulver, Matthew; Cefalu, William; York, David; Smith, Steven R
2006-12-19
Classification of large volumes of data produced in a microarray experiment allows for the extraction of important clues as to the nature of a disease. Using multi-dimensional unsupervised FOREL (FORmal ELement) algorithm we have re-analyzed three public datasets of skeletal muscle gene expression in connection with insulin resistance and type 2 diabetes (DM2). Our analysis revealed the major line of variation between expression profiles of normal, insulin resistant, and diabetic skeletal muscle. A cluster of most "metabolically sound" samples occupied one end of this line. The distance along this line coincided with the classic markers of diabetes risk, namely obesity and insulin resistance, but did not follow the accepted clinical diagnosis of DM2 as defined by the presence or absence of hyperglycemia. Genes implicated in this expression pattern are those controlling skeletal muscle fiber type and glycolytic metabolism. Additionally myoglobin and hemoglobin were upregulated and ribosomal genes deregulated in insulin resistant patients. Our findings are concordant with the changes seen in skeletal muscle with altitude hypoxia. This suggests that hypoxia and shift to glycolytic metabolism may also drive insulin resistance.
Compositional variability in Mediterranean archaeofaunas from Upper Paleolithic Southwest Europe
NASA Astrophysics Data System (ADS)
Jones, Emily Lena
2018-03-01
Recent meta-analyses of Upper Paleolithic Southwestern European archaeofaunas (Jones, 2015, 2016) have identified a consistent "Mediterranean" cluster from the Last Glacial Maximum through the early Holocene, suggesting similarities in environment and/or consistency in hunting strategy across this region through time despite radical changes in climate. However, while these archaeofaunas from this cluster all derive from sites located within today's Mediterranean bioclimatic region, many of them are from locations far from the Mediterranean Sea - Atlantic Portugal, the Spanish Meseta - which today differ significantly from each other in biotic composition. In this paper, I explore clustering (through cluster analysis and non-metric multidimensional scaling) within the Mediterranean archaeofaunal group. I test for the influence of sample size as well as the geographic variables of site elevation, latitude, and longitude on variability in the large mammal portions of archaeofaunal assemblages. ANOVA shows no relationship between cluster-defined groups and site elevation or longitude; instead, site latitude appears to be a primary contributor to patterning. However, the overall compositional similarity of the Mediterranean archaeofaunas in this dataset suggests more consistency than variability in Upper Paleolithic hunting strategy in this region.
EDA-gram: designing electrodermal activity fingerprints for visualization and feature extraction.
Chaspari, Theodora; Tsiartas, Andreas; Stein Duker, Leah I; Cermak, Sharon A; Narayanan, Shrikanth S
2016-08-01
Wearable technology permeates every aspect of our daily life increasing the need of reliable and interpretable models for processing the large amount of biomedical data. We propose the EDA-Gram, a multidimensional fingerprint of the electrodermal activity (EDA) signal, inspired by the widely-used notion of spectrogram. The EDA-Gram is based on the sparse decomposition of EDA from a knowledge-driven set of dictionary atoms. The time axis reflects the analysis frames, the spectral dimension depicts the width of selected dictionary atoms, while intensity values are computed from the atom coefficients. In this way, EDA-Gram incorporates the amplitude and shape of Skin Conductance Responses (SCR), which comprise an essential part of the signal. EDA-Gram is further used as a foundation for signal-specific feature design. Our results indicate that the proposed representation can accentuate fine-grain signal fluctuations, which might not always be apparent through simple visual inspection. Statistical analysis and classification/regression experiments further suggest that the derived features can differentiate between multiple arousal levels and stress-eliciting environments for two datasets.
NASA Astrophysics Data System (ADS)
Shiklomanov, A. I.; Okladnikov, I.; Gordov, E. P.; Proussevitch, A. A.; Titov, A. G.
2016-12-01
Presented is a collaborative project carrying out by joint team of researchers from the Institute of Monitoring of Climatic and Ecological Systems, Russia and Earth Systems Research Center, University of New Hampshire, USA. Its main objective is development of a hardware and software prototype of Distributed Research Center (DRC) for monitoring and projecting of regional climatic and and their impacts on the environment over the Northern extratropical areas. In the framework of the project new approaches to "cloud" processing and analysis of large geospatial datasets (big geospatial data) are being developed. It will be deployed on technical platforms of both institutions and applied in research of climate change and its consequences. Datasets available at NCEI and IMCES include multidimensional arrays of climatic, environmental, demographic, and socio-economic characteristics. The project is aimed at solving several major research and engineering tasks: 1) structure analysis of huge heterogeneous climate and environmental geospatial datasets used in the project, their preprocessing and unification; 2) development of a new distributed storage and processing model based on a "shared nothing" paradigm; 3) development of a dedicated database of metadata describing geospatial datasets used in the project; 4) development of a dedicated geoportal and a high-end graphical frontend providing intuitive user interface, internet-accessible online tools for analysis of geospatial data and web services for interoperability with other geoprocessing software packages. DRC will operate as a single access point to distributed archives of spatial data and online tools for their processing. Flexible modular computational engine running verified data processing routines will provide solid results of geospatial data analysis. "Cloud" data analysis and visualization approach will guarantee access to the DRC online tools and data from all over the world. Additionally, exporting of data processing results through WMS and WFS services will be used to provide their interoperability. Financial support of this activity by the RF Ministry of Education and Science under Agreement 14.613.21.0037 (RFMEFI61315X0037) and by the Iola Hubbard Climate Change Endowment is acknowledged.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jasper, Ahren
2015-04-14
The appropriateness of treating crossing seams of electronic states of different spins as nonadiabatic transition states in statistical calculations of spin-forbidden reaction rates is considered. We show that the spin-forbidden reaction coordinate, the nuclear coordinate perpendicular to the crossing seam, is coupled to the remaining nuclear degrees of freedom. We found that this coupling gives rise to multidimensional effects that are not typically included in statistical treatments of spin-forbidden kinetics. Three qualitative categories of multidimensional effects may be identified: static multidimensional effects due to the geometry-dependence of the local shape of the crossing seam and of the spin–orbit coupling, dynamicalmore » multidimensional effects due to energy exchange with the reaction coordinate during the seam crossing, and nonlocal(history-dependent) multidimensional effects due to interference of the electronic variables at second, third, and later seam crossings. Nonlocal multidimensional effects are intimately related to electronic decoherence, where electronic dephasing acts to erase the history of the system. A semiclassical model based on short-time full-dimensional trajectories that includes all three multidimensional effects as well as a model for electronic decoherence is presented. The results of this multidimensional nonadiabatic statistical theory (MNST) for the 3O + CO → CO 2 reaction are compared with the results of statistical theories employing one-dimensional (Landau–Zener and weak coupling) models for the transition probability and with those calculated previously using multistate trajectories. The MNST method is shown to accurately reproduce the multistate decay-of-mixing trajectory results, so long as consistent thresholds are used. Furthermore, the MNST approach has several advantages over multistate trajectory approaches and is more suitable in chemical kinetics calculations at low temperatures and for complex systems. The error in statistical calculations that neglect multidimensional effects is shown to be as large as a factor of 2 for this system, with static multidimensional effects identified as the largest source of error.« less
Multidimensional poverty and child survival in India.
Mohanty, Sanjay K
2011-01-01
Though the concept of multidimensional poverty has been acknowledged cutting across the disciplines (among economists, public health professionals, development thinkers, social scientists, policy makers and international organizations) and included in the development agenda, its measurement and application are still limited. OBJECTIVES AND METHODOLOGY: Using unit data from the National Family and Health Survey 3, India, this paper measures poverty in multidimensional space and examine the linkages of multidimensional poverty with child survival. The multidimensional poverty is measured in the dimension of knowledge, health and wealth and the child survival is measured with respect to infant mortality and under-five mortality. Descriptive statistics, principal component analyses and the life table methods are used in the analyses. The estimates of multidimensional poverty are robust and the inter-state differentials are large. While infant mortality rate and under-five mortality rate are disproportionately higher among the abject poor compared to the non-poor, there are no significant differences in child survival among educationally, economically and health poor at the national level. State pattern in child survival among the education, economical and health poor are mixed. Use of multidimensional poverty measures help to identify abject poor who are unlikely to come out of poverty trap. The child survival is significantly lower among abject poor compared to moderate poor and non-poor. We urge to popularize the concept of multiple deprivations in research and program so as to reduce poverty and inequality in the population.
Exploring the Dominant Modes of Shoreline Change Along the Central Florida Atlantic Coast
NASA Astrophysics Data System (ADS)
Conlin, M. P.; Adams, P. N.; Jaeger, J. M.; MacKenzie, R.
2017-12-01
Geomorphic change within the littoral zone can place communities, ecosystems, and critical infrastructure at risk as the coastal environment responds to changes in sea level, sediment supply, and wave climate. At NASA's Kennedy Space Center near Cape Canaveral, Florida, chronic shoreline retreat currently threatens critical launch infrastructure, but the spatial (alongshore) pattern of this hazard has not been well documented. During a 5-year monitoring campaign (2009-2014), 86 monthly and rapid-response RTK GPS surveys were completed along this 11 km-long coastal reach in order to monitor and characterize shoreline change and identify links between ocean forcing and beach morphology. Results indicate that the study area can be divided into four behaviorally-distinct alongshore regions based on seasonal variability in shoreline change, mediated by the complex offshore bathymetry of the Cape Canaveral shoals. In addition, seasonal erosion/accretion cycles are regularly interrupted by large erosive storm events, especially during the anomalous wave climates produced during winter Nor'Easter storms. An effective tool for analyzing multidimensional datasets like this one is Empirical Orthogonal Function (EOF) analysis, a technique to determine the dominant spatial and temporal signals within a dataset. Using this approach, it is possible to identify the main time and space scales (modes) along which coastal changes are occurring. Through correlation of these changes with oceanographic forcing mechanisms, we are enabled to infer the principal drivers of shoreline change at this site. Here, we document the results of EOF analysis applied to the Cape Canaveral shoreline change dataset, and further correlate the results of this analysis with oceanographic forcings in order to reveal the dominant modes as well as drivers of coastal variability along the central Atlantic coast of Florida. This EOF-based analysis, which is the first such analysis in the region, is shedding light on the hazards that most affect Florida's coastal communities and the scales at which coastal planners and stakeholders should focus protection efforts.
Integrative Exploratory Analysis of Two or More Genomic Datasets.
Meng, Chen; Culhane, Aedin
2016-01-01
Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets
NASA Astrophysics Data System (ADS)
Day-Lewis, F. D.; Slater, L. D.; Johnson, T.
2012-12-01
Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.
Artificial Neural Networks: an overview and their use in the analysis of the AMPHORA-3 dataset.
Buscema, Paolo Massimo; Massini, Giulia; Maurelli, Guido
2014-10-01
The Artificial Adaptive Systems (AAS) are theories with which generative algebras are able to create artificial models simulating natural phenomenon. Artificial Neural Networks (ANNs) are the more diffused and best-known learning system models in the AAS. This article describes an overview of ANNs, noting its advantages and limitations for analyzing dynamic, complex, non-linear, multidimensional processes. An example of a specific ANN application to alcohol consumption in Spain, as part of the EU AMPHORA-3 project, during 1961-2006 is presented. Study's limitations are noted and future needed research using ANN methodologies are suggested.
Amira: Multi-Dimensional Scientific Visualization for the GeoSciences in the 21st Century
NASA Astrophysics Data System (ADS)
Bartsch, H.; Erlebacher, G.
2003-12-01
amira (www.amiravis.com) is a general purpose framework for 3D scientific visualization that meets the needs of the non-programmer, the script writer, and the advanced programmer alike. Provided modules may be visually assembled in an interactive manner to create complex visual displays. These modules and their associated user interfaces are controlled either through a mouse, or via an interactive scripting mechanism based on Tcl. We provide interactive demonstrations of the various features of Amira and explain how these may be used to enhance the comprehension of datasets in use in the Earth Sciences community. Its features will be illustrated on scalar and vector fields on grid types ranging from Cartesian to fully unstructured. Specialized extension modules developed by some of our collaborators will be illustrated [1]. These include a module to automatically choose values for salient isosurface identification and extraction, and color maps suitable for volume rendering. During the session, we will present several demonstrations of remote networking, processing of very large spatio-temporal datasets, and various other projects that are underway. In particular, we will demonstrate WEB-IS, a java-applet interface to Amira that allows script editing via the web, and selected data analysis [2]. [1] G. Erlebacher, D. A. Yuen, F. Dubuffet, "Case Study: Visualization and Analysis of High Rayleigh Number -- 3D Convection in the Earth's Mantle", Proceedings of Visualization 2002, pp. 529--532. [2] Y. Wang, G. Erlebacher, Z. A. Garbow, D. A. Yuen, "Web-Based Service of a Visualization Package 'amira' for the Geosciences", Visual Geosciences, 2003.
Crowell, Kevin L; Slysz, Gordon W; Baker, Erin S; LaMarche, Brian L; Monroe, Matthew E; Ibrahim, Yehia M; Payne, Samuel H; Anderson, Gordon A; Smith, Richard D
2013-11-01
The addition of ion mobility spectrometry to liquid chromatography-mass spectrometry experiments requires new, or updated, software tools to facilitate data processing. We introduce a command line software application LC-IMS-MS Feature Finder that searches for molecular ion signatures in multidimensional liquid chromatography-ion mobility spectrometry-mass spectrometry (LC-IMS-MS) data by clustering deisotoped peaks with similar monoisotopic mass, charge state, LC elution time and ion mobility drift time values. The software application includes an algorithm for detecting and quantifying co-eluting chemical species, including species that exist in multiple conformations that may have been separated in the IMS dimension. LC-IMS-MS Feature Finder is available as a command-line tool for download at http://omics.pnl.gov/software/LC-IMS-MS_Feature_Finder.php. The Microsoft.NET Framework 4.0 is required to run the software. All other dependencies are included with the software package. Usage of this software is limited to non-profit research to use (see README). rds@pnnl.gov. Supplementary data are available at Bioinformatics online.
Rehman, Zia Ur; Idris, Adnan; Khan, Asifullah
2018-06-01
Protein-Protein Interactions (PPI) play a vital role in cellular processes and are formed because of thousands of interactions among proteins. Advancements in proteomics technologies have resulted in huge PPI datasets that need to be systematically analyzed. Protein complexes are the locally dense regions in PPI networks, which extend important role in metabolic pathways and gene regulation. In this work, a novel two-phase protein complex detection and grouping mechanism is proposed. In the first phase, topological and biological features are extracted for each complex, and prediction performance is investigated using Bagging based Ensemble classifier (PCD-BEns). Performance evaluation through cross validation shows improvement in comparison to CDIP, MCode, CFinder and PLSMC methods Second phase employs Multi-Dimensional Scaling (MDS) for the grouping of known complexes by exploring inter complex relations. It is experimentally observed that the combination of topological and biological features in the proposed approach has greatly enhanced prediction performance for protein complex detection, which may help to understand various biological processes, whereas application of MDS based exploration may assist in grouping potentially similar complexes. Copyright © 2018 Elsevier Ltd. All rights reserved.
Lee, Eugene K; Tran, David D; Keung, Wendy; Chan, Patrick; Wong, Gabriel; Chan, Camie W; Costa, Kevin D; Li, Ronald A; Khine, Michelle
2017-11-14
Accurately predicting cardioactive effects of new molecular entities for therapeutics remains a daunting challenge. Immense research effort has been focused toward creating new screening platforms that utilize human pluripotent stem cell (hPSC)-derived cardiomyocytes and three-dimensional engineered cardiac tissue constructs to better recapitulate human heart function and drug responses. As these new platforms become increasingly sophisticated and high throughput, the drug screens result in larger multidimensional datasets. Improved automated analysis methods must therefore be developed in parallel to fully comprehend the cellular response across a multidimensional parameter space. Here, we describe the use of machine learning to comprehensively analyze 17 functional parameters derived from force readouts of hPSC-derived ventricular cardiac tissue strips (hvCTS) electrically paced at a range of frequencies and exposed to a library of compounds. A generated metric is effective for then determining the cardioactivity of a given drug. Furthermore, we demonstrate a classification model that can automatically predict the mechanistic action of an unknown cardioactive drug. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Multi-Dimensional Pattern Discovery of Trajectories Using Contextual Information
NASA Astrophysics Data System (ADS)
Sharif, M.; Alesheikh, A. A.
2017-10-01
Movement of point objects are highly sensitive to the underlying situations and conditions during the movement, which are known as contexts. Analyzing movement patterns, while accounting the contextual information, helps to better understand how point objects behave in various contexts and how contexts affect their trajectories. One potential solution for discovering moving objects patterns is analyzing the similarities of their trajectories. This article, therefore, contextualizes the similarity measure of trajectories by not only their spatial footprints but also a notion of internal and external contexts. The dynamic time warping (DTW) method is employed to assess the multi-dimensional similarities of trajectories. Then, the results of similarity searches are utilized in discovering the relative movement patterns of the moving point objects. Several experiments are conducted on real datasets that were obtained from commercial airplanes and the weather information during the flights. The results yielded the robustness of DTW method in quantifying the commonalities of trajectories and discovering movement patterns with 80 % accuracy. Moreover, the results revealed the importance of exploiting contextual information because it can enhance and restrict movements.
Knudsen, Anders Dahl; Bennike, Tue; Kjeldal, Henrik; Birkelund, Svend; Otzen, Daniel Erik; Stensballe, Allan
2014-05-30
We describe Condenser, a freely available, comprehensive open-source tool for merging multidimensional quantitative proteomics data from the Matrix Science Mascot Distiller Quantitation Toolbox into a common format ready for subsequent bioinformatic analysis. A number of different relative quantitation technologies, such as metabolic (15)N and amino acid stable isotope incorporation, label-free and chemical-label quantitation are supported. The program features multiple options for curative filtering of the quantified peptides, allowing the user to choose data quality thresholds appropriate for the current dataset, and ensure the quality of the calculated relative protein abundances. Condenser also features optional global normalization, peptide outlier removal, multiple testing and calculation of t-test statistics for highlighting and evaluating proteins with significantly altered relative protein abundances. Condenser provides an attractive addition to the gold-standard quantitative workflow of Mascot Distiller, allowing easy handling of larger multi-dimensional experiments. Source code, binaries, test data set and documentation are available at http://condenser.googlecode.com/. Copyright © 2014 Elsevier B.V. All rights reserved.
Multidimensional Poverty and Child Survival in India
Mohanty, Sanjay K.
2011-01-01
Background Though the concept of multidimensional poverty has been acknowledged cutting across the disciplines (among economists, public health professionals, development thinkers, social scientists, policy makers and international organizations) and included in the development agenda, its measurement and application are still limited. Objectives and Methodology Using unit data from the National Family and Health Survey 3, India, this paper measures poverty in multidimensional space and examine the linkages of multidimensional poverty with child survival. The multidimensional poverty is measured in the dimension of knowledge, health and wealth and the child survival is measured with respect to infant mortality and under-five mortality. Descriptive statistics, principal component analyses and the life table methods are used in the analyses. Results The estimates of multidimensional poverty are robust and the inter-state differentials are large. While infant mortality rate and under-five mortality rate are disproportionately higher among the abject poor compared to the non-poor, there are no significant differences in child survival among educationally, economically and health poor at the national level. State pattern in child survival among the education, economical and health poor are mixed. Conclusion Use of multidimensional poverty measures help to identify abject poor who are unlikely to come out of poverty trap. The child survival is significantly lower among abject poor compared to moderate poor and non-poor. We urge to popularize the concept of multiple deprivations in research and program so as to reduce poverty and inequality in the population. PMID:22046384
Data-driven probability concentration and sampling on manifold
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu
2016-09-15
A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less
Mohanty, Sanjay K; Agrawal, Nand Kishor; Mahapatra, Bidhubhusan; Choudhury, Dhrupad; Tuladhar, Sabarnee; Holmgren, E Valdemar
2017-01-18
Economic burden to households due to out-of-pocket expenditure (OOPE) is large in many Asian countries. Though studies suggest increasing household poverty due to high OOPE in developing countries, studies on association of multidimensional poverty and household health spending is limited. This paper tests the hypothesis that the multidimensionally poor are more likely to incur catastrophic health spending cutting across countries. Data from the Poverty and Vulnerability Assessment (PVA) Survey carried out by the International Center for Integrated Mountain Development (ICIMOD) has been used in the analyses. The PVA survey was a comprehensive household survey that covered the mountainous regions of India, Nepal and Myanmar. A total of 2647 households from India, 2310 households in Nepal and 4290 households in Myanmar covered under the PVA survey. Poverty is measured in a multidimensional framework by including the dimensions of education, income and energy, water and sanitation using the Alkire and Foster method. Health shock is measured using the frequency of illness, family sickness and death of any family member in a reference period of one year. Catastrophic health expenditure is defined as 40% above the household's capacity to pay. Results suggest that about three-fifths of the population in Myanmar, two-fifths of the population in Nepal and one-third of the population in India are multidimensionally poor. About 47% of the multidimensionally poor in India had incurred catastrophic health spending compared to 35% of the multidimensionally non-poor and the pattern was similar in both Nepal and Myanmar. The odds of incurring catastrophic health spending was 56% more among the multidimensionally poor than among the multidimensionally non-poor [95% CI: 1.35-1.76]. While health shocks to households are consistently significant predictors of catastrophic health spending cutting across country of residence, the educational attainment of the head of the household is not significant. The multidimensionally poor in the poorer regions are more likely to face health shocks and are less likely to afford professional health services. Increasing government spending on health and increasing households' access to health insurance can reduce catastrophic health spending and multidimensional poverty.
Druka, Arnis; Druka, Ilze; Centeno, Arthur G; Li, Hongqiang; Sun, Zhaohui; Thomas, William T B; Bonar, Nicola; Steffenson, Brian J; Ullrich, Steven E; Kleinhofs, Andris; Wise, Roger P; Close, Timothy J; Potokina, Elena; Luo, Zewei; Wagner, Carola; Schweizer, Günther F; Marshall, David F; Kearsey, Michael J; Williams, Robert W; Waugh, Robbie
2008-11-18
A typical genetical genomics experiment results in four separate data sets; genotype, gene expression, higher-order phenotypic data and metadata that describe the protocols, processing and the array platform. Used in concert, these data sets provide the opportunity to perform genetic analysis at a systems level. Their predictive power is largely determined by the gene expression dataset where tens of millions of data points can be generated using currently available mRNA profiling technologies. Such large, multidimensional data sets often have value beyond that extracted during their initial analysis and interpretation, particularly if conducted on widely distributed reference genetic materials. Besides quality and scale, access to the data is of primary importance as accessibility potentially allows the extraction of considerable added value from the same primary dataset by the wider research community. Although the number of genetical genomics experiments in different plant species is rapidly increasing, none to date has been presented in a form that allows quick and efficient on-line testing for possible associations between genes, loci and traits of interest by an entire research community. Using a reference population of 150 recombinant doubled haploid barley lines we generated novel phenotypic, mRNA abundance and SNP-based genotyping data sets, added them to a considerable volume of legacy trait data and entered them into the GeneNetwork http://www.genenetwork.org. GeneNetwork is a unified on-line analytical environment that enables the user to test genetic hypotheses about how component traits, such as mRNA abundance, may interact to condition more complex biological phenotypes (higher-order traits). Here we describe these barley data sets and demonstrate some of the functionalities GeneNetwork provides as an easily accessible and integrated analytical environment for exploring them. By integrating barley genotypic, phenotypic and mRNA abundance data sets directly within GeneNetwork's analytical environment we provide simple web access to the data for the research community. In this environment, a combination of correlation analysis and linkage mapping provides the potential to identify and substantiate gene targets for saturation mapping and positional cloning. By integrating datasets from an unsequenced crop plant (barley) in a database that has been designed for an animal model species (mouse) with a well established genome sequence, we prove the importance of the concept and practice of modular development and interoperability of software engineering for biological data sets.
An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams
ERIC Educational Resources Information Center
Teymourlouei, Haydar
2013-01-01
The emerging large datasets have made efficient data processing a much more difficult task for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with time. The purpose of this research is to give an overview of some of the tools and techniques that can be utilized to manage and analyze large datasets. We…
A patch-based convolutional neural network for remote sensing image classification.
Sharma, Atharva; Liu, Xiuwen; Yang, Xiaojun; Shi, Di
2017-11-01
Availability of accurate land cover information over large areas is essential to the global environment sustainability; digital classification using medium-resolution remote sensing data would provide an effective method to generate the required land cover information. However, low accuracy of existing per-pixel based classification methods for medium-resolution data is a fundamental limiting factor. While convolutional neural networks (CNNs) with deep layers have achieved unprecedented improvements in object recognition applications that rely on fine image structures, they cannot be applied directly to medium-resolution data due to lack of such fine structures. In this paper, considering the spatial relation of a pixel to its neighborhood, we propose a new deep patch-based CNN system tailored for medium-resolution remote sensing data. The system is designed by incorporating distinctive characteristics of medium-resolution data; in particular, the system computes patch-based samples from multidimensional top of atmosphere reflectance data. With a test site from the Florida Everglades area (with a size of 771 square kilometers), the proposed new system has outperformed pixel-based neural network, pixel-based CNN and patch-based neural network by 24.36%, 24.23% and 11.52%, respectively, in overall classification accuracy. By combining the proposed deep CNN and the huge collection of medium-resolution remote sensing data, we believe that much more accurate land cover datasets can be produced over large areas. Copyright © 2017 Elsevier Ltd. All rights reserved.
Near-lossless multichannel EEG compression based on matrix and tensor decompositions.
Dauwels, Justin; Srinivasan, K; Reddy, M Ramasubba; Cichocki, Andrzej
2013-05-01
A novel near-lossless compression algorithm for multichannel electroencephalogram (MC-EEG) is proposed based on matrix/tensor decomposition models. MC-EEG is represented in suitable multiway (multidimensional) forms to efficiently exploit temporal and spatial correlations simultaneously. Several matrix/tensor decomposition models are analyzed in view of efficient decorrelation of the multiway forms of MC-EEG. A compression algorithm is built based on the principle of “lossy plus residual coding,” consisting of a matrix/tensor decomposition-based coder in the lossy layer followed by arithmetic coding in the residual layer. This approach guarantees a specifiable maximum absolute error between original and reconstructed signals. The compression algorithm is applied to three different scalp EEG datasets and an intracranial EEG dataset, each with different sampling rate and resolution. The proposed algorithm achieves attractive compression ratios compared to compressing individual channels separately. For similar compression ratios, the proposed algorithm achieves nearly fivefold lower average error compared to a similar wavelet-based volumetric MC-EEG compression algorithm.
Automated Spatiotemporal Analysis of Fibrils and Coronal Rain Using the Rolling Hough Transform
NASA Astrophysics Data System (ADS)
Schad, Thomas
2017-09-01
A technique is presented that automates the direction characterization of curvilinear features in multidimensional solar imaging datasets. It is an extension of the Rolling Hough Transform (RHT) technique presented by Clark, Peek, and Putman ( Astrophys. J. 789, 82, 2014), and it excels at rapid quantification of spatial and spatiotemporal feature orientation even for applications with a low signal-to-noise ratio. It operates on a pixel-by-pixel basis within a dataset and reliably quantifies orientation even for locations not centered on a feature ridge, which is used here to derive a quasi-continuous map of the chromospheric fine-structure projection angle. For time-series analysis, a procedure is developed that uses a hierarchical application of the RHT to automatically derive the apparent motion of coronal rain observed off-limb. Essential to the success of this technique is the formulation presented in this article for the RHT error analysis as it provides a means to properly filter results.
Use of multidimensional, multimodal imaging and PACS to support neurological diagnoses
NASA Astrophysics Data System (ADS)
Wong, Stephen T. C.; Knowlton, Robert C.; Hoo, Kent S.; Huang, H. K.
1995-05-01
Technological advances in brain imaging have revolutionized diagnosis in neurology and neurological surgery. Major imaging techniques include magnetic resonance imaging (MRI) to visualize structural anatomy, positron emission tomography (PET) to image metabolic function and cerebral blood flow, magnetoencephalography (MEG) to visualize the location of physiologic current sources, and magnetic resonance spectroscopy (MRS) to measure specific biochemicals. Each of these techniques studies different biomedical aspects of the brain, but there lacks an effective means to quantify and correlate the disparate imaging datasets in order to improve clinical decision making processes. This paper describes several techniques developed in a UNIX-based neurodiagnostic workstation to aid the noninvasive presurgical evaluation of epilepsy patients. These techniques include online access to the picture archiving and communication systems (PACS) multimedia archive, coregistration of multimodality image datasets, and correlation and quantitation of structural and functional information contained in the registered images. For illustration, we describe the use of these techniques in a patient case of nonlesional neocortical epilepsy. We also present out future work based on preliminary studies.
NASA Astrophysics Data System (ADS)
Li, Hongsong; Lyu, Hang; Liao, Ningfang; Wu, Wenmin
2016-12-01
The bidirectional reflectance distribution function (BRDF) data in the ultraviolet (UV) band are valuable for many applications including cultural heritage, material analysis, surface characterization, and trace detection. We present a BRDF measurement instrument working in the near- and middle-UV spectral range. The instrument includes a collimated UV light source, a rotation stage, a UV imaging spectrometer, and a control computer. The data captured by the proposed instrument describe spatial, spectral, and angular variations of the light scattering from a sample surface. Such a multidimensional dataset of an example sample is captured by the proposed instrument and analyzed by a k-mean clustering algorithm to separate surface regions with same material but different surface roughnesses. The clustering results show that the angular dimension of the dataset can be exploited for surface roughness characterization. The two clustered BRDFs are fitted to a theoretical BRDF model. The fitting results show good agreement between the measurement data and the theoretical model.
Multidimensional stability of traveling fronts in combustion and non-KPP monostable equations
NASA Astrophysics Data System (ADS)
Bu, Zhen-Hui; Wang, Zhi-Cheng
2018-02-01
This paper is concerned with the multidimensional stability of traveling fronts for the combustion and non-KPP monostable equations. Our study contains two parts: in the first part, we first show that the two-dimensional V-shaped traveling fronts are asymptotically stable in R^{n+2} with n≥1 under any (possibly large) initial perturbations that decay at space infinity, and then, we prove that there exists a solution that oscillates permanently between two V-shaped traveling fronts, which implies that even very small perturbations to the V-shaped traveling front can lead to permanent oscillation. In the second part, we establish the multidimensional stability of planar traveling front in R^{n+1} with n≥1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kwon, Tae-Soon; Yun, Byong-Jo; Euh, Dong-Jin
Multidimensional thermal-hydraulic behavior in the downcomer annulus of a pressurized water reactor (PWR) vessel with a direct vessel injection mode is presented based on the experimental observation in the MIDAS (multidimensional investigation in downcomer annulus simulation) steam-water test facility. From the steady-state test results to simulate the late reflood phase of a large-break loss-of-coolant accident (LBLOCA), isothermal lines show the multidimensional phenomena of a phasic interaction between steam and water in the downcomer annulus very well. MIDAS is a steam-water separate effect test facility, which is 1/4.93 linearly scaled down to a 1400-MW(electric) PWR type of a nuclear reactor, focusedmore » on understanding multidimensional thermal-hydraulic phenomena in a downcomer annulus with various types of safety injection during the refill or reflood phase of an LBLOCA. The initial and the boundary conditions are scaled from the pretest analysis based on the preliminary calculation using the TRAC code. The superheated steam with a superheating degree of 80 K at a given downcomer pressure of 180 kPa is injected equally through three intact cold legs into the downcomer.« less
Evolution of large amplitude Alfven waves in solar wind plasmas: Kinetic-fluid models
NASA Astrophysics Data System (ADS)
Nariyuki, Y.
2014-12-01
Large amplitude Alfven waves are ubiquitously observed in solar wind plasmas. Mjolhus(JPP, 1976) and Mio et al(JPSJ, 1976) found that nonlinear evolution of the uni-directional, parallel propagating Alfven waves can be described by the derivative nonlinear Schrodinger equation (DNLS). Later, the multi-dimensional extension (Mjolhus and Wyller, JPP, 1988; Passot and Sulem, POP, 1993; Gazol et al, POP, 1999) and ion kinetic modification (Mjolhus and Wyller, JPP, 1988; Spangler, POP, 1989; Medvedev and Diamond, POP, 1996; Nariyuki et al, POP, 2013) of DNLS have been reported. Recently, Nariyuki derived multi-dimensional DNLS from an expanding box model of the Hall-MHD system (Nariyuki, submitted). The set of equations including the nonlinear evolution of compressional wave modes (TDNLS) was derived by Hada(GRL, 1993). DNLS can be derived from TDNLS by rescaling of the variables (Mjolhus, Phys. Scr., 2006). Nariyuki and Hada(JPSJ, 2007) derived a kinetically modified TDNLS by using a simple Landau closure (Hammet and Perkins, PRL, 1990; Medvedev and Diamond, POP, 1996). In the present study, we revisit the ion kinetic modification of multi-dimensional TDNLS through more rigorous derivations, which is consistent with the past kinetic modification of DNLS. Although the original TDNLS was derived in the multi-dimensional form, the evolution of waves with finite propagation angles in TDNLS has not been paid much attention. Applicability of the resultant models to solar wind turbulence is discussed.
ComVisMD - compact visualization of multidimensional data: experimenting with cricket players data
NASA Astrophysics Data System (ADS)
Dandin, Shridhar B.; Ducassé, Mireille
2018-03-01
Database information is multidimensional and often displayed in tabular format (row/column display). Presented in aggregated form, multidimensional data can be used to analyze the records or objects. Online Analytical database Processing (OLAP) proposes mechanisms to display multidimensional data in aggregated forms. A choropleth map is a thematic map in which areas are colored in proportion to the measurement of a statistical variable being displayed, such as population density. They are used mostly for compact graphical representation of geographical information. We propose a system, ComVisMD inspired by choropleth map and the OLAP cube to visualize multidimensional data in a compact way. ComVisMD displays multidimensional data like OLAP Cube, where we are mapping an attribute a (first dimension, e.g. year started playing cricket) in vertical direction, object coloring based on b (second dimension, e.g. batting average), mapping varying-size circles based on attribute c (third dimension, e.g. highest score), mapping numbers based on attribute d (fourth dimension, e.g. matches played). We illustrate our approach on cricket players data, namely on two tables Country and Player. They have a large number of rows and columns: 246 rows and 17 columns for players of one country. ComVisMD’s visualization reduces the size of the tabular display by a factor of about 4, allowing users to grasp more information at a time than the bare table display.
Varni, James W; Limbers, Christine A
2008-02-01
The PedsQL (Pediatric Quality of Life Inventory) is a modular instrument designed to measure health-related quality of life (HRQOL) and disease-specific symptoms in children and adolescents ages 2-18. The PedsQL Multidimensional Fatigue Scale was designed as a generic symptom-specific instrument to measure fatigue in pediatric patients ages 2-18. Since a sizeable number of pediatric patients prefer to remain with their pediatric providers after age 18, the objective of the present study was to determine the feasibility, reliability, and validity of the PedsQL Multidimensional Fatigue Scale in young adults. The 18-item PedsQL Multidimensional Fatigue Scale (General Fatigue, Sleep/Rest Fatigue, and Cognitive Fatigue domains), the PedsQL 4.0 Generic Core Scales Young Adult Version, and the SF-8 Health Survey were completed by 423 university students ages 18-25. The PedsQL Multidimensional Fatigue Scale evidenced minimal missing responses, achieved excellent reliability for the Total Scale Score (alpha = 0.90), distinguished between healthy young adults and young adults with chronic health conditions, was significantly correlated with the relevant PedsQL 4.0 Generic Core Scales and the SF-8 standardized scores, and demonstrated a factor-derived structure largely consistent with the a priori conceptual model. The results demonstrate the measurement properties of the PedsQL Multidimensional Fatigue Scale in a convenience sample of young adult university students. The findings suggest that the PedsQL Multidimensional Fatigue Scale may be utilized in the evaluation of fatigue for a broad age range.
Really big data: Processing and analysis of large datasets
USDA-ARS?s Scientific Manuscript database
Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...
The solution of large multi-dimensional Poisson problems
NASA Technical Reports Server (NTRS)
Stone, H. S.
1974-01-01
The Buneman algorithm for solving Poisson problems can be adapted to solve large Poisson problems on computers with a rotating drum memory so that the computation is done with very little time lost due to rotational latency of the drum.
Predicting clinical outcome of neuroblastoma patients using an integrative network-based approach.
Tranchevent, Léon-Charles; Nazarov, Petr V; Kaoma, Tony; Schmartz, Georges P; Muller, Arnaud; Kim, Sang-Yoon; Rajapakse, Jagath C; Azuaje, Francisco
2018-06-07
One of the main current challenges in computational biology is to make sense of the huge amounts of multidimensional experimental data that are being produced. For instance, large cohorts of patients are often screened using different high-throughput technologies, effectively producing multiple patient-specific molecular profiles for hundreds or thousands of patients. We propose and implement a network-based method that integrates such patient omics data into Patient Similarity Networks. Topological features derived from these networks were then used to predict relevant clinical features. As part of the 2017 CAMDA challenge, we have successfully applied this strategy to a neuroblastoma dataset, consisting of genomic and transcriptomic data. In particular, we observe that models built on our network-based approach perform at least as well as state of the art models. We furthermore explore the effectiveness of various topological features and observe, for instance, that redundant centrality metrics can be combined to build more powerful models. We demonstrate that the networks inferred from omics data contain clinically relevant information and that patient clinical outcomes can be predicted using only network topological data. This article was reviewed by Yang-Yu Liu, Tomislav Smuc and Isabel Nepomuceno.
Klijn, Marieke E; Hubbuch, Jürgen
2018-04-27
Protein phase diagrams are a tool to investigate cause and consequence of solution conditions on protein phase behavior. The effects are scored according to aggregation morphologies such as crystals or amorphous precipitates. Solution conditions affect morphological features, such as crystal size, as well as kinetic features, such as crystal growth time. Common used data visualization techniques include individual line graphs or symbols-based phase diagrams. These techniques have limitations in terms of handling large datasets, comprehensiveness or completeness. To eliminate these limitations, morphological and kinetic features obtained from crystallization images generated with high throughput microbatch experiments have been visualized with radar charts in combination with the empirical phase diagram (EPD) method. Morphological features (crystal size, shape, and number, as well as precipitate size) and kinetic features (crystal and precipitate onset and growth time) are extracted for 768 solutions with varying chicken egg white lysozyme concentration, salt type, ionic strength and pH. Image-based aggregation morphology and kinetic features were compiled into a single and easily interpretable figure, thereby showing that the EPD method can support high throughput crystallization experiments in its data amount as well as its data complexity. Copyright © 2018. Published by Elsevier Inc.
Efficient multidimensional regularization for Volterra series estimation
NASA Astrophysics Data System (ADS)
Birpoutsoukis, Georgios; Csurcsia, Péter Zoltán; Schoukens, Johan
2018-05-01
This paper presents an efficient nonparametric time domain nonlinear system identification method. It is shown how truncated Volterra series models can be efficiently estimated without the need of long, transient-free measurements. The method is a novel extension of the regularization methods that have been developed for impulse response estimates of linear time invariant systems. To avoid the excessive memory needs in case of long measurements or large number of estimated parameters, a practical gradient-based estimation method is also provided, leading to the same numerical results as the proposed Volterra estimation method. Moreover, the transient effects in the simulated output are removed by a special regularization method based on the novel ideas of transient removal for Linear Time-Varying (LTV) systems. Combining the proposed methodologies, the nonparametric Volterra models of the cascaded water tanks benchmark are presented in this paper. The results for different scenarios varying from a simple Finite Impulse Response (FIR) model to a 3rd degree Volterra series with and without transient removal are compared and studied. It is clear that the obtained models capture the system dynamics when tested on a validation dataset, and their performance is comparable with the white-box (physical) models.
Finding Spatio-Temporal Patterns in Large Sensor Datasets
ERIC Educational Resources Information Center
McGuire, Michael Patrick
2010-01-01
Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…
Collaborative Sharing of Multidimensional Space-time Data Using HydroShare
NASA Astrophysics Data System (ADS)
Gan, T.; Tarboton, D. G.; Horsburgh, J. S.; Dash, P. K.; Idaszak, R.; Yi, H.; Blanton, B.
2015-12-01
HydroShare is a collaborative environment being developed for sharing hydrological data and models. It includes capability to upload data in many formats as resources that can be shared. The HydroShare data model for resources uses a specific format for the representation of each type of data and specifies metadata common to all resource types as well as metadata unique to specific resource types. The Network Common Data Form (NetCDF) was chosen as the format for multidimensional space-time data in HydroShare. NetCDF is widely used in hydrological and other geoscience modeling because it contains self-describing metadata and supports the creation of array-oriented datasets that may include three spatial dimensions, a time dimension and other user defined dimensions. For example, NetCDF may be used to represent precipitation or surface air temperature fields that have two dimensions in space and one dimension in time. This presentation will illustrate how NetCDF files are used in HydroShare. When a NetCDF file is loaded into HydroShare, header information is extracted using the "ncdump" utility. Python functions developed for the Django web framework on which HydroShare is based, extract science metadata present in the NetCDF file, saving the user from having to enter it. Where the file follows Climate Forecast (CF) convention and Attribute Convention for Dataset Discovery (ACDD) standards, metadata is thus automatically populated. Users also have the ability to add metadata to the resource that may not have been present in the original NetCDF file. HydroShare's metadata editing functionality then writes this science metadata back into the NetCDF file to maintain consistency between the science metadata in HydroShare and the metadata in the NetCDF file. This further helps researchers easily add metadata information following the CF and ACDD conventions. Additional data inspection and subsetting functions were developed, taking advantage of Python and command line libraries for working with NetCDF files. We describe the design and implementation of these features and illustrate how NetCDF files from a modeling application may be curated in HydroShare and thus enhance reproducibility of the associated research. We also discuss future development planned for multidimensional space-time data in HydroShare.
Parallel Index and Query for Large Scale Data Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chou, Jerry; Wu, Kesheng; Ruebel, Oliver
2011-07-18
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less
SBRML: a markup language for associating systems biology data with models.
Dada, Joseph O; Spasić, Irena; Paton, Norman W; Mendes, Pedro
2010-04-01
Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results. The XML Schema file for SBRML is available at http://www.comp-sys-bio.org/SBRML under the Academic Free License (AFL) v3.0.
Oil Extraction and Indigenous Livelihoods in the Northern Ecuadorian Amazon
Bozigar, Matthew; Gray, Clark L.; Bilsborrow, Richard E.
2015-01-01
Globally, the extraction of minerals and fossil fuels is increasingly penetrating into isolated regions inhabited by indigenous peoples, potentially undermining their livelihoods and well-being. To provide new insight to this issue, we draw on a unique longitudinal dataset collected in the Ecuadorian Amazon over an 11-year period from 484 indigenous households with varying degrees of exposure to oil extraction. Fixed and random effects regression models of the consequences of oil activities for livelihood outcomes reveal mixed and multidimensional effects. These results challenge common assumptions about these processes and are only partly consistent with hypotheses drawn from the Dutch disease literature. PMID:26543302
Otis-Green, Shirley; Sidhu, Rupinder K.; Ferraro, Catherine Del; Ferrell, Betty
2014-01-01
Lung cancer patients and their family caregivers face a wide range of potentially distressing symptoms across the four domains of quality of life. A multi-dimensional approach to addressing these complex concerns with early integration of palliative care has proven beneficial. This article highlights opportunities to integrate social work using a comprehensive quality of life model and a composite patient scenario from a large lung cancer educational intervention National Cancer Institute-funded program project grant. PMID:24797998
Felyx : A Free Open Software Solution for the Analysis of Large Earth Observation Datasets
NASA Astrophysics Data System (ADS)
Piolle, Jean-Francois; Shutler, Jamie; Poulter, David; Guidetti, Veronica; Donlon, Craig
2014-05-01
GHRSST project, by assembling large collections of earth observation data from various sources and agencies, has also raised the need for providing the user community with tools to inter-compare them, assess and monitor their quality. The ESA /Medspiration project, which implemented the first operating node of GHRSST system for Europe, also paved the way successfully towards such generic analytics tools by developing the High Resolution Diagnostic Dataset System (HR-DDS) and Satellite to In situ Multi-sensor Match-up Databases. Building on this heritage, ESA is now funding the development by IFREMER, PML and Pelamis of felyx, a web tool merging the two capabilities into a single software solution. It will consist in a free open software solution, written in python and javascript, whose aim is to provide Earth Observation data producers and users with an open-source, flexible and reusable tool to allow the quality and performance of data streams (satellite, in situ and model) to be easily monitored and studied. The primary concept of Felyx is to work as an extraction tool, subsetting source data over predefined target areas (which can be static or moving) : these data subsets, and associated metrics, can then be accessed by users or client applications either as raw files, automatic alerts and reports generated periodically, or through a flexible web interface enabling statistical analysis and visualization. Felyx presents itself as an open-source suite of tools, written in python and javascript, enabling : * subsetting large local or remote collections of Earth Observation data over predefined sites (geographical boxes) or moving targets (ship, buoy, hurricane), storing locally the extracted data (refered as miniProds). These miniProds constitute a much smaller representative subset of the original collection on which one can perform any kind of processing or assessment without having to cope with heavy volumes of data. * computing statistical metrics over these miniProds using for instance a set of usual statistical operators (mean, median, rms, ...), fully extensible and applicable to any variable of a dataset. These metrics are stored in a fast search engine, queryable by humans and automated applications. * reporting or alerting, based on user-defined inference rules, through various media (emails, twitter feeds,..) and devices (phones, tablets). * analysing miniProds and metrics through a web interface allowing to dig into this base of information and extracting useful knowledge through multidimensional interactive display functions (time series, scatterplots, histograms, maps). The services provided by felyx will be generic, deployable at users own premises and adaptable enough to integrate any kind of parameters. Users will be able to operate their own felyx instance at any location, on datasets and parameters of their own interest, and the various instances will be able to interact with each other, creating a web of felyx systems enabling aggregation and cross comparison of miniProds and metrics from multiple sources. Initially two instances will be operated simultaneously during a 6 months demonstration phase, at IFREMER - on sea surface temperature (for GHRSST community) and ocean waves datasets - and PML - on ocean colour. We will present results from the Felyx project, demonstrate how the GHRSST community can exploit Felyx and demonstrate how the wider community can make use of the GHRSST data within Felyx.
Reeves, Anthony P; Xie, Yiting; Liu, Shuang
2017-04-01
With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset.
NASA Astrophysics Data System (ADS)
McGuire, M. P.; Welty, C.; Gangopadhyay, A.; Karabatis, G.; Chen, Z.
2006-05-01
The urban environment is formed by complex interactions between natural and human dominated systems, the study of which requires the collection and analysis of very large datasets that span many disciplines. Recent advances in sensor technology and automated data collection have improved the ability to monitor urban environmental systems and are making the idea of an urban environmental observatory a reality. This in turn has created a number of potential challenges in data management and analysis. We present the design of an end-to-end system to store, analyze, and visualize data from a prototype urban environmental observatory based at the Baltimore Ecosystem Study, a National Science Foundation Long Term Ecological Research site (BES LTER). We first present an object-relational design of an operational database to store high resolution spatial datasets as well as data from sensor networks, archived data from the BES LTER, data from external sources such as USGS NWIS, EPA Storet, and metadata. The second component of the system design includes a spatiotemporal data warehouse consisting of a data staging plan and a multidimensional data model designed for the spatiotemporal analysis of monitoring data. The system design also includes applications for multi-resolution exploratory data analysis, multi-resolution data mining, and spatiotemporal visualization based on the spatiotemporal data warehouse. Also the system design includes interfaces with water quality models such as HSPF, SWMM, and SWAT, and applications for real-time sensor network visualization, data discovery, data download, QA/QC, and backup and recovery, all of which are based on the operational database. The system design includes both internet and workstation-based interfaces. Finally we present the design of a laboratory for spatiotemporal analysis and visualization as well as real-time monitoring of the sensor network.
Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource.
Dean, Dennis A; Goldberger, Ary L; Mueller, Remo; Kim, Matthew; Rueschman, Michael; Mobley, Daniel; Sahoo, Satya S; Jayapandian, Catherine P; Cui, Licong; Morrical, Michael G; Surovec, Susan; Zhang, Guo-Qiang; Redline, Susan
2016-05-01
Professional sleep societies have identified a need for strategic research in multiple areas that may benefit from access to and aggregation of large, multidimensional datasets. Technological advances provide opportunities to extract and analyze physiological signals and other biomedical information from datasets of unprecedented size, heterogeneity, and complexity. The National Institutes of Health has implemented a Big Data to Knowledge (BD2K) initiative that aims to develop and disseminate state of the art big data access tools and analytical methods. The National Sleep Research Resource (NSRR) is a new National Heart, Lung, and Blood Institute resource designed to provide big data resources to the sleep research community. The NSRR is a web-based data portal that aggregates, harmonizes, and organizes sleep and clinical data from thousands of individuals studied as part of cohort studies or clinical trials and provides the user a suite of tools to facilitate data exploration and data visualization. Each deidentified study record minimally includes the summary results of an overnight sleep study; annotation files with scored events; the raw physiological signals from the sleep record; and available clinical and physiological data. NSRR is designed to be interoperable with other public data resources such as the Biologic Specimen and Data Repository Information Coordinating Center Demographics (BioLINCC) data and analyzed with methods provided by the Research Resource for Complex Physiological Signals (PhysioNet). This article reviews the key objectives, challenges and operational solutions to addressing big data opportunities for sleep research in the context of the national sleep research agenda. It provides information to facilitate further interactions of the user community with NSRR, a community resource. © 2016 Associated Professional Sleep Societies, LLC.
NeatMap--non-clustering heat map alternatives in R.
Rajaram, Satwik; Oono, Yoshi
2010-01-22
The clustered heat map is the most popular means of visualizing genomic data. It compactly displays a large amount of data in an intuitive format that facilitates the detection of hidden structures and relations in the data. However, it is hampered by its use of cluster analysis which does not always respect the intrinsic relations in the data, often requiring non-standardized reordering of rows/columns to be performed post-clustering. This sometimes leads to uninformative and/or misleading conclusions. Often it is more informative to use dimension-reduction algorithms (such as Principal Component Analysis and Multi-Dimensional Scaling) which respect the topology inherent in the data. Yet, despite their proven utility in the analysis of biological data, they are not as widely used. This is at least partially due to the lack of user-friendly visualization methods with the visceral impact of the heat map. NeatMap is an R package designed to meet this need. NeatMap offers a variety of novel plots (in 2 and 3 dimensions) to be used in conjunction with these dimension-reduction techniques. Like the heat map, but unlike traditional displays of such results, it allows the entire dataset to be displayed while visualizing relations between elements. It also allows superimposition of cluster analysis results for mutual validation. NeatMap is shown to be more informative than the traditional heat map with the help of two well-known microarray datasets. NeatMap thus preserves many of the strengths of the clustered heat map while addressing some of its deficiencies. It is hoped that NeatMap will spur the adoption of non-clustering dimension-reduction algorithms.
Zhou, Xiaolu; Li, Dongying
2018-05-09
Advancement in location-aware technologies, and information and communication technology in the past decades has furthered our knowledge of the interaction between human activities and the built environment. An increasing number of studies have collected data regarding individual activities to better understand how the environment shapes human behavior. Despite this growing interest, some challenges exist in collecting and processing individual's activity data, e.g., capturing people's precise environmental contexts and analyzing data at multiple spatial scales. In this study, we propose and implement an innovative system that integrates smartphone-based step tracking with an app and the sequential tile scan techniques to collect and process activity data. We apply the OpenStreetMap tile system to aggregate positioning points at various scales. We also propose duration, step and probability surfaces to quantify the multi-dimensional attributes of activities. Results show that, by running the app in the background, smartphones can measure multi-dimensional attributes of human activities, including space, duration, step, and location uncertainty at various spatial scales. By coordinating Global Positioning System (GPS) sensor with accelerometer sensor, this app can save battery which otherwise would be drained by GPS sensor quickly. Based on a test dataset, we were able to detect the recreational center and sports center as the space where the user was most active, among other places visited. The methods provide techniques to address key issues in analyzing human activity data. The system can support future studies on behavioral and health consequences related to individual's environmental exposure.
Annual Review of Research Under the Joint Service Electronics Program.
1979-10-01
Contents: Quadratic Optimization Problems; Nonlinear Control; Nonlinear Fault Analysis; Qualitative Analysis of Large Scale Systems; Multidimensional System Theory ; Optical Noise; and Pattern Recognition.
I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less
Analysis of the IJCNN 2011 UTL Challenge
2012-01-13
large datasets from various application domains: handwriting recognition, image recognition, video processing, text processing, and ecology. The goal...http //clopinet.com/ul). We made available large datasets from various application domains handwriting recognition, image recognition, video...evaluation sets consist of 4096 examples each. Dataset Domain Features Sparsity Devel. Transf. AVICENNA Handwriting 120 0% 150205 50000 HARRY Video 5000 98.1
Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data.
Gray, Vanessa E; Hause, Ronald J; Luebeck, Jens; Shendure, Jay; Fowler, Douglas M
2018-01-24
Large datasets describing the quantitative effects of mutations on protein function are becoming increasingly available. Here, we leverage these datasets to develop Envision, which predicts the magnitude of a missense variant's molecular effect. Envision combines 21,026 variant effect measurements from nine large-scale experimental mutagenesis datasets, a hitherto untapped training resource, with a supervised, stochastic gradient boosting learning algorithm. Envision outperforms other missense variant effect predictors both on large-scale mutagenesis data and on an independent test dataset comprising 2,312 TP53 variants whose effects were measured using a low-throughput approach. This dataset was never used for hyperparameter tuning or model training and thus serves as an independent validation set. Envision prediction accuracy is also more consistent across amino acids than other predictors. Finally, we demonstrate that Envision's performance improves as more large-scale mutagenesis data are incorporated. We precompute Envision predictions for every possible single amino acid variant in human, mouse, frog, zebrafish, fruit fly, worm, and yeast proteomes (https://envision.gs.washington.edu/). Copyright © 2017 Elsevier Inc. All rights reserved.
Remote visual analysis of large turbulence databases at multiple scales
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Remote visual analysis of large turbulence databases at multiple scales
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...
2018-06-15
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Szűcs, D
2016-01-01
A large body of research suggests that mathematical learning disability (MLD) is related to working memory impairment. Here, I organize part of this literature through a meta-analysis of 36 studies with 665 MLD and 1049 control participants. I demonstrate that one subtype of MLD is associated with reading problems and weak verbal short-term and working memory. Another subtype of MLD does not have associated reading problems and is linked to weak visuospatial short-term and working memory. In order to better understand MLD we need to precisely define potentially modality-specific memory subprocesses and supporting executive functions, relevant for mathematical learning. This can be achieved by taking a multidimensional parametric approach systematically probing an extended network of cognitive functions. Rather than creating arbitrary subgroups and/or focus on a single factor, highly powered studies need to position individuals in a multidimensional parametric space. This will allow us to understand the multidimensional structure of cognitive functions and their relationship to mathematical performance. © 2016 Elsevier B.V. All rights reserved.
A nonlocal electron conduction model for multidimensional radiation hydrodynamics codes
NASA Astrophysics Data System (ADS)
Schurtz, G. P.; Nicolaï, Ph. D.; Busquet, M.
2000-10-01
Numerical simulation of laser driven Inertial Confinement Fusion (ICF) related experiments require the use of large multidimensional hydro codes. Though these codes include detailed physics for numerous phenomena, they deal poorly with electron conduction, which is the leading energy transport mechanism of these systems. Electron heat flow is known, since the work of Luciani, Mora, and Virmont (LMV) [Phys. Rev. Lett. 51, 1664 (1983)], to be a nonlocal process, which the local Spitzer-Harm theory, even flux limited, is unable to account for. The present work aims at extending the original formula of LMV to two or three dimensions of space. This multidimensional extension leads to an equivalent transport equation suitable for easy implementation in a two-dimensional radiation-hydrodynamic code. Simulations are presented and compared to Fokker-Planck simulations in one and two dimensions of space.
NASA Astrophysics Data System (ADS)
Boscheri, Walter; Dumbser, Michael; Loubère, Raphaël; Maire, Pierre-Henri
2018-04-01
In this paper we develop a conservative cell-centered Lagrangian finite volume scheme for the solution of the hydrodynamics equations on unstructured multidimensional grids. The method is derived from the Eucclhyd scheme discussed in [47,43,45]. It is second-order accurate in space and is combined with the a posteriori Multidimensional Optimal Order Detection (MOOD) limiting strategy to ensure robustness and stability at shock waves. Second-order of accuracy in time is achieved via the ADER (Arbitrary high order schemes using DERivatives) approach. A large set of numerical test cases is proposed to assess the ability of the method to achieve effective second order of accuracy on smooth flows, maintaining an essentially non-oscillatory behavior on discontinuous profiles, general robustness ensuring physical admissibility of the numerical solution, and precision where appropriate.
Reeves, Anthony P.; Xie, Yiting; Liu, Shuang
2017-01-01
Abstract. With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset. PMID:28612037
Functional connectomics from a "big data" perspective.
Xia, Mingrui; He, Yong
2017-10-15
In the last decade, explosive growth regarding functional connectome studies has been observed. Accumulating knowledge has significantly contributed to our understanding of the brain's functional network architectures in health and disease. With the development of innovative neuroimaging techniques, the establishment of large brain datasets and the increasing accumulation of published findings, functional connectomic research has begun to move into the era of "big data", which generates unprecedented opportunities for discovery in brain science and simultaneously encounters various challenging issues, such as data acquisition, management and analyses. Big data on the functional connectome exhibits several critical features: high spatial and/or temporal precision, large sample sizes, long-term recording of brain activity, multidimensional biological variables (e.g., imaging, genetic, demographic, cognitive and clinic) and/or vast quantities of existing findings. We review studies regarding functional connectomics from a big data perspective, with a focus on recent methodological advances in state-of-the-art image acquisition (e.g., multiband imaging), analysis approaches and statistical strategies (e.g., graph theoretical analysis, dynamic network analysis, independent component analysis, multivariate pattern analysis and machine learning), as well as reliability and reproducibility validations. We highlight the novel findings in the application of functional connectomic big data to the exploration of the biological mechanisms of cognitive functions, normal development and aging and of neurological and psychiatric disorders. We advocate the urgent need to expand efforts directed at the methodological challenges and discuss the direction of applications in this field. Copyright © 2017 Elsevier Inc. All rights reserved.
Schure, Mark R; Davis, Joe M
2017-11-10
Orthogonality metrics (OMs) for three and higher dimensional separations are proposed as extensions of previously developed OMs, which were used to evaluate the zone utilization of two-dimensional (2D) separations. These OMs include correlation coefficients, dimensionality, information theory metrics and convex-hull metrics. In a number of these cases, lower dimensional subspace metrics exist and can be readily calculated. The metrics are used to interpret previously generated experimental data. The experimental datasets are derived from Gilar's peptide data, now modified to be three dimensional (3D), and a comprehensive 3D chromatogram from Moore and Jorgenson. The Moore and Jorgenson chromatogram, which has 25 identifiable 3D volume elements or peaks, displayed good orthogonality values over all dimensions. However, OMs based on discretization of the 3D space changed substantially with changes in binning parameters. This example highlights the importance in higher dimensions of having an abundant number of retention times as data points, especially for methods that use discretization. The Gilar data, which in a previous study produced 21 2D datasets by the pairing of 7 one-dimensional separations, was reinterpreted to produce 35 3D datasets. These datasets show a number of interesting properties, one of which is that geometric and harmonic means of lower dimensional subspace (i.e., 2D) OMs correlate well with the higher dimensional (i.e., 3D) OMs. The space utilization of the Gilar 3D datasets was ranked using OMs, with the retention times of the datasets having the largest and smallest OMs presented as graphs. A discussion concerning the orthogonality of higher dimensional techniques is given with emphasis on molecular diversity in chromatographic separations. In the information theory work, an inconsistency is found in previous studies of orthogonality using the 2D metric often identified as %O. A new choice of metric is proposed, extended to higher dimensions, characterized by mixes of ordered and random retention times, and applied to the experimental datasets. In 2D, the new metric always equals or exceeds the original one. However, results from both the original and new methods are given. Copyright © 2017 Elsevier B.V. All rights reserved.
Querying Large Biological Network Datasets
ERIC Educational Resources Information Center
Gulsoy, Gunhan
2013-01-01
New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…
NASA Astrophysics Data System (ADS)
Tisdale, M.
2016-12-01
NASA's Atmospheric Science Data Center (ASDC) is operationally using the Esri ArcGIS Platform to improve data discoverability, accessibility and interoperability to meet the diversifying government, private, public and academic communities' driven requirements. The ASDC is actively working to provide their mission essential datasets as ArcGIS Image Services, Open Geospatial Consortium (OGC) Web Mapping Services (WMS), OGC Web Coverage Services (WCS) and leveraging the ArcGIS multidimensional mosaic dataset structure. Science teams and ASDC are utilizing these services, developing applications using the Web AppBuilder for ArcGIS and ArcGIS API for Javascript, and evaluating restructuring their data production and access scripts within the ArcGIS Python Toolbox framework and Geoprocessing service environment. These capabilities yield a greater usage and exposure of ASDC data holdings and provide improved geospatial analytical tools for a mission critical understanding in the areas of the earth's radiation budget, clouds, aerosols, and tropospheric chemistry.
NASA Astrophysics Data System (ADS)
Tisdale, M.
2017-12-01
NASA's Atmospheric Science Data Center (ASDC) is operationally using the Esri ArcGIS Platform to improve data discoverability, accessibility and interoperability to meet the diversifying user requirements from government, private, public and academic communities. The ASDC is actively working to provide their mission essential datasets as ArcGIS Image Services, Open Geospatial Consortium (OGC) Web Mapping Services (WMS), and OGC Web Coverage Services (WCS) while leveraging the ArcGIS multidimensional mosaic dataset structure. Science teams at ASDC are utilizing these services through the development of applications using the Web AppBuilder for ArcGIS and the ArcGIS API for Javascript. These services provide greater exposure of ASDC data holdings to the GIS community and allow for broader sharing and distribution to various end users. These capabilities provide interactive visualization tools and improved geospatial analytical tools for a mission critical understanding in the areas of the earth's radiation budget, clouds, aerosols, and tropospheric chemistry. The presentation will cover how the ASDC is developing geospatial web services and applications to improve data discoverability, accessibility, and interoperability.
Use of multidimensional, multimodal imaging and PACS to support neurological diagnoses
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wong, S.T.C.; Knowlton, R.; Hoo, K.S.
1995-12-31
Technological advances in brain imaging have revolutionized diagnosis in neurology and neurological surgery. Major imaging techniques include magnetic resonance imaging (MRI) to visualize structural anatomy, positron emission tomography (PET) to image metabolic function and cerebral blood flow, magnetoencephalography (MEG) to visualize the location of physiologic current sources, and magnetic resonance spectroscopy (MRS) to measure specific biochemicals. Each of these techniques studies different biomedical aspects of the grain, but there lacks an effective means to quantify and correlate the disparate imaging datasets in order to improve clinical decision making processes. This paper describes several techniques developed in a UNIX-based neurodiagnostic workstationmore » to aid the non-invasive presurgical evaluation of epilepsy patients. These techniques include on-line access to the picture archiving and communication systems (PACS) multimedia archive, coregistration of multimodality image datasets, and correlation and quantitative of structural and functional information contained in the registered images. For illustration, the authors describe the use of these techniques in a patient case of non-lesional neocortical epilepsy. They also present the future work based on preliminary studies.« less
Adams, Helen; Adger, W Neil; Ahmad, Sate; Ahmed, Ali; Begum, Dilruba; Lázár, Attila N; Matthews, Zoe; Rahman, Mohammed Mofizur; Streatfield, Peter Kim
2016-11-08
Populations in resource dependent economies gain well-being from the natural environment, in highly spatially and temporally variable patterns. To collect information on this, we designed and implemented a 1586-household quantitative survey in the southwest coastal zone of Bangladesh. Data were collected on material, subjective and health dimensions of well-being in the context of natural resource use, particularly agriculture, aquaculture, mangroves and fisheries. The questionnaire included questions on factors that mediate poverty outcomes: mobility and remittances; loans and micro-credit; environmental perceptions; shocks; and women's empowerment. The data are stratified by social-ecological system to take into account spatial dynamics and the survey was repeated with the same respondents three times within a year to incorporate seasonal dynamics. The dataset includes blood pressure measurements and height and weight of men, women and children. In addition, the household listing includes basic data on livelihoods and income for approximately 10,000 households. The dataset facilitates interdisciplinary research on spatial and temporal dynamics of well-being in the context of natural resource dependence in low income countries.
Adams, Helen; Adger, W. Neil; Ahmad, Sate; Ahmed, Ali; Begum, Dilruba; Lázár, Attila N.; Matthews, Zoe; Rahman, Mohammed Mofizur; Streatfield, Peter Kim
2016-01-01
Populations in resource dependent economies gain well-being from the natural environment, in highly spatially and temporally variable patterns. To collect information on this, we designed and implemented a 1586-household quantitative survey in the southwest coastal zone of Bangladesh. Data were collected on material, subjective and health dimensions of well-being in the context of natural resource use, particularly agriculture, aquaculture, mangroves and fisheries. The questionnaire included questions on factors that mediate poverty outcomes: mobility and remittances; loans and micro-credit; environmental perceptions; shocks; and women’s empowerment. The data are stratified by social-ecological system to take into account spatial dynamics and the survey was repeated with the same respondents three times within a year to incorporate seasonal dynamics. The dataset includes blood pressure measurements and height and weight of men, women and children. In addition, the household listing includes basic data on livelihoods and income for approximately 10,000 households. The dataset facilitates interdisciplinary research on spatial and temporal dynamics of well-being in the context of natural resource dependence in low income countries. PMID:27824340
Lichtenstein, James L. L.; Wright, Colin M; McEwen, Brendan; Pinter-Wollman, Noa; Pruitt, Jonathan N.
2018-01-01
Individual animals differ consistently in their behaviour, thus impacting a wide variety of ecological outcomes. Recent advances in animal personality research have established the ecological importance of the multidimensional behavioural volume occupied by individuals and by multispecies communities. Here, we examine the degree to which the multidimensional behavioural volume of a group predicts the outcome of both intra- and interspecific interactions. In particular, we test the hypothesis that a population of conspecifics will experience low intraspecific competition when the population occupies a large volume in behavioural space. We further hypothesize that populations of interacting species will exhibit greater interspecific competition when one or both species occupy large volumes in behavioural space. We evaluate these hypotheses by studying groups of katydids (Scudderia nymphs) and froghoppers (Philaenus spumarius), which compete for food and space on their shared host plant, Solidago canadensis. We found that individuals in single-species groups of katydids positioned themselves closer to one another, suggesting reduced competition, when groups occupied a large behavioural volume. When both species were placed together, we found that the survival of froghoppers was greatest when both froghoppers and katydids occupied a small volume in behavioural space, particularly at high froghopper densities. These results suggest that groups that occupy large behavioural volumes can have low intraspecific competition but high interspecific competition. Thus, behavioural hypervolumes appear to have ecological consequences at both the level of the population and the community and may help to predict the intensity of competition both within and across species. PMID:29681647
Interactive 4D Visualization of Sediment Transport Models
NASA Astrophysics Data System (ADS)
Butkiewicz, T.; Englert, C. M.
2013-12-01
Coastal sediment transport models simulate the effects that waves, currents, and tides have on near-shore bathymetry and features such as beaches and barrier islands. Understanding these dynamic processes is integral to the study of coastline stability, beach erosion, and environmental contamination. Furthermore, analyzing the results of these simulations is a critical task in the design, placement, and engineering of coastal structures such as seawalls, jetties, support pilings for wind turbines, etc. Despite the importance of these models, there is a lack of available visualization software that allows users to explore and perform analysis on these datasets in an intuitive and effective manner. Existing visualization interfaces for these datasets often present only one variable at a time, using two dimensional plan or cross-sectional views. These visual restrictions limit the ability to observe the contents in the proper overall context, both in spatial and multi-dimensional terms. To improve upon these limitations, we use 3D rendering and particle system based illustration techniques to show water column/flow data across all depths simultaneously. We can also encode multiple variables across different perceptual channels (color, texture, motion, etc.) to enrich surfaces with multi-dimensional information. Interactive tools are provided, which can be used to explore the dataset and find regions-of-interest for further investigation. Our visualization package provides an intuitive 4D (3D, time-varying) visualization of sediment transport model output. In addition, we are also integrating real world observations with the simulated data to support analysis of the impact from major sediment transport events. In particular, we have been focusing on the effects of Superstorm Sandy on the Redbird Artificial Reef Site, offshore of Delaware Bay. Based on our pre- and post-storm high-resolution sonar surveys, there has significant scour and bedform migration around the sunken subway cars and other vessels present at the Redbird site. Due to the extensive surveying and historical data availability in the area, the site is highly attractive for comparing hindcasted sediment transport simulations to our observations of actual changes. This work has the potential to strengthen the accuracy of sediment transport modeling, as well as help predict and prepare for future changes due to similar extreme sediment transport events. Our visualization showing a simple sediment transport model with tidal flows causing significant erosion (red) and deposition (blue).
Druka, Arnis; Druka, Ilze; Centeno, Arthur G; Li, Hongqiang; Sun, Zhaohui; Thomas, William TB; Bonar, Nicola; Steffenson, Brian J; Ullrich, Steven E; Kleinhofs, Andris; Wise, Roger P; Close, Timothy J; Potokina, Elena; Luo, Zewei; Wagner, Carola; Schweizer, Günther F; Marshall, David F; Kearsey, Michael J; Williams, Robert W; Waugh, Robbie
2008-01-01
Background A typical genetical genomics experiment results in four separate data sets; genotype, gene expression, higher-order phenotypic data and metadata that describe the protocols, processing and the array platform. Used in concert, these data sets provide the opportunity to perform genetic analysis at a systems level. Their predictive power is largely determined by the gene expression dataset where tens of millions of data points can be generated using currently available mRNA profiling technologies. Such large, multidimensional data sets often have value beyond that extracted during their initial analysis and interpretation, particularly if conducted on widely distributed reference genetic materials. Besides quality and scale, access to the data is of primary importance as accessibility potentially allows the extraction of considerable added value from the same primary dataset by the wider research community. Although the number of genetical genomics experiments in different plant species is rapidly increasing, none to date has been presented in a form that allows quick and efficient on-line testing for possible associations between genes, loci and traits of interest by an entire research community. Description Using a reference population of 150 recombinant doubled haploid barley lines we generated novel phenotypic, mRNA abundance and SNP-based genotyping data sets, added them to a considerable volume of legacy trait data and entered them into the GeneNetwork . GeneNetwork is a unified on-line analytical environment that enables the user to test genetic hypotheses about how component traits, such as mRNA abundance, may interact to condition more complex biological phenotypes (higher-order traits). Here we describe these barley data sets and demonstrate some of the functionalities GeneNetwork provides as an easily accessible and integrated analytical environment for exploring them. Conclusion By integrating barley genotypic, phenotypic and mRNA abundance data sets directly within GeneNetwork's analytical environment we provide simple web access to the data for the research community. In this environment, a combination of correlation analysis and linkage mapping provides the potential to identify and substantiate gene targets for saturation mapping and positional cloning. By integrating datasets from an unsequenced crop plant (barley) in a database that has been designed for an animal model species (mouse) with a well established genome sequence, we prove the importance of the concept and practice of modular development and interoperability of software engineering for biological data sets. PMID:19017390
The observed clustering of damaging extra-tropical cyclones in Europe
NASA Astrophysics Data System (ADS)
Cusack, S.
2015-12-01
The clustering of severe European windstorms on annual timescales has substantial impacts on the re/insurance industry. Management of the risk is impaired by large uncertainties in estimates of clustering from historical storm datasets typically covering the past few decades. The uncertainties are unusually large because clustering depends on the variance of storm counts. Eight storm datasets are gathered for analysis in this study in order to reduce these uncertainties. Six of the datasets contain more than 100~years of severe storm information to reduce sampling errors, and the diversity of information sources and analysis methods between datasets sample observational errors. All storm severity measures used in this study reflect damage, to suit re/insurance applications. It is found that the shortest storm dataset of 42 years in length provides estimates of clustering with very large sampling and observational errors. The dataset does provide some useful information: indications of stronger clustering for more severe storms, particularly for southern countries off the main storm track. However, substantially different results are produced by removal of one stormy season, 1989/1990, which illustrates the large uncertainties from a 42-year dataset. The extended storm records place 1989/1990 into a much longer historical context to produce more robust estimates of clustering. All the extended storm datasets show a greater degree of clustering with increasing storm severity and suggest clustering of severe storms is much more material than weaker storms. Further, they contain signs of stronger clustering in areas off the main storm track, and weaker clustering for smaller-sized areas, though these signals are smaller than uncertainties in actual values. Both the improvement of existing storm records and development of new historical storm datasets would help to improve management of this risk.
Application Perspective of 2D+SCALE Dimension
NASA Astrophysics Data System (ADS)
Karim, H.; Rahman, A. Abdul
2016-09-01
Different applications or users need different abstraction of spatial models, dimensionalities and specification of their datasets due to variations of required analysis and output. Various approaches, data models and data structures are now available to support most current application models in Geographic Information System (GIS). One of the focuses trend in GIS multi-dimensional research community is the implementation of scale dimension with spatial datasets to suit various scale application needs. In this paper, 2D spatial datasets that been scaled up as the third dimension are addressed as 2D+scale (or 3D-scale) dimension. Nowadays, various data structures, data models, approaches, schemas, and formats have been proposed as the best approaches to support variety of applications and dimensionality in 3D topology. However, only a few of them considers the element of scale as their targeted dimension. As the scale dimension is concerned, the implementation approach can be either multi-scale or vario-scale (with any available data structures and formats) depending on application requirements (topology, semantic and function). This paper attempts to discuss on the current and new potential applications which positively could be integrated upon 3D-scale dimension approach. The previous and current works on scale dimension as well as the requirements to be preserved for any given applications, implementation issues and future potential applications forms the major discussion of this paper.
Application of stochastic weighted algorithms to a multidimensional silica particle model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Menz, William J.; Patterson, Robert I.A.; Wagner, Wolfgang
2013-09-01
Highlights: •Stochastic weighted algorithms (SWAs) are developed for a detailed silica model. •An implementation of SWAs with the transition kernel is presented. •The SWAs’ solutions converge to the direct simulation algorithm’s (DSA) solution. •The efficiency of SWAs is evaluated for this multidimensional particle model. •It is shown that SWAs can be used for coagulation problems in industrial systems. -- Abstract: This paper presents a detailed study of the numerical behaviour of stochastic weighted algorithms (SWAs) using the transition regime coagulation kernel and a multidimensional silica particle model. The implementation in the SWAs of the transition regime coagulation kernel and associatedmore » majorant rates is described. The silica particle model of Shekar et al. [S. Shekar, A.J. Smith, W.J. Menz, M. Sander, M. Kraft, A multidimensional population balance model to describe the aerosol synthesis of silica nanoparticles, Journal of Aerosol Science 44 (2012) 83–98] was used in conjunction with this coagulation kernel to study the convergence properties of SWAs with a multidimensional particle model. High precision solutions were calculated with two SWAs and also with the established direct simulation algorithm. These solutions, which were generated using large number of computational particles, showed close agreement. It was thus demonstrated that SWAs can be successfully used with complex coagulation kernels and high dimensional particle models to simulate real-world systems.« less
Earth Science Data Analytics: Preparing for Extracting Knowledge from Information
NASA Technical Reports Server (NTRS)
Kempler, Steven; Barbieri, Lindsay
2016-01-01
Data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Data analytics is a broad term that includes data analysis, as well as an understanding of the cognitive processes an analyst uses to understand problems and explore data in meaningful ways. Analytics also include data extraction, transformation, and reduction, utilizing specific tools, techniques, and methods. Turning to data science, definitions of data science sound very similar to those of data analytics (which leads to a lot of the confusion between the two). But the skills needed for both, co-analyzing large amounts of heterogeneous data, understanding and utilizing relevant tools and techniques, and subject matter expertise, although similar, serve different purposes. Data Analytics takes on a practitioners approach to applying expertise and skills to solve issues and gain subject knowledge. Data Science, is more theoretical (research in itself) in nature, providing strategic actionable insights and new innovative methodologies. Earth Science Data Analytics (ESDA) is the process of examining, preparing, reducing, and analyzing large amounts of spatial (multi-dimensional), temporal, or spectral data using a variety of data types to uncover patterns, correlations and other information, to better understand our Earth. The large variety of datasets (temporal spatial differences, data types, formats, etc.) invite the need for data analytics skills that understand the science domain, and data preparation, reduction, and analysis techniques, from a practitioners point of view. The application of these skills to ESDA is the focus of this presentation. The Earth Science Information Partners (ESIP) Federation Earth Science Data Analytics (ESDA) Cluster was created in recognition of the practical need to facilitate the co-analysis of large amounts of data and information for Earth science. Thus, from a to advance science point of view: On the continuum of ever evolving data management systems, we need to understand and develop ways that allow for the variety of data relationships to be examined, and information to be manipulated, such that knowledge can be enhanced, to facilitate science. Recognizing the importance and potential impacts of the unlimited ways to co-analyze heterogeneous datasets, now and especially in the future, one of the objectives of the ESDA cluster is to facilitate the preparation of individuals to understand and apply needed skills to Earth science data analytics. Pinpointing and communicating the needed skills and expertise is new, and not easy. Information technology is just beginning to provide the tools for advancing the analysis of heterogeneous datasets in a big way, thus, providing opportunity to discover unobvious scientific relationships, previously invisible to the science eye. And it is not easy It takes individuals, or teams of individuals, with just the right combination of skills to understand the data and develop the methods to glean knowledge out of data and information. In addition, whereas definitions of data science and big data are (more or less) available (summarized in Reference 5), Earth science data analytics is virtually ignored in the literature, (barring a few excellent sources).
Bring NASA Scientific Data into GIS
NASA Astrophysics Data System (ADS)
Xu, H.
2016-12-01
NASA's Earth Observation System (EOS) and many other missions produce data of huge volume and near real time which drives the research and understanding of climate change. Geographic Information System (GIS) is a technology used for the management, visualization and analysis of spatial data. Since it's inception in the 1960s, GIS has been applied to many fields at the city, state, national, and world scales. People continue to use it today to analyze and visualize trends, patterns, and relationships from the massive datasets of scientific data. There is great interest in both the scientific and GIS communities in improving technologies that can bring scientific data into a GIS environment, where scientific research and analysis can be shared through the GIS platform to the public. Most NASA scientific data are delivered in the Hierarchical Data Format (HDF), a format is both flexible and powerful. However, this flexibility results in challenges when trying to develop supported GIS software - data stored with HDF formats lack a unified standard and convention among these products. The presentation introduces an information model that enables ArcGIS software to ingest NASA scientific data and create a multidimensional raster - univariate and multivariate hypercubes - for scientific visualization and analysis. We will present the framework how ArcGIS leverages the open source GDAL (Geospatial Data Abstract Library) to support its raster data access, discuss how we overcame the GDAL drivers limitations in handing scientific products that are stored with HDF4 and HDF5 formats and how we improve the way in modeling the multidimensionality with GDAL. In additional, we will talk about the direction of ArcGIS handling NASA products and demonstrate how the multidimensional information model can help scientists work with various data products such as MODIS, MOPPIT, SMAP as well as many data products in a GIS environment.
NASA Astrophysics Data System (ADS)
Kruithof, Maarten C.; Bouma, Henri; Fischer, Noëlle M.; Schutte, Klamer
2016-10-01
Object recognition is important to understand the content of video and allow flexible querying in a large number of cameras, especially for security applications. Recent benchmarks show that deep convolutional neural networks are excellent approaches for object recognition. This paper describes an approach of domain transfer, where features learned from a large annotated dataset are transferred to a target domain where less annotated examples are available as is typical for the security and defense domain. Many of these networks trained on natural images appear to learn features similar to Gabor filters and color blobs in the first layer. These first-layer features appear to be generic for many datasets and tasks while the last layer is specific. In this paper, we study the effect of copying all layers and fine-tuning a variable number. We performed an experiment with a Caffe-based network on 1000 ImageNet classes that are randomly divided in two equal subgroups for the transfer from one to the other. We copy all layers and vary the number of layers that is fine-tuned and the size of the target dataset. We performed additional experiments with the Keras platform on CIFAR-10 dataset to validate general applicability. We show with both platforms and both datasets that the accuracy on the target dataset improves when more target data is used. When the target dataset is large, it is beneficial to freeze only a few layers. For a large target dataset, the network without transfer learning performs better than the transfer network, especially if many layers are frozen. When the target dataset is small, it is beneficial to transfer (and freeze) many layers. For a small target dataset, the transfer network boosts generalization and it performs much better than the network without transfer learning. Learning time can be reduced by freezing many layers in a network.
The experience of linking Victorian emergency medical service trauma data
Boyle, Malcolm J
2008-01-01
Background The linking of a large Emergency Medical Service (EMS) dataset with the Victorian Department of Human Services (DHS) hospital datasets and Victorian State Trauma Outcome Registry and Monitoring (VSTORM) dataset to determine patient outcomes has not previously been undertaken in Victoria. The objective of this study was to identify the linkage rate of a large EMS trauma dataset with the Department of Human Services hospital datasets and VSTORM dataset. Methods The linking of an EMS trauma dataset to the hospital datasets utilised deterministic and probabilistic matching. The linking of three EMS trauma datasets to the VSTORM dataset utilised deterministic, probabilistic and manual matching. Results There were 66.7% of patients from the EMS dataset located in the VEMD. There were 96% of patients located in the VAED who were defined in the VEMD as being admitted to hospital. 3.7% of patients located in the VAED could not be found in the VEMD due to hospitals not reporting to the VEMD. For the EMS datasets, there was a 146% increase in successful links with the trauma profile dataset, a 221% increase in successful links with the mechanism of injury only dataset, and a 46% increase with sudden deterioration dataset, to VSTORM when using manual compared to deterministic matching. Conclusion This study has demonstrated that EMS data can be successfully linked to other health related datasets using deterministic and probabilistic matching with varying levels of success. The quality of EMS data needs to be improved to ensure better linkage success rates with other health related datasets. PMID:19014622
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
Modeling Individual Cyclic Variation in Human Behavior.
Pierson, Emma; Althoff, Tim; Leskovec, Jure
2018-04-01
Cycles are fundamental to human health and behavior. Examples include mood cycles, circadian rhythms, and the menstrual cycle. However, modeling cycles in time series data is challenging because in most cases the cycles are not labeled or directly observed and need to be inferred from multidimensional measurements taken over time. Here, we present Cyclic Hidden Markov Models (CyH-MMs) for detecting and modeling cycles in a collection of multidimensional heterogeneous time series data. In contrast to previous cycle modeling methods, CyHMMs deal with a number of challenges encountered in modeling real-world cycles: they can model multivariate data with both discrete and continuous dimensions; they explicitly model and are robust to missing data; and they can share information across individuals to accommodate variation both within and between individual time series. Experiments on synthetic and real-world health-tracking data demonstrate that CyHMMs infer cycle lengths more accurately than existing methods, with 58% lower error on simulated data and 63% lower error on real-world data compared to the best-performing baseline. CyHMMs can also perform functions which baselines cannot: they can model the progression of individual features/symptoms over the course of the cycle, identify the most variable features, and cluster individual time series into groups with distinct characteristics. Applying CyHMMs to two real-world health-tracking datasets-of human menstrual cycle symptoms and physical activity tracking data-yields important insights including which symptoms to expect at each point during the cycle. We also find that people fall into several groups with distinct cycle patterns, and that these groups differ along dimensions not provided to the model. For example, by modeling missing data in the menstrual cycles dataset, we are able to discover a medically relevant group of birth control users even though information on birth control is not given to the model.
Uncertainty propagation for statistical impact prediction of space debris
NASA Astrophysics Data System (ADS)
Hoogendoorn, R.; Mooij, E.; Geul, J.
2018-01-01
Predictions of the impact time and location of space debris in a decaying trajectory are highly influenced by uncertainties. The traditional Monte Carlo (MC) method can be used to perform accurate statistical impact predictions, but requires a large computational effort. A method is investigated that directly propagates a Probability Density Function (PDF) in time, which has the potential to obtain more accurate results with less computational effort. The decaying trajectory of Delta-K rocket stages was used to test the methods using a six degrees-of-freedom state model. The PDF of the state of the body was propagated in time to obtain impact-time distributions. This Direct PDF Propagation (DPP) method results in a multi-dimensional scattered dataset of the PDF of the state, which is highly challenging to process. No accurate results could be obtained, because of the structure of the DPP data and the high dimensionality. Therefore, the DPP method is less suitable for practical uncontrolled entry problems and the traditional MC method remains superior. Additionally, the MC method was used with two improved uncertainty models to obtain impact-time distributions, which were validated using observations of true impacts. For one of the two uncertainty models, statistically more valid impact-time distributions were obtained than in previous research.
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets
Nowicka, Malgorzata; Krieg, Carsten; Weber, Lukas M.; Hartmann, Felix J.; Guglietta, Silvia; Becher, Burkhard; Levesque, Mitchell P.; Robinson, Mark D.
2017-01-01
High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals). PMID:28663787
Identification of cancer genes that are independent of dominant proliferation and lineage programs
Selfors, Laura M.; Stover, Daniel G.; Harris, Isaac S.; Brugge, Joan S.; Coloff, Jonathan L.
2017-01-01
Large, multidimensional cancer datasets provide a resource that can be mined to identify candidate therapeutic targets for specific subgroups of tumors. Here, we analyzed human breast cancer data to identify transcriptional programs associated with tumors bearing specific genetic driver alterations. Using an unbiased approach, we identified thousands of genes whose expression was enriched in tumors with specific genetic alterations. However, expression of the vast majority of these genes was not enriched if associations were analyzed within individual breast tumor molecular subtypes, across multiple tumor types, or after gene expression was normalized to account for differences in proliferation or tumor lineage. Together with linear modeling results, these findings suggest that most transcriptional programs associated with specific genetic alterations in oncogenes and tumor suppressors are highly context-dependent and are predominantly linked to differences in proliferation programs between distinct breast cancer subtypes. We demonstrate that such proliferation-dependent gene expression dominates tumor transcriptional programs relative to matched normal tissues. However, we also identified a relatively small group of cancer-associated genes that are both proliferation- and lineage-independent. A subset of these genes are attractive candidate targets for combination therapy because they are essential in breast cancer cell lines, druggable, enriched in stem-like breast cancer cells, and resistant to chemotherapy-induced down-regulation. PMID:29229826
NASA Astrophysics Data System (ADS)
Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd
2017-09-01
The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.
geoknife: Reproducible web-processing of large gridded datasets
Read, Jordan S.; Walker, Jordan I.; Appling, Alison P.; Blodgett, David L.; Read, Emily K.; Winslow, Luke A.
2016-01-01
Geoprocessing of large gridded data according to overlap with irregular landscape features is common to many large-scale ecological analyses. The geoknife R package was created to facilitate reproducible analyses of gridded datasets found on the U.S. Geological Survey Geo Data Portal web application or elsewhere, using a web-enabled workflow that eliminates the need to download and store large datasets that are reliably hosted on the Internet. The package provides access to several data subset and summarization algorithms that are available on remote web processing servers. Outputs from geoknife include spatial and temporal data subsets, spatially-averaged time series values filtered by user-specified areas of interest, and categorical coverage fractions for various land-use types.
A high-resolution European dataset for hydrologic modeling
NASA Astrophysics Data System (ADS)
Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta
2013-04-01
There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as inputs to the hydrological calibration and validation of EFAS as well as for establishing long-term discharge "proxy" climatologies which can then in turn be used for statistical analysis to derive return periods or other time series derivatives. In addition, this dataset will be used to assess climatological trends in Europe. Unfortunately, to date no baseline dataset at the European scale exists to test the quality of the herein presented data. Hence, a comparison against other existing datasets can therefore only be an indication of data quality. Due to availability, a comparison was made for precipitation and temperature only, arguably the most important meteorological drivers for hydrologic models. A variety of analyses was undertaken at country scale against data reported to EUROSTAT and E-OBS datasets. The comparison revealed that while the datasets showed overall similar temporal and spatial patterns, there were some differences in magnitudes especially for precipitation. It is not straightforward to define the specific cause for these differences. However, in most cases the comparatively low observation station density appears to be the principal reason for the differences in magnitude.
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.
Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian
2014-07-01
We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.
Ernst, Jason; Kellis, Manolis
2015-04-01
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
An interactive web application for the dissemination of human systems immunology data.
Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien
2015-06-19
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
Multidimensional indexing structure for use with linear optimization queries
NASA Technical Reports Server (NTRS)
Bergman, Lawrence David (Inventor); Castelli, Vittorio (Inventor); Chang, Yuan-Chi (Inventor); Li, Chung-Sheng (Inventor); Smith, John Richard (Inventor)
2002-01-01
Linear optimization queries, which usually arise in various decision support and resource planning applications, are queries that retrieve top N data records (where N is an integer greater than zero) which satisfy a specific optimization criterion. The optimization criterion is to either maximize or minimize a linear equation. The coefficients of the linear equation are given at query time. Methods and apparatus are disclosed for constructing, maintaining and utilizing a multidimensional indexing structure of database records to improve the execution speed of linear optimization queries. Database records with numerical attributes are organized into a number of layers and each layer represents a geometric structure called convex hull. Such linear optimization queries are processed by searching from the outer-most layer of this multi-layer indexing structure inwards. At least one record per layer will satisfy the query criterion and the number of layers needed to be searched depends on the spatial distribution of records, the query-issued linear coefficients, and N, the number of records to be returned. When N is small compared to the total size of the database, answering the query typically requires searching only a small fraction of all relevant records, resulting in a tremendous speedup as compared to linearly scanning the entire dataset.
NASA Astrophysics Data System (ADS)
Tang, Jun; Yuan, Yunbin
2017-10-01
Ionospheric anomalies possibly associated with large earthquakes, particularly coseismic ionospheric disturbances, have been detected by global positioning system (GPS). A large Nepal earthquake with magnitude Mw7.8 occurred on April 25, 2015. In this paper, we investigate the multi-dimensional distribution of near-field coseismic ionospheric disturbances (CIDs) using total electron content (TEC) and computerized ionospheric tomography (CIT) from regional GPS observational data. The results show significant ionospheric TEC disturbances and interesting multi-dimensional structures around the main shock. Regarding the TEC changes, coseismic ionospheric disturbances occur approximately 10-20 min after the earthquake northeast and northwest of epicentre. The maximum ridge-to-trough amplitude of CIDs is up to approximately 0.90 TECU/min. Propagation velocities of the TEC disturbances are 1.27 ± 0.06 km/s and 1.91 ± 0.38 km/s. It is believed that the ionospheric disturbances are triggered by acoustic and Rayleigh waves. Tomographic results show that the three-dimensional distribution of ionospheric disturbances obviously increases at an altitude of 300 km above the surrounding epicentre, predominantly in the entire region between 200 km and 400 km. Significant ionospheric disturbances appear at 06:30 UT from tomographic images. This study reveals characteristics of an ionospheric anomaly caused by the Nepal earthquake.
Advancing Collaboration through Hydrologic Data and Model Sharing
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Castronova, A. M.; Miles, B.; Li, Z.; Morsy, M. M.
2015-12-01
HydroShare is an online, collaborative system for open sharing of hydrologic data, analytical tools, and models. It supports the sharing of and collaboration around "resources" which are defined primarily by standardized metadata, content data models for each resource type, and an overarching resource data model based on the Open Archives Initiative's Object Reuse and Exchange (OAI-ORE) standard and a hierarchical file packaging system called "BagIt". HydroShare expands the data sharing capability of the CUAHSI Hydrologic Information System by broadening the classes of data accommodated to include geospatial and multidimensional space-time datasets commonly used in hydrology. HydroShare also includes new capability for sharing models, model components, and analytical tools and will take advantage of emerging social media functionality to enhance information about and collaboration around hydrologic data and models. It also supports web services and server/cloud based computation operating on resources for the execution of hydrologic models and analysis and visualization of hydrologic data. HydroShare uses iRODS as a network file system for underlying storage of datasets and models. Collaboration is enabled by casting datasets and models as "social objects". Social functions include both private and public sharing, formation of collaborative groups of users, and value-added annotation of shared datasets and models. The HydroShare web interface and social media functions were developed using the Django web application framework coupled to iRODS. Data visualization and analysis is supported through the Tethys Platform web GIS software stack. Links to external systems are supported by RESTful web service interfaces to HydroShare's content. This presentation will introduce the HydroShare functionality developed to date and describe ongoing development of functionality to support collaboration and integration of data and models.
Kasmel, Anu; Tanggaard, Pernille
2011-01-01
This study assessed changes in community members’ ratings of the dimensions of individual community related empowerment (ICRE) before and two years after the implementation of an empowerment expansion framework in three community health promotion initiatives within the Estonian context. We employed a self-administered questionnaire, the adapted mobilisation scale–individual. As the first step, we investigated the multidimensional nature of the ICRE construct and explored the validity and reliability (internal consistency) of the ICRE scale. Two datasets were used. The first dataset comprised a cross-sectional random sample of 1,000 inhabitants of Rapla County selected in 2003 from the National Population Register, which was used to confirm the composition of the dimensions of the scale and to examine the reliability of the dimensions. The second dataset comprised two waves of data: 120 participants from three health promotion programs in 2003 (pre-test) and 115 participants in 2005 (post-test), and the dataset was used to compare participants’ pre-test and post-test ratings of their levels of empowerment. The content validity ratio, determined using Lawshe’s formula, was high (0.98). Five dimensions of ICRE, self-efficacy, intention, participation, motivation and critical awareness, emerged from the factor analysis. The internal consistency (α) of the total empowerment scale was 0.86 (subscales self-efficacy α = 0.88, intention α = 0.83, participation α = 0.81 and motivation α = 0.69; critical awareness comprised only one item). The levels of ICRE dimensions measured after the application of the empowerment expansion framework were significantly more favourable for the dimensions self-efficacy, participation, intention and motivation to participate. We conclude that for Rapla community workgroups and networks, their ICRE was rendered more favourable after the implementation of the empowerment expansion framework. PMID:21776201
The use of large scale datasets for understanding traffic network state.
DOT National Transportation Integrated Search
2013-09-01
The goal of this proposal is to develop novel modeling techniques to infer individual activity patterns from the large scale cell phone : datasets and taxi data from NYC. As such this research offers a paradigm shift from traditional transportation m...
The Renewed Primary School in Belgium: Analysis of the Local Innovation Policy.
ERIC Educational Resources Information Center
Vandenberghe, Roland
The Renewed Primary School project in Belgium is analyzed in this paper in terms of organizational response to a large-scale innovation, which is characterized by its multidimensionality, by the large number of participating schools, and by a complex support structure. Section 2 of the report presents an elaborated description of these…
NASA Astrophysics Data System (ADS)
Williamson, A.; Newman, A. V.
2017-12-01
Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.
NASA Astrophysics Data System (ADS)
Lary, D. J.
2013-12-01
A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.
Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data.
Marco-Ramell, Anna; Palau-Rodriguez, Magali; Alay, Ania; Tulipani, Sara; Urpi-Sarda, Mireia; Sanchez-Pla, Alex; Andres-Lacueva, Cristina
2018-01-02
Bioinformatic tools for the enrichment of 'omics' datasets facilitate interpretation and understanding of data. To date few are suitable for metabolomics datasets. The main objective of this work is to give a critical overview, for the first time, of the performance of these tools. To that aim, datasets from metabolomic repositories were selected and enriched data were created. Both types of data were analysed with these tools and outputs were thoroughly examined. An exploratory multivariate analysis of the most used tools for the enrichment of metabolite sets, based on a non-metric multidimensional scaling (NMDS) of Jaccard's distances, was performed and mirrored their diversity. Codes (identifiers) of the metabolites of the datasets were searched in different metabolite databases (HMDB, KEGG, PubChem, ChEBI, BioCyc/HumanCyc, LipidMAPS, ChemSpider, METLIN and Recon2). The databases that presented more identifiers of the metabolites of the dataset were PubChem, followed by METLIN and ChEBI. However, these databases had duplicated entries and might present false positives. The performance of over-representation analysis (ORA) tools, including BioCyc/HumanCyc, ConsensusPathDB, IMPaLA, MBRole, MetaboAnalyst, Metabox, MetExplore, MPEA, PathVisio and Reactome and the mapping tool KEGGREST, was examined. Results were mostly consistent among tools and between real and enriched data despite the variability of the tools. Nevertheless, a few controversial results such as differences in the total number of metabolites were also found. Disease-based enrichment analyses were also assessed, but they were not found to be accurate probably due to the fact that metabolite disease sets are not up-to-date and the difficulty of predicting diseases from a list of metabolites. We have extensively reviewed the state-of-the-art of the available range of tools for metabolomic datasets, the completeness of metabolite databases, the performance of ORA methods and disease-based analyses. Despite the variability of the tools, they provided consistent results independent of their analytic approach. However, more work on the completeness of metabolite and pathway databases is required, which strongly affects the accuracy of enrichment analyses. Improvements will be translated into more accurate and global insights of the metabolome.
Distributed File System Utilities to Manage Large DatasetsVersion 0.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-05-21
FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.
Statistical analysis of large simulated yield datasets for studying climate effects
USDA-ARS?s Scientific Manuscript database
Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...
Reconciling long-term cultural diversity and short-term collective social behavior.
Valori, Luca; Picciolo, Francesco; Allansdottir, Agnes; Garlaschelli, Diego
2012-01-24
An outstanding open problem is whether collective social phenomena occurring over short timescales can systematically reduce cultural heterogeneity in the long run, and whether offline and online human interactions contribute differently to the process. Theoretical models suggest that short-term collective behavior and long-term cultural diversity are mutually excluding, since they require very different levels of social influence. The latter jointly depends on two factors: the topology of the underlying social network and the overlap between individuals in multidimensional cultural space. However, while the empirical properties of social networks are intensively studied, little is known about the large-scale organization of real societies in cultural space, so that random input specifications are necessarily used in models. Here we use a large dataset to perform a high-dimensional analysis of the scientific beliefs of thousands of Europeans. We find that interopinion correlations determine a nontrivial ultrametric hierarchy of individuals in cultural space. When empirical data are used as inputs in models, ultrametricity has strong and counterintuitive effects. On short timescales, it facilitates a symmetry-breaking phase transition triggering coordinated social behavior. On long timescales, it suppresses cultural convergence by restricting it within disjoint groups. Moreover, ultrametricity implies that these results are surprisingly robust to modifications of the dynamical rules considered. Thus the empirical distribution of individuals in cultural space appears to systematically optimize the coexistence of short-term collective behavior and long-term cultural diversity, which can be realized simultaneously for the same moderate level of mutual influence in a diverse range of online and offline settings.
Neighbourhood typology based on virtual audit of environmental obesogenic characteristics.
Feuillet, T; Charreire, H; Roda, C; Ben Rebah, M; Mackenbach, J D; Compernolle, S; Glonti, K; Bárdos, H; Rutter, H; De Bourdeaudhuij, I; McKee, M; Brug, J; Lakerveld, J; Oppert, J-M
2016-01-01
Virtual audit (using tools such as Google Street View) can help assess multiple characteristics of the physical environment. This exposure assessment can then be associated with health outcomes such as obesity. Strengths of virtual audit include collection of large amount of data, from various geographical contexts, following standard protocols. Using data from a virtual audit of obesity-related features carried out in five urban European regions, the current study aimed to (i) describe this international virtual audit dataset and (ii) identify neighbourhood patterns that can synthesize the complexity of such data and compare patterns across regions. Data were obtained from 4,486 street segments across urban regions in Belgium, France, Hungary, the Netherlands and the UK. We used multiple factor analysis and hierarchical clustering on principal components to build a typology of neighbourhoods and to identify similar/dissimilar neighbourhoods, regardless of region. Four neighbourhood clusters emerged, which differed in terms of food environment, recreational facilities and active mobility features, i.e. the three indicators derived from factor analysis. Clusters were unequally distributed across urban regions. Neighbourhoods mostly characterized by a high level of outdoor recreational facilities were predominantly located in Greater London, whereas neighbourhoods characterized by high urban density and large amounts of food outlets were mostly located in Paris. Neighbourhoods in the Randstad conurbation, Ghent and Budapest appeared to be very similar, characterized by relatively lower residential densities, greener areas and a very low percentage of streets offering food and recreational facility items. These results provide multidimensional constructs of obesogenic characteristics that may help target at-risk neighbourhoods more effectively than isolated features. © 2016 World Obesity.
Extraction of drainage networks from large terrain datasets using high throughput computing
NASA Astrophysics Data System (ADS)
Gong, Jianya; Xie, Jibo
2009-02-01
Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
Node, Node-Link, and Node-Link-Group Diagrams: An Evaluation.
Saket, Bahador; Simonetto, Paolo; Kobourov, Stephen; Börner, Katy
2014-12-01
Effectively showing the relationships between objects in a dataset is one of the main tasks in information visualization. Typically there is a well-defined notion of distance between pairs of objects, and traditional approaches such as principal component analysis or multi-dimensional scaling are used to place the objects as points in 2D space, so that similar objects are close to each other. In another typical setting, the dataset is visualized as a network graph, where related nodes are connected by links. More recently, datasets are also visualized as maps, where in addition to nodes and links, there is an explicit representation of groups and clusters. We consider these three Techniques, characterized by a progressive increase of the amount of encoded information: node diagrams, node-link diagrams and node-link-group diagrams. We assess these three types of diagrams with a controlled experiment that covers nine different tasks falling broadly in three categories: node-based tasks, network-based tasks and group-based tasks. Our findings indicate that adding links, or links and group representations, does not negatively impact performance (time and accuracy) of node-based tasks. Similarly, adding group representations does not negatively impact the performance of network-based tasks. Node-link-group diagrams outperform the others on group-based tasks. These conclusions contradict results in other studies, in similar but subtly different settings. Taken together, however, such results can have significant implications for the design of standard and domain snecific visualizations tools.
Electrochemical force microscopy
Kalinin, Sergei V.; Jesse, Stephen; Collins, Liam F.; Rodriguez, Brian J.
2017-01-10
A system and method for electrochemical force microscopy are provided. The system and method are based on a multidimensional detection scheme that is sensitive to forces experienced by a biased electrode in a solution. The multidimensional approach allows separation of fast processes, such as double layer charging, and charge relaxation, and slow processes, such as diffusion and faradaic reactions, as well as capturing the bias dependence of the response. The time-resolved and bias measurements can also allow probing both linear (small bias range) and non-linear (large bias range) electrochemical regimes and potentially the de-convolution of charge dynamics and diffusion processes from steric effects and electrochemical reactivity.
Method of multi-dimensional moment analysis for the characterization of signal peaks
Pfeifer, Kent B; Yelton, William G; Kerr, Dayle R; Bouchier, Francis A
2012-10-23
A method of multi-dimensional moment analysis for the characterization of signal peaks can be used to optimize the operation of an analytical system. With a two-dimensional Peclet analysis, the quality and signal fidelity of peaks in a two-dimensional experimental space can be analyzed and scored. This method is particularly useful in determining optimum operational parameters for an analytical system which requires the automated analysis of large numbers of analyte data peaks. For example, the method can be used to optimize analytical systems including an ion mobility spectrometer that uses a temperature stepped desorption technique for the detection of explosive mixtures.
AlMenhali, Entesar Ali; Khalid, Khalizani; Iyanna, Shilpa
2018-01-01
The Environmental Attitudes Inventory (EAI) was developed to evaluate the multidimensional nature of environmental attitudes; however, it is based on a dataset from outside the Arab context. This study reinvestigated the construct validity of the EAI with a new dataset and confirmed the feasibility of applying it in the Arab context. One hundred and forty-eight subjects in Study 1 and 130 in Study 2 provided valid responses. An exploratory factor analysis (EFA) was used to extract a new factor structure in Study 1, and confirmatory factor analysis (CFA) was performed in Study 2. Both studies generated a seven-factor model, and the model fit was discussed for both the studies. Study 2 exhibited satisfactory model fit indices compared to Study 1. Factor loading values of a few items in Study 1 affected the reliability values and average variance extracted values, which demonstrated low discriminant validity. Based on the results of the EFA and CFA, this study showed sufficient model fit and suggested the feasibility of applying the EAI in the Arab context with a good construct validity and internal consistency.
2018-01-01
The Environmental Attitudes Inventory (EAI) was developed to evaluate the multidimensional nature of environmental attitudes; however, it is based on a dataset from outside the Arab context. This study reinvestigated the construct validity of the EAI with a new dataset and confirmed the feasibility of applying it in the Arab context. One hundred and forty-eight subjects in Study 1 and 130 in Study 2 provided valid responses. An exploratory factor analysis (EFA) was used to extract a new factor structure in Study 1, and confirmatory factor analysis (CFA) was performed in Study 2. Both studies generated a seven-factor model, and the model fit was discussed for both the studies. Study 2 exhibited satisfactory model fit indices compared to Study 1. Factor loading values of a few items in Study 1 affected the reliability values and average variance extracted values, which demonstrated low discriminant validity. Based on the results of the EFA and CFA, this study showed sufficient model fit and suggested the feasibility of applying the EAI in the Arab context with a good construct validity and internal consistency. PMID:29758021
Do pre-trained deep learning models improve computer-aided classification of digital mammograms?
NASA Astrophysics Data System (ADS)
Aboutalib, Sarah S.; Mohamed, Aly A.; Zuley, Margarita L.; Berg, Wendie A.; Luo, Yahong; Wu, Shandong
2018-02-01
Digital mammography screening is an important exam for the early detection of breast cancer and reduction in mortality. False positives leading to high recall rates, however, results in unnecessary negative consequences to patients and health care systems. In order to better aid radiologists, computer-aided tools can be utilized to improve distinction between image classifications and thus potentially reduce false recalls. The emergence of deep learning has shown promising results in the area of biomedical imaging data analysis. This study aimed to investigate deep learning and transfer learning methods that can improve digital mammography classification performance. In particular, we evaluated the effect of pre-training deep learning models with other imaging datasets in order to boost classification performance on a digital mammography dataset. Two types of datasets were used for pre-training: (1) a digitized film mammography dataset, and (2) a very large non-medical imaging dataset. By using either of these datasets to pre-train the network initially, and then fine-tuning with the digital mammography dataset, we found an increase in overall classification performance in comparison to a model without pre-training, with the very large non-medical dataset performing the best in improving the classification accuracy.
Rethinking language in autism.
Sterponi, Laura; de Kirby, Kenton; Shankey, Jennifer
2015-07-01
In this article, we invite a rethinking of traditional perspectives of language in autism. We advocate a theoretical reappraisal that offers a corrective to the dominant and largely tacitly held view that language, in its essence, is a referential system and a reflection of the individual's cognition. Drawing on scholarship in Conversation Analysis and linguistic anthropology, we present a multidimensional view of language, showing how it also functions as interactional accomplishment, social action, and mode of experience. From such a multidimensional perspective, we revisit data presented by other researchers that include instances of prototypical features of autistic speech, giving them a somewhat different-at times complementary, at times alternative-interpretation. In doing so, we demonstrate that there is much at stake in the view of language that we as researchers bring to our analysis of autistic speech. Ultimately, we argue that adopting a multidimensional view of language has wide ranging implications, deepening our understanding of autism's core features and developmental trajectory. © The Author(s) 2014.
Secondary analysis of national survey datasets.
Boo, Sunjoo; Froelicher, Erika Sivarajan
2013-06-01
This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.
NASA Astrophysics Data System (ADS)
Hahn, T.
2016-10-01
The parallel version of the multidimensional numerical integration package Cuba is presented and achievable speed-ups discussed. The parallelization is based on the fork/wait POSIX functions, needs no extra software installed, imposes almost no constraints on the integrand function, and works largely automatically.
Shi, Yingzhong; Chung, Fu-Lai; Wang, Shitong
2015-09-01
Recently, a time-adaptive support vector machine (TA-SVM) is proposed for handling nonstationary datasets. While attractive performance has been reported and the new classifier is distinctive in simultaneously solving several SVM subclassifiers locally and globally by using an elegant SVM formulation in an alternative kernel space, the coupling of subclassifiers brings in the computation of matrix inversion, thus resulting to suffer from high computational burden in large nonstationary dataset applications. To overcome this shortcoming, an improved TA-SVM (ITA-SVM) is proposed using a common vector shared by all the SVM subclassifiers involved. ITA-SVM not only keeps an SVM formulation, but also avoids the computation of matrix inversion. Thus, we can realize its fast version, that is, improved time-adaptive core vector machine (ITA-CVM) for large nonstationary datasets by using the CVM technique. ITA-CVM has the merit of asymptotic linear time complexity for large nonstationary datasets as well as inherits the advantage of TA-SVM. The effectiveness of the proposed classifiers ITA-SVM and ITA-CVM is also experimentally confirmed.
Boubela, Roland N.; Kalcher, Klaudius; Huf, Wolfgang; Našel, Christian; Moser, Ewald
2016-01-01
Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets. PMID:26778951
Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.
Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher
2008-01-01
With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Deep learning-based fine-grained car make/model classification for visual surveillance
NASA Astrophysics Data System (ADS)
Gundogdu, Erhan; Parıldı, Enes Sinan; Solmaz, Berkan; Yücesoy, Veysel; Koç, Aykut
2017-10-01
Fine-grained object recognition is a potential computer vision problem that has been recently addressed by utilizing deep Convolutional Neural Networks (CNNs). Nevertheless, the main disadvantage of classification methods relying on deep CNN models is the need for considerably large amount of data. In addition, there exists relatively less amount of annotated data for a real world application, such as the recognition of car models in a traffic surveillance system. To this end, we mainly concentrate on the classification of fine-grained car make and/or models for visual scenarios by the help of two different domains. First, a large-scale dataset including approximately 900K images is constructed from a website which includes fine-grained car models. According to their labels, a state-of-the-art CNN model is trained on the constructed dataset. The second domain that is dealt with is the set of images collected from a camera integrated to a traffic surveillance system. These images, which are over 260K, are gathered by a special license plate detection method on top of a motion detection algorithm. An appropriately selected size of the image is cropped from the region of interest provided by the detected license plate location. These sets of images and their provided labels for more than 30 classes are employed to fine-tune the CNN model which is already trained on the large scale dataset described above. To fine-tune the network, the last two fully-connected layers are randomly initialized and the remaining layers are fine-tuned in the second dataset. In this work, the transfer of a learned model on a large dataset to a smaller one has been successfully performed by utilizing both the limited annotated data of the traffic field and a large scale dataset with available annotations. Our experimental results both in the validation dataset and the real field show that the proposed methodology performs favorably against the training of the CNN model from scratch.
ITQ-54: a multi-dimensional extra-large pore zeolite with 20 × 14 × 12-ring channels
Jiang, Jiuxing; Yun, Yifeng; Zou, Xiaodong; ...
2015-01-01
A multi-dimensional extra-large pore silicogermanate zeolite, named ITQ-54, has been synthesised by in situ decomposition of the N,N-dicyclohexylisoindolinium cation into the N-cyclohexylisoindolinium cation. Its structure was solved by 3D rotation electron diffraction (RED) from crystals of ca. 1 μm in size. The structure of ITQ-54 contains straight intersecting 20 × 14 × 12-ring channels along the three crystallographic axes and it is one of the few zeolites with extra-large channels in more than one direction. ITQ-54 has a framework density of 11.1 T atoms per 1000 Å 3, which is one of the lowest among the known zeolites. ITQ-54 wasmore » obtained together with GeO 2 as an impurity. A heavy liquid separation method was developed and successfully applied to remove this impurity from the zeolite. ITQ-54 is stable up to 600 °C and exhibits permanent porosity. The structure was further refined using powder X-ray diffraction (PXRD) data for both as-made and calcined samples.« less
Local Prediction Models on Mid-Atlantic Ridge MORB by Principal Component Regression
NASA Astrophysics Data System (ADS)
Ling, X.; Snow, J. E.; Chin, W.
2017-12-01
The isotopic compositions of the daughter isotopes of long-lived radioactive systems (Sr, Nd, Hf and Pb ) can be used to map the scale and history of mantle heterogeneities beneath mid-ocean ridges. Our goal is to relate the multidimensional structure in the existing isotopic dataset with an underlying physical reality of mantle sources. The numerical technique of Principal Component Analysis is useful to reduce the linear dependence of the data to a minimum set of orthogonal eigenvectors encapsulating the information contained (cf Agranier et al 2005). The dataset used for this study covers almost all the MORBs along mid-Atlantic Ridge (MAR), from 54oS to 77oN and 8.8oW to -46.7oW, including replicating the dataset of Agranier et al., 2005 published plus 53 basalt samples dredged and analyzed since then (data from PetDB). The principal components PC1 and PC2 account for 61.56% and 29.21%, respectively, of the total isotope ratios variability. The samples with similar compositions to HIMU and EM and DM are identified to better understand the PCs. PC1 and PC2 are accountable for HIMU and EM whereas PC2 has limited control over the DM source. PC3 is more strongly controlled by the depleted mantle source than PC2. What this means is that all three principal components have a high degree of significance relevant to the established mantle sources. We also tested the relationship between mantle heterogeneity and sample locality. K-means clustering algorithm is a type of unsupervised learning to find groups in the data based on feature similarity. The PC factor scores of each sample are clustered into three groups. Cluster one and three are alternating on the north and south MAR. Cluster two exhibits on 45.18oN to 0.79oN and -27.9oW to -30.40oW alternating with cluster one. The ridge has been preliminarily divided into 16 sections considering both the clusters and ridge segments. The principal component regression models the section based on 6 isotope ratios and PCs. The prediction residual is about 1-2km. It means that the combined 5 isotopes are a strong predictor of geographic location along the ridge, a slightly surprising result. PCR is a robust and powerful method for both visualizing and manipulating the multidimensional representation of isotope data.
Large-scale machine learning and evaluation platform for real-time traffic surveillance
NASA Astrophysics Data System (ADS)
Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel
2016-09-01
In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.
Transforming the Geocomputational Battlespace Framework with HDF5
2010-08-01
layout level, dataset arrays can be stored in chunks or tiles , enabling fast subsetting of large datasets, including compressed datasets. HDF software...Image Base (CIB) image of the AOI: an orthophoto made from rectified grayscale aerial images b. An IKONOS satellite image made up of 3 spectral
Segmentation of Unstructured Datasets
NASA Technical Reports Server (NTRS)
Bhat, Smitha
1996-01-01
Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.
Hanley, Terry; Ujhelyi, Katalin
2017-01-01
Background The Internet has the potential to help young people by reducing the stigma associated with mental health and enabling young people to access services and professionals which they may not otherwise access. Online support can empower young people, help them develop new online friendships, share personal experiences, communicate with others who understand, provide information and emotional support, and most importantly help them feel less alone and normalize their experiences in the world. Objective The aim of the research was to gain an understanding of how young people use an online forum for emotional and mental health issues. Specifically, the project examined what young people discuss and how they seek support on the forum (objective 1). Furthermore, it looked at how the young service users responded to posts to gain an understanding of how young people provided each other with peer-to-peer support (objective 2). Methods Kooth is an online counseling service for young people aged 11-25 years and experiencing emotional and mental health problems. It is based in the United Kingdom and provides support that is anonymous, confidential, and free at the point of delivery. Kooth provided the researchers with all the online forum posts between a 2-year period, which resulted in a dataset of 622 initial posts and 3657 initial posts with responses. Thematic analysis was employed to elicit key themes from the dataset. Results The findings support the literature that online forums provide young people with both informational and emotional support around a wide array of topics. The findings from this large dataset also reveal that this informational or emotional support can be viewed as directive or nondirective. The nondirective approach refers to when young people provide others with support by sharing their own experiences. These posts do not include explicit advice to act in a particular way, but the sharing process is hoped to be of use to the poster. The directive approach, in contrast, involves individuals making an explicit suggestion of what they believe the poster should do. Conclusions This study adds to the research exploring what young people discuss within online forums and provides insights into how these communications take place. Furthermore, it highlights the challenge that organizations may encounter in mediating support that is multidimensional in nature (informational-emotional, directive-nondirective). PMID:28768607
Keuleers, Emmanuel; Balota, David A
2015-01-01
This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.
Sleep stages identification in patients with sleep disorder using k-means clustering
NASA Astrophysics Data System (ADS)
Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.
2018-05-01
Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.
Image segmentation evaluation for very-large datasets
NASA Astrophysics Data System (ADS)
Reeves, Anthony P.; Liu, Shuang; Xie, Yiting
2016-03-01
With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
Numericware i: Identical by State Matrix Calculator
Kim, Bongsong; Beavis, William D
2017-01-01
We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375
Large-Scale Pattern Discovery in Music
NASA Astrophysics Data System (ADS)
Bertin-Mahieux, Thierry
This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.
Yu, Qiang; Wei, Dingbang; Huo, Hongwei
2018-06-18
Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Light at Night Markup Language (LANML): XML Technology for Light at Night Monitoring Data
NASA Astrophysics Data System (ADS)
Craine, B. L.; Craine, E. R.; Craine, E. M.; Crawford, D. L.
2013-05-01
Light at Night Markup Language (LANML) is a standard, based upon XML, useful in acquiring, validating, transporting, archiving and analyzing multi-dimensional light at night (LAN) datasets of any size. The LANML standard can accommodate a variety of measurement scenarios including single spot measures, static time-series, web based monitoring networks, mobile measurements, and airborne measurements. LANML is human-readable, machine-readable, and does not require a dedicated parser. In addition LANML is flexible; ensuring future extensions of the format will remain backward compatible with analysis software. The XML technology is at the heart of communicating over the internet and can be equally useful at the desktop level, making this standard particularly attractive for web based applications, educational outreach and efficient collaboration between research groups.
Inverse and Predictive Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Syracuse, Ellen Marie
The LANL Seismo-Acoustic team has a strong capability in developing data-driven models that accurately predict a variety of observations. These models range from the simple – one-dimensional models that are constrained by a single dataset and can be used for quick and efficient predictions – to the complex – multidimensional models that are constrained by several types of data and result in more accurate predictions. Team members typically build models of geophysical characteristics of Earth and source distributions at scales of 1 to 1000s of km, the techniques used are applicable for other types of physical characteristics at an evenmore » greater range of scales. The following cases provide a snapshot of some of the modeling work done by the Seismo- Acoustic team at LANL.« less
Brabets, Timothy P.; Conaway, Jeffrey S.
2009-01-01
The Copper River Basin, the sixth largest watershed in Alaska, drains an area of 24,200 square miles. This large, glacier-fed river flows across a wide alluvial fan before it enters the Gulf of Alaska. Bridges along the Copper River Highway, which traverses the alluvial fan, have been impacted by channel migration. Due to a major channel change in 2001, Bridge 339 at Mile 36 of the highway has undergone excessive scour, resulting in damage to its abutments and approaches. During the snow- and ice-melt runoff season, which typically extends from mid-May to September, the design discharge for the bridge often is exceeded. The approach channel shifts continuously, and during our study it has shifted back and forth from the left bank to a course along the right bank nearly parallel to the road.Maintenance at Bridge 339 has been costly and will continue to be so if no action is taken. Possible solutions to the scour and erosion problem include (1) constructing a guide bank to redirect flow, (2) dredging approximately 1,000 feet of channel above the bridge to align flow perpendicular to the bridge, and (3) extending the bridge. The USGS Multi-Dimensional Surface Water Modeling System (MD_SWMS) was used to assess these possible solutions. The major limitation of modeling these scenarios was the inability to predict ongoing channel migration. We used a hybrid dataset of surveyed and synthetic bathymetry in the approach channel, which provided the best approximation of this dynamic system. Under existing conditions and at the highest measured discharge and stage of 32,500 ft3/s and 51.08 ft, respectively, the velocities and shear stresses simulated by MD_SWMS indicate scour and erosion will continue. Construction of a 250-foot-long guide bank would not improve conditions because it is not long enough. Dredging a channel upstream of Bridge 339 would help align the flow perpendicular to Bridge 339, but because of the mobility of the channel bed, the dredged channel would likely fill in during high flows. Extending Bridge 339 would accommodate higher discharges and re-align flow to the bridge.
Making Sense of 'Big Data' in Provenance Studies
NASA Astrophysics Data System (ADS)
Vermeesch, P.
2014-12-01
Huge online databases can be 'mined' to reveal previously hidden trends and relationships in society. One could argue that sedimentary geology has entered a similar era of 'Big Data', as modern provenance studies routinely apply multiple proxies to dozens of samples. Just like the Internet, sedimentary geology now requires specialised statistical tools to interpret such large datasets. These can be organised on three levels of progressively higher order:A single sample: The most effective way to reveal the provenance information contained in a representative sample of detrital zircon U-Pb ages are probability density estimators such as histograms and kernel density estimates. The widely popular 'probability density plots' implemented in IsoPlot and AgeDisplay compound analytical uncertainty with geological scatter and are therefore invalid.Several samples: Multi-panel diagrams comprising many detrital age distributions or compositional pie charts quickly become unwieldy and uninterpretable. For example, if there are N samples in a study, then the number of pairwise comparisons between samples increases quadratically as N(N-1)/2. This is simply too much information for the human eye to process. To solve this problem, it is necessary to (a) express the 'distance' between two samples as a simple scalar and (b) combine all N(N-1)/2 such values in a single two-dimensional 'map', grouping similar and pulling apart dissimilar samples. This can be easily achieved using simple statistics-based dissimilarity measures and a standard statistical method called Multidimensional Scaling (MDS).Several methods: Suppose that we use four provenance proxies: bulk petrography, chemistry, heavy minerals and detrital geochronology. This will result in four MDS maps, each of which likely show slightly different trends and patterns. To deal with such cases, it may be useful to use a related technique called 'three way multidimensional scaling'. This results in two graphical outputs: an MDS map, and a map with 'weights' showing to what extent the different provenance proxies influence the horizontal and vertical axis of the MDS map. Thus, detrital data can not only inform the user about the provenance of sediments, but also about the causal relationships between the mineralogy, geochronology and chemistry.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616
Evaluating Cellular Polyfunctionality with a Novel Polyfunctionality Index
Larsen, Martin; Sauce, Delphine; Arnaud, Laurent; Fastenackels, Solène; Appay, Victor; Gorochov, Guy
2012-01-01
Functional evaluation of naturally occurring or vaccination-induced T cell responses in mice, men and monkeys has in recent years advanced from single-parameter (e.g. IFN-γ-secretion) to much more complex multidimensional measurements. Co-secretion of multiple functional molecules (such as cytokines and chemokines) at the single-cell level is now measurable due primarily to major advances in multiparametric flow cytometry. The very extensive and complex datasets generated by this technology raise the demand for proper analytical tools that enable the analysis of combinatorial functional properties of T cells, hence polyfunctionality. Presently, multidimensional functional measures are analysed either by evaluating all combinations of parameters individually or by summing frequencies of combinations that include the same number of simultaneous functions. Often these evaluations are visualized as pie charts. Whereas pie charts effectively represent and compare average polyfunctionality profiles of particular T cell subsets or patient groups, they do not document the degree or variation of polyfunctionality within a group nor does it allow more sophisticated statistical analysis. Here we propose a novel polyfunctionality index that numerically evaluates the degree and variation of polyfuntionality, and enable comparative and correlative parametric and non-parametric statistical tests. Moreover, it allows the usage of more advanced statistical approaches, such as cluster analysis. We believe that the polyfunctionality index will render polyfunctionality an appropriate end-point measure in future studies of T cell responsiveness. PMID:22860124
Atlas-guided cluster analysis of large tractography datasets.
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.
McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr
2016-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
Defining pyromes and global syndromes of fire regimes.
Archibald, Sally; Lehmann, Caroline E R; Gómez-Dans, Jose L; Bradstock, Ross A
2013-04-16
Fire is a ubiquitous component of the Earth system that is poorly understood. To date, a global-scale understanding of fire is largely limited to the annual extent of burning as detected by satellites. This is problematic because fire is multidimensional, and focus on a single metric belies its complexity and importance within the Earth system. To address this, we identified five key characteristics of fire regimes--size, frequency, intensity, season, and extent--and combined new and existing global datasets to represent each. We assessed how these global fire regime characteristics are related to patterns of climate, vegetation (biomes), and human activity. Cross-correlations demonstrate that only certain combinations of fire characteristics are possible, reflecting fundamental constraints in the types of fire regimes that can exist. A Bayesian clustering algorithm identified five global syndromes of fire regimes, or pyromes. Four pyromes represent distinctions between crown, litter, and grass-fueled fires, and the relationship of these to biomes and climate are not deterministic. Pyromes were partially discriminated on the basis of available moisture and rainfall seasonality. Human impacts also affected pyromes and are globally apparent as the driver of a fifth and unique pyrome that represents human-engineered modifications to fire characteristics. Differing biomes and climates may be represented within the same pyrome, implying that pathways of change in future fire regimes in response to changes in climate and human activity may be difficult to predict.
High-Mach number, laser-driven magnetized collisionless shocks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schaeffer, Derek B.; Fox, W.; Haberberger, D.
Collisionless shocks are ubiquitous in space and astrophysical systems, and the class of supercritical shocks is of particular importance due to their role in accelerating particles to high energies. While these shocks have been traditionally studied by spacecraft and remote sensing observations, laboratory experiments can provide reproducible and multi-dimensional datasets that provide complementary understanding of the underlying microphysics. We present experiments undertaken on the OMEGA and OMEGA EP laser facilities that show the formation and evolution of high-Mach number collisionless shocks created through the interaction of a laser-driven magnetic piston and magnetized ambient plasma. Through time-resolved, 2-D imaging we observemore » large density and magnetic compressions that propagate at super-Alfvenic speeds and that occur over ion kinetic length scales. Electron density and temperature of the initial ambient plasma are characterized using optical Thomson scattering. Measurements of the piston laser-plasma are modeled with 2-D radiation-hydrodynamic simulations, which are used to initialize 2-D particle-in-cell simulations of the interaction between the piston and ambient plasmas. The numerical results show the formation of collisionless shocks, including the separate dynamics of the carbon and hydrogen ions that constitute the ambient plasma and their effect on the shock structure. Furthermore, the simulations also show the shock separating from the piston, which we observe in the data at late experimental times.« less
High-Mach number, laser-driven magnetized collisionless shocks
Schaeffer, Derek B.; Fox, W.; Haberberger, D.; ...
2017-12-08
Collisionless shocks are ubiquitous in space and astrophysical systems, and the class of supercritical shocks is of particular importance due to their role in accelerating particles to high energies. While these shocks have been traditionally studied by spacecraft and remote sensing observations, laboratory experiments can provide reproducible and multi-dimensional datasets that provide complementary understanding of the underlying microphysics. We present experiments undertaken on the OMEGA and OMEGA EP laser facilities that show the formation and evolution of high-Mach number collisionless shocks created through the interaction of a laser-driven magnetic piston and magnetized ambient plasma. Through time-resolved, 2-D imaging we observemore » large density and magnetic compressions that propagate at super-Alfvenic speeds and that occur over ion kinetic length scales. Electron density and temperature of the initial ambient plasma are characterized using optical Thomson scattering. Measurements of the piston laser-plasma are modeled with 2-D radiation-hydrodynamic simulations, which are used to initialize 2-D particle-in-cell simulations of the interaction between the piston and ambient plasmas. The numerical results show the formation of collisionless shocks, including the separate dynamics of the carbon and hydrogen ions that constitute the ambient plasma and their effect on the shock structure. Furthermore, the simulations also show the shock separating from the piston, which we observe in the data at late experimental times.« less
Defining pyromes and global syndromes of fire regimes
Archibald, Sally; Lehmann, Caroline E. R.; Gómez-Dans, Jose L.; Bradstock, Ross A.
2013-01-01
Fire is a ubiquitous component of the Earth system that is poorly understood. To date, a global-scale understanding of fire is largely limited to the annual extent of burning as detected by satellites. This is problematic because fire is multidimensional, and focus on a single metric belies its complexity and importance within the Earth system. To address this, we identified five key characteristics of fire regimes—size, frequency, intensity, season, and extent—and combined new and existing global datasets to represent each. We assessed how these global fire regime characteristics are related to patterns of climate, vegetation (biomes), and human activity. Cross-correlations demonstrate that only certain combinations of fire characteristics are possible, reflecting fundamental constraints in the types of fire regimes that can exist. A Bayesian clustering algorithm identified five global syndromes of fire regimes, or pyromes. Four pyromes represent distinctions between crown, litter, and grass-fueled fires, and the relationship of these to biomes and climate are not deterministic. Pyromes were partially discriminated on the basis of available moisture and rainfall seasonality. Human impacts also affected pyromes and are globally apparent as the driver of a fifth and unique pyrome that represents human-engineered modifications to fire characteristics. Differing biomes and climates may be represented within the same pyrome, implying that pathways of change in future fire regimes in response to changes in climate and human activity may be difficult to predict. PMID:23559374
High-Mach number, laser-driven magnetized collisionless shocks
NASA Astrophysics Data System (ADS)
Schaeffer, D. B.; Fox, W.; Haberberger, D.; Fiksel, G.; Bhattacharjee, A.; Barnak, D. H.; Hu, S. X.; Germaschewski, K.; Follett, R. K.
2017-12-01
Collisionless shocks are ubiquitous in space and astrophysical systems, and the class of supercritical shocks is of particular importance due to their role in accelerating particles to high energies. While these shocks have been traditionally studied by spacecraft and remote sensing observations, laboratory experiments can provide reproducible and multi-dimensional datasets that provide a complementary understanding of the underlying microphysics. We present experiments undertaken on the OMEGA and OMEGA EP laser facilities that show the formation and evolution of high-Mach number collisionless shocks created through the interaction of a laser-driven magnetic piston and a magnetized ambient plasma. Through time-resolved, 2-D imaging, we observe large density and magnetic compressions that propagate at super-Alfvénic speeds and that occur over ion kinetic length scales. The electron density and temperature of the initial ambient plasma are characterized using optical Thomson scattering. Measurements of the piston laser-plasma are modeled with 2-D radiation-hydrodynamic simulations, which are used to initialize 2-D particle-in-cell simulations of the interaction between the piston and ambient plasmas. The numerical results show the formation of collisionless shocks, including the separate dynamics of the carbon and hydrogen ions that constitute the ambient plasma and their effect on the shock structure. The simulations also show the shock separating from the piston, which we observe in the data at late experimental times.
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset
NASA Astrophysics Data System (ADS)
Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.
2017-12-01
Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Systematic review of the multidimensional fatigue symptom inventory-short form.
Donovan, Kristine A; Stein, Kevin D; Lee, Morgan; Leach, Corinne R; Ilozumba, Onaedo; Jacobsen, Paul B
2015-01-01
Fatigue is a subjective complaint that is believed to be multifactorial in its etiology and multidimensional in its expression. Fatigue may be experienced by individuals in different dimensions as physical, mental, and emotional tiredness. The purposes of this study were to review and characterize the use of the 30-item Multidimensional Fatigue Symptom Inventory-Short Form (MFSI-SF) in published studies and to evaluate the available evidence for its psychometric properties. A systematic review was conducted to identify published articles reporting results for the MFSI-SF. Data were analyzed to characterize internal consistency reliability of multi-item MFSI-SF scales and test-retest reliability. Correlation coefficients were summarized to characterize concurrent, convergent, and divergent validity. Standardized effect sizes were calculated to characterize the discriminative validity of the MFSI-SF and its sensitivity to change. Seventy articles were identified. Sample sizes reported ranged from 10 to 529 and nearly half consisted exclusively of females. More than half the samples were composed of cancer patients; of those, 59% were breast cancer patients. Mean alpha coefficients for MFSI-SF fatigue subscales ranged from 0.84 for physical fatigue to 0.93 for general fatigue. The MFSI-SF demonstrated moderate test-retest reliability in a small number of studies. Correlations with other fatigue and vitality measures were moderate to large in size and in the expected direction. The MFSI-SF fatigue subscales were positively correlated with measures of distress, depressive, and anxious symptoms. Effect sizes for discriminative validity ranged from medium to large, while effect sizes for sensitivity to change ranged from small to large. Findings demonstrate the positive psychometric properties of the MFSI-SF, provide evidence for its usefulness in medically ill and nonmedically ill individuals, and support its use in future studies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
Primary Datasets for Case Studies of River-Water Quality
ERIC Educational Resources Information Center
Goulder, Raymond
2008-01-01
Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…
Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.
2014-01-01
The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).
Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L
2015-02-01
Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
ICM: a web server for integrated clustering of multi-dimensional biomedical data.
He, Song; He, Haochen; Xu, Wenjian; Huang, Xin; Jiang, Shuai; Li, Fei; He, Fuchu; Bo, Xiaochen
2016-07-08
Large-scale efforts for parallel acquisition of multi-omics profiling continue to generate extensive amounts of multi-dimensional biomedical data. Thus, integrated clustering of multiple types of omics data is essential for developing individual-based treatments and precision medicine. However, while rapid progress has been made, methods for integrated clustering are lacking an intuitive web interface that facilitates the biomedical researchers without sufficient programming skills. Here, we present a web tool, named Integrated Clustering of Multi-dimensional biomedical data (ICM), that provides an interface from which to fuse, cluster and visualize multi-dimensional biomedical data and knowledge. With ICM, users can explore the heterogeneity of a disease or a biological process by identifying subgroups of patients. The results obtained can then be interactively modified by using an intuitive user interface. Researchers can also exchange the results from ICM with collaborators via a web link containing a Project ID number that will directly pull up the analysis results being shared. ICM also support incremental clustering that allows users to add new sample data into the data of a previous study to obtain a clustering result. Currently, the ICM web server is available with no login requirement and at no cost at http://biotech.bmi.ac.cn/icm/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Improved p-type conductivity in Al-rich AlGaN using multidimensional Mg-doped superlattices
Zheng, T. C.; Lin, W.; Liu, R.; Cai, D. J.; Li, J. C.; Li, S. P.; Kang, J. Y.
2016-01-01
A novel multidimensional Mg-doped superlattice (SL) is proposed to enhance vertical hole conductivity in conventional Mg-doped AlGaN SL which generally suffers from large potential barrier for holes. Electronic structure calculations within the first-principle theoretical framework indicate that the densities of states (DOS) of the valence band nearby the Fermi level are more delocalized along the c-axis than that in conventional SL, and the potential barrier significantly decreases. Hole concentration is greatly enhanced in the barrier of multidimensional SL. Detailed comparisons of partial charges and decomposed DOS reveal that the improvement of vertical conductance may be ascribed to the stronger pz hybridization between Mg and N. Based on the theoretical analysis, highly conductive p-type multidimensional Al0.63Ga0.37N/Al0.51Ga0.49N SLs are grown with identified steps via metalorganic vapor-phase epitaxy. The hole concentration reaches up to 3.5 × 1018 cm−3, while the corresponding resistivity reduces to 0.7 Ω cm at room temperature, which is tens times improvement in conductivity compared with that of conventional SLs. High hole concentration can be maintained even at 100 K. High p-type conductivity in Al-rich structural material is an important step for the future design of superior AlGaN-based deep ultraviolet devices. PMID:26906334
A Conceptual Modeling Approach for OLAP Personalization
NASA Astrophysics Data System (ADS)
Garrigós, Irene; Pardillo, Jesús; Mazón, Jose-Norberto; Trujillo, Juan
Data warehouses rely on multidimensional models in order to provide decision makers with appropriate structures to intuitively analyze data with OLAP technologies. However, data warehouses may be potentially large and multidimensional structures become increasingly complex to be understood at a glance. Even if a departmental data warehouse (also known as data mart) is used, these structures would be also too complex. As a consequence, acquiring the required information is more costly than expected and decision makers using OLAP tools may get frustrated. In this context, current approaches for data warehouse design are focused on deriving a unique OLAP schema for all analysts from their previously stated information requirements, which is not enough to lighten the complexity of the decision making process. To overcome this drawback, we argue for personalizing multidimensional models for OLAP technologies according to the continuously changing user characteristics, context, requirements and behaviour. In this paper, we present a novel approach to personalizing OLAP systems at the conceptual level based on the underlying multidimensional model of the data warehouse, a user model and a set of personalization rules. The great advantage of our approach is that a personalized OLAP schema is provided for each decision maker contributing to better satisfy their specific analysis needs. Finally, we show the applicability of our approach through a sample scenario based on our CASE tool for data warehouse development.
An R package for analyzing and modeling ranking data
2013-01-01
Background In medical informatics, psychology, market research and many other fields, researchers often need to analyze and model ranking data. However, there is no statistical software that provides tools for the comprehensive analysis of ranking data. Here, we present pmr, an R package for analyzing and modeling ranking data with a bundle of tools. The pmr package enables descriptive statistics (mean rank, pairwise frequencies, and marginal matrix), Analytic Hierarchy Process models (with Saaty’s and Koczkodaj’s inconsistencies), probability models (Luce model, distance-based model, and rank-ordered logit model), and the visualization of ranking data with multidimensional preference analysis. Results Examples of the use of package pmr are given using a real ranking dataset from medical informatics, in which 566 Hong Kong physicians ranked the top five incentives (1: competitive pressures; 2: increased savings; 3: government regulation; 4: improved efficiency; 5: improved quality care; 6: patient demand; 7: financial incentives) to the computerization of clinical practice. The mean rank showed that item 4 is the most preferred item and item 3 is the least preferred item, and significance difference was found between physicians’ preferences with respect to their monthly income. A multidimensional preference analysis identified two dimensions that explain 42% of the total variance. The first can be interpreted as the overall preference of the seven items (labeled as “internal/external”), and the second dimension can be interpreted as their overall variance of (labeled as “push/pull factors”). Various statistical models were fitted, and the best were found to be weighted distance-based models with Spearman’s footrule distance. Conclusions In this paper, we presented the R package pmr, the first package for analyzing and modeling ranking data. The package provides insight to users through descriptive statistics of ranking data. Users can also visualize ranking data by applying a thought multidimensional preference analysis. Various probability models for ranking data are also included, allowing users to choose that which is most suitable to their specific situations. PMID:23672645
An R package for analyzing and modeling ranking data.
Lee, Paul H; Yu, Philip L H
2013-05-14
In medical informatics, psychology, market research and many other fields, researchers often need to analyze and model ranking data. However, there is no statistical software that provides tools for the comprehensive analysis of ranking data. Here, we present pmr, an R package for analyzing and modeling ranking data with a bundle of tools. The pmr package enables descriptive statistics (mean rank, pairwise frequencies, and marginal matrix), Analytic Hierarchy Process models (with Saaty's and Koczkodaj's inconsistencies), probability models (Luce model, distance-based model, and rank-ordered logit model), and the visualization of ranking data with multidimensional preference analysis. Examples of the use of package pmr are given using a real ranking dataset from medical informatics, in which 566 Hong Kong physicians ranked the top five incentives (1: competitive pressures; 2: increased savings; 3: government regulation; 4: improved efficiency; 5: improved quality care; 6: patient demand; 7: financial incentives) to the computerization of clinical practice. The mean rank showed that item 4 is the most preferred item and item 3 is the least preferred item, and significance difference was found between physicians' preferences with respect to their monthly income. A multidimensional preference analysis identified two dimensions that explain 42% of the total variance. The first can be interpreted as the overall preference of the seven items (labeled as "internal/external"), and the second dimension can be interpreted as their overall variance of (labeled as "push/pull factors"). Various statistical models were fitted, and the best were found to be weighted distance-based models with Spearman's footrule distance. In this paper, we presented the R package pmr, the first package for analyzing and modeling ranking data. The package provides insight to users through descriptive statistics of ranking data. Users can also visualize ranking data by applying a thought multidimensional preference analysis. Various probability models for ranking data are also included, allowing users to choose that which is most suitable to their specific situations.
Scalable Visual Analytics of Massive Textual Datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.
2007-04-01
This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.
a Critical Review of Automated Photogrammetric Processing of Large Datasets
NASA Astrophysics Data System (ADS)
Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.
2017-08-01
The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.
Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer
Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.
2015-01-01
Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938
Relational Messages Associated with Nonverbal Behaviors.
ERIC Educational Resources Information Center
Burgoon, Judee K.; And Others
Based on the assumptions that relational messages are multidimensional and that they are largely communicated by nonverbal cues, this experiment manipulated five nonverbal cues--eye contact, proximity, body lean, smiling, and touch--to determine what meanings they convey along four relational message dimensions: emotionality/arousal/composure,…
Kang, Jin Soo; Choi, Hyelim; Kim, Jin; Park, Hyeji; Kim, Jae-Yup; Choi, Jung-Woo; Yu, Seung-Ho; Lee, Kyung Jae; Kang, Yun Sik; Park, Sun Ha; Cho, Yong-Hun; Yum, Jun-Ho; Dunand, David C; Choe, Heeman; Sung, Yung-Eun
2017-09-01
Mesoscopic solar cells based on nanostructured oxide semiconductors are considered as a promising candidates to replace conventional photovoltaics employing costly materials. However, their overall performances are below the sufficient level required for practical usages. Herein, this study proposes an anodized Ti foam (ATF) with multidimensional and hierarchical architecture as a highly efficient photoelectrode for the generation of a large photocurrent. ATF photoelectrodes prepared by electrochemical anodization of freeze-cast Ti foams have three favorable characteristics: (i) large surface area for enhanced light harvesting, (ii) 1D semiconductor structure for facilitated charge collection, and (iii) 3D highly conductive metallic current collector that enables exclusion of transparent conducting oxide substrate. Based on these advantages, when ATF is utilized in dye-sensitized solar cells, short-circuit photocurrent density up to 22.0 mA cm -2 is achieved in the conventional N719 dye-I 3 - /I - redox electrolyte system even with an intrinsically inferior quasi-solid electrolyte. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Visual analytics of large multidimensional data using variable binned scatter plots
NASA Astrophysics Data System (ADS)
Hao, Ming C.; Dayal, Umeshwar; Sharma, Ratnesh K.; Keim, Daniel A.; Janetzko, Halldór
2010-01-01
The scatter plot is a well-known method of visualizing pairs of two-dimensional continuous variables. Multidimensional data can be depicted in a scatter plot matrix. They are intuitive and easy-to-use, but often have a high degree of overlap which may occlude a significant portion of data. In this paper, we propose variable binned scatter plots to allow the visualization of large amounts of data without overlapping. The basic idea is to use a non-uniform (variable) binning of the x and y dimensions and plots all the data points that fall within each bin into corresponding squares. Further, we map a third attribute to color for visualizing clusters. Analysts are able to interact with individual data points for record level information. We have applied these techniques to solve real-world problems on credit card fraud and data center energy consumption to visualize their data distribution and cause-effect among multiple attributes. A comparison of our methods with two recent well-known variants of scatter plots is included.
On the visualization of water-related big data: extracting insights from drought proxies' datasets
NASA Astrophysics Data System (ADS)
Diaz, Vitali; Corzo, Gerald; van Lanen, Henny A. J.; Solomatine, Dimitri
2017-04-01
Big data is a growing area of science where hydroinformatics can benefit largely. There have been a number of important developments in the area of data science aimed at analysis of large datasets. Such datasets related to water include measurements, simulations, reanalysis, scenario analyses and proxies. By convention, information contained in these databases is referred to a specific time and a space (i.e., longitude/latitude). This work is motivated by the need to extract insights from large water-related datasets, i.e., transforming large amounts of data into useful information that helps to better understand of water-related phenomena, particularly about drought. In this context, data visualization, part of data science, involves techniques to create and to communicate data by encoding it as visual graphical objects. They may help to better understand data and detect trends. Base on existing methods of data analysis and visualization, this work aims to develop tools for visualizing water-related large datasets. These tools were developed taking advantage of existing libraries for data visualization into a group of graphs which include both polar area diagrams (PADs) and radar charts (RDs). In both graphs, time steps are represented by the polar angles and the percentages of area in drought by the radios. For illustration, three large datasets of drought proxies are chosen to identify trends, prone areas and spatio-temporal variability of drought in a set of case studies. The datasets are (1) SPI-TS2p1 (1901-2002, 11.7 GB), (2) SPI-PRECL0p5 (1948-2016, 7.91 GB) and (3) SPEI-baseV2.3 (1901-2013, 15.3 GB). All of them are on a monthly basis and with a spatial resolution of 0.5 degrees. First two were retrieved from the repository of the International Research Institute for Climate and Society (IRI). They are included into the Analyses Standardized Precipitation Index (SPI) project (iridl.ldeo.columbia.edu/SOURCES/.IRI/.Analyses/.SPI/). The third dataset was recovered from the Standardized Precipitation Evaporation Index (SPEI) Monitor (digital.csic.es/handle/10261/128892). PADs were found suitable to identify the spatio-temporal variability and prone areas of drought. Drought trends were visually detected by using both PADs and RDs. A similar approach can be followed to include other types of graphs to deal with the analysis of water-related big data. Key words: Big data, data visualization, drought, SPI, SPEI
Measures for a multidimensional multiverse
NASA Astrophysics Data System (ADS)
Chung, Hyeyoun
2015-04-01
We explore the phenomenological implications of generalizing the causal patch and fat geodesic measures to a multidimensional multiverse, where the vacua can have differing numbers of large dimensions. We consider a simple model in which the vacua are nucleated from a D -dimensional parent spacetime through dynamical compactification of the extra dimensions, and compute the geometric contribution to the probability distribution of observations within the multiverse for each measure. We then study how the shape of this probability distribution depends on the time scales for the existence of observers, for vacuum domination, and for curvature domination (tobs,tΛ , and tc, respectively.) In this work we restrict ourselves to bubbles with positive cosmological constant, Λ . We find that in the case of the causal patch cutoff, when the bubble universes have p +1 large spatial dimensions with p ≥2 , the shape of the probability distribution is such that we obtain the coincidence of time scales tobs˜tΛ˜tc . Moreover, the size of the cosmological constant is related to the size of the landscape. However, the exact shape of the probability distribution is different in the case p =2 , compared to p ≥3 . In the case of the fat geodesic measure, the result is even more robust: the shape of the probability distribution is the same for all p ≥2 , and we once again obtain the coincidence tobs˜tΛ˜tc . These results require only very mild conditions on the prior probability of the distribution of vacua in the landscape. Our work shows that the observed double coincidence of time scales is a robust prediction even when the multiverse is generalized to be multidimensional; that this coincidence is not a consequence of our particular Universe being (3 +1 )-dimensional; and that this observable cannot be used to preferentially select one measure over another in a multidimensional multiverse.
The MATISSE analysis of large spectral datasets from the ESO Archive
NASA Astrophysics Data System (ADS)
Worley, C.; de Laverny, P.; Recio-Blanco, A.; Hill, V.; Vernisse, Y.; Ordenovic, C.; Bijaoui, A.
2010-12-01
The automated stellar classification algorithm, MATISSE, has been developed at the Observatoire de la Côte d'Azur (OCA) in order to determine stellar temperatures, gravities and chemical abundances for large datasets of stellar spectra. The Gaia Data Processing and Analysis Consortium (DPAC) has selected MATISSE as one of the key programmes to be used in the analysis of the Gaia Radial Velocity Spectrometer (RVS) spectra. MATISSE is currently being used to analyse large datasets of spectra from the ESO archive with the primary goal of producing advanced data products to be made available in the ESO database via the Virtual Observatory. This is also an invaluable opportunity to identify and address issues that can be encountered with the analysis large samples of real spectra prior to the launch of Gaia in 2012. The analysis of the archived spectra of the FEROS spectrograph is currently underway and preliminary results are presented.
2014-06-30
steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors...guilty’ user (of steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty...floating point operations (1 TFLOPs) for a 1 megapixel image. We designed a new implementation using Compute Unified Device Architecture (CUDA) on NVIDIA
The role of metadata in managing large environmental science datasets. Proceedings
DOE Office of Scientific and Technical Information (OSTI.GOV)
Melton, R.B.; DeVaney, D.M.; French, J. C.
1995-06-01
The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.
Multidimensional biochemical information processing of dynamical patterns
NASA Astrophysics Data System (ADS)
Hasegawa, Yoshihiko
2018-02-01
Cells receive signaling molecules by receptors and relay information via sensory networks so that they can respond properly depending on the type of signal. Recent studies have shown that cells can extract multidimensional information from dynamical concentration patterns of signaling molecules. We herein study how biochemical systems can process multidimensional information embedded in dynamical patterns. We model the decoding networks by linear response functions, and optimize the functions with the calculus of variations to maximize the mutual information between patterns and output. We find that, when the noise intensity is lower, decoders with different linear response functions, i.e., distinct decoders, can extract much information. However, when the noise intensity is higher, distinct decoders do not provide the maximum amount of information. This indicates that, when transmitting information by dynamical patterns, embedding information in multiple patterns is not optimal when the noise intensity is very large. Furthermore, we explore the biochemical implementations of these decoders using control theory and demonstrate that these decoders can be implemented biochemically through the modification of cascade-type networks, which are prevalent in actual signaling pathways.
Testlet-Based Multidimensional Adaptive Testing.
Frey, Andreas; Seitz, Nicki-Nils; Brandt, Steffen
2016-01-01
Multidimensional adaptive testing (MAT) is a highly efficient method for the simultaneous measurement of several latent traits. Currently, no psychometrically sound approach is available for the use of MAT in testlet-based tests. Testlets are sets of items sharing a common stimulus such as a graph or a text. They are frequently used in large operational testing programs like TOEFL, PISA, PIRLS, or NAEP. To make MAT accessible for such testing programs, we present a novel combination of MAT with a multidimensional generalization of the random effects testlet model (MAT-MTIRT). MAT-MTIRT compared to non-adaptive testing is examined for several combinations of testlet effect variances (0.0, 0.5, 1.0, and 1.5) and testlet sizes (3, 6, and 9 items) with a simulation study considering three ability dimensions with simple loading structure. MAT-MTIRT outperformed non-adaptive testing regarding the measurement precision of the ability estimates. Further, the measurement precision decreased when testlet effect variances and testlet sizes increased. The suggested combination of the MTIRT model therefore provides a solution to the substantial problems of testlet-based tests while keeping the length of the test within an acceptable range.
Rosset, Antoine; Spadola, Luca; Pysher, Lance; Ratib, Osman
2006-01-01
The display and interpretation of images obtained by combining three-dimensional data acquired with two different modalities (eg, positron emission tomography and computed tomography) in the same subject require complex software tools that allow the user to adjust the image parameters. With the current fast imaging systems, it is possible to acquire dynamic images of the beating heart, which add a fourth dimension of visual information-the temporal dimension. Moreover, images acquired at different points during the transit of a contrast agent or during different functional phases add a fifth dimension-functional data. To facilitate real-time image navigation in the resultant large multidimensional image data sets, the authors developed a Digital Imaging and Communications in Medicine-compliant software program. The open-source software, called OsiriX, allows the user to navigate through multidimensional image series while adjusting the blending of images from different modalities, image contrast and intensity, and the rate of cine display of dynamic images. The software is available for free download at http://homepage.mac.com/rossetantoine/osirix. (c) RSNA, 2006.
Membership determination of open clusters based on a spectral clustering method
NASA Astrophysics Data System (ADS)
Gao, Xin-Hua
2018-06-01
We present a spectral clustering (SC) method aimed at segregating reliable members of open clusters in multi-dimensional space. The SC method is a non-parametric clustering technique that performs cluster division using eigenvectors of the similarity matrix; no prior knowledge of the clusters is required. This method is more flexible in dealing with multi-dimensional data compared to other methods of membership determination. We use this method to segregate the cluster members of five open clusters (Hyades, Coma Ber, Pleiades, Praesepe, and NGC 188) in five-dimensional space; fairly clean cluster members are obtained. We find that the SC method can capture a small number of cluster members (weak signal) from a large number of field stars (heavy noise). Based on these cluster members, we compute the mean proper motions and distances for the Hyades, Coma Ber, Pleiades, and Praesepe clusters, and our results are in general quite consistent with the results derived by other authors. The test results indicate that the SC method is highly suitable for segregating cluster members of open clusters based on high-precision multi-dimensional astrometric data such as Gaia data.
Multidimensional biochemical information processing of dynamical patterns.
Hasegawa, Yoshihiko
2018-02-01
Cells receive signaling molecules by receptors and relay information via sensory networks so that they can respond properly depending on the type of signal. Recent studies have shown that cells can extract multidimensional information from dynamical concentration patterns of signaling molecules. We herein study how biochemical systems can process multidimensional information embedded in dynamical patterns. We model the decoding networks by linear response functions, and optimize the functions with the calculus of variations to maximize the mutual information between patterns and output. We find that, when the noise intensity is lower, decoders with different linear response functions, i.e., distinct decoders, can extract much information. However, when the noise intensity is higher, distinct decoders do not provide the maximum amount of information. This indicates that, when transmitting information by dynamical patterns, embedding information in multiple patterns is not optimal when the noise intensity is very large. Furthermore, we explore the biochemical implementations of these decoders using control theory and demonstrate that these decoders can be implemented biochemically through the modification of cascade-type networks, which are prevalent in actual signaling pathways.
Thermalnet: a Deep Convolutional Network for Synthetic Thermal Image Generation
NASA Astrophysics Data System (ADS)
Kniaz, V. V.; Gorbatsevich, V. S.; Mizginov, V. A.
2017-05-01
Deep convolutional neural networks have dramatically changed the landscape of the modern computer vision. Nowadays methods based on deep neural networks show the best performance among image recognition and object detection algorithms. While polishing of network architectures received a lot of scholar attention, from the practical point of view the preparation of a large image dataset for a successful training of a neural network became one of major challenges. This challenge is particularly profound for image recognition in wavelengths lying outside the visible spectrum. For example no infrared or radar image datasets large enough for successful training of a deep neural network are available to date in public domain. Recent advances of deep neural networks prove that they are also capable to do arbitrary image transformations such as super-resolution image generation, grayscale image colorisation and imitation of style of a given artist. Thus a natural question arise: how could be deep neural networks used for augmentation of existing large image datasets? This paper is focused on the development of the Thermalnet deep convolutional neural network for augmentation of existing large visible image datasets with synthetic thermal images. The Thermalnet network architecture is inspired by colorisation deep neural networks.
A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video
2011-06-01
orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3
Hu, Ming-Hsia; Yeh, Chih-Jun; Chen, Tou-Rong; Wang, Ching-Yi
2014-01-01
A valid, time-efficient and easy-to-use instrument is important for busy clinical settings, large scale surveys, or community screening use. The purpose of this study was to validate the mobility hierarchical disability categorization model (an abbreviated model) by investigating its concurrent validity with the multidimensional hierarchical disability categorization model (a comprehensive model) and triangulating both models with physical performance measures in older adults. 604 community-dwelling older adults of at least 60 years in age volunteered to participate. Self-reported function on mobility, instrumental activities of daily living (IADL) and activities of daily living (ADL) domains were recorded and then the disability status determined based on both the multidimensional hierarchical categorization model and the mobility hierarchical categorization model. The physical performance measures, consisting of grip strength and usual and fastest gait speeds (UGS, FGS), were collected on the same day. Both categorization models showed high correlation (γs = 0.92, p < 0.001) and agreement (kappa = 0.61, p < 0.0001). Physical performance measures demonstrated significant different group means among the disability subgroups based on both categorization models. The results of multiple regression analysis indicated that both models individually explain similar amount of variance on all physical performances, with adjustments for age, sex, and number of comorbidities. Our results found that the mobility hierarchical disability categorization model is a valid and time efficient tool for large survey or screening use.
NASA Astrophysics Data System (ADS)
Ladd, Matthew; Viau, Andre
2013-04-01
Paleoclimate reconstructions rely on the accuracy of modern climate datasets for calibration of fossil records under the assumption of climate normality through time, which means that the modern climate operates in a similar manner as over the past 2,000 years. In this study, we show how using different modern climate datasets have an impact on a pollen-based reconstruction of mean temperature of the warmest month (MTWA) during the past 2,000 years for North America. The modern climate datasets used to explore this research question include the: Whitmore et al., (2005) modern climate dataset; North American Regional Reanalysis (NARR); National Center For Environmental Prediction (NCEP); European Center for Medium Range Weather Forecasting (ECMWF) ERA-40 reanalysis; WorldClim, Global Historical Climate Network (GHCN) and New et al., which is derived from the CRU dataset. Results show that some caution is advised in using the reanalysis data on large-scale reconstructions. Station data appears to dampen out the variability of the reconstruction produced using station based datasets. The reanalysis or model-based datasets are not recommended for paleoclimate large-scale North American reconstructions as they appear to lack some of the dynamics observed in station datasets (CRU) which resulted in warm-biased reconstructions as compared to the station-based reconstructions. The Whitmore et al. (2005) modern climate dataset appears to be a compromise between CRU-based datasets and model-based datasets except for the ERA-40. In addition, an ultra-high resolution gridded climate dataset such as WorldClim may only be useful if the pollen calibration sites in North America have at least the same spatial precision. We reconstruct the MTWA to within +/-0.01°C by using an average of all curves derived from the different modern climate datasets, demonstrating the robustness of the procedure used. It may be that the use of an average of different modern datasets may reduce the impact of uncertainty of paleoclimate reconstructions, however, this is yet to be determined with certainty. Future evaluation using for example the newly developed Berkeley earth surface temperature datasets should be tested against the paleoclimate record.
Atlas-Guided Cluster Analysis of Large Tractography Datasets
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
PNNL - WRF-LES - Convective - TTU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
ANL - WRF-LES - Convective - TTU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
LLNL - WRF-LES - Neutral - TTU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
Kosovic, Branko
2018-06-20
This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
LANL - WRF-LES - Neutral - TTU
Kosovic, Branko
2018-06-20
This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
LANL - WRF-LES - Convective - TTU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kosovic, Branko
This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.
A Computational Approach to Qualitative Analysis in Large Textual Datasets
Evans, Michael S.
2014-01-01
In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398
Evolving Deep Networks Using HPC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Young, Steven R.; Rose, Derek C.; Johnston, Travis
While a large number of deep learning networks have been studied and published that produce outstanding results on natural image datasets, these datasets only make up a fraction of those to which deep learning can be applied. These datasets include text data, audio data, and arrays of sensors that have very different characteristics than natural images. As these “best” networks for natural images have been largely discovered through experimentation and cannot be proven optimal on some theoretical basis, there is no reason to believe that they are the optimal network for these drastically different datasets. Hyperparameter search is thus oftenmore » a very important process when applying deep learning to a new problem. In this work we present an evolutionary approach to searching the possible space of network hyperparameters and construction that can scale to 18, 000 nodes. This approach is applied to datasets of varying types and characteristics where we demonstrate the ability to rapidly find best hyperparameters in order to enable practitioners to quickly iterate between idea and result.« less
Mendenhall, Jeffrey; Meiler, Jens
2016-02-01
Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both enrichment false positive rate and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22-46 % over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods.
Mendenhall, Jeffrey; Meiler, Jens
2016-01-01
Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery (LB-CADD) pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both Enrichment false positive rate (FPR) and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22–46% over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods. PMID:26830599
Pentexonomy: A Multi-Dimensional Taxonomy of Educational Online Technologies
ERIC Educational Resources Information Center
Tuapawa, Kimberley; Sher, William; Gu, Ning
2014-01-01
Educational online technologies (EOTs) have revolutionised the delivery of online education, making a large contribution towards the global increase in demand for higher learning. Educationalists have striven to adapt through knowledge development and application of online tools, but making educationally sound choices about technology has proved…
Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei
2017-02-01
Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.
McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr
2017-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.
Goekoop, Rutger; Goekoop, Jaap G.; Scholte, H. Steven
2012-01-01
Introduction Human personality is described preferentially in terms of factors (dimensions) found using factor analysis. An alternative and highly related method is network analysis, which may have several advantages over factor analytic methods. Aim To directly compare the ability of network community detection (NCD) and principal component factor analysis (PCA) to examine modularity in multidimensional datasets such as the neuroticism-extraversion-openness personality inventory revised (NEO-PI-R). Methods 434 healthy subjects were tested on the NEO-PI-R. PCA was performed to extract factor structures (FS) of the current dataset using both item scores and facet scores. Correlational network graphs were constructed from univariate correlation matrices of interactions between both items and facets. These networks were pruned in a link-by-link fashion while calculating the network community structure (NCS) of each resulting network using the Wakita Tsurumi clustering algorithm. NCSs were matched against FS and networks of best matches were kept for further analysis. Results At facet level, NCS showed a best match (96.2%) with a ‘confirmatory’ 5-FS. At item level, NCS showed a best match (80%) with the standard 5-FS and involved a total of 6 network clusters. Lesser matches were found with ‘confirmatory’ 5-FS and ‘exploratory’ 6-FS of the current dataset. Network analysis did not identify facets as a separate level of organization in between items and clusters. A small-world network structure was found in both item- and facet level networks. Conclusion We present the first optimized network graph of personality traits according to the NEO-PI-R: a ‘Personality Web’. Such a web may represent the possible routes that subjects can take during personality development. NCD outperforms PCA by producing plausible modularity at item level in non-standard datasets, and can identify the key roles of individual items and clusters in the network. PMID:23284713
Goekoop, Rutger; Goekoop, Jaap G; Scholte, H Steven
2012-01-01
Human personality is described preferentially in terms of factors (dimensions) found using factor analysis. An alternative and highly related method is network analysis, which may have several advantages over factor analytic methods. To directly compare the ability of network community detection (NCD) and principal component factor analysis (PCA) to examine modularity in multidimensional datasets such as the neuroticism-extraversion-openness personality inventory revised (NEO-PI-R). 434 healthy subjects were tested on the NEO-PI-R. PCA was performed to extract factor structures (FS) of the current dataset using both item scores and facet scores. Correlational network graphs were constructed from univariate correlation matrices of interactions between both items and facets. These networks were pruned in a link-by-link fashion while calculating the network community structure (NCS) of each resulting network using the Wakita Tsurumi clustering algorithm. NCSs were matched against FS and networks of best matches were kept for further analysis. At facet level, NCS showed a best match (96.2%) with a 'confirmatory' 5-FS. At item level, NCS showed a best match (80%) with the standard 5-FS and involved a total of 6 network clusters. Lesser matches were found with 'confirmatory' 5-FS and 'exploratory' 6-FS of the current dataset. Network analysis did not identify facets as a separate level of organization in between items and clusters. A small-world network structure was found in both item- and facet level networks. We present the first optimized network graph of personality traits according to the NEO-PI-R: a 'Personality Web'. Such a web may represent the possible routes that subjects can take during personality development. NCD outperforms PCA by producing plausible modularity at item level in non-standard datasets, and can identify the key roles of individual items and clusters in the network.
Fast Detection of Material Deformation through Structural Dissimilarity
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ushizima, Daniela; Perciano, Talita; Parkinson, Dilworth
2015-10-29
Designing materials that are resistant to extreme temperatures and brittleness relies on assessing structural dynamics of samples. Algorithms are critically important to characterize material deformation under stress conditions. Here, we report on our design of coarse-grain parallel algorithms for image quality assessment based on structural information and on crack detection of gigabyte-scale experimental datasets. We show how key steps can be decomposed into distinct processing flows, one based on structural similarity (SSIM) quality measure, and another on spectral content. These algorithms act upon image blocks that fit into memory, and can execute independently. We discuss the scientific relevance of themore » problem, key developments, and decomposition of complementary tasks into separate executions. We show how to apply SSIM to detect material degradation, and illustrate how this metric can be allied to spectral analysis for structure probing, while using tiled multi-resolution pyramids stored in HDF5 chunked multi-dimensional arrays. Results show that the proposed experimental data representation supports an average compression rate of 10X, and data compression scales linearly with the data size. We also illustrate how to correlate SSIM to crack formation, and how to use our numerical schemes to enable fast detection of deformation from 3D datasets evolving in time.« less
NASA Astrophysics Data System (ADS)
Mote, P.; Foster, J. G.; Daley-Laursen, S. B.
2014-12-01
The Northwest has the nation's strongest geographic, institutional, and scientific alignment between NOAA RISA, DOI Climate Science Center, USDA Climate Hub, and participating universities. Considering each of those institutions' distinct mission, funding structures, governance, stakeholder engagement, methods of priority-setting, and deliverables, it is a challenge to find areas of common interest and ways for these institutions to work together. In view of the rich history of stakeholder engagement and the deep base of previous research on climate change in the region, these institutions are cooperating in developing a regional capacity to mine the vast available data in ways that are mutually beneficial, synergistic, and regionally relevant. Fundamentally, data mining means exploring connections across and within multiple datasets using advanced statistical techniques, development of multidimensional indices, machine learning, and more. The challenge is not just what we do with big datasets, but how we integrate the wide variety and types of data coming out of scenario analyses to create knowledge and inform decision-making. Federal agencies and their partners need to learn integrate big data on climate change and develop useful tools for important stake-holders to assist them in anticipating the main stresses of climate change to their own resources and preparing to abate those stresses.
A cross-diffusion system derived from a Fokker-Planck equation with partial averaging
NASA Astrophysics Data System (ADS)
Jüngel, Ansgar; Zamponi, Nicola
2017-02-01
A cross-diffusion system for two components with a Laplacian structure is analyzed on the multi-dimensional torus. This system, which was recently suggested by P.-L. Lions, is formally derived from a Fokker-Planck equation for the probability density associated with a multi-dimensional Itō process, assuming that the diffusion coefficients depend on partial averages of the probability density with exponential weights. A main feature is that the diffusion matrix of the limiting cross-diffusion system is generally neither symmetric nor positive definite, but its structure allows for the use of entropy methods. The global-in-time existence of positive weak solutions is proved and, under a simplifying assumption, the large-time asymptotics is investigated.
Hird, Sarah; Kubatko, Laura; Carstens, Bryan
2010-11-01
We describe a method for estimating species trees that relies on replicated subsampling of large data matrices. One application of this method is phylogeographic research, which has long depended on large datasets that sample intensively from the geographic range of the focal species; these datasets allow systematicists to identify cryptic diversity and understand how contemporary and historical landscape forces influence genetic diversity. However, analyzing any large dataset can be computationally difficult, particularly when newly developed methods for species tree estimation are used. Here we explore the use of replicated subsampling, a potential solution to the problem posed by large datasets, with both a simulation study and an empirical analysis. In the simulations, we sample different numbers of alleles and loci, estimate species trees using STEM, and compare the estimated to the actual species tree. Our results indicate that subsampling three alleles per species for eight loci nearly always results in an accurate species tree topology, even in cases where the species tree was characterized by extremely rapid divergence. Even more modest subsampling effort, for example one allele per species and two loci, was more likely than not (>50%) to identify the correct species tree topology, indicating that in nearly all cases, computing the majority-rule consensus tree from replicated subsampling provides a good estimate of topology. These results were supported by estimating the correct species tree topology and reasonable branch lengths for an empirical 10-locus great ape dataset. Copyright © 2010 Elsevier Inc. All rights reserved.
Oscar, Nels; Fox, Pamela A; Croucher, Racheal; Wernick, Riana; Keune, Jessica; Hooker, Karen
2017-09-01
Social scientists need practical methods for harnessing large, publicly available datasets that inform the social context of aging. We describe our development of a semi-automated text coding method and use a content analysis of Alzheimer's disease (AD) and dementia portrayal on Twitter to demonstrate its use. The approach improves feasibility of examining large publicly available datasets. Machine learning techniques modeled stigmatization expressed in 31,150 AD-related tweets collected via Twitter's search API based on 9 AD-related keywords. Two researchers manually coded 311 random tweets on 6 dimensions. This input from 1% of the dataset was used to train a classifier against the tweet text and code the remaining 99% of the dataset. Our automated process identified that 21.13% of the AD-related tweets used AD-related keywords to perpetuate public stigma, which could impact stereotypes and negative expectations for individuals with the disease and increase "excess disability". This technique could be applied to questions in social gerontology related to how social media outlets reflect and shape attitudes bearing on other developmental outcomes. Recommendations for the collection and analysis of large Twitter datasets are discussed. © The Author 2017. Published by Oxford University Press on behalf of The Gerontological Society of America. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Parallel task processing of very large datasets
NASA Astrophysics Data System (ADS)
Romig, Phillip Richardson, III
This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.
Aghaeepour, Nima; Chattopadhyay, Pratip; Chikina, Maria; Dhaene, Tom; Van Gassen, Sofie; Kursa, Miron; Lambrecht, Bart N; Malek, Mehrnoush; McLachlan, G J; Qian, Yu; Qiu, Peng; Saeys, Yvan; Stanton, Rick; Tong, Dong; Vens, Celine; Walkowiak, Sławomir; Wang, Kui; Finak, Greg; Gottardo, Raphael; Mosmann, Tim; Nolan, Garry P; Scheuermann, Richard H; Brinkman, Ryan R
2016-01-01
The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenges were established to compare the performance of computational methods for identifying cell populations in multidimensional flow cytometry data. Here we report the results of FlowCAP-IV where algorithms from seven different research groups predicted the time to progression to AIDS among a cohort of 384 HIV+ subjects, using antigen-stimulated peripheral blood mononuclear cell (PBMC) samples analyzed with a 14-color staining panel. Two approaches (FlowReMi.1 and flowDensity-flowType-RchyOptimyx) provided statistically significant predictive value in the blinded test set. Manual validation of submitted results indicated that unbiased analysis of single cell phenotypes could reveal unexpected cell types that correlated with outcomes of interest in high dimensional flow cytometry datasets. © 2015 International Society for Advancement of Cytometry.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Doughty, Benjamin; Simpson, Mary Jane; Yang, Bin
Our work aims to simplify multi-dimensional femtosecond transient absorption microscopy (TAM) data into decay associated amplitude maps that describe the spatial distributions of dynamical processes occurring on various characteristic timescales. Application of this method to TAM data obtained from a model methyl-ammonium lead iodide (CH 3NH 3PbI 3) perovskite thin film allows us to simplify the dataset consisting of a 68 time-resolved images into 4 decay associated amplitude maps. Furthermore, these maps provide a simple means to visualize the complex electronic excited-state dynamics in this system by separating distinct dynamical processes evolving on characteristic timescales into individual spatial images. Thismore » approach provides new insight into subtle aspects of ultrafast relaxation dynamics associated with excitons and charge carriers in the perovskite thin film, which have recently been found to coexist at spatially distinct locations.« less
Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun
2014-01-01
As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Imbalanced class learning in epigenetics.
Haque, M Muksitul; Skinner, Michael K; Holder, Lawrence B
2014-07-01
In machine learning, one of the important criteria for higher classification accuracy is a balanced dataset. Datasets with a large ratio between minority and majority classes face hindrance in learning using any classifier. Datasets having a magnitude difference in number of instances between the target concept result in an imbalanced class distribution. Such datasets can range from biological data, sensor data, medical diagnostics, or any other domain where labeling any instances of the minority class can be time-consuming or costly or the data may not be easily available. The current study investigates a number of imbalanced class algorithms for solving the imbalanced class distribution present in epigenetic datasets. Epigenetic (DNA methylation) datasets inherently come with few differentially DNA methylated regions (DMR) and with a higher number of non-DMR sites. For this class imbalance problem, a number of algorithms are compared, including the TAN+AdaBoost algorithm. Experiments performed on four epigenetic datasets and several known datasets show that an imbalanced dataset can have similar accuracy as a regular learner on a balanced dataset.
An Aggregate IRT Procedure for Exploratory Factor Analysis
ERIC Educational Resources Information Center
Camilli, Gregory; Fox, Jean-Paul
2015-01-01
An aggregation strategy is proposed to potentially address practical limitation related to computing resources for two-level multidimensional item response theory (MIRT) models with large data sets. The aggregate model is derived by integration of the normal ogive model, and an adaptation of the stochastic approximation expectation maximization…
ERIC Educational Resources Information Center
Hoogeveen, Lianne; van Hell, Janet G.; Verhoeven, Ludo
2012-01-01
Background: In the studies of acceleration conducted so far a multidimensional perspective has largely been neglected. No attempt has been made to relate social-emotional characteristics of accelerated versus non-accelerated students in perspective of environmental factors. Aims: In this study, social-emotional characteristics of accelerated…
Evolving Approaches to the Study of Childhood Poverty and Education
ERIC Educational Resources Information Center
Hannum, Emily; Liu, Ran; Alvarado-Urbina, Andrea
2017-01-01
Social scientists have conceptualised poverty in multiple ways, with measurement approaches that seek to identify absolute, relative, subjective, and multi-dimensional poverty. The concept of poverty is central in the comparative education field, but has been empirically elusive in many large, international educational surveys: these studies have…
NASA Astrophysics Data System (ADS)
Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram
2010-05-01
Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.
Toward Computational Cumulative Biology by Combining Models of Biological Datasets
Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel
2014-01-01
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176
Toward computational cumulative biology by combining models of biological datasets.
Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel
2014-01-01
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Klein, Max; Sharma, Rati; Bohrer, Chris H; Avelis, Cameron M; Roberts, Elijah
2017-01-15
Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology. Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
A semiparametric graphical modelling approach for large-scale equity selection.
Liu, Han; Mulvey, John; Zhao, Tianqi
2016-01-01
We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption.
Astephen, J L; Deluzio, K J
2005-02-01
Osteoarthritis of the knee is related to many correlated mechanical factors that can be measured with gait analysis. Gait analysis results in large data sets. The analysis of these data is difficult due to the correlated, multidimensional nature of the measures. A multidimensional model that uses two multivariate statistical techniques, principal component analysis and discriminant analysis, was used to discriminate between the gait patterns of the normal subject group and the osteoarthritis subject group. Nine time varying gait measures and eight discrete measures were included in the analysis. All interrelationships between and within the measures were retained in the analysis. The multidimensional analysis technique successfully separated the gait patterns of normal and knee osteoarthritis subjects with a misclassification error rate of <6%. The most discriminatory feature described a static and dynamic alignment factor. The second most discriminatory feature described a gait pattern change during the loading response phase of the gait cycle. The interrelationships between gait measures and between the time instants of the gait cycle can provide insight into the mechanical mechanisms of pathologies such as knee osteoarthritis. These results suggest that changes in frontal plane loading and alignment and the loading response phase of the gait cycle are characteristic of severe knee osteoarthritis gait patterns. Subsequent investigations earlier in the disease process may suggest the importance of these factors to the progression of knee osteoarthritis.
NASA Astrophysics Data System (ADS)
Griffiths, Thomas; Habler, Gerlinde; Schantl, Philip; Abart, Rainer
2017-04-01
Crystallographic orientation relationships (CORs) between crystalline inclusions and their hosts are commonly used to support particular inclusion origins, but often interpretations are based on a small fraction of all inclusions in a system. The electron backscatter diffraction (EBSD) method allows collection of large COR datasets more quickly than other methods while maintaining high spatial resolution. Large datasets allow analysis of the relative frequencies of different CORs, and identification of 'statistical CORs', where certain limited degrees of freedom exist in the orientation relationship between two neighbour crystals (Griffiths et al. 2016). Statistical CORs exist in addition to completely fixed 'specific' CORs (previously the only type of COR considered). We present a comparison of three EBSD single point datasets (all N > 200 inclusions) of rutile inclusions in garnet hosts, covering three rock systems, each with a different geological history: 1) magmatic garnet in pegmatite from the Koralpe complex, Eastern Alps, formed at temperatures > 600°C and low pressures; 2) granulite facies garnet rims on ultra-high-pressure garnets from the Kimi complex, Rhodope Massif; and 3) a Moldanubian granulite from the southeastern Bohemian Massif, equilibrated at peak conditions of 1050°C and 1.6 GPa. The present study is unique because all datasets have been analysed using the same catalogue of potential CORs, therefore relative frequencies and other COR properties can be meaningfully compared. In every dataset > 94% of the inclusions analysed exhibit one of the CORs tested for. Certain CORs are consistently among the most common in all datasets. However, the relative abundances of these common CORs show large variations between datasets (varying from 8 to 42 % relative abundance in one case). Other CORs are consistently uncommon but nonetheless present in every dataset. Lastly, there are some CORs that are common in one of the datasets and rare in the remainder. These patterns suggest competing influences on relative COR frequencies. Certain CORs seem consistently favourable, perhaps pointing to very stable low energy configurations, whereas some CORs are favoured in only one system, perhaps due to particulars of the formation mechanism, kinetics or conditions. Variations in COR frequencies between datasets seem to correlate with the conditions of host-inclusion system evolution. The two datasets from granulite-facies metamorphic samples show more similarities to each other than to the pegmatite dataset, and the sample inferred to have experienced the highest temperatures (Moldanubian granulite) shows the lowest diversity of CORs, low frequencies of statistical CORs and the highest frequency of specific CORs. These results provide evidence that petrological information is being encoded in COR distributions. They make a strong case for further studies of the factors influencing COR development and for measurements of COR distributions in other systems and between different phases. Griffiths, T.A., Habler, G., Abart, R. (2016): Crystallographic orientation relationships in host-inclusion systems: New insights from large EBSD data sets. Amer. Miner., 101, 690-705.
Multiresolution persistent homology for excessively large biomolecular datasets
NASA Astrophysics Data System (ADS)
Xia, Kelin; Zhao, Zhixiong; Wei, Guo-Wei
2015-10-01
Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.
NASA Astrophysics Data System (ADS)
Jiménez-Ruano, Adrián; Rodrigues Mimbrero, Marcos; de la Riva Fernández, Juan
2017-04-01
Understanding fire regime is a crucial step towards achieving a better knowledge of the wildfire phenomenon. This study proposes a method for the analysis of fire regime based on multidimensional scatterplots (MDS). MDS are a visual approach that allows direct comparison among several variables and fire regime features so that we are able to unravel spatial patterns and relationships within the region of analysis. Our analysis is conducted in Spain, one of the most fire-affected areas within the Mediterranean region. Specifically, the Spanish territory has been split into three regions - Northwest, Hinterland and Mediterranean - considered as representative fire regime zones according to MAGRAMA (Spanish Ministry of Agriculture, Environment and Food). The main goal is to identify key relationships between fire frequency and burnt area, two of the most common fire regime features, with socioeconomic activity and climate. In this way we will be able to better characterize fire activity within each fire region. Fire data along the period 1974-2010 was retrieved from the General Statistics Forest Fires database (EGIF). Specifically, fire frequency and burnt area size was examined for each region and fire season (summer and winter). Socioeconomic activity was defined in terms of human pressure on wildlands, i.e. the presence and intensity of anthropogenic activity near wildland or forest areas. Human pressure was built from GIS spatial information about land use (wildland-agriculture and wildland-urban interface) and demographic potential. Climate variables (average maximum temperature and annual precipitation) were extracted from MOTEDAS (Monthly Temperature Dataset of Spain) and MOPREDAS (Monthly Precipitation Dataset of Spain) datasets and later reclassified into ten categories. All these data were resampled to fit the 10x10 Km grid used as spatial reference for fire data. Climate and socioeconomic variables were then explored by means of MDS to find the extent to which fire frequency and burnt areas are controlled by either environmental, human, or both factors. Results reveal a noticeable link between fire frequency and human activity, especially in the Northwest area during winter. On the other hand, in the Hinterland and Mediterranean regions, human and climate factors 'work' together in terms of their relationship with fire activity, being the concurrence of high human pressure and favourable climate conditions the main driver. In turn, burned area shows a similar behaviour except in the Hinterland region, were fire-affected area depends mostly on climate factors. Overall, we can conclude that the visual analysis of multidimensional scatterplots has proved to be a powerful tool that facilitates characterization and investigation of fire regimes.
NASA Astrophysics Data System (ADS)
Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.
2017-12-01
We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.
Decision tree methods: applications for classification and prediction.
Song, Yan-Yan; Lu, Ying
2015-04-25
Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.
Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.
Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O; Gelfand, Alan E
2016-01-01
Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.
Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets
Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O.; Gelfand, Alan E.
2018-01-01
Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online. PMID:29720777
geneLAB: Expanding the Impact of NASA's Biological Research in Space
NASA Technical Reports Server (NTRS)
Rayl, Nicole; Smith, Jeffrey D.
2014-01-01
The geneLAB project is designed to leverage the value of large 'omics' datasets from molecular biology projects conducted on the ISS by making these datasets available, citable, discoverable, interpretable, reusable, and reproducible. geneLAB will create a collaboration space with an integrated set of tools for depositing, accessing, analyzing, and modeling these diverse datasets from spaceflight and related terrestrial studies.
Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila
2018-01-01
Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374
Multidimensional poverty, household environment and short-term morbidity in India.
Dehury, Bidyadhar; Mohanty, Sanjay K
2017-01-01
Using the unit data from the second round of the Indian Human Development Survey (IHDS-II), 2011-2012, which covered 42,152 households, this paper examines the association between multidimensional poverty, household environmental deprivation and short-term morbidities (fever, cough and diarrhoea) in India. Poverty is measured in a multidimensional framework that includes the dimensions of education, health and income, while household environmental deprivation is defined as lack of access to improved sanitation, drinking water and cooking fuel. A composite index combining multidimensional poverty and household environmental deprivation has been computed, and households are classified as follows: multidimensional poor and living in a poor household environment, multidimensional non-poor and living in a poor household environment, multidimensional poor and living in a good household environment and multidimensional non-poor and living in a good household environment. Results suggest that about 23% of the population belonging to multidimensional poor households and living in a poor household environment had experienced short-term morbidities in a reference period of 30 days compared to 20% of the population belonging to multidimensional non-poor households and living in a poor household environment, 19% of the population belonging to multidimensional poor households and living in a good household environment and 15% of the population belonging to multidimensional non-poor households and living in a good household environment. Controlling for socioeconomic covariates, the odds of short-term morbidity was 1.47 [CI 1.40-1.53] among the multidimensional poor and living in a poor household environment, 1.28 [CI 1.21-1.37] among the multidimensional non-poor and living in a poor household environment and 1.21 [CI 1.64-1.28] among the multidimensional poor and living in a good household environment compared to the multidimensional non-poor and living in a good household environment. Results are robust across states and hold good for each of the three morbidities: fever, cough and diarrhoea. This establishes that along with poverty, household environmental conditions have a significant bearing on short-term morbidities in India. Public investment in sanitation, drinking water and cooking fuel can reduce the morbidity and improve the health of the population.
Hu, Wenjun; Chung, Fu-Lai; Wang, Shitong
2012-03-01
Although pattern classification has been extensively studied in the past decades, how to effectively solve the corresponding training on large datasets is a problem that still requires particular attention. Many kernelized classification methods, such as SVM and SVDD, can be formulated as the corresponding quadratic programming (QP) problems, but computing the associated kernel matrices requires O(n2)(or even up to O(n3)) computational complexity, where n is the size of the training patterns, which heavily limits the applicability of these methods for large datasets. In this paper, a new classification method called the maximum vector-angular margin classifier (MAMC) is first proposed based on the vector-angular margin to find an optimal vector c in the pattern feature space, and all the testing patterns can be classified in terms of the maximum vector-angular margin ρ, between the vector c and all the training data points. Accordingly, it is proved that the kernelized MAMC can be equivalently formulated as the kernelized Minimum Enclosing Ball (MEB), which leads to a distinctive merit of MAMC, i.e., it has the flexibility of controlling the sum of support vectors like v-SVC and may be extended to a maximum vector-angular margin core vector machine (MAMCVM) by connecting the core vector machine (CVM) method with MAMC such that the corresponding fast training on large datasets can be effectively achieved. Experimental results on artificial and real datasets are provided to validate the power of the proposed methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Training Scalable Restricted Boltzmann Machines Using a Quantum Annealer
NASA Astrophysics Data System (ADS)
Kumar, V.; Bass, G.; Dulny, J., III
2016-12-01
Machine learning and the optimization involved therein is of critical importance for commercial and military applications. Due to the computational complexity of many-variable optimization, the conventional approach is to employ meta-heuristic techniques to find suboptimal solutions. Quantum Annealing (QA) hardware offers a completely novel approach with the potential to obtain significantly better solutions with large speed-ups compared to traditional computing. In this presentation, we describe our development of new machine learning algorithms tailored for QA hardware. We are training restricted Boltzmann machines (RBMs) using QA hardware on large, high-dimensional commercial datasets. Traditional optimization heuristics such as contrastive divergence and other closely related techniques are slow to converge, especially on large datasets. Recent studies have indicated that QA hardware when used as a sampler provides better training performance compared to conventional approaches. Most of these studies have been limited to moderately-sized datasets due to the hardware restrictions imposed by exisitng QA devices, which make it difficult to solve real-world problems at scale. In this work we develop novel strategies to circumvent this issue. We discuss scale-up techniques such as enhanced embedding and partitioned RBMs which allow large commercial datasets to be learned using QA hardware. We present our initial results obtained by training an RBM as an autoencoder on an image dataset. The results obtained so far indicate that the convergence rates can be improved significantly by increasing RBM network connectivity. These ideas can be readily applied to generalized Boltzmann machines and we are currently investigating this in an ongoing project.
A dataset of human decision-making in teamwork management.
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-17
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
A dataset of human decision-making in teamwork management
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-01
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787
A dataset of human decision-making in teamwork management
NASA Astrophysics Data System (ADS)
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-01
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
GODIVA2: interactive visualization of environmental data on the Web.
Blower, J D; Haines, K; Santokhee, A; Liu, C L
2009-03-13
GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.
Reblin, Maija; Clayton, Margaret F; John, Kevin K; Ellington, Lee
2015-01-01
In this paper, we present strategies for collecting and coding a large longitudinal communication dataset collected across multiple sites, consisting of over 2000 hours of digital audio recordings from approximately 300 families. We describe our methods within the context of implementing a large-scale study of communication during cancer home hospice nurse visits, but this procedure could be adapted to communication datasets across a wide variety of settings. This research is the first study designed to capture home hospice nurse-caregiver communication, a highly understudied location and type of communication event. We present a detailed example protocol encompassing data collection in the home environment, large-scale, multi-site secure data management, the development of theoretically-based communication coding, and strategies for preventing coder drift and ensuring reliability of analyses. Although each of these challenges have the potential to undermine the utility of the data, reliability between coders is often the only issue consistently reported and addressed in the literature. Overall, our approach demonstrates rigor and provides a “how-to” example for managing large, digitally-recorded data sets from collection through analysis. These strategies can inform other large-scale health communication research. PMID:26580414
Han, Qing; Bradshaw, Elizabeth M; Nilsson, Björn; Hafler, David A; Love, J Christopher
2010-06-07
The large diversity of cells that comprise the human immune system requires methods that can resolve the individual contributions of specific subsets to an immunological response. Microengraving is process that uses a dense, elastomeric array of microwells to generate microarrays of proteins secreted from large numbers of individual live cells (approximately 10(4)-10(5) cells/assay). In this paper, we describe an approach based on this technology to quantify the rates of secretion from single immune cells. Numerical simulations of the microengraving process indicated an operating regime between 30 min-4 h that permits quantitative analysis of the rates of secretion. Through experimental validation, we demonstrate that microengraving can provide quantitative measurements of both the frequencies and the distribution in rates of secretion for up to four cytokines simultaneously released from individual viable primary immune cells. The experimental limits of detection ranged from 0.5 to 4 molecules/s for IL-6, IL-17, IFNgamma, IL-2, and TNFalpha. These multidimensional measures resolve the number and intensities of responses by cells exposed to stimuli with greater sensitivity than single-parameter assays for cytokine release. We show that cells from different donors exhibit distinct responses based on both the frequency and magnitude of cytokine secretion when stimulated under different activating conditions. Primary T cells with specific profiles of secretion can also be recovered after microengraving for subsequent expansion in vitro. These examples demonstrate the utility of quantitative, multidimensional profiles of single cells for analyzing the diversity and dynamics of immune responses in vitro and for identifying rare cells from clinical samples.
OsiriX: an open-source software for navigating in multidimensional DICOM images.
Rosset, Antoine; Spadola, Luca; Ratib, Osman
2004-09-01
A multidimensional image navigation and display software was designed for display and interpretation of large sets of multidimensional and multimodality images such as combined PET-CT studies. The software is developed in Objective-C on a Macintosh platform under the MacOS X operating system using the GNUstep development environment. It also benefits from the extremely fast and optimized 3D graphic capabilities of the OpenGL graphic standard widely used for computer games optimized for taking advantage of any hardware graphic accelerator boards available. In the design of the software special attention was given to adapt the user interface to the specific and complex tasks of navigating through large sets of image data. An interactive jog-wheel device widely used in the video and movie industry was implemented to allow users to navigate in the different dimensions of an image set much faster than with a traditional mouse or on-screen cursors and sliders. The program can easily be adapted for very specific tasks that require a limited number of functions, by adding and removing tools from the program's toolbar and avoiding an overwhelming number of unnecessary tools and functions. The processing and image rendering tools of the software are based on the open-source libraries ITK and VTK. This ensures that all new developments in image processing that could emerge from other academic institutions using these libraries can be directly ported to the OsiriX program. OsiriX is provided free of charge under the GNU open-source licensing agreement at http://homepage.mac.com/rossetantoine/osirix.
Verdin, Kristine L.; Godt, Jonathan W.; Funk, Christopher C.; Pedreros, Diego; Worstell, Bruce; Verdin, James
2007-01-01
Landslides resulting from earthquakes can cause widespread loss of life and damage to critical infrastructure. The U.S. Geological Survey (USGS) has developed an alarm system, PAGER (Prompt Assessment of Global Earthquakes for Response), that aims to provide timely information to emergency relief organizations on the impact of earthquakes. Landslides are responsible for many of the damaging effects following large earthquakes in mountainous regions, and thus data defining the topographic relief and slope are critical to the PAGER system. A new global topographic dataset was developed to aid in rapidly estimating landslide potential following large earthquakes. We used the remotely-sensed elevation data collected as part of the Shuttle Radar Topography Mission (SRTM) to generate a slope dataset with nearly global coverage. Slopes from the SRTM data, computed at 3-arc-second resolution, were summarized at 30-arc-second resolution, along with statistics developed to describe the distribution of slope within each 30-arc-second pixel. Because there are many small areas lacking SRTM data and the northern limit of the SRTM mission was lat 60?N., statistical methods referencing other elevation data were used to fill the voids within the dataset and to extrapolate the data north of 60?. The dataset will be used in the PAGER system to rapidly assess the susceptibility of areas to landsliding following large earthquakes.
Summerfield, Taryn L.; Yu, Lianbo; Gulati, Parul; Zhang, Jie; Huang, Kun; Romero, Roberto; Kniss, Douglas A.
2011-01-01
A majority of the studies examining the molecular regulation of human labor have been conducted using single gene approaches. While the technology to produce multi-dimensional datasets is readily available, the means for facile analysis of such data are limited. The objective of this study was to develop a systems approach to infer regulatory mechanisms governing global gene expression in cytokine-challenged cells in vitro, and to apply these methods to predict gene regulatory networks (GRNs) in intrauterine tissues during term parturition. To this end, microarray analysis was applied to human amnion mesenchymal cells (AMCs) stimulated with interleukin-1β, and differentially expressed transcripts were subjected to hierarchical clustering, temporal expression profiling, and motif enrichment analysis, from which a GRN was constructed. These methods were then applied to fetal membrane specimens collected in the absence or presence of spontaneous term labor. Analysis of cytokine-responsive genes in AMCs revealed a sterile immune response signature, with promoters enriched in response elements for several inflammation-associated transcription factors. In comparison to the fetal membrane dataset, there were 34 genes commonly upregulated, many of which were part of an acute inflammation gene expression signature. Binding motifs for nuclear factor-κB were prominent in the gene interaction and regulatory networks for both datasets; however, we found little evidence to support the utilization of pathogen-associated molecular pattern (PAMP) signaling. The tissue specimens were also enriched for transcripts governed by hypoxia-inducible factor. The approach presented here provides an uncomplicated means to infer global relationships among gene clusters involved in cellular responses to labor-associated signals. PMID:21655103
Bhaskar, Anand; Javanmard, Adel; Courtade, Thomas A; Tse, David
2017-03-15
Genetic variation in human populations is influenced by geographic ancestry due to spatial locality in historical mating and migration patterns. Spatial population structure in genetic datasets has been traditionally analyzed using either model-free algorithms, such as principal components analysis (PCA) and multidimensional scaling, or using explicit spatial probabilistic models of allele frequency evolution. We develop a general probabilistic model and an associated inference algorithm that unify the model-based and data-driven approaches to visualizing and inferring population structure. Our spatial inference algorithm can also be effectively applied to the problem of population stratification in genome-wide association studies (GWAS), where hidden population structure can create fictitious associations when population ancestry is correlated with both the genotype and the trait. Our algorithm Geographic Ancestry Positioning (GAP) relates local genetic distances between samples to their spatial distances, and can be used for visually discerning population structure as well as accurately inferring the spatial origin of individuals on a two-dimensional continuum. On both simulated and several real datasets from diverse human populations, GAP exhibits substantially lower error in reconstructing spatial ancestry coordinates compared to PCA. We also develop an association test that uses the ancestry coordinates inferred by GAP to accurately account for ancestry-induced correlations in GWAS. Based on simulations and analysis of a dataset of 10 metabolic traits measured in a Northern Finland cohort, which is known to exhibit significant population structure, we find that our method has superior power to current approaches. Our software is available at https://github.com/anand-bhaskar/gap . abhaskar@stanford.edu or ajavanma@usc.edu. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Flocks, James
2006-01-01
Scientific knowledge from the past century is commonly represented by two-dimensional figures and graphs, as presented in manuscripts and maps. Using today's computer technology, this information can be extracted and projected into three- and four-dimensional perspectives. Computer models can be applied to datasets to provide additional insight into complex spatial and temporal systems. This process can be demonstrated by applying digitizing and modeling techniques to valuable information within widely used publications. The seminal paper by D. Frazier, published in 1967, identified 16 separate delta lobes formed by the Mississippi River during the past 6,000 yrs. The paper includes stratigraphic descriptions through geologic cross-sections, and provides distribution and chronologies of the delta lobes. The data from Frazier's publication are extensively referenced in the literature. Additional information can be extracted from the data through computer modeling. Digitizing and geo-rectifying Frazier's geologic cross-sections produce a three-dimensional perspective of the delta lobes. Adding the chronological data included in the report provides the fourth-dimension of the delta cycles, which can be visualized through computer-generated animation. Supplemental information can be added to the model, such as post-abandonment subsidence of the delta-lobe surface. Analyzing the regional, net surface-elevation balance between delta progradations and land subsidence is computationally intensive. By visualizing this process during the past 4,500 yrs through multi-dimensional animation, the importance of sediment compaction in influencing both the shape and direction of subsequent delta progradations becomes apparent. Visualization enhances a classic dataset, and can be further refined using additional data, as well as provide a guide for identifying future areas of study.
Integrated Strategy Improves the Prediction Accuracy of miRNA in Large Dataset
Lipps, David; Devineni, Sree
2016-01-01
MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy is developed for the community for identifying novel miRNAs and the complete set of miRNAs. Source code is available at: https://github.com/xueLab/mirMeta PMID:28002428
Effects of Aromatherapy on Test Anxiety and Performance in College Students
ERIC Educational Resources Information Center
Dunnigan, Jocelyn Marie
2013-01-01
Test anxiety is a complex, multidimensional construct composed of cognitive, affective, and behavioral components that have been shown to negatively affect test performance. Furthermore, test anxiety is a pervasive problem in modern society largely related to the evaluative nature of educational programs, therefore meriting study of its nature,…
Reliability and Perceived Pedagogical Utility of a Weighted Music Performance Assessment Rubric
ERIC Educational Resources Information Center
Latimer, Marvin E., Jr.; Bergee, Martin J.; Cohen, Mary L.
2010-01-01
The purpose of this study was to investigate the reliability and perceived pedagogical utility of a multidimensional weighted performance assessment rubric used in Kansas state high school large-group festivals. Data were adjudicator rubrics (N = 2,016) and adjudicator and director questionnaires (N = 515). Rubric internal consistency was…
The Multi-Dimensional Nature of Emergency Communications Management
ERIC Educational Resources Information Center
Staman, E. Michael; Katsouros, Mark; Hach, Richard
2009-01-01
Within an incredibly short period--perhaps less than twenty-four months--the need for emergency preparedness has risen to a higher level of urgency than at any other time in the history of academe. Large or small, public or private, higher education institutions are seriously considering the dual problems of notification and communications…
ERIC Educational Resources Information Center
Goh, Ailsa E.
2010-01-01
A large majority of adults with intellectual disabilities are unemployed. Unemployment of adults with intellectual disabilities is a complex multidimensional issue. Some barriers to employment of individuals with intellectual disabilities are the lack of job experience and skills training. In recent years, video-based interventions, such as video…
Attributions for School Achievement of Anglo and Native American Community College Students.
ERIC Educational Resources Information Center
Powers, Stephen; Rossman, Mark H.
Attributions for school success and failure were examined among 211 community college students (112 Native Americans and 99 Anglos) enrolled in remedial reading classes at a large, urban multi-campus community college system in the Southwest. The Multidimensional-Multiattributional Causality Scale (MMCS) was administered to the students in their…
Bi-Factor MIRT Observed-Score Equating for Mixed-Format Tests
ERIC Educational Resources Information Center
Lee, Guemin; Lee, Won-Chan
2016-01-01
The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples,…
ERIC Educational Resources Information Center
Kroopnick, Marc Howard
2010-01-01
When Item Response Theory (IRT) is operationally applied for large scale assessments, unidimensionality is typically assumed. This assumption requires that the test measures a single latent trait. Furthermore, when tests are vertically scaled using IRT, the assumption of unidimensionality would require that the battery of tests across grades…
Computers as an Instrument for Data Analysis. Technical Report No. 11.
ERIC Educational Resources Information Center
Muller, Mervin E.
A review of statistical data analysis involving computers as a multi-dimensional problem provides the perspective for consideration of the use of computers in statistical analysis and the problems associated with large data files. An overall description of STATJOB, a particular system for doing statistical data analysis on a digital computer,…
Preparation for Old Age in Different Life Domains: Dimensions and Age Differences
ERIC Educational Resources Information Center
Kornadt, Anna E.; Rothermund, Klaus
2014-01-01
We investigated preparation for age-related changes from a multidimensional, life span perspective and administered a newly developed questionnaire to a large sample aged 30-80 years. Preparing for age-related changes was organized by life domains, with domain-specific types of preparation addressing obstacles and opportunities in the respective…
ERIC Educational Resources Information Center
Lockwood, J. R.; Castellano, Katherine E.
2017-01-01
Student Growth Percentiles (SGPs) increasingly are being used in the United States for inferences about student achievement growth and educator effectiveness. Emerging research has indicated that SGPs estimated from observed test scores have large measurement errors. As such, little is known about "true" SGPs, which are defined in terms…
Being Online Peer Supported: Experiences from a Work-Based Learning Programme
ERIC Educational Resources Information Center
Altinay Aksal, Fahriye; Altinay, Zehra; De Rossi, Gazivalerio; Isman, Aytekin
2012-01-01
Problem Statement: Work-based learning programmes have become an increasingly popular way of fulfilling the desire for life-long learning; multi-dimensional work-based learning modes have recently played a large role in both personal and institutional development. The peculiarity of this innovative way of learning derives from the fact that…
Across several EPA Program Offices (e.g., OPPTS, OW, OAR), there is a clear need to develop strategies and methods to screen large numbers of chemicals for potential toxicity, and to use the resulting information to prioritize the use of testing resources towards those entities a...
ERIC Educational Resources Information Center
Kaspar, Roman; Hartig, Johannes
2016-01-01
The care of older people was described as involving substantial emotion-related affordances. Scholars in vocational training and nursing disagree whether emotion-related skills could be conceptualized and assessed as a professional competence. Studies on emotion work and empathy regularly neglect the multidimensionality of these phenomena and…
Intensification and Structure Change of Super Typhoon Flo as Related to the Large-Scale Environment.
1998-06-01
large dataset is a challenge. Schiavone and Papathomas (1990) summarize methods currently available for visualizing scientific 116 datasets. These...Prediction and Dynamic Meteorology, Second Edition. John Wiley and Sons, 477 pp. Hardy, R. L., 1971: Multiquadric equations of topography and other...Inter. Corp., Monterey CA, 40 pp. Sawyer, J. S., 1947: Notes on the theory of tropical cyclones. Quart. J. Roy. Meteor. Soc, 73, 101-126. Schiavone
Multidimensional NMR inversion without Kronecker products: Multilinear inversion
NASA Astrophysics Data System (ADS)
Medellín, David; Ravi, Vivek R.; Torres-Verdín, Carlos
2016-08-01
Multidimensional NMR inversion using Kronecker products poses several challenges. First, kernel compression is only possible when the kernel matrices are separable, and in recent years, there has been an increasing interest in NMR sequences with non-separable kernels. Second, in three or more dimensions, the singular value decomposition is not unique; therefore kernel compression is not well-defined for higher dimensions. Without kernel compression, the Kronecker product yields matrices that require large amounts of memory, making the inversion intractable for personal computers. Finally, incorporating arbitrary regularization terms is not possible using the Lawson-Hanson (LH) or the Butler-Reeds-Dawson (BRD) algorithms. We develop a minimization-based inversion method that circumvents the above problems by using multilinear forms to perform multidimensional NMR inversion without using kernel compression or Kronecker products. The new method is memory efficient, requiring less than 0.1% of the memory required by the LH or BRD methods. It can also be extended to arbitrary dimensions and adapted to include non-separable kernels, linear constraints, and arbitrary regularization terms. Additionally, it is easy to implement because only a cost function and its first derivative are required to perform the inversion.
Manycore Performance-Portability: Kokkos Multidimensional Array Library
Edwards, H. Carter; Sunderland, Daniel; Porter, Vicki; ...
2012-01-01
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel executionmore » performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].« less
Politi, Liran; Codish, Shlomi; Sagy, Iftach; Fink, Lior
2014-12-01
Insights about patterns of system use are often gained through the analysis of system log files, which record the actual behavior of users. In a clinical context, however, few attempts have been made to typify system use through log file analysis. The present study offers a framework for identifying, describing, and discerning among patterns of use of a clinical information retrieval system. We use the session attributes of volume, diversity, granularity, duration, and content to define a multidimensional space in which each specific session can be positioned. We also describe an analytical method for identifying the common archetypes of system use in this multidimensional space. We demonstrate the value of the proposed framework with a log file of the use of a health information exchange (HIE) system by physicians in an emergency department (ED) of a large Israeli hospital. The analysis reveals five distinct patterns of system use, which have yet to be described in the relevant literature. The results of this study have the potential to inform the design of HIE systems for efficient and effective use, thus increasing their contribution to the clinical decision-making process. Copyright © 2014 Elsevier Inc. All rights reserved.
Testlet-Based Multidimensional Adaptive Testing
Frey, Andreas; Seitz, Nicki-Nils; Brandt, Steffen
2016-01-01
Multidimensional adaptive testing (MAT) is a highly efficient method for the simultaneous measurement of several latent traits. Currently, no psychometrically sound approach is available for the use of MAT in testlet-based tests. Testlets are sets of items sharing a common stimulus such as a graph or a text. They are frequently used in large operational testing programs like TOEFL, PISA, PIRLS, or NAEP. To make MAT accessible for such testing programs, we present a novel combination of MAT with a multidimensional generalization of the random effects testlet model (MAT-MTIRT). MAT-MTIRT compared to non-adaptive testing is examined for several combinations of testlet effect variances (0.0, 0.5, 1.0, and 1.5) and testlet sizes (3, 6, and 9 items) with a simulation study considering three ability dimensions with simple loading structure. MAT-MTIRT outperformed non-adaptive testing regarding the measurement precision of the ability estimates. Further, the measurement precision decreased when testlet effect variances and testlet sizes increased. The suggested combination of the MTIRT model therefore provides a solution to the substantial problems of testlet-based tests while keeping the length of the test within an acceptable range. PMID:27917132
Design and analysis issues in quantitative proteomics studies.
Karp, Natasha A; Lilley, Kathryn S
2007-09-01
Quantitative proteomics is the comparison of distinct proteomes which enables the identification of protein species which exhibit changes in expression or post-translational state in response to a given stimulus. Many different quantitative techniques are being utilized and generate large datasets. Independent of the technique used, these large datasets need robust data analysis to ensure valid conclusions are drawn from such studies. Approaches to address the problems that arise with large datasets are discussed to give insight into the types of statistical analyses of data appropriate for the various experimental strategies that can be employed by quantitative proteomic studies. This review also highlights the importance of employing a robust experimental design and highlights various issues surrounding the design of experiments. The concepts and examples discussed within will show how robust design and analysis will lead to confident results that will ensure quantitative proteomics delivers.
A semiparametric graphical modelling approach for large-scale equity selection
Liu, Han; Mulvey, John; Zhao, Tianqi
2016-01-01
We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption. PMID:28316507
Multidimensional chromatography in food analysis.
Herrero, Miguel; Ibáñez, Elena; Cifuentes, Alejandro; Bernal, Jose
2009-10-23
In this work, the main developments and applications of multidimensional chromatographic techniques in food analysis are reviewed. Different aspects related to the existing couplings involving chromatographic techniques are examined. These couplings include multidimensional GC, multidimensional LC, multidimensional SFC as well as all their possible combinations. Main advantages and drawbacks of each coupling are critically discussed and their key applications in food analysis described.
NASA Astrophysics Data System (ADS)
Balsara, Dinshaw S.; Nkonga, Boniface
2017-10-01
Just as the quality of a one-dimensional approximate Riemann solver is improved by the inclusion of internal sub-structure, the quality of a multidimensional Riemann solver is also similarly improved. Such multidimensional Riemann problems arise when multiple states come together at the vertex of a mesh. The interaction of the resulting one-dimensional Riemann problems gives rise to a strongly-interacting state. We wish to endow this strongly-interacting state with physically-motivated sub-structure. The fastest way of endowing such sub-structure consists of making a multidimensional extension of the HLLI Riemann solver for hyperbolic conservation laws. Presenting such a multidimensional analogue of the HLLI Riemann solver with linear sub-structure for use on structured meshes is the goal of this work. The multidimensional MuSIC Riemann solver documented here is universal in the sense that it can be applied to any hyperbolic conservation law. The multidimensional Riemann solver is made to be consistent with constraints that emerge naturally from the Galerkin projection of the self-similar states within the wave model. When the full eigenstructure in both directions is used in the present Riemann solver, it becomes a complete Riemann solver in a multidimensional sense. I.e., all the intermediate waves are represented in the multidimensional wave model. The work also presents, for the very first time, an important analysis of the dissipation characteristics of multidimensional Riemann solvers. The present Riemann solver results in the most efficient implementation of a multidimensional Riemann solver with sub-structure. Because it preserves stationary linearly degenerate waves, it might also help with well-balancing. Implementation-related details are presented in pointwise fashion for the one-dimensional HLLI Riemann solver as well as the multidimensional MuSIC Riemann solver.
Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil
2009-07-01
Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.
3D variational brain tumor segmentation on a clustered feature set
NASA Astrophysics Data System (ADS)
Popuri, Karteek; Cobzas, Dana; Jagersand, Martin; Shah, Sirish L.; Murtha, Albert
2009-02-01
Tumor segmentation from MRI data is a particularly challenging and time consuming task. Tumors have a large diversity in shape and appearance with intensities overlapping the normal brain tissues. In addition, an expanding tumor can also deflect and deform nearby tissue. Our work addresses these last two difficult problems. We use the available MRI modalities (T1, T1c, T2) and their texture characteristics to construct a multi-dimensional feature set. Further, we extract clusters which provide a compact representation of the essential information in these features. The main idea in this paper is to incorporate these clustered features into the 3D variational segmentation framework. In contrast to the previous variational approaches, we propose a segmentation method that evolves the contour in a supervised fashion. The segmentation boundary is driven by the learned inside and outside region voxel probabilities in the cluster space. We incorporate prior knowledge about the normal brain tissue appearance, during the estimation of these region statistics. In particular, we use a Dirichlet prior that discourages the clusters in the ventricles to be in the tumor and hence better disambiguate the tumor from brain tissue. We show the performance of our method on real MRI scans. The experimental dataset includes MRI scans, from patients with difficult instances, with tumors that are inhomogeneous in appearance, small in size and in proximity to the major structures in the brain. Our method shows good results on these test cases.
Systematic pan-cancer analysis reveals immune cell interactions in the tumor microenvironment
Varn, Frederick S.; Wang, Yue; Mullins, David W.; Fiering, Steven; Cheng, Chao
2017-01-01
With the recent advent of immunotherapy, there is a critical need to understand immune cell interactions in the tumor microenvironment in both pan-cancer and tissue-specific contexts. Multi-dimensional datasets have enabled systematic approaches to dissect these interactions in large numbers of patients, furthering our understanding of the patient immune response to solid tumors. Using an integrated approach, we inferred the infiltration levels of distinct immune cell subsets in 23 tumor types from The Cancer Genome Atlas. From these quantities, we constructed a co-infiltration network, revealing interactions between cytolytic cells and myeloid cells in the tumor microenvironment. By integrating patient mutation data, we found that while mutation burden was associated with immune infiltration differences between distinct tumor types, additional factors likely explained differences between tumors originating from the same tissue. We concluded this analysis by examining the prognostic value of individual immune cell subsets as well as how co-infiltration of functionally discordant cell types associated with patient survival. In multiple tumor types, we found that the protective effect of CD8+ T cell infiltration was heavily modulated by co-infiltration of macrophages and other myeloid cell types, suggesting the involvement of myeloid-derived suppressor cells in tumor development. Our findings illustrate complex interactions between different immune cell types in the tumor microenvironment and indicate these interactions play meaningful roles in patient survival. These results demonstrate the importance of personalized immune response profiles when studying the factors underlying tumor immunogenicity and immunotherapy response. PMID:28126714
Component separation of a isotropic Gravitational Wave Background
DOE Office of Scientific and Technical Information (OSTI.GOV)
Parida, Abhishek; Jhingan, Sanjay; Mitra, Sanjit, E-mail: abhishek@jmi.ac.in, E-mail: sanjit@iucaa.in, E-mail: sjhingan@jmi.ac.in
2016-04-01
A Gravitational Wave Background (GWB) is expected in the universe from the superposition of a large number of unresolved astrophysical sources and phenomena in the early universe. Each component of the background (e.g., from primordial metric perturbations, binary neutron stars, milli-second pulsars etc.) has its own spectral shape. Many ongoing experiments aim to probe GWB at a variety of frequency bands. In the last two decades, using data from ground-based laser interferometric gravitational wave (GW) observatories, upper limits on GWB were placed in the frequency range of 0∼ 50−100 Hz, considering one spectral shape at a time. However, one strong componentmore » can significantly enhance the estimated strength of another component. Hence, estimation of the amplitudes of the components with different spectral shapes should be done jointly. Here we propose a method for 'component separation' of a statistically isotropic background, that can, for the first time, jointly estimate the amplitudes of many components and place upper limits. The method is rather straightforward and needs negligible amount of computation. It utilises the linear relationship between the measurements and the amplitudes of the actual components, alleviating the need for a sampling based method, e.g., Markov Chain Monte Carlo (MCMC) or matched filtering, which are computationally intensive and cumbersome in a multi-dimensional parameter space. Using this formalism we could also study how many independent components can be separated using a given dataset from a network of current and upcoming ground based interferometric detectors.« less
de Oliveira, Bruno Menezes; Matsumura, Cintia Y.; Fontes-Oliveira, Cibely C.; Gawlik, Kinga I.; Acosta, Helena; Wernhoff, Patrik; Durbeej, Madeleine
2014-01-01
Congenital muscular dystrophy with laminin α2 chain deficiency (MDC1A) is one of the most severe forms of muscular disease and is characterized by severe muscle weakness and delayed motor milestones. The genetic basis of MDC1A is well known, yet the secondary mechanisms ultimately leading to muscle degeneration and subsequent connective tissue infiltration are not fully understood. In order to obtain new insights into the molecular mechanisms underlying MDC1A, we performed a comparative proteomic analysis of affected muscles (diaphragm and gastrocnemius) from laminin α2 chain–deficient dy3K/dy3K mice, using multidimensional protein identification technology combined with tandem mass tags. Out of the approximately 700 identified proteins, 113 and 101 proteins, respectively, were differentially expressed in the diseased gastrocnemius and diaphragm muscles compared with normal muscles. A large portion of these proteins are involved in different metabolic processes, bind calcium, or are expressed in the extracellular matrix. Our findings suggest that metabolic alterations and calcium dysregulation could be novel mechanisms that underlie MDC1A and might be targets that should be explored for therapy. Also, detailed knowledge of the composition of fibrotic tissue, rich in extracellular matrix proteins, in laminin α2 chain–deficient muscle might help in the design of future anti-fibrotic treatments. All MS data have been deposited in the ProteomeXchange with identifier PXD000978 (http://proteomecentral.proteomexchange.org/dataset/PXD000978). PMID:24994560
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data
Wong, Raymond K.; Mohammed, Sabah; Fiaidhi, Jinan; Sung, Yunsick
2017-01-01
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method. PMID:28753613
Prescott, Julie; Hanley, Terry; Ujhelyi, Katalin
2017-08-02
The Internet has the potential to help young people by reducing the stigma associated with mental health and enabling young people to access services and professionals which they may not otherwise access. Online support can empower young people, help them develop new online friendships, share personal experiences, communicate with others who understand, provide information and emotional support, and most importantly help them feel less alone and normalize their experiences in the world. The aim of the research was to gain an understanding of how young people use an online forum for emotional and mental health issues. Specifically, the project examined what young people discuss and how they seek support on the forum (objective 1). Furthermore, it looked at how the young service users responded to posts to gain an understanding of how young people provided each other with peer-to-peer support (objective 2). Kooth is an online counseling service for young people aged 11-25 years and experiencing emotional and mental health problems. It is based in the United Kingdom and provides support that is anonymous, confidential, and free at the point of delivery. Kooth provided the researchers with all the online forum posts between a 2-year period, which resulted in a dataset of 622 initial posts and 3657 initial posts with responses. Thematic analysis was employed to elicit key themes from the dataset. The findings support the literature that online forums provide young people with both informational and emotional support around a wide array of topics. The findings from this large dataset also reveal that this informational or emotional support can be viewed as directive or nondirective. The nondirective approach refers to when young people provide others with support by sharing their own experiences. These posts do not include explicit advice to act in a particular way, but the sharing process is hoped to be of use to the poster. The directive approach, in contrast, involves individuals making an explicit suggestion of what they believe the poster should do. This study adds to the research exploring what young people discuss within online forums and provides insights into how these communications take place. Furthermore, it highlights the challenge that organizations may encounter in mediating support that is multidimensional in nature (informational-emotional, directive-nondirective). ©Julie Prescott, Terry Hanley, Katalin Ujhelyi. Originally published in JMIR Mental Health (http://mental.jmir.org), 02.08.2017.
Large scale validation of the M5L lung CAD on heterogeneous CT datasets.
Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P
2015-04-01
M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.
rasdaman Array Database: current status
NASA Astrophysics Data System (ADS)
Merticariu, George; Toader, Alexandru
2015-04-01
rasdaman (Raster Data Manager) is a Free Open Source Array Database Management System which provides functionality for storing and processing massive amounts of raster data in the form of multidimensional arrays. The user can access, process and delete the data using SQL. The key features of rasdaman are: flexibility (datasets of any dimensionality can be processed with the help of SQL queries), scalability (rasdaman's distributed architecture enables it to seamlessly run on cloud infrastructures while offering an increase in performance with the increase of computation resources), performance (real-time access, processing, mixing and filtering of arrays of any dimensionality) and reliability (legacy communication protocol replaced with a new one based on cutting edge technology - Google Protocol Buffers and ZeroMQ). Among the data with which the system works, we can count 1D time series, 2D remote sensing imagery, 3D image time series, 3D geophysical data, and 4D atmospheric and climate data. Most of these representations cannot be stored only in the form of raw arrays, as the location information of the contents is also important for having a correct geoposition on Earth. This is defined by ISO 19123 as coverage data. rasdaman provides coverage data support through the Petascope service. Extensions were added on top of rasdaman in order to provide support for the Geoscience community. The following OGC standards are currently supported: Web Map Service (WMS), Web Coverage Service (WCS), and Web Coverage Processing Service (WCPS). The Web Map Service is an extension which provides zoom and pan navigation over images provided by a map server. Starting with version 9.1, rasdaman supports WMS version 1.3. The Web Coverage Service provides capabilities for downloading multi-dimensional coverage data. Support is also provided for several extensions of this service: Subsetting Extension, Scaling Extension, and, starting with version 9.1, Transaction Extension, which defines request types for inserting, updating and deleting coverages. A web client, designed for both novice and experienced users, is also available for the service and its extensions. The client offers an intuitive interface that allows users to work with multi-dimensional coverages by abstracting the specifics of the standard definitions of the requests. The Web Coverage Processing Service defines a language for on-the-fly processing and filtering multi-dimensional raster coverages. rasdaman exposes this service through the WCS processing extension. Demonstrations are provided online via the Earthlook website (earthlook.org) which presents use-cases from a wide variety of application domains, using the rasdaman system as processing engine.
Statistical Downscaling in Multi-dimensional Wave Climate Forecast
NASA Astrophysics Data System (ADS)
Camus, P.; Méndez, F. J.; Medina, R.; Losada, I. J.; Cofiño, A. S.; Gutiérrez, J. M.
2009-04-01
Wave climate at a particular site is defined by the statistical distribution of sea state parameters, such as significant wave height, mean wave period, mean wave direction, wind velocity, wind direction and storm surge. Nowadays, long-term time series of these parameters are available from reanalysis databases obtained by numerical models. The Self-Organizing Map (SOM) technique is applied to characterize multi-dimensional wave climate, obtaining the relevant "wave types" spanning the historical variability. This technique summarizes multi-dimension of wave climate in terms of a set of clusters projected in low-dimensional lattice with a spatial organization, providing Probability Density Functions (PDFs) on the lattice. On the other hand, wind and storm surge depend on instantaneous local large-scale sea level pressure (SLP) fields while waves depend on the recent history of these fields (say, 1 to 5 days). Thus, these variables are associated with large-scale atmospheric circulation patterns. In this work, a nearest-neighbors analog method is used to predict monthly multi-dimensional wave climate. This method establishes relationships between the large-scale atmospheric circulation patterns from numerical models (SLP fields as predictors) with local wave databases of observations (monthly wave climate SOM PDFs as predictand) to set up statistical models. A wave reanalysis database, developed by Puertos del Estado (Ministerio de Fomento), is considered as historical time series of local variables. The simultaneous SLP fields calculated by NCEP atmospheric reanalysis are used as predictors. Several applications with different size of sea level pressure grid and with different temporal domain resolution are compared to obtain the optimal statistical model that better represents the monthly wave climate at a particular site. In this work we examine the potential skill of this downscaling approach considering perfect-model conditions, but we will also analyze the suitability of this methodology to be used for seasonal forecast and for long-term climate change scenario projection of wave climate.
Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y
2012-11-18
Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.
NASA Astrophysics Data System (ADS)
Ozturk, D.; Chaudhary, A.; Votava, P.; Kotfila, C.
2016-12-01
Jointly developed by Kitware and NASA Ames, GeoNotebook is an open source tool designed to give the maximum amount of flexibility to analysts, while dramatically simplifying the process of exploring geospatially indexed datasets. Packages like Fiona (backed by GDAL), Shapely, Descartes, Geopandas, and PySAL provide a stack of technologies for reading, transforming, and analyzing geospatial data. Combined with the Jupyter notebook and libraries like matplotlib/Basemap it is possible to generate detailed geospatial visualizations. Unfortunately, visualizations generated is either static or does not perform well for very large datasets. Also, this setup requires a great deal of boilerplate code to create and maintain. Other extensions exist to remedy these problems, but they provide a separate map for each input cell and do not support map interactions that feed back into the python environment. To support interactive data exploration and visualization on large datasets we have developed an extension to the Jupyter notebook that provides a single dynamic map that can be managed from the Python environment, and that can communicate back with a server which can perform operations like data subsetting on a cloud-based cluster.
NASA Astrophysics Data System (ADS)
Skok, Gregor; Žagar, Nedjeljka; Honzak, Luka; Žabkar, Rahela; Rakovec, Jože; Ceglar, Andrej
2016-01-01
The study presents a precipitation intercomparison based on two satellite-derived datasets (TRMM 3B42, CMORPH), four raingauge-based datasets (GPCC, E-OBS, Willmott & Matsuura, CRU), ERA Interim reanalysis (ERAInt), and a single climate simulation using the WRF model. The comparison was performed for a domain encompassing parts of Europe and the North Atlantic over the 11-year period of 2000-2010. The four raingauge-based datasets are similar to the TRMM dataset with biases over Europe ranging from -7 % to +4 %. The spread among the raingauge-based datasets is relatively small over most of Europe, although areas with greater uncertainty (more than 30 %) exist, especially near the Alps and other mountainous regions. There are distinct differences between the datasets over the European land area and the Atlantic Ocean in comparison to the TRMM dataset. ERAInt has a small dry bias over the land; the WRF simulation has a large wet bias (+30 %), whereas CMORPH is characterized by a large and spatially consistent dry bias (-21 %). Over the ocean, both ERAInt and CMORPH have a small wet bias (+8 %) while the wet bias in WRF is significantly larger (+47 %). ERAInt has the highest frequency of low-intensity precipitation while the frequency of high-intensity precipitation is the lowest due to its lower native resolution. Both satellite-derived datasets have more low-intensity precipitation over the ocean than over the land, while the frequency of higher-intensity precipitation is similar or larger over the land. This result is likely related to orography, which triggers more intense convective precipitation, while the Atlantic Ocean is characterized by more homogenous large-scale precipitation systems which are associated with larger areas of lower intensity precipitation. However, this is not observed in ERAInt and WRF, indicating the insufficient representation of convective processes in the models. Finally, the Fraction Skill Score confirmed that both models perform better over the Atlantic Ocean with ERAInt outperforming the WRF at low thresholds and WRF outperforming ERAInt at higher thresholds. The diurnal cycle is simulated better in the WRF simulation than in ERAInt, although WRF could not reproduce well the amplitude of the diurnal cycle. While the evaluation of the WRF model confirms earlier findings related to the model's wet bias over European land, the applied satellite-derived precipitation datasets revealed differences between the land and ocean areas along with uncertainties in the observation datasets.
Zanni, Martin Thomas; Damrauer, Niels H.
2010-07-20
A multidimensional spectrometer for the infrared, visible, and ultraviolet regions of the electromagnetic spectrum, and a method for making multidimensional spectroscopic measurements in the infrared, visible, and ultraviolet regions of the electromagnetic spectrum. The multidimensional spectrometer facilitates measurements of inter- and intra-molecular interactions.
Descriptive Characteristics of Surface Water Quality in Hong Kong by a Self-Organising Map
An, Yan; Zou, Zhihong; Li, Ranran
2016-01-01
In this study, principal component analysis (PCA) and a self-organising map (SOM) were used to analyse a complex dataset obtained from the river water monitoring stations in the Tolo Harbor and Channel Water Control Zone (Hong Kong), covering the period of 2009–2011. PCA was initially applied to identify the principal components (PCs) among the nonlinear and complex surface water quality parameters. SOM followed PCA, and was implemented to analyze the complex relationships and behaviors of the parameters. The results reveal that PCA reduced the multidimensional parameters to four significant PCs which are combinations of the original ones. The positive and inverse relationships of the parameters were shown explicitly by pattern analysis in the component planes. It was found that PCA and SOM are efficient tools to capture and analyze the behavior of multivariable, complex, and nonlinear related surface water quality data. PMID:26761018
Descriptive Characteristics of Surface Water Quality in Hong Kong by a Self-Organising Map.
An, Yan; Zou, Zhihong; Li, Ranran
2016-01-08
In this study, principal component analysis (PCA) and a self-organising map (SOM) were used to analyse a complex dataset obtained from the river water monitoring stations in the Tolo Harbor and Channel Water Control Zone (Hong Kong), covering the period of 2009-2011. PCA was initially applied to identify the principal components (PCs) among the nonlinear and complex surface water quality parameters. SOM followed PCA, and was implemented to analyze the complex relationships and behaviors of the parameters. The results reveal that PCA reduced the multidimensional parameters to four significant PCs which are combinations of the original ones. The positive and inverse relationships of the parameters were shown explicitly by pattern analysis in the component planes. It was found that PCA and SOM are efficient tools to capture and analyze the behavior of multivariable, complex, and nonlinear related surface water quality data.
Concepts and applications for influenza antigenic cartography
Cai, Zhipeng; Zhang, Tong; Wan, Xiu-Feng
2011-01-01
Influenza antigenic cartography projects influenza antigens into a two or three dimensional map based on immunological datasets, such as hemagglutination inhibition and microneutralization assays. A robust antigenic cartography can facilitate influenza vaccine strain selection since the antigenic map can simplify data interpretation through intuitive antigenic map. However, antigenic cartography construction is not trivial due to the challenging features embedded in the immunological data, such as data incompleteness, high noises, and low reactors. To overcome these challenges, we developed a computational method, temporal Matrix Completion-Multidimensional Scaling (MC-MDS), by adapting the low rank MC concept from the movie recommendation system in Netflix and the MDS method from geographic cartography construction. The application on H3N2 and 2009 pandemic H1N1 influenza A viruses demonstrates that temporal MC-MDS is effective and efficient in constructing influenza antigenic cartography. The web sever is available at http://sysbio.cvm.msstate.edu/AntigenMap. PMID:21761589
Fast Acquisition and Reconstruction of Optical Coherence Tomography Images via Sparse Representation
Li, Shutao; McNabb, Ryan P.; Nie, Qing; Kuo, Anthony N.; Toth, Cynthia A.; Izatt, Joseph A.; Farsiu, Sina
2014-01-01
In this paper, we present a novel technique, based on compressive sensing principles, for reconstruction and enhancement of multi-dimensional image data. Our method is a major improvement and generalization of the multi-scale sparsity based tomographic denoising (MSBTD) algorithm we recently introduced for reducing speckle noise. Our new technique exhibits several advantages over MSBTD, including its capability to simultaneously reduce noise and interpolate missing data. Unlike MSBTD, our new method does not require an a priori high-quality image from the target imaging subject and thus offers the potential to shorten clinical imaging sessions. This novel image restoration method, which we termed sparsity based simultaneous denoising and interpolation (SBSDI), utilizes sparse representation dictionaries constructed from previously collected datasets. We tested the SBSDI algorithm on retinal spectral domain optical coherence tomography images captured in the clinic. Experiments showed that the SBSDI algorithm qualitatively and quantitatively outperforms other state-of-the-art methods. PMID:23846467
Categorical dimensions of human odor descriptor space revealed by non-negative matrix factorization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chennubhotla, Chakra; Castro, Jason
2013-01-01
In contrast to most other sensory modalities, the basic perceptual dimensions of olfaction remain un- clear. Here, we use non-negative matrix factorization (NMF) - a dimensionality reduction technique - to uncover structure in a panel of odor profiles, with each odor defined as a point in multi-dimensional descriptor space. The properties of NMF are favorable for the analysis of such lexical and perceptual data, and lead to a high-dimensional account of odor space. We further provide evidence that odor di- mensions apply categorically. That is, odor space is not occupied homogenously, but rather in a discrete and intrinsically clustered manner.more » We discuss the potential implications of these results for the neural coding of odors, as well as for developing classifiers on larger datasets that may be useful for predicting perceptual qualities from chemical structures.« less
OSIRIX: open source multimodality image navigation software
NASA Astrophysics Data System (ADS)
Rosset, Antoine; Pysher, Lance; Spadola, Luca; Ratib, Osman
2005-04-01
The goal of our project is to develop a completely new software platform that will allow users to efficiently and conveniently navigate through large sets of multidimensional data without the need of high-end expensive hardware or software. We also elected to develop our system on new open source software libraries allowing other institutions and developers to contribute to this project. OsiriX is a free and open-source imaging software designed manipulate and visualize large sets of medical images: http://homepage.mac.com/rossetantoine/osirix/
Perspectives in astrophysical databases
NASA Astrophysics Data System (ADS)
Frailis, Marco; de Angelis, Alessandro; Roberto, Vito
2004-07-01
Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large data sets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large data sets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
Partial Information Community Detection in a Multilayer Network
2016-06-01
Network was taken from the CORE Lab at the Naval Postgraduate School [27]. Facebook dataset We will use a subgraph of the Facebook network to build a...larger synthetic multilayer network. We want to use this Facebook data as a way to introduce a real world example of a network into our synthetic network...This data is provided by the Standford Large Network Dataset Collection [28]. This is a large anonymous subgraph of Facebook . It contains over 4,000
cellVIEW: a Tool for Illustrative and Multi-Scale Rendering of Large Biomolecular Datasets
Le Muzic, Mathieu; Autin, Ludovic; Parulek, Julius; Viola, Ivan
2017-01-01
In this article we introduce cellVIEW, a new system to interactively visualize large biomolecular datasets on the atomic level. Our tool is unique and has been specifically designed to match the ambitions of our domain experts to model and interactively visualize structures comprised of several billions atom. The cellVIEW system integrates acceleration techniques to allow for real-time graphics performance of 60 Hz display rate on datasets representing large viruses and bacterial organisms. Inspired by the work of scientific illustrators, we propose a level-of-detail scheme which purpose is two-fold: accelerating the rendering and reducing visual clutter. The main part of our datasets is made out of macromolecules, but it also comprises nucleic acids strands which are stored as sets of control points. For that specific case, we extend our rendering method to support the dynamic generation of DNA strands directly on the GPU. It is noteworthy that our tool has been directly implemented inside a game engine. We chose to rely on a third party engine to reduce software development work-load and to make bleeding-edge graphics techniques more accessible to the end-users. To our knowledge cellVIEW is the only suitable solution for interactive visualization of large bimolecular landscapes on the atomic level and is freely available to use and extend. PMID:29291131
Remote Sensing Data Analytics for Planetary Science with PlanetServer/EarthServer
NASA Astrophysics Data System (ADS)
Rossi, Angelo Pio; Figuera, Ramiro Marco; Flahaut, Jessica; Martinot, Melissa; Misev, Dimitar; Baumann, Peter; Pham Huu, Bang; Besse, Sebastien
2016-04-01
Planetary Science datasets, beyond the change in the last two decades from physical volumes to internet-accessible archives, still face the problem of large-scale processing and analytics (e.g. Rossi et al., 2014, Gaddis and Hare, 2015). PlanetServer, the Planetary Science Data Service of the EC-funded EarthServer-2 project (#654367) tackles the planetary Big Data analytics problem with an array database approach (Baumann et al., 2014). It is developed to serve a large amount of calibrated, map-projected planetary data online, mainly through Open Geospatial Consortium (OGC) Web Coverage Processing Service (WCPS) (e.g. Rossi et al., 2014; Oosthoek et al., 2013; Cantini et al., 2014). The focus of the H2020 evolution of PlanetServer is still on complex multidimensional data, particularly hyperspectral imaging and topographic cubes and imagery. In addition to hyperspectral and topographic from Mars (Rossi et al., 2014), the use of WCPS is applied to diverse datasets on the Moon, as well as Mercury. Other Solar System Bodies are going to be progressively available. Derived parameters such as summary products and indices can be produced through WCPS queries, as well as derived imagery colour combination products, dynamically generated and accessed also through OGC Web Coverage Service (WCS). Scientific questions translated into queries can be posed to a large number of individual coverages (data products), locally, regionally or globally. The new PlanetServer system uses the the Open Source Nasa WorldWind (e.g. Hogan, 2011) virtual globe as visualisation engine, and the array database Rasdaman Community Edition as core server component. Analytical tools and client components of relevance for multiple communities and disciplines are shared across service such as the Earth Observation and Marine Data Services of EarthServer. The Planetary Science Data Service of EarthServer is accessible on http://planetserver.eu. All its code base is going to be available on GitHub, on https://github.com/planetserver References: Baumann, P., et al. (2015) Big Data Analytics for Earth Sciences: the EarthServer approach, International Journal of Digital Earth, doi: 10.1080/17538947.2014.1003106. Cantini, F. et al. (2014) Geophys. Res. Abs., Vol. 16, #EGU2014-3784. Gaddis, L., and T. Hare (2015), Status of tools and data for planetary research, Eos, 96, dos: 10.1029/2015EO041125. Hogan, P., 2011. NASA World Wind: Infrastructure for Spatial Data. Technical report. Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications ACM. Oosthoek, J.H.P, et al. (2013) Advances in Space Research. doi: 10.1016/j.asr.2013.07.002. Rossi, A. P., et al. (2014) PlanetServer/EarthServer: Big Data analytics in Planetary Science. Geophysical Research Abstracts, Vol. 16, #EGU2014-5149.
Mapping and spatiotemporal analysis tool for hydrological data: Spellmap
USDA-ARS?s Scientific Manuscript database
Lack of data management and analyses tools is one of the major limitations to effectively evaluate and use large datasets of high-resolution atmospheric, surface, and subsurface observations. High spatial and temporal resolution datasets better represent the spatiotemporal variability of hydrologica...
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Crockett, Thomas W.
1999-01-01
This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
Walters, Stephen J
2004-05-25
We describe and compare four different methods for estimating sample size and power, when the primary outcome of the study is a Health Related Quality of Life (HRQoL) measure. These methods are: 1. assuming a Normal distribution and comparing two means; 2. using a non-parametric method; 3. Whitehead's method based on the proportional odds model; 4. the bootstrap. We illustrate the various methods, using data from the SF-36. For simplicity this paper deals with studies designed to compare the effectiveness (or superiority) of a new treatment compared to a standard treatment at a single point in time. The results show that if the HRQoL outcome has a limited number of discrete values (< 7) and/or the expected proportion of cases at the boundaries is high (scoring 0 or 100), then we would recommend using Whitehead's method (Method 3). Alternatively, if the HRQoL outcome has a large number of distinct values and the proportion at the boundaries is low, then we would recommend using Method 1. If a pilot or historical dataset is readily available (to estimate the shape of the distribution) then bootstrap simulation (Method 4) based on this data will provide a more accurate and reliable sample size estimate than conventional methods (Methods 1, 2, or 3). In the absence of a reliable pilot set, bootstrapping is not appropriate and conventional methods of sample size estimation or simulation will need to be used. Fortunately, with the increasing use of HRQoL outcomes in research, historical datasets are becoming more readily available. Strictly speaking, our results and conclusions only apply to the SF-36 outcome measure. Further empirical work is required to see whether these results hold true for other HRQoL outcomes. However, the SF-36 has many features in common with other HRQoL outcomes: multi-dimensional, ordinal or discrete response categories with upper and lower bounds, and skewed distributions, so therefore, we believe these results and conclusions using the SF-36 will be appropriate for other HRQoL measures.
Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis
2014-01-01
Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.
NASA Astrophysics Data System (ADS)
Li, Z.; Clark, E. P.
2017-12-01
Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.
Medical imaging informatics based solutions for human performance analytics
NASA Astrophysics Data System (ADS)
Verma, Sneha; McNitt-Gray, Jill; Liu, Brent J.
2018-03-01
For human performance analysis, extensive experimental trials are often conducted to identify the underlying cause or long-term consequences of certain pathologies and to improve motor functions by examining the movement patterns of affected individuals. Data collected for human performance analysis includes high-speed video, surveys, spreadsheets, force data recordings from instrumented surfaces etc. These datasets are recorded from various standalone sources and therefore captured in different folder structures as well as in varying formats depending on the hardware configurations. Therefore, data integration and synchronization present a huge challenge while handling these multimedia datasets specifically for large datasets. Another challenge faced by researchers is querying large quantity of unstructured data and to design feedbacks/reporting tools for users who need to use datasets at various levels. In the past, database server storage solutions have been introduced to securely store these datasets. However, to automate the process of uploading raw files, various file manipulation steps are required. In the current workflow, this file manipulation and structuring is done manually and is not feasible for large amounts of data. However, by attaching metadata files and data dictionaries with these raw datasets, they can provide information and structure needed for automated server upload. We introduce one such system for metadata creation for unstructured multimedia data based on the DICOM data model design. We will discuss design and implementation of this system and evaluate this system with data set collected for movement analysis study. The broader aim of this paper is to present a solutions space achievable based on medical imaging informatics design and methods for improvement in workflow for human performance analysis in a biomechanics research lab.
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.
Computer-Based Tools for Inquiry in Undergraduate Classrooms: Results from the VGEE
NASA Astrophysics Data System (ADS)
Pandya, R. E.; Bramer, D. J.; Elliott, D.; Hay, K. E.; Mallaiahgari, L.; Marlino, M. R.; Middleton, D.; Ramamurhty, M. K.; Scheitlin, T.; Weingroff, M.; Wilhelmson, R.; Yoder, J.
2002-05-01
The Visual Geophysical Exploration Environment (VGEE) is a suite of computer-based tools designed to help learners connect observable, large-scale geophysical phenomena to underlying physical principles. Technologically, this connection is mediated by java-based interactive tools: a multi-dimensional visualization environment, authentic scientific data-sets, concept models that illustrate fundamental physical principles, and an interactive web-based work management system for archiving and evaluating learners' progress. Our preliminary investigations showed, however, that the tools alone are not sufficient to empower undergraduate learners; learners have trouble in organizing inquiry and using the visualization tools effectively. To address these issues, the VGEE includes an inquiry strategy and scaffolding activities that are similar to strategies used successfully in K-12 classrooms. The strategy is organized around the steps: identify, relate, explain, and integrate. In the first step, students construct visualizations from data to try to identify salient features of a particular phenomenon. They compare their previous conceptions of a phenomenon to the data examine their current knowledge and motivate investigation. Next, students use the multivariable functionality of the visualization environment to relate the different features they identified. Explain moves the learner temporarily outside the visualization to the concept models, where they explore fundamental physical principles. Finally, in integrate, learners use these fundamental principles within the visualization environment by literally placing the concept model within the visualization environment as a probe and watching it respond to larger-scale patterns. This capability, unique to the VGEE, addresses the disconnect that novice learners often experience between fundamental physics and observable phenomena. It also allows learners the opportunity to reflect on and refine their knowledge as well as anchor it within a context for long-term retention. We are implementing the VGEE in one of two otherwise identical entry-level atmospheric courses. In addition to comparing student learning and attitudes in the two courses, we are analyzing student participation with the VGEE to evaluate the effectiveness and usability of the VGEE. In particular, we seek to identify the scaffolding students need to construct physically meaningful multi-dimensional visualizations, and evaluate the effectiveness of the visualization-embedded concept-models in addressing inert knowledge. We will also examine the utility of the inquiry strategy in developing content knowledge, process-of-science knowledge, and discipline-specific investigatory skills. Our presentation will include video examples of student use to illustrate our findings.
Dazard, Jean-Eudes; Rao, J. Sunil
2010-01-01
The search for structures in real datasets e.g. in the form of bumps, components, classes or clusters is important as these often reveal underlying phenomena leading to scientific discoveries. One of these tasks, known as bump hunting, is to locate domains of a multidimensional input space where the target function assumes local maxima without pre-specifying their total number. A number of related methods already exist, yet are challenged in the context of high dimensional data. We introduce a novel supervised and multivariate bump hunting strategy for exploring modes or classes of a target function of many continuous variables. This addresses the issues of correlation, interpretability, and high-dimensionality (p ≫ n case), while making minimal assumptions. The method is based upon a divide and conquer strategy, combining a tree-based method, a dimension reduction technique, and the Patient Rule Induction Method (PRIM). Important to this task, we show how to estimate the PRIM meta-parameters. Using accuracy evaluation procedures such as cross-validation and ROC analysis, we show empirically how the method outperforms a naive PRIM as well as competitive non-parametric supervised and unsupervised methods in the problem of class discovery. The method has practical application especially in the case of noisy high-throughput data. It is applied to a class discovery problem in a colon cancer micro-array dataset aimed at identifying tumor subtypes in the metastatic stage. Supplemental Materials are available online. PMID:22399839
Factor structure and dimensionality of the two depression scales in STAR*D using level 1 datasets.
Bech, P; Fava, M; Trivedi, M H; Wisniewski, S R; Rush, A J
2011-08-01
The factor structure and dimensionality of the HAM-D(17) and the IDS-C(30) are as yet uncertain, because psychometric analyses of these scales have been performed without a clear separation between factor structure profile and dimensionality (total scores being a sufficient statistic). The first treatment step (Level 1) in the STAR*D study provided a dataset of 4041 outpatients with DSM-IV nonpsychotic major depression. The HAM-D(17) and IDS-C(30) were evaluated by principal component analysis (PCA) without rotation. Mokken analysis tested the unidimensionality of the IDS-C(6), which corresponds to the unidimensional HAM-D(6.) For both the HAM-D(17) and IDS-C(30), PCA identified a bi-directional factor contrasting the depressive symptoms versus the neurovegetative symptoms. The HAM-D(6) and the corresponding IDS-C(6) symptoms all emerged in the depression factor. Both the HAM-D(6) and IDS-C(6) were found to be unidimensional scales, i.e., their total scores are each a sufficient statistic for the measurement of depressive states. STAR*D used only one medication in Level 1. The unidimensional HAM-D(6) and IDS-C(6) should be used when evaluating the pure clinical effect of antidepressive treatment, whereas the multidimensional HAM-D(17) and IDS-C(30) should be considered when selecting antidepressant treatment. Copyright © 2011 Elsevier B.V. All rights reserved.
MacLeod, Melissa A; Tremblay, Paul F; Graham, Kathryn; Bernards, Sharon; Rehm, Jürgen; Wells, Samantha
2016-12-01
The 12-item World Health Organization Disability Assessment Schedule 2.0 (WHODAS 2.0) is a brief measurement tool used cross-culturally to capture the multi-dimensional nature of disablement through six domains, including: understanding and interacting with the world; moving and getting around; self-care; getting on with people; life activities; and participation in society. Previous psychometric research supports that the WHODAS 2.0 functions as a general factor of disablement. In a pooled dataset from community samples of adults (N = 447) we used confirmatory factor analysis to confirm a one-factor structure. Latent class analysis was used to identify subgroups of individuals based on their patterns of responses. We identified four distinct classes, or patterns of disablement: (1) pervasive disability; (2) physical disability; (3) emotional, cognitive, or interpersonal disability; (4) no/low disability. Convergent validity of the latent class subgroups was found with respect to socio-demographic characteristics, number of days affected by disabilities, stress, mental health, and substance use. These classes offer a simple and meaningful way to classify people with disabilities based on the 12-item WHODAS 2.0. Focusing on individuals with a high probability of being in the first three classes may help guide interventions. Copyright © 2016 John Wiley & Sons, Ltd.
Feature combinations and the divergence criterion
NASA Technical Reports Server (NTRS)
Decell, H. P., Jr.; Mayekar, S. M.
1976-01-01
Classifying large quantities of multidimensional remotely sensed agricultural data requires efficient and effective classification techniques and the construction of certain transformations of a dimension reducing, information preserving nature. The construction of transformations that minimally degrade information (i.e., class separability) is described. Linear dimension reducing transformations for multivariate normal populations are presented. Information content is measured by divergence.
An Investigation on Computer-Adaptive Multistage Testing Panels for Multidimensional Assessment
ERIC Educational Resources Information Center
Wang, Xinrui
2013-01-01
The computer-adaptive multistage testing (ca-MST) has been developed as an alternative to computerized adaptive testing (CAT), and been increasingly adopted in large-scale assessments. Current research and practice only focus on ca-MST panels for credentialing purposes. The ca-MST test mode, therefore, is designed to gauge a single scale. The…
Decomposing the Education Wage Gap: Everything but the Kitchen Sink. Working Paper 2010-12
ERIC Educational Resources Information Center
Hotchkiss, Julie L.; Shiferaw, Menbere
2010-01-01
This paper contributes to a large literature concerned with identifying the source of the widening wage gap between high school and college graduates by providing a comprehensive, multidimensional decomposition of wages across both time and educational status. Data from a multitude of sources are brought to bear on the question of the relative…
ERIC Educational Resources Information Center
Badilescu-Buga, Emil
2012-01-01
Learning Activity Management System (LAMS) has been trialled and used by users from many countries around the globe, but despite the positive attitude towards its potential benefits to pedagogical processes its adoption in practice has been uneven, reflecting how difficult it is to make a new technology based concept an integral part of the…
ERIC Educational Resources Information Center
Muijselaar, Marloes M. L.; Swart, Nicole M.; Steenbeek-Planting, Esther G.; Droop, Mienke; Verhoeven, Ludo; de Jong, Peter F.
2017-01-01
Many recent studies have aimed to demonstrate that specific types of reading comprehension depend on different underlying cognitive abilities. In these studies, it is often implicitly assumed that reading comprehension is a multidimensional construct. The general aim of this study was to examine the dimensionality of a large pool of reading…
NASA Technical Reports Server (NTRS)
Brodsky, Alexander; Segal, Victor E.
1999-01-01
The EOSCUBE constraint database system is designed to be a software productivity tool for high-level specification and efficient generation of EOSDIS and other scientific products. These products are typically derived from large volumes of multidimensional data which are collected via a range of scientific instruments.
ERIC Educational Resources Information Center
Shaw, Lynn; Polatajko, Helene
2002-01-01
A 20-year review of literature on return to work outcomes for ill or injured persons found that research is largely atheoretical and the knowledge base fragmented. The Occupational Competence Model can fill this gap by reflecting the multidimensional nature of work disability (personal, environmental, and occupational dimensions) and factors…
ERIC Educational Resources Information Center
Feryok, Anne
2013-01-01
This exploratory study focuses on four non-native English speaking secondary content teachers in a short-term immersion program aimed at introducing them to language teaching methods for secondary school content instruction through the medium of English. Such programs have been found to have largely mixed results for language performance. This may…
Fast and Accurate Support Vector Machines on Large Scale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry
Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less
Scalable persistent identifier systems for dynamic datasets
NASA Astrophysics Data System (ADS)
Golodoniuc, P.; Cox, S. J. D.; Klump, J. F.
2016-12-01
Reliable and persistent identification of objects, whether tangible or not, is essential in information management. Many Internet-based systems have been developed to identify digital data objects, e.g., PURL, LSID, Handle, ARK. These were largely designed for identification of static digital objects. The amount of data made available online has grown exponentially over the last two decades and fine-grained identification of dynamically generated data objects within large datasets using conventional systems (e.g., PURL) has become impractical. We have compared capabilities of various technological solutions to enable resolvability of data objects in dynamic datasets, and developed a dataset-centric approach to resolution of identifiers. This is particularly important in Semantic Linked Data environments where dynamic frequently changing data is delivered live via web services, so registration of individual data objects to obtain identifiers is impractical. We use identifier patterns and pattern hierarchies for identification of data objects, which allows relationships between identifiers to be expressed, and also provides means for resolving a single identifier into multiple forms (i.e. views or representations of an object). The latter can be implemented through (a) HTTP content negotiation, or (b) use of URI querystring parameters. The pattern and hierarchy approach has been implemented in the Linked Data API supporting the United Nations Spatial Data Infrastructure (UNSDI) initiative and later in the implementation of geoscientific data delivery for the Capricorn Distal Footprints project using International Geo Sample Numbers (IGSN). This enables flexible resolution of multi-view persistent identifiers and provides a scalable solution for large heterogeneous datasets.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H
2015-01-01
Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595
Plant databases and data analysis tools
USDA-ARS?s Scientific Manuscript database
It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
Jensen, Tue V.; Pinson, Pierre
2017-01-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.
Jensen, Tue V; Pinson, Pierre
2017-11-28
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
Application of multivariate statistical techniques in microbial ecology
Paliy, O.; Shankar, V.
2016-01-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
Shah, Sohil Atul
2017-01-01
Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
NASA Astrophysics Data System (ADS)
Jensen, Tue V.; Pinson, Pierre
2017-11-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
Document similarity measures and document browsing
NASA Astrophysics Data System (ADS)
Ahmadullin, Ildus; Fan, Jian; Damera-Venkata, Niranjan; Lim, Suk Hwan; Lin, Qian; Liu, Jerry; Liu, Sam; O'Brien-Strain, Eamonn; Allebach, Jan
2011-03-01
Managing large document databases is an important task today. Being able to automatically com- pare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between dierent documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on dierent levels of the pyramid, and zoom into the documents that are of interest.
Classification of foods by transferring knowledge from ImageNet dataset
NASA Astrophysics Data System (ADS)
Heravi, Elnaz J.; Aghdam, Hamed H.; Puig, Domenec
2017-03-01
Automatic classification of foods is a way to control food intake and tackle with obesity. However, it is a challenging problem since foods are highly deformable and complex objects. Results on ImageNet dataset have revealed that Convolutional Neural Network has a great expressive power to model natural objects. Nonetheless, it is not trivial to train a ConvNet from scratch for classification of foods. This is due to the fact that ConvNets require large datasets and to our knowledge there is not a large public dataset of food for this purpose. Alternative solution is to transfer knowledge from trained ConvNets to the domain of foods. In this work, we study how transferable are state-of-art ConvNets to the task of food classification. We also propose a method for transferring knowledge from a bigger ConvNet to a smaller ConvNet by keeping its accuracy similar to the bigger ConvNet. Our experiments on UECFood256 datasets show that Googlenet, VGG and residual networks produce comparable results if we start transferring knowledge from appropriate layer. In addition, we show that our method is able to effectively transfer knowledge to the smaller ConvNet using unlabeled samples.
Numeric invariants from multidimensional persistence
DOE Office of Scientific and Technical Information (OSTI.GOV)
Skryzalin, Jacek; Carlsson, Gunnar
2017-05-19
In this paper, we analyze the space of multidimensional persistence modules from the perspectives of algebraic geometry. We first build a moduli space of a certain subclass of easily analyzed multidimensional persistence modules, which we construct specifically to capture much of the information which can be gained by using multidimensional persistence over one-dimensional persistence. We argue that the global sections of this space provide interesting numeric invariants when evaluated against our subclass of multidimensional persistence modules. Lastly, we extend these global sections to the space of all multidimensional persistence modules and discuss how the resulting numeric invariants might be usedmore » to study data.« less
NASA Astrophysics Data System (ADS)
Bouchet, L.; Amestoy, P.; Buttari, A.; Rouet, F.-H.; Chauvin, M.
2013-02-01
Nowadays, analyzing and reducing the ever larger astronomical datasets is becoming a crucial challenge, especially for long cumulated observation times. The INTEGRAL/SPI X/γ-ray spectrometer is an instrument for which it is essential to process many exposures at the same time in order to increase the low signal-to-noise ratio of the weakest sources. In this context, the conventional methods for data reduction are inefficient and sometimes not feasible at all. Processing several years of data simultaneously requires computing not only the solution of a large system of equations, but also the associated uncertainties. We aim at reducing the computation time and the memory usage. Since the SPI transfer function is sparse, we have used some popular methods for the solution of large sparse linear systems; we briefly review these methods. We use the Multifrontal Massively Parallel Solver (MUMPS) to compute the solution of the system of equations. We also need to compute the variance of the solution, which amounts to computing selected entries of the inverse of the sparse matrix corresponding to our linear system. This can be achieved through one of the latest features of the MUMPS software that has been partly motivated by this work. In this paper we provide a brief presentation of this feature and evaluate its effectiveness on astrophysical problems requiring the processing of large datasets simultaneously, such as the study of the entire emission of the Galaxy. We used these algorithms to solve the large sparse systems arising from SPI data processing and to obtain both their solutions and the associated variances. In conclusion, thanks to these newly developed tools, processing large datasets arising from SPI is now feasible with both a reasonable execution time and a low memory usage.
A Robust Absorbing Boundary Condition for Compressible Flows
NASA Technical Reports Server (NTRS)
Loh, Ching Y.; orgenson, Philip C. E.
2005-01-01
An absorbing non-reflecting boundary condition (NRBC) for practical computations in fluid dynamics and aeroacoustics is presented with theoretical proof. This paper is a continuation and improvement of a previous paper by the author. The absorbing NRBC technique is based on a first principle of non reflecting, which contains the essential physics that a plane wave solution of the Euler equations remains intact across the boundary. The technique is theoretically shown to work for a large class of finite volume approaches. When combined with the hyperbolic conservation laws, the NRBC is simple, robust and truly multi-dimensional; no additional implementation is needed except the prescribed physical boundary conditions. Several numerical examples in multi-dimensional spaces using two different finite volume schemes are illustrated to demonstrate its robustness in practical computations. Limitations and remedies of the technique are also discussed.
Multidimensional Multiphysics Simulation of TRISO Particle Fuel
DOE Office of Scientific and Technical Information (OSTI.GOV)
J. D. Hales; R. L. Williamson; S. R. Novascone
2013-11-01
Multidimensional multiphysics analysis of TRISO-coated particle fuel using the BISON finite-element based nuclear fuels code is described. The governing equations and material models applicable to particle fuel and implemented in BISON are outlined. Code verification based on a recent IAEA benchmarking exercise is described, and excellant comparisons are reported. Multiple TRISO-coated particles of increasing geometric complexity are considered. It is shown that the code's ability to perform large-scale parallel computations permits application to complex 3D phenomena while very efficient solutions for either 1D spherically symmetric or 2D axisymmetric geometries are straightforward. Additionally, the flexibility to easily include new physical andmore » material models and uncomplicated ability to couple to lower length scale simulations makes BISON a powerful tool for simulation of coated-particle fuel. Future code development activities and potential applications are identified.« less
Data publication, documentation and user friendly landing pages - improving data discovery and reuse
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2016-04-01
Research data are the basis for scientific research and often irreplaceable (e.g. observational data). Storage of such data in appropriate, theme specific or institutional repositories is an essential part of ensuring their long term preservation and access. The free and open access to research data for reuse and scrutiny has been identified as a key issue by the scientific community as well as by research agencies and the public. To ensure the datasets to intelligible and usable for others they must be accompanied by comprehensive data description and standardized metadata for data discovery, and ideally should be published using digital object identifier (DOI). These make datasets citable and ensure their long-term accessibility and are accepted in reference lists of journal articles (http://www.copdess.org/statement-of-commitment/). The GFZ German Research Centre for Geosciences is the national laboratory for Geosciences in Germany and part of the Helmholtz Association, Germany's largest scientific organization. The development and maintenance of data systems is a key component of 'GFZ Data Services' to support state-of-the-art research. The datasets, archived in and published by the GFZ Data Repository cover all geoscientific disciplines and range from large dynamic datasets deriving from global monitoring seismic or geodetic networks with real-time data acquisition, to remotely sensed satellite products, to automatically generated data publications from a database for data from micro meteorological stations, to various model results, to geochemical and rock mechanical analyses from various labs, and field observations. The user-friendly presentation of published datasets via a DOI landing page is as important for reuse as the storage itself, and the required information is highly specific for each scientific discipline. If dataset descriptions are too general, or require the download of a dataset before knowing its suitability, many researchers often decide not to reuse a published dataset. In contrast to large data repositories without thematic specification, theme-specific data repositories have a large expertise in data discovery and opportunity to develop usable, discipline-specific formats and layouts for specific datasets, including consultation to different formats for the data description (e.g., via a Data Report or an article in a Data Journal) with full consideration of international metadata standards.
ERIC Educational Resources Information Center
Chen, Ping
2017-01-01
Calibration of new items online has been an important topic in item replenishment for multidimensional computerized adaptive testing (MCAT). Several online calibration methods have been proposed for MCAT, such as multidimensional "one expectation-maximization (EM) cycle" (M-OEM) and multidimensional "multiple EM cycles"…
Best Design for Multidimensional Computerized Adaptive Testing with the Bifactor Model
ERIC Educational Resources Information Center
Seo, Dong Gi; Weiss, David J.
2015-01-01
Most computerized adaptive tests (CATs) have been studied using the framework of unidimensional item response theory. However, many psychological variables are multidimensional and might benefit from using a multidimensional approach to CATs. This study investigated the accuracy, fidelity, and efficiency of a fully multidimensional CAT algorithm…
Multidimensional Measurement of Poverty among Women in Sub-Saharan Africa
ERIC Educational Resources Information Center
Batana, Yele Maweki
2013-01-01
Since the seminal work of Sen, poverty has been recognized as a multidimensional phenomenon. The recent availability of relevant databases renewed the interest in this approach. This paper estimates multidimensional poverty among women in fourteen Sub-Saharan African countries using the Alkire and Foster multidimensional poverty measures, whose…
The Efficacy of Multidimensional Constraint Keys in Database Query Performance
ERIC Educational Resources Information Center
Cardwell, Leslie K.
2012-01-01
This work is intended to introduce a database design method to resolve the two-dimensional complexities inherent in the relational data model and its resulting performance challenges through abstract multidimensional constructs. A multidimensional constraint is derived and utilized to implement an indexed Multidimensional Key (MK) to abstract a…
GeNets: a unified web platform for network-based genomic analyses.
Li, Taibo; Kim, April; Rosenbluh, Joseph; Horn, Heiko; Greenfeld, Liraz; An, David; Zimmer, Andrew; Liberzon, Arthur; Bistline, Jon; Natoli, Ted; Li, Yang; Tsherniak, Aviad; Narayan, Rajiv; Subramanian, Aravind; Liefeld, Ted; Wong, Bang; Thompson, Dawn; Calvo, Sarah; Carr, Steve; Boehm, Jesse; Jaffe, Jake; Mesirov, Jill; Hacohen, Nir; Regev, Aviv; Lage, Kasper
2018-06-18
Functional genomics networks are widely used to identify unexpected pathway relationships in large genomic datasets. However, it is challenging to compare the signal-to-noise ratios of different networks and to identify the optimal network with which to interpret a particular genetic dataset. We present GeNets, a platform in which users can train a machine-learning model (Quack) to carry out these comparisons and execute, store, and share analyses of genetic and RNA-sequencing datasets.
Efficient genotype compression and analysis of large genetic variation datasets
Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.
2015-01-01
Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772
Scientific Visualization and Simulation for Multi-dimensional Marine Environment Data
NASA Astrophysics Data System (ADS)
Su, T.; Liu, H.; Wang, W.; Song, Z.; Jia, Z.
2017-12-01
As higher attention on the ocean and rapid development of marine detection, there are increasingly demands for realistic simulation and interactive visualization of marine environment in real time. Based on advanced technology such as GPU rendering, CUDA parallel computing and rapid grid oriented strategy, a series of efficient and high-quality visualization methods, which can deal with large-scale and multi-dimensional marine data in different environmental circumstances, has been proposed in this paper. Firstly, a high-quality seawater simulation is realized by FFT algorithm, bump mapping and texture animation technology. Secondly, large-scale multi-dimensional marine hydrological environmental data is virtualized by 3d interactive technologies and volume rendering techniques. Thirdly, seabed terrain data is simulated with improved Delaunay algorithm, surface reconstruction algorithm, dynamic LOD algorithm and GPU programming techniques. Fourthly, seamless modelling in real time for both ocean and land based on digital globe is achieved by the WebGL technique to meet the requirement of web-based application. The experiments suggest that these methods can not only have a satisfying marine environment simulation effect, but also meet the rendering requirements of global multi-dimension marine data. Additionally, a simulation system for underwater oil spill is established by OSG 3D-rendering engine. It is integrated with the marine visualization method mentioned above, which shows movement processes, physical parameters, current velocity and direction for different types of deep water oil spill particle (oil spill particles, hydrates particles, gas particles, etc.) dynamically and simultaneously in multi-dimension. With such application, valuable reference and decision-making information can be provided for understanding the progress of oil spill in deep water, which is helpful for ocean disaster forecasting, warning and emergency response.
Linguistic Extensions of Topic Models
ERIC Educational Resources Information Center
Boyd-Graber, Jordan
2010-01-01
Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable…
A Comparison of Latent Heat Fluxes over Global Oceans for Four Flux Products
NASA Technical Reports Server (NTRS)
Chou, Shu-Hsien; Nelkin, Eric; Ardizzone, Joe; Atlas, Robert M.
2003-01-01
To improve our understanding of global energy and water cycle variability, and to improve model simulations of climate variations, it is vital to have accurate latent heat fluxes (LHF) over global oceans. Monthly LHF, 10-m wind speed (U10m), 10-m specific humidity (Q10h), and sea-air humidity difference (Qs-Q10m) of GSSTF2 (version 2 Goddard Satellite-based Surface Turbulent Fluxes) over global Oceans during 1992-93 are compared with those of HOAPS (Hamburg Ocean Atmosphere Parameters and Fluxes from Satellite Data), NCEP (NCEP/NCAR reanalysis). The mean differences, standard deviations of differences, and temporal correlation of these monthly variables over global Oceans during 1992-93 between GSSTF2 and each of the three datasets are analyzed. The large-scale patterns of the 2yr-mean fields for these variables are similar among these four datasets, but significant quantitative differences are found. The temporal correlation is higher in the northern extratropics than in the south for all variables, with the contrast being especially large for da Silva as a result of more missing ship data in the south. The da Silva has extremely low temporal correlation and large differences with GSSTF2 for all variables in the southern extratropics, indicating that da Silva hardly produces a realistic variability in these variables. The NCEP has extremely low temporal correlation (0.27) and large spatial variations of differences with GSSTF2 for Qs-Q10m in the tropics, which causes the low correlation for LHF. Over the tropics, the HOAPS LHF is significantly smaller than GSSTF2 by approx. 31% (37 W/sq m), whereas the other two datasets are comparable to GSSTF2. This is because the HOAPS has systematically smaller LHF than GSSTF2 in space, while the other two datasets have very large spatial variations of large positive and negative LHF differences with GSSTF2 to cancel and to produce smaller regional-mean differences. Our analyses suggest that the GSSTF2 latent heat flux, surface air humidity, and winds are likely to be more realistic than the other three flux datasets examined, although those of GSSTF2 are still subject to regional biases.
CANFAR + Skytree: Mining Massive Datasets as an Essential Part of the Future of Astronomy
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.
2013-01-01
The future study of large astronomical datasets, consisting of hundreds of millions to billions of objects, will be dominated by large computing resources, and by analysis tools of the necessary scalability and sophistication to extract useful information. Significant effort will be required to fulfil their potential as a provider of the next generation of science results. To-date, computing systems have allowed either sophisticated analysis of small datasets, e.g., most astronomy software, or simple analysis of large datasets, e.g., database queries. At the Canadian Astronomy Data Centre, we have combined our cloud computing system, the Canadian Advanced Network for Astronomical Research (CANFAR), with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. This allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. CANFAR works by utilizing virtual machines, which appear to the user as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement carries far more flexibility than other cloud systems, because it enables the user to immediately install and run the same code that they already utilize for science on their desktop. We demonstrate the utility of the CANFAR + Skytree system by showing science results obtained, including assigning photometric redshifts with full probability density functions (PDFs) to a catalog of approximately 133 million galaxies from the MegaPipe reductions of the Canada-France-Hawaii Telescope Legacy Wide and Deep surveys. Each PDF is produced nonparametrically from 100 instances of the photometric parameters for each galaxy, generated by perturbing within the errors on the measurements. Hence, we produce, store, and assign redshifts to, a catalog of over 13 billion object instances. This catalog is comparable in size to those expected from next-generation surveys, such as Large Synoptic Survey Telescope. The CANFAR+Skytree system is open for use by any interested member of the astronomical community.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.
Data Bookkeeping Service 3 - Providing Event Metadata in CMS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Giffels, Manuel; Guo, Y.; Riley, Daniel
The Data Bookkeeping Service 3 provides a catalog of event metadata for Monte Carlo and recorded data of the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) at CERN, Geneva. It comprises all necessary information for tracking datasets, their processing history and associations between runs, files and datasets, on a large scale of about 200, 000 datasets and more than 40 million files, which adds up in around 700 GB of metadata. The DBS is an essential part of the CMS Data Management and Workload Management (DMWM) systems [1], all kind of data-processing like Monte Carlo production,more » processing of recorded event data as well as physics analysis done by the users are heavily relying on the information stored in DBS.« less
sbtools: A package connecting R to cloud-based data for collaborative online research
Winslow, Luke; Chamberlain, Scott; Appling, Alison P.; Read, Jordan S.
2016-01-01
The adoption of high-quality tools for collaboration and reproducible research such as R and Github is becoming more common in many research fields. While Github and other version management systems are excellent resources, they were originally designed to handle code and scale poorly to large text-based or binary datasets. A number of scientific data repositories are coming online and are often focused on dataset archival and publication. To handle collaborative workflows using large scientific datasets, there is increasing need to connect cloud-based online data storage to R. In this article, we describe how the new R package sbtools enables direct access to the advanced online data functionality provided by ScienceBase, the U.S. Geological Survey’s online scientific data storage platform.
a Metadata Based Approach for Analyzing Uav Datasets for Photogrammetric Applications
NASA Astrophysics Data System (ADS)
Dhanda, A.; Remondino, F.; Santana Quintero, M.
2018-05-01
This paper proposes a methodology for pre-processing and analysing Unmanned Aerial Vehicle (UAV) datasets before photogrammetric processing. In cases where images are gathered without a detailed flight plan and at regular acquisition intervals the datasets can be quite large and be time consuming to process. This paper proposes a method to calculate the image overlap and filter out images to reduce large block sizes and speed up photogrammetric processing. The python-based algorithm that implements this methodology leverages the metadata in each image to determine the end and side overlap of grid-based UAV flights. Utilizing user input, the algorithm filters out images that are unneeded for photogrammetric processing. The result is an algorithm that can speed up photogrammetric processing and provide valuable information to the user about the flight path.
Content-level deduplication on mobile internet datasets
NASA Astrophysics Data System (ADS)
Hou, Ziyu; Chen, Xunxun; Wang, Yang
2017-06-01
Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.
Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Xing, Z.; Fetzer, E.
2008-12-01
NASA's Earth Observing System (EOS) is the world's most ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the A-Train platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the cloud scenes from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time matchups between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, and assemble merged datasets for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the VizFlow GUI, or uses a text editor to modify the simple XML workflow documents. The SciFlo client & server engines optimize the execution of such distributed workflows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The engine transparently moves data to the operators, and moves operators to the data (on the dozen trusted SciFlo nodes). SciFlo also deploys a variety of Data Grid services to: query datasets in space and time, locate & retrieve on-line data granules, provide on-the-fly variable and spatial subsetting, and perform pairwise instrument matchups for A-Train datasets. These services are combined into efficient workflows to assemble the desired large-scale, merged climate datasets. SciFlo is currently being applied in several large climate studies: comparisons of aerosol optical depth between MODIS, MISR, AERONET ground network, and U. Michigan's IMPACT aerosol transport model; characterization of long-term biases in microwave and infrared instruments (AIRS, MLS) by comparisons to GPS temperature retrievals accurate to 0.1 degrees Kelvin; and construction of a decade-long, multi-sensor water vapor climatology stratified by classified cloud scene by bringing together datasets from AIRS/AMSU, AMSR-E, MLS, MODIS, and CloudSat (NASA MEASUREs grant, Fetzer PI). The presentation will discuss the SciFlo technologies, their application in these distributed workflows, and the many challenges encountered in assembling and analyzing these massive datasets.