LLMapReduce: Multi-Lingual Map-Reduce for Supercomputing Environments
2015-11-20
1990s. Popularized by Google [36] and Apache Hadoop [37], map-reduce has become a staple technology of the ever- growing big data community...Lexington, MA, U.S.A Abstract— The map-reduce parallel programming model has become extremely popular in the big data community. Many big data ...to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming
Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Lin, Jimmy
2013-03-01
Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails" in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This article espouses a very different position: that MapReduce is "good enough," and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce. The simple solution is to find alternative, noniterative algorithms that solve the same problem. This article captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective, I reflect on the current state and future of "big data" research.
MaMR: High-performance MapReduce programming model for material cloud applications
NASA Astrophysics Data System (ADS)
Jing, Weipeng; Tong, Danyu; Wang, Yangang; Wang, Jingyuan; Liu, Yaqiu; Zhao, Peng
2017-02-01
With the increasing data size in materials science, existing programming models no longer satisfy the application requirements. MapReduce is a programming model that enables the easy development of scalable parallel applications to process big data on cloud computing systems. However, this model does not directly support the processing of multiple related data, and the processing performance does not reflect the advantages of cloud computing. To enhance the capability of workflow applications in material data processing, we defined a programming model for material cloud applications that supports multiple different Map and Reduce functions running concurrently based on hybrid share-memory BSP called MaMR. An optimized data sharing strategy to supply the shared data to the different Map and Reduce stages was also designed. We added a new merge phase to MapReduce that can efficiently merge data from the map and reduce modules. Experiments showed that the model and framework present effective performance improvements compared to previous work.
Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds
2012-01-01
Background The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions. Results Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized. Conclusions Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at http://cloudgene.uibk.ac.at. PMID:22888776
High-Performance Signal Detection for Adverse Drug Events using MapReduce Paradigm.
Fan, Kai; Sun, Xingzhi; Tao, Ying; Xu, Linhao; Wang, Chen; Mao, Xianling; Peng, Bo; Pan, Yue
2010-11-13
Post-marketing pharmacovigilance is important for public health, as many Adverse Drug Events (ADEs) are unknown when those drugs were approved for marketing. However, due to the large number of reported drugs and drug combinations, detecting ADE signals by mining these reports is becoming a challenging task in terms of computational complexity. Recently, a parallel programming model, MapReduce has been introduced by Google to support large-scale data intensive applications. In this study, we proposed a MapReduce-based algorithm, for common ADE detection approach, Proportional Reporting Ratio (PRR), and tested it in mining spontaneous ADE reports from FDA. The purpose is to investigate the possibility of using MapReduce principle to speed up biomedical data mining tasks using this pharmacovigilance case as one specific example. The results demonstrated that MapReduce programming model could improve the performance of common signal detection algorithm for pharmacovigilance in a distributed computation environment at approximately liner speedup rates.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fadika, Zacharia; Dede, Elif; Govindaraju, Madhusudhan
MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typicallysupport globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper notmore » only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPCenvironments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan testbed at the National Energy Research Scientific Computing Center (NERSC).« less
Evaluating SPLASH-2 Applications Using MapReduce
NASA Astrophysics Data System (ADS)
Zhu, Shengkai; Xiao, Zhiwei; Chen, Haibo; Chen, Rong; Zhang, Weihua; Zang, Binyu
MapReduce has been prevalent for running data-parallel applications. By hiding other non-functionality parts such as parallelism, fault tolerance and load balance from programmers, MapReduce significantly simplifies the programming of large clusters. Due to the mentioned features of MapReduce above, researchers have also explored the use of MapReduce on other application domains, such as machine learning, textual retrieval and statistical translation, among others.
Handling Data Skew in MapReduce Cluster by Using Partition Tuning
Gao, Yufei; Zhou, Yanjie; Zhou, Bing; Shi, Lei; Zhang, Jiacai
2017-01-01
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data. © 2017 Yufei Gao et al.
Handling Data Skew in MapReduce Cluster by Using Partition Tuning.
Gao, Yufei; Zhou, Yanjie; Zhou, Bing; Shi, Lei; Zhang, Jiacai
2017-01-01
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.
Handling Data Skew in MapReduce Cluster by Using Partition Tuning
Zhou, Yanjie; Zhou, Bing; Shi, Lei
2017-01-01
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data. PMID:29065568
Bringing MapReduce Closer To Data With Active Drives
NASA Astrophysics Data System (ADS)
Golpayegani, N.; Prathapan, S.; Warmka, R.; Wyatt, B.; Halem, M.; Trantham, J. D.; Markey, C. A.
2017-12-01
Moving computation closer to the data location has been a much theorized improvement to computation for decades. The increase in processor performance, the decrease in processor size and power requirement combined with the increase in data intensive computing has created a push to move computation as close to data as possible. We will show the next logical step in this evolution in computing: moving computation directly to storage. Hypothetical systems, known as Active Drives, have been proposed as early as 1998. These Active Drives would have a general-purpose CPU on each disk allowing for computations to be performed on them without the need to transfer the data to the computer over the system bus or via a network. We will utilize Seagate's Active Drives to perform general purpose parallel computing using the MapReduce programming model directly on each drive. We will detail how the MapReduce programming model can be adapted to the Active Drive compute model to perform general purpose computing with comparable results to traditional MapReduce computations performed via Hadoop. We will show how an Active Drive based approach significantly reduces the amount of data leaving the drive when performing several common algorithms: subsetting and gridding. We will show that an Active Drive based design significantly improves data transfer speeds into and out of drives compared to Hadoop's HDFS while at the same time keeping comparable compute speeds as Hadoop.
A strategy to load balancing for non-connectivity MapReduce job
NASA Astrophysics Data System (ADS)
Zhou, Huaping; Liu, Guangzong; Gui, Haixia
2017-09-01
MapReduce has been widely used in large scale and complex datasets as a kind of distributed programming model. Original Hash partitioning function in MapReduce often results the problem of data skew when data distribution is uneven. To solve the imbalance of data partitioning, we proposes a strategy to change the remaining partitioning index when data is skewed. In Map phase, we count the amount of data which will be distributed to each reducer, then Job Tracker monitor the global partitioning information and dynamically modify the original partitioning function according to the data skew model, so the Partitioner can change the index of these partitioning which will cause data skew to the other reducer that has less load in the next partitioning process, and can eventually balance the load of each node. Finally, we experimentally compare our method with existing methods on both synthetic and real datasets, the experimental results show our strategy can solve the problem of data skew with better stability and efficiency than Hash method and Sampling method for non-connectivity MapReduce task.
NASA Astrophysics Data System (ADS)
Ramirez, Andres; Rahnemoonfar, Maryam
2017-04-01
A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumbhare, Alok; Frincu, Marc; Simmhan, Yogesh
2015-06-29
The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the runtime variations in data characteristics such as data-rates and key-distribution cause resource overload, that inturn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful reducers, whose state depends on the complete tuple history, necessitates efficient fault-recovery mechanisms to maintain the desired QoS in the presence ofmore » resource failures. We propose an integrated streaming MapReduce architecture leveraging the concept of consistent hashing to support runtime elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2:8 improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700 -1500 ms from multiple failures.« less
2014-01-01
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields. PMID:25383096
Mohammed, Emad A; Far, Behrouz H; Naugler, Christopher
2014-01-01
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called "big data" challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. THE MAPREDUCE PROGRAMMING FRAMEWORK USES TWO TASKS COMMON IN FUNCTIONAL PROGRAMMING: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
Optimizing Crawler4j using MapReduce Programming Model
NASA Astrophysics Data System (ADS)
Siddesh, G. M.; Suresh, Kavya; Madhuri, K. Y.; Nijagal, Madhushree; Rakshitha, B. R.; Srinivasa, K. G.
2017-06-01
World wide web is a decentralized system that consists of a repository of information on the basis of web pages. These web pages act as a source of information or data in the present analytics world. Web crawlers are used for extracting useful information from web pages for different purposes. Firstly, it is used in web search engines where the web pages are indexed to form a corpus of information and allows the users to query on the web pages. Secondly, it is used for web archiving where the web pages are stored for later analysis phases. Thirdly, it can be used for web mining where the web pages are monitored for copyright purposes. The amount of information processed by the web crawler needs to be improved by using the capabilities of modern parallel processing technologies. In order to solve the problem of parallelism and the throughput of crawling this work proposes to optimize the Crawler4j using the Hadoop MapReduce programming model by parallelizing the processing of large input data. Crawler4j is a web crawler that retrieves useful information about the pages that it visits. The crawler Crawler4j coupled with data and computational parallelism of Hadoop MapReduce programming model improves the throughput and accuracy of web crawling. The experimental results demonstrate that the proposed solution achieves significant improvements with respect to performance and throughput. Hence the proposed approach intends to carve out a new methodology towards optimizing web crawling by achieving significant performance gain.
Algorithms for Large-Scale Astronomical Problems
2013-08-01
implemented as a succession of Hadoop MapReduce jobs and sequential programs written in Java . The sampling and splitting stages are implemented as...one MapReduce job, the partitioning and clustering phases make up another job. The merging stage is implemented as a stand-alone Java program. The...Merging. The merging stage is implemented as a sequential Java program that reads the files with the shell information, which were generated by
A Discrete Global Grid System Programming Language Using MapReduce
NASA Astrophysics Data System (ADS)
Peterson, P.; Shatz, I.
2016-12-01
A discrete global grid system (DGGS) is a powerful mechanism for storing and integrating geospatial information. As a "pixelization" of the Earth, many image processing techniques lend themselves to the transformation of data values referenced to the DGGS cells. It has been shown that image algebra, as an example, and advanced algebra, like Fast Fourier Transformation, can be used on the DGGS tiling structure for geoprocessing and spatial analysis. MapReduce has been shown to provide advantages for processing and generating large data sets within distributed and parallel computing. The DGGS structure is ideally suited for big distributed Earth data. We proposed that basic expressions could be created to form the atoms of a generalized DGGS language using the MapReduce programming model. We created three very efficient expressions: Selectors (aka filter) - A selection function that generate a set of cells, cell collections, or geometries; Calculators (aka map) - A computational function (including quantization of raw measurements and data sources) that generate values in a DGGS cell; and Aggregators (aka reduce) - A function that generate spatial statistics from cell values within a cell. We found that these three basic MapReduce operations along with a forth function, the Iterator, for horizontal and vertical traversing of any DGGS structure, provided simple building block resulting in very efficient operations and processes that could be used with any DGGS. We provide examples and a demonstration of their effectiveness using the ISEA3H DGGS on the PYXIS Studio.
A genetic algorithm-based job scheduling model for big data analytics.
Lu, Qinghua; Li, Shanshan; Zhang, Weishan; Zhang, Lei
Big data analytics (BDA) applications are a new category of software applications that process large amounts of data using scalable parallel processing infrastructure to obtain hidden value. Hadoop is the most mature open-source big data analytics framework, which implements the MapReduce programming model to process big data with MapReduce jobs. Big data analytics jobs are often continuous and not mutually separated. The existing work mainly focuses on executing jobs in sequence, which are often inefficient and consume high energy. In this paper, we propose a genetic algorithm-based job scheduling model for big data analytics applications to improve the efficiency of big data analytics. To implement the job scheduling model, we leverage an estimation module to predict the performance of clusters when executing analytics jobs. We have evaluated the proposed job scheduling model in terms of feasibility and accuracy.
Hybrid cloud and cluster computing paradigms for life science applications
2010-01-01
Background Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Results Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. Conclusions The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. Methods We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments. PMID:21210982
Hybrid cloud and cluster computing paradigms for life science applications.
Qiu, Judy; Ekanayake, Jaliya; Gunarathne, Thilina; Choi, Jong Youl; Bae, Seung-Hee; Li, Hui; Zhang, Bingjing; Wu, Tak-Lon; Ruan, Yang; Ekanayake, Saliya; Hughes, Adam; Fox, Geoffrey
2010-12-21
Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
A data colocation grid framework for big data medical image processing: backend design
NASA Astrophysics Data System (ADS)
Bao, Shunxing; Huo, Yuankai; Parvathaneni, Prasanna; Plassard, Andrew J.; Bermudez, Camilo; Yao, Yuang; Lyu, Ilwoo; Gokhale, Aniruddha; Landman, Bennett A.
2018-03-01
When processing large medical imaging studies, adopting high performance grid computing resources rapidly becomes important. We recently presented a "medical image processing-as-a-service" grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase for data colocation by moving computation close to medical image storage. However, the framework has not yet proven to be easy to use in a heterogeneous hardware environment. Furthermore, the system has not yet validated when considering variety of multi-level analysis in medical imaging. Our target design criteria are (1) improving the framework's performance in a heterogeneous cluster, (2) performing population based summary statistics on large datasets, and (3) introducing a table design scheme for rapid NoSQL query. In this paper, we present a heuristic backend interface application program interface (API) design for Hadoop and HBase for Medical Image Processing (HadoopBase-MIP). The API includes: Upload, Retrieve, Remove, Load balancer (for heterogeneous cluster) and MapReduce templates. A dataset summary statistic model is discussed and implemented by MapReduce paradigm. We introduce a HBase table scheme for fast data query to better utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a university secure, shared web database and used to empirically access an in-house grid with 224 heterogeneous CPU cores. Three empirical experiments results are presented and discussed: (1) load balancer wall-time improvement of 1.5-fold compared with a framework with built-in data allocation strategy, (2) a summary statistic model is empirically verified on grid framework and is compared with the cluster when deployed with a standard Sun Grid Engine (SGE), which reduces 8-fold of wall clock time and 14-fold of resource time, and (3) the proposed HBase table scheme improves MapReduce computation with 7 fold reduction of wall time compare with a naïve scheme when datasets are relative small. The source code and interfaces have been made publicly available.
A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design.
Bao, Shunxing; Huo, Yuankai; Parvathaneni, Prasanna; Plassard, Andrew J; Bermudez, Camilo; Yao, Yuang; Lyu, Ilwoo; Gokhale, Aniruddha; Landman, Bennett A
2018-03-01
When processing large medical imaging studies, adopting high performance grid computing resources rapidly becomes important. We recently presented a "medical image processing-as-a-service" grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase for data colocation by moving computation close to medical image storage. However, the framework has not yet proven to be easy to use in a heterogeneous hardware environment. Furthermore, the system has not yet validated when considering variety of multi-level analysis in medical imaging. Our target design criteria are (1) improving the framework's performance in a heterogeneous cluster, (2) performing population based summary statistics on large datasets, and (3) introducing a table design scheme for rapid NoSQL query. In this paper, we present a heuristic backend interface application program interface (API) design for Hadoop & HBase for Medical Image Processing (HadoopBase-MIP). The API includes: Upload, Retrieve, Remove, Load balancer (for heterogeneous cluster) and MapReduce templates. A dataset summary statistic model is discussed and implemented by MapReduce paradigm. We introduce a HBase table scheme for fast data query to better utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a university secure, shared web database and used to empirically access an in-house grid with 224 heterogeneous CPU cores. Three empirical experiments results are presented and discussed: (1) load balancer wall-time improvement of 1.5-fold compared with a framework with built-in data allocation strategy, (2) a summary statistic model is empirically verified on grid framework and is compared with the cluster when deployed with a standard Sun Grid Engine (SGE), which reduces 8-fold of wall clock time and 14-fold of resource time, and (3) the proposed HBase table scheme improves MapReduce computation with 7 fold reduction of wall time compare with a naïve scheme when datasets are relative small. The source code and interfaces have been made publicly available.
A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design
Huo, Yuankai; Parvathaneni, Prasanna; Plassard, Andrew J.; Bermudez, Camilo; Yao, Yuang; Lyu, Ilwoo; Gokhale, Aniruddha; Landman, Bennett A.
2018-01-01
When processing large medical imaging studies, adopting high performance grid computing resources rapidly becomes important. We recently presented a "medical image processing-as-a-service" grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase for data colocation by moving computation close to medical image storage. However, the framework has not yet proven to be easy to use in a heterogeneous hardware environment. Furthermore, the system has not yet validated when considering variety of multi-level analysis in medical imaging. Our target design criteria are (1) improving the framework’s performance in a heterogeneous cluster, (2) performing population based summary statistics on large datasets, and (3) introducing a table design scheme for rapid NoSQL query. In this paper, we present a heuristic backend interface application program interface (API) design for Hadoop & HBase for Medical Image Processing (HadoopBase-MIP). The API includes: Upload, Retrieve, Remove, Load balancer (for heterogeneous cluster) and MapReduce templates. A dataset summary statistic model is discussed and implemented by MapReduce paradigm. We introduce a HBase table scheme for fast data query to better utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a university secure, shared web database and used to empirically access an in-house grid with 224 heterogeneous CPU cores. Three empirical experiments results are presented and discussed: (1) load balancer wall-time improvement of 1.5-fold compared with a framework with built-in data allocation strategy, (2) a summary statistic model is empirically verified on grid framework and is compared with the cluster when deployed with a standard Sun Grid Engine (SGE), which reduces 8-fold of wall clock time and 14-fold of resource time, and (3) the proposed HBase table scheme improves MapReduce computation with 7 fold reduction of wall time compare with a naïve scheme when datasets are relative small. The source code and interfaces have been made publicly available. PMID:29887668
Efficient and Scalable Graph Similarity Joins in MapReduce
Chen, Yifan; Zhang, Weiming; Tang, Jiuyang
2014-01-01
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results. PMID:25121135
Efficient and scalable graph similarity joins in MapReduce.
Chen, Yifan; Zhao, Xiang; Xiao, Chuan; Zhang, Weiming; Tang, Jiuyang
2014-01-01
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
BESIU Physical Analysis on Hadoop Platform
NASA Astrophysics Data System (ADS)
Huo, Jing; Zang, Dongsong; Lei, Xiaofeng; Li, Qiang; Sun, Gongxing
2014-06-01
In the past 20 years, computing cluster has been widely used for High Energy Physics data processing. The jobs running on the traditional cluster with a Data-to-Computing structure, have to read large volumes of data via the network to the computing nodes for analysis, thereby making the I/O latency become a bottleneck of the whole system. The new distributed computing technology based on the MapReduce programming model has many advantages, such as high concurrency, high scalability and high fault tolerance, and it can benefit us in dealing with Big Data. This paper brings the idea of using MapReduce model to do BESIII physical analysis, and presents a new data analysis system structure based on Hadoop platform, which not only greatly improve the efficiency of data analysis, but also reduces the cost of system building. Moreover, this paper establishes an event pre-selection system based on the event level metadata(TAGs) database to optimize the data analyzing procedure.
Abduallah, Yasser; Turki, Turki; Byron, Kevin; Du, Zongxuan; Cervantes-Cervantes, Miguel; Wang, Jason T L
2017-01-01
Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to expeditiously analyze copious amounts of experimental data resulting from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here, we propose new MapReduce algorithms for inferring gene regulatory networks on a Hadoop cluster in a cloud environment. These algorithms employ an information-theoretic approach to infer GRNs using time-series microarray data. Experimental results show that our MapReduce program is much faster than an existing tool while achieving slightly better prediction accuracy than the existing tool.
Monte Carlo simulation of photon migration in a cloud computing environment with MapReduce
Pratx, Guillem; Xing, Lei
2011-01-01
Monte Carlo simulation is considered the most reliable method for modeling photon migration in heterogeneous media. However, its widespread use is hindered by the high computational cost. The purpose of this work is to report on our implementation of a simple MapReduce method for performing fault-tolerant Monte Carlo computations in a massively-parallel cloud computing environment. We ported the MC321 Monte Carlo package to Hadoop, an open-source MapReduce framework. In this implementation, Map tasks compute photon histories in parallel while a Reduce task scores photon absorption. The distributed implementation was evaluated on a commercial compute cloud. The simulation time was found to be linearly dependent on the number of photons and inversely proportional to the number of nodes. For a cluster size of 240 nodes, the simulation of 100 billion photon histories took 22 min, a 1258 × speed-up compared to the single-threaded Monte Carlo program. The overall computational throughput was 85,178 photon histories per node per second, with a latency of 100 s. The distributed simulation produced the same output as the original implementation and was resilient to hardware failure: the correctness of the simulation was unaffected by the shutdown of 50% of the nodes. PMID:22191916
Implementation of a parallel protein structure alignment service on cloud.
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica
2017-06-29
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Mining on Big Data Using Hadoop MapReduce Model
NASA Astrophysics Data System (ADS)
Salman Ahmed, G.; Bhattacharya, Sweta
2017-11-01
Customary parallel calculations for mining nonstop item create opportunity to adjust stack of similar data among hubs. The paper aims to review this process by analyzing the critical execution downside of the common parallel recurrent item-set mining calculations. Given a larger than average dataset, data apportioning strategies inside the current arrangements endure high correspondence and mining overhead evoked by repetitive exchanges transmitted among registering hubs. We tend to address this downside by building up a learning apportioning approach referred as Hadoop abuse using the map-reduce programming model. All objectives of Hadoop are to zest up the execution of parallel recurrent item-set mining on Hadoop bunches. Fusing the comparability metric and furthermore the locality-sensitive hashing procedure, Hadoop puts to a great degree comparative exchanges into an information segment to lift neighborhood while not making AN exorbitant assortment of excess exchanges. We tend to execute Hadoop on a 34-hub Hadoop bunch, driven by a decent change of datasets made by IBM quest market-basket manufactured data generator. Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoop hubs. Hadoop impressively outperforms and enhances the other models considerably.
Vineyard, Craig M.; Verzi, Stephen J.; James, Conrad D.; ...
2015-08-10
Despite technological advances making computing devices faster, smaller, and more prevalent in today's age, data generation and collection has outpaced data processing capabilities. Simply having more compute platforms does not provide a means of addressing challenging problems in the big data era. Rather, alternative processing approaches are needed and the application of machine learning to big data is hugely important. The MapReduce programming paradigm is an alternative to conventional supercomputing approaches, and requires less stringent data passing constrained problem decompositions. Rather, MapReduce relies upon defining a means of partitioning the desired problem so that subsets may be computed independently andmore » recom- bined to yield the net desired result. However, not all machine learning algorithms are amenable to such an approach. Game-theoretic algorithms are often innately distributed, consisting of local interactions between players without requiring a central authority and are iterative by nature rather than requiring extensive retraining. Effectively, a game-theoretic approach to machine learning is well suited for the MapReduce paradigm and provides a novel, alternative new perspective to addressing the big data problem. In this paper we present a variant of our Support Vector Machine (SVM) Game classifier which may be used in a distributed manner, and show an illustrative example of applying this algorithm.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vineyard, Craig M.; Verzi, Stephen J.; James, Conrad D.
Despite technological advances making computing devices faster, smaller, and more prevalent in today's age, data generation and collection has outpaced data processing capabilities. Simply having more compute platforms does not provide a means of addressing challenging problems in the big data era. Rather, alternative processing approaches are needed and the application of machine learning to big data is hugely important. The MapReduce programming paradigm is an alternative to conventional supercomputing approaches, and requires less stringent data passing constrained problem decompositions. Rather, MapReduce relies upon defining a means of partitioning the desired problem so that subsets may be computed independently andmore » recom- bined to yield the net desired result. However, not all machine learning algorithms are amenable to such an approach. Game-theoretic algorithms are often innately distributed, consisting of local interactions between players without requiring a central authority and are iterative by nature rather than requiring extensive retraining. Effectively, a game-theoretic approach to machine learning is well suited for the MapReduce paradigm and provides a novel, alternative new perspective to addressing the big data problem. In this paper we present a variant of our Support Vector Machine (SVM) Game classifier which may be used in a distributed manner, and show an illustrative example of applying this algorithm.« less
The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce
NASA Astrophysics Data System (ADS)
Chen, Xi; Zhou, Liqing
2015-12-01
With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.
Center for Technology for Advanced Scientific Componet Software (TASCS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Govindaraju, Madhusudhan
Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more » We also worked on the design and implementation modules that will be required for the emerging MapReduce model to be effective for scientific applications. Finally, we focused on optimizing the processing of scientific datasets on multi-core processors. Research Details We worked on the following research projects that we are working on applying to CCA-based scientific applications. 1. Non-Hydrostatic Hydrodynamics: Non-static hydrodynamics are significantly more accurate at modeling internal waves that may be important in lake ecosystems. Non-hydrostatic codes, however, are significantly more computationally expensive, often prohibitively so. We have worked with Chin Wu at the University of Wisconsin to parallelize non-hydrostatic code. We have obtained a speed up of about 26 times maximum. Although this is significant progress, we hope to improve the performance further, such that it becomes a practical alternative to hydrostatic codes. 2. Model-coupling for water-based ecosystems: To answer pressing questions about water resources requires that physical models (hydrodynamics) be coupled with biological and chemical models. Most hydrodynamics codes are written in Fortran, however, while most ecologists work in MATLAB. This disconnect creates a great barrier. To address this, we are working on a model coupling interface that will allow biogeochemical computations written in MATLAB to couple with Fortran codes. This will greatly improve the productivity of ecosystem scientists. 2. Low overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications: Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive distributed applications. MapReduce however carries its own burdens. Through experiments using Hadoop in the context of diverse applications, we uncovered latencies and delay conditions potentially inhibiting the expected performance of a parallel execution in CPU-intensive applications. Furthermore, as it currently stands, MapReduce is favored for data-centric applications, and as such tends to be solely applied to disk-based applications. The paradigm, falls short in bringing its novelty to diskless systems dedicated to in-memory applications, and compute intensive programs processing much smaller data, but requiring intensive computations. In this project, we focused both on the performance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. We designed LEMO-MR [1], a Low overhead, elastic, configurable for in- memory applications, and on-demand fault tolerance, an optimized implementation of MapReduce, for both on disk and in memory applications. We conducted experiments to identify not only the necessary components of this model, but also trade offs and factors to be considered. We have initial results to show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. We have quantified the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. 3. Cache Performance Optimization for Processing XML and HDF-based Application Data on Multi-core Processors: It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors. Implementations of scientific middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. In this project, we focused on the utilization of the L2 cache, which is a critical shared resource on chip multiprocessors (CMP). The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or hurt the ability to hide memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to inform scheduling decisions in multi-threaded programming. In this project, using the TAU toolkit for performance feedback from dual- and quad-core machines, we conducted performance analysis and recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of scientific applications that requires processing of HDF5 data. In particular, we quantified the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time [2]. References: 1. Zacharia Fadika, Madhusudhan Govindaraju, ``MapReduce Implementation for Memory-Based and Processing Intensive Applications'', accepted in 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, USA, Nov 30 - Dec 3, 2010. 2. Rajdeep Bhowmik, Madhusudhan Govindaraju, ``Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors'', in proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 17-20, 2010, Melbourne, Victoria, Australia. Contact Information: Madhusudhan Govindaraju Binghamton University State University of New York (SUNY) mgovinda@cs.binghamton.edu Phone: 607-777-4904« less
Reliable, Memory Speed Storage for Cluster Computing Frameworks
2014-06-16
specification API that can capture computations in many of today’s popular data -parallel computing models, e.g., MapReduce and SQL. We also ported the Hadoop ...today’s big data workloads: • Immutable data : Data is immutable once written, since dominant underlying storage systems, such as HDFS [3], only support...network transfers, so reads can be data -local. • Program size vs. data size: In big data processing, the same operation is repeatedly applied on massive
Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun
2018-01-01
The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
2010-01-01
Background Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. Description An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Conclusions Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms. PMID:21210976
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.
Taylor, Ronald C
2010-12-21
Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.
Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.
Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan
2017-01-01
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
Remote sensing image segmentation based on Hadoop cloud platform
NASA Astrophysics Data System (ADS)
Li, Jie; Zhu, Lingling; Cao, Fubin
2018-01-01
To solve the problem that the remote sensing image segmentation speed is slow and the real-time performance is poor, this paper studies the method of remote sensing image segmentation based on Hadoop platform. On the basis of analyzing the structural characteristics of Hadoop cloud platform and its component MapReduce programming, this paper proposes a method of image segmentation based on the combination of OpenCV and Hadoop cloud platform. Firstly, the MapReduce image processing model of Hadoop cloud platform is designed, the input and output of image are customized and the segmentation method of the data file is rewritten. Then the Mean Shift image segmentation algorithm is implemented. Finally, this paper makes a segmentation experiment on remote sensing image, and uses MATLAB to realize the Mean Shift image segmentation algorithm to compare the same image segmentation experiment. The experimental results show that under the premise of ensuring good effect, the segmentation rate of remote sensing image segmentation based on Hadoop cloud Platform has been greatly improved compared with the single MATLAB image segmentation, and there is a great improvement in the effectiveness of image segmentation.
Application-Level Interoperability Across Grids and Clouds
NASA Astrophysics Data System (ADS)
Jha, Shantenu; Luckow, Andre; Merzky, Andre; Erdely, Miklos; Sehgal, Saurabh
Application-level interoperability is defined as the ability of an application to utilize multiple distributed heterogeneous resources. Such interoperability is becoming increasingly important with increasing volumes of data, multiple sources of data as well as resource types. The primary aim of this chapter is to understand different ways in which application-level interoperability can be provided across distributed infrastructure. We achieve this by (i) using the canonical wordcount application, based on an enhanced version of MapReduce that scales-out across clusters, clouds, and HPC resources, (ii) establishing how SAGA enables the execution of wordcount application using MapReduce and other programming models such as Sphere concurrently, and (iii) demonstrating the scale-out of ensemble-based biomolecular simulations across multiple resources. We show user-level control of the relative placement of compute and data and also provide simple performance measures and analysis of SAGA-MapReduce when using multiple, different, heterogeneous infrastructures concurrently for the same problem instance. Finally, we discuss Azure and some of the system-level abstractions that it provides and show how it is used to support ensemble-based biomolecular simulations.
Wang, Min; Tian, Yun
2018-01-01
The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance. PMID:29861711
MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning
Yang, Jie; Huang, Yuan; Xu, Lixiong; Li, Siguang; Qi, Man
2015-01-01
Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation. PMID:26681933
MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning.
Liu, Yang; Yang, Jie; Huang, Yuan; Xu, Lixiong; Li, Siguang; Qi, Man
2015-01-01
Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation.
Large-scale parallel genome assembler over cloud computing environment.
Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong
2017-06-01
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce
2015-01-01
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement. PMID:26305223
MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce.
Idris, Muhammad; Hussain, Shujaat; Siddiqi, Muhammad Hameed; Hassan, Waseem; Syed Muhammad Bilal, Hafiz; Lee, Sungyoung
2015-01-01
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.
Chung, Wei-Chun; Chen, Chien-Chih; Ho, Jan-Ming; Lin, Chung-Yen; Hsu, Wen-Lian; Wang, Yu-Chun; Lee, D T; Lai, Feipei; Huang, Chih-Wei; Chang, Yu-Jung
2014-01-01
Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce. We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard. CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark. CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.
Chung, Wei-Chun; Chen, Chien-Chih; Ho, Jan-Ming; Lin, Chung-Yen; Hsu, Wen-Lian; Wang, Yu-Chun; Lee, D. T.; Lai, Feipei; Huang, Chih-Wei; Chang, Yu-Jung
2014-01-01
Background Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce. Results We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard. Conclusions CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark. Availability: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/. PMID:24897343
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing.
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-08
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4_ speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration.
Design and development of a medical big data processing system based on Hadoop.
Yao, Qin; Tian, Yu; Li, Peng-Fei; Tian, Li-Li; Qian, Yang-Ming; Li, Jing-Song
2015-03-01
Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital information systems that are designing and expanding services. Big data has four characteristics--Volume, Variety, Velocity and Value (the 4 Vs)--that make traditional systems incapable of processing these data using standalones. Apache Hadoop MapReduce is a promising software framework for developing applications that process vast amounts of data in parallel with large clusters of commodity hardware in a reliable, fault-tolerant manner. With the Hadoop framework and MapReduce application program interface (API), we can more easily develop our own MapReduce applications to run on a Hadoop framework that can scale up from a single node to thousands of machines. This paper investigates a practical case of a Hadoop-based medical big data processing system. We developed this system to intelligently process medical big data and uncover some features of hospital information system user behaviors. This paper studies user behaviors regarding various data produced by different hospital information systems for daily work. In this paper, we also built a five-node Hadoop cluster to execute distributed MapReduce algorithms. Our distributed algorithms show promise in facilitating efficient data processing with medical big data in healthcare services and clinical research compared with single nodes. Additionally, with medical big data analytics, we can design our hospital information systems to be much more intelligent and easier to use by making personalized recommendations.
Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment.
Meng, Bowen; Pratx, Guillem; Xing, Lei
2011-12-01
Four-dimensional CT (4DCT) and cone beam CT (CBCT) are widely used in radiation therapy for accurate tumor target definition and localization. However, high-resolution and dynamic image reconstruction is computationally demanding because of the large amount of data processed. Efficient use of these imaging techniques in the clinic requires high-performance computing. The purpose of this work is to develop a novel ultrafast, scalable and reliable image reconstruction technique for 4D CBCT∕CT using a parallel computing framework called MapReduce. We show the utility of MapReduce for solving large-scale medical physics problems in a cloud computing environment. In this work, we accelerated the Feldcamp-Davis-Kress (FDK) algorithm by porting it to Hadoop, an open-source MapReduce implementation. Gated phases from a 4DCT scans were reconstructed independently. Following the MapReduce formalism, Map functions were used to filter and backproject subsets of projections, and Reduce function to aggregate those partial backprojection into the whole volume. MapReduce automatically parallelized the reconstruction process on a large cluster of computer nodes. As a validation, reconstruction of a digital phantom and an acquired CatPhan 600 phantom was performed on a commercial cloud computing environment using the proposed 4D CBCT∕CT reconstruction algorithm. Speedup of reconstruction time is found to be roughly linear with the number of nodes employed. For instance, greater than 10 times speedup was achieved using 200 nodes for all cases, compared to the same code executed on a single machine. Without modifying the code, faster reconstruction is readily achievable by allocating more nodes in the cloud computing environment. Root mean square error between the images obtained using MapReduce and a single-threaded reference implementation was on the order of 10(-7). Our study also proved that cloud computing with MapReduce is fault tolerant: the reconstruction completed successfully with identical results even when half of the nodes were manually terminated in the middle of the process. An ultrafast, reliable and scalable 4D CBCT∕CT reconstruction method was developed using the MapReduce framework. Unlike other parallel computing approaches, the parallelization and speedup required little modification of the original reconstruction code. MapReduce provides an efficient and fault tolerant means of solving large-scale computing problems in a cloud computing environment.
Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment
Meng, Bowen; Pratx, Guillem; Xing, Lei
2011-01-01
Purpose: Four-dimensional CT (4DCT) and cone beam CT (CBCT) are widely used in radiation therapy for accurate tumor target definition and localization. However, high-resolution and dynamic image reconstruction is computationally demanding because of the large amount of data processed. Efficient use of these imaging techniques in the clinic requires high-performance computing. The purpose of this work is to develop a novel ultrafast, scalable and reliable image reconstruction technique for 4D CBCT/CT using a parallel computing framework called MapReduce. We show the utility of MapReduce for solving large-scale medical physics problems in a cloud computing environment. Methods: In this work, we accelerated the Feldcamp–Davis–Kress (FDK) algorithm by porting it to Hadoop, an open-source MapReduce implementation. Gated phases from a 4DCT scans were reconstructed independently. Following the MapReduce formalism, Map functions were used to filter and backproject subsets of projections, and Reduce function to aggregate those partial backprojection into the whole volume. MapReduce automatically parallelized the reconstruction process on a large cluster of computer nodes. As a validation, reconstruction of a digital phantom and an acquired CatPhan 600 phantom was performed on a commercial cloud computing environment using the proposed 4D CBCT/CT reconstruction algorithm. Results: Speedup of reconstruction time is found to be roughly linear with the number of nodes employed. For instance, greater than 10 times speedup was achieved using 200 nodes for all cases, compared to the same code executed on a single machine. Without modifying the code, faster reconstruction is readily achievable by allocating more nodes in the cloud computing environment. Root mean square error between the images obtained using MapReduce and a single-threaded reference implementation was on the order of 10−7. Our study also proved that cloud computing with MapReduce is fault tolerant: the reconstruction completed successfully with identical results even when half of the nodes were manually terminated in the middle of the process. Conclusions: An ultrafast, reliable and scalable 4D CBCT/CT reconstruction method was developed using the MapReduce framework. Unlike other parallel computing approaches, the parallelization and speedup required little modification of the original reconstruction code. MapReduce provides an efficient and fault tolerant means of solving large-scale computing problems in a cloud computing environment. PMID:22149842
MapReduce Based Parallel Bayesian Network for Manufacturing Quality Control
NASA Astrophysics Data System (ADS)
Zheng, Mao-Kuan; Ming, Xin-Guo; Zhang, Xian-Yu; Li, Guo-Ming
2017-09-01
Increasing complexity of industrial products and manufacturing processes have challenged conventional statistics based quality management approaches in the circumstances of dynamic production. A Bayesian network and big data analytics integrated approach for manufacturing process quality analysis and control is proposed. Based on Hadoop distributed architecture and MapReduce parallel computing model, big volume and variety quality related data generated during the manufacturing process could be dealt with. Artificial intelligent algorithms, including Bayesian network learning, classification and reasoning, are embedded into the Reduce process. Relying on the ability of the Bayesian network in dealing with dynamic and uncertain problem and the parallel computing power of MapReduce, Bayesian network of impact factors on quality are built based on prior probability distribution and modified with posterior probability distribution. A case study on hull segment manufacturing precision management for ship and offshore platform building shows that computing speed accelerates almost directly proportionally to the increase of computing nodes. It is also proved that the proposed model is feasible for locating and reasoning of root causes, forecasting of manufacturing outcome, and intelligent decision for precision problem solving. The integration of bigdata analytics and BN method offers a whole new perspective in manufacturing quality control.
Interactive Query Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads
2012-04-02
invite cluster operators and the broader data management commu- nity to share additional knowledge about their MapReduce workloads. 9. ACKNOWLEDGMENTS...against real- life production MapReduce workloads. Knowledge of such workloads is currently limited to a handful of technology companies [19, 8, 48, 41...database management insights would benefit from checking workload assumptions against empirical measurements. The broad spectrum of workloads analyzed allows
Design Insights for MapReduce from Diverse Production Workloads
2012-01-25
different industries [5]. Consequently, there is a need to develop systematic knowledge of MapRe- duce behavior at both established users within technol...relevant to MapReduce-like systems that combine data movements and computation. 5.2 Task granularity Many MapReduce workload management mechanisms make...ex- ecutes the jobs given, versus what the jobs actually are. MapReduce workload managers currently optimize exe- cution scheduling and placement
MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sharma, Bikash; Prabhakar, Ramya; Kandemir, Mahmut
2012-01-01
Efficient resource management in data centers and clouds running large distributed data processing frameworks like MapReduce is crucial for enhancing the performance of hosted applications and boosting resource utilization. However, existing resource scheduling schemes in Hadoop MapReduce allocate resources at the granularity of fixed-size, static portions of nodes, called slots. In this work, we show that MapReduce jobs have widely varying demands for multiple resources, making the static and fixed-size slot-level resource allocation a poor choice both from the performance and resource utilization standpoints. Furthermore, lack of co-ordination in the management of mul- tiple resources across nodes prevents dynamic slotmore » reconfigura- tion, and leads to resource contention. Motivated by this, we propose MROrchestrator, a MapReduce resource Orchestrator framework, which can dynamically identify resource bottlenecks, and resolve them through fine-grained, co-ordinated, and on- demand resource allocations. We have implemented MROrches- trator on two 24-node native and virtualized Hadoop clusters. Experimental results with a suite of representative MapReduce benchmarks demonstrate up to 38% reduction in job completion times, and up to 25% increase in resource utilization. We further show how popular resource managers like NGM and Mesos when augmented with MROrchestrator can hike up their performance.« less
Large-scale seismic waveform quality metric calculation using Hadoop
NASA Astrophysics Data System (ADS)
Magana-Zook, S.; Gaylord, J. M.; Knapp, D. R.; Dodge, D. A.; Ruppert, S. D.
2016-09-01
In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of 0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-01
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4× speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration. PMID:28075343
Preliminary Evaluation of MapReduce for High-Performance Climate Data Analysis
NASA Technical Reports Server (NTRS)
Duffy, Daniel Q.; Schnase, John L.; Thompson, John H.; Freeman, Shawn M.; Clune, Thomas L.
2012-01-01
MapReduce is an approach to high-performance analytics that may be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. We are particularly interested in the potential of MapReduce to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we are prototyping a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. Our initial focus has been on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. Preliminary results suggest this approach can improve efficiencies within data intensive analytic workflows.
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
A case study of tuning MapReduce for efficient Bioinformatics in the cloud
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, Lizhen; Wang, Zhong; Yu, Weikuan
The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics datamore » analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.« less
Large-scale seismic waveform quality metric calculation using Hadoop
Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; ...
2016-05-27
Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less
Large-scale seismic waveform quality metric calculation using Hadoop
DOE Office of Scientific and Technical Information (OSTI.GOV)
Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.
Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less
When cloud computing meets bioinformatics: a review.
Zhou, Shuigeng; Liao, Ruiqi; Guan, Jihong
2013-10-01
In the past decades, with the rapid development of high-throughput technologies, biology research has generated an unprecedented amount of data. In order to store and process such a great amount of data, cloud computing and MapReduce were applied to many fields of bioinformatics. In this paper, we first introduce the basic concepts of cloud computing and MapReduce, and their applications in bioinformatics. We then highlight some problems challenging the applications of cloud computing and MapReduce to bioinformatics. Finally, we give a brief guideline for using cloud computing in biology research.
Leverage hadoop framework for large scale clinical informatics applications.
Dong, Xiao; Bahroos, Neil; Sadhu, Eugene; Jackson, Tommie; Chukhman, Morris; Johnson, Robert; Boyd, Andrew; Hynes, Denise
2013-01-01
In this manuscript, we present our experiences using the Apache Hadoop framework for high data volume and computationally intensive applications, and discuss some best practice guidelines in a clinical informatics setting. There are three main aspects in our approach: (a) process and integrate diverse, heterogeneous data sources using standard Hadoop programming tools and customized MapReduce programs; (b) after fine-grained aggregate results are obtained, perform data analysis using the Mahout data mining library; (c) leverage the column oriented features in HBase for patient centric modeling and complex temporal reasoning. This framework provides a scalable solution to meet the rapidly increasing, imperative "Big Data" needs of clinical and translational research. The intrinsic advantage of fault tolerance, high availability and scalability of Hadoop platform makes these applications readily deployable at the enterprise level cluster environment.
Environmental models are products of the computer architecture and software tools available at the time of development. Scientifically sound algorithms may persist in their original state even as system architectures and software development approaches evolve and progress. Dating...
Sequential data access with Oracle and Hadoop: a performance comparison
NASA Astrophysics Data System (ADS)
Baranowski, Zbigniew; Canali, Luca; Grancher, Eric
2014-06-01
The Hadoop framework has proven to be an effective and popular approach for dealing with "Big Data" and, thanks to its scaling ability and optimised storage access, Hadoop Distributed File System-based projects such as MapReduce or HBase are seen as candidates to replace traditional relational database management systems whenever scalable speed of data processing is a priority. But do these projects deliver in practice? Does migrating to Hadoop's "shared nothing" architecture really improve data access throughput? And, if so, at what cost? Authors answer these questions-addressing cost/performance as well as raw performance- based on a performance comparison between an Oracle-based relational database and Hadoop's distributed solutions like MapReduce or HBase for sequential data access. A key feature of our approach is the use of an unbiased data model as certain data models can significantly favour one of the technologies tested.
Long Read Alignment with Parallel MapReduce Cloud Platform
Al-Absi, Ahmed Abdulhakim; Kang, Dae-Ki
2015-01-01
Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms. PMID:26839887
Long Read Alignment with Parallel MapReduce Cloud Platform.
Al-Absi, Ahmed Abdulhakim; Kang, Dae-Ki
2015-01-01
Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms.
A scalable neuroinformatics data flow for electrophysiological signals using MapReduce.
Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S
2015-01-01
Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications.
A scalable neuroinformatics data flow for electrophysiological signals using MapReduce
Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D.; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S.
2015-01-01
Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications. PMID:25852536
Using Clouds for MapReduce Measurement Assignments
ERIC Educational Resources Information Center
Rabkin, Ariel; Reiss, Charles; Katz, Randy; Patterson, David
2013-01-01
We describe our experiences teaching MapReduce in a large undergraduate lecture course using public cloud services and the standard Hadoop API. Using the standard API, students directly experienced the quality of industrial big-data tools. Using the cloud, every student could carry out scalability benchmarking assignments on realistic hardware,…
Low Latency Workflow Scheduling and an Application of Hyperspectral Brightness Temperatures
NASA Astrophysics Data System (ADS)
Nguyen, P. T.; Chapman, D. R.; Halem, M.
2012-12-01
New system analytics for Big Data computing holds the promise of major scientific breakthroughs and discoveries from the exploration and mining of the massive data sets becoming available to the science community. However, such data intensive scientific applications face severe challenges in accessing, managing and analyzing petabytes of data. While the Hadoop MapReduce environment has been successfully applied to data intensive problems arising in business, there are still many scientific problem domains where limitations in the functionality of MapReduce systems prevent its wide adoption by those communities. This is mainly because MapReduce does not readily support the unique science discipline needs such as special science data formats, graphic and computational data analysis tools, maintaining high degrees of computational accuracies, and interfacing with application's existing components across heterogeneous computing processors. We address some of these limitations by exploiting the MapReduce programming model for satellite data intensive scientific problems and address scalability, reliability, scheduling, and data management issues when dealing with climate data records and their complex observational challenges. In addition, we will present techniques to support the unique Earth science discipline needs such as dealing with special science data formats (HDF and NetCDF). We have developed a Hadoop task scheduling algorithm that improves latency by 2x for a scientific workflow including the gridding of the EOS AIRS hyperspectral Brightness Temperatures (BT). This workflow processing algorithm has been tested at the Multicore Computing Center private Hadoop based Intel Nehalem cluster, as well as in a virtual mode under the Open Source Eucalyptus cloud. The 55TB AIRS hyperspectral L1b Brightness Temperature record has been gridded at the resolution of 0.5x1.0 degrees, and we have computed a 0.9 annual anti-correlation to the El Nino Southern oscillation in the Nino 4 region, as well as a 1.9 Kelvin decadal Arctic warming in the 4u and 12u spectral regions. Additionally, we will present the frequency of extreme global warming events by the use of a normalized maximum BT in a grid cell relative to its local standard deviation. A low-latency Hadoop scheduling environment maintains data integrity and fault tolerance in a MapReduce data intensive Cloud environment while improving the "time to solution" metric by 35% when compared to a more traditional parallel processing system for the same dataset. Our next step will be to improve the usability of our Hadoop task scheduling system, to enable rapid prototyping of data intensive experiments by means of processing "kernels". We will report on the performance and experience of implementing these experiments on the NEX testbed, and propose the use of a graphical directed acyclic graph (DAG) interface to help us develop on-demand scientific experiments. Our workflow system works within Hadoop infrastructure as a replacement for the FIFO or FairScheduler, thus the use of Apache "Pig" latin or other Apache tools may also be worth investigating on the NEX system to improve the usability of our workflow scheduling infrastructure for rapid experimentation.
Evaluation of Apache Hadoop for parallel data analysis with ROOT
NASA Astrophysics Data System (ADS)
Lehrack, S.; Duckeck, G.; Ebke, J.
2014-06-01
The Apache Hadoop software is a Java based framework for distributed processing of large data sets across clusters of computers, using the Hadoop file system (HDFS) for data storage and backup and MapReduce as a processing platform. Hadoop is primarily designed for processing large textual data sets which can be processed in arbitrary chunks, and must be adapted to the use case of processing binary data files which cannot be split automatically. However, Hadoop offers attractive features in terms of fault tolerance, task supervision and control, multi-user functionality and job management. For this reason, we evaluated Apache Hadoop as an alternative approach to PROOF for ROOT data analysis. Two alternatives in distributing analysis data were discussed: either the data was stored in HDFS and processed with MapReduce, or the data was accessed via a standard Grid storage system (dCache Tier-2) and MapReduce was used only as execution back-end. The focus in the measurements were on the one hand to safely store analysis data on HDFS with reasonable data rates and on the other hand to process data fast and reliably with MapReduce. In the evaluation of the HDFS, read/write data rates from local Hadoop cluster have been measured and compared to standard data rates from the local NFS installation. In the evaluation of MapReduce, realistic ROOT analyses have been used and event rates have been compared to PROOF.
Sundvall, Erik; Wei-Kleiner, Fang; Freire, Sergio M; Lambrix, Patrick
2017-01-01
Archetype-based Electronic Health Record (EHR) systems using generic reference models from e.g. openEHR, ISO 13606 or CIMI should be easy to update and reconfigure with new types (or versions) of data models or entries, ideally with very limited programming or manual database tweaking. Exploratory research (e.g. epidemiology) leading to ad-hoc querying on a population-wide scale can be a challenge in such environments. This publication describes implementation and test of an archetype-aware Dewey encoding optimization that can be used to produce such systems in environments supporting relational operations, e.g. RDBMs and distributed map-reduce frameworks like Hadoop. Initial testing was done using a nine-node 2.2 GHz quad-core Hadoop cluster querying a dataset consisting of targeted extracts from 4+ million real patient EHRs, query results with sub-minute response time were obtained.
Survey of MapReduce frame operation in bioinformatics.
Zou, Quan; Li, Xu-Bin; Jiang, Wen-Rui; Lin, Zi-Yu; Li, Gui-Lin; Chen, Ke
2014-07-01
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Distributed Kernelized Locality-Sensitive Hashing for Faster Image Based Navigation
2015-03-26
Facebook, Google, and Yahoo !. Current methods for image retrieval become problematic when implemented on image datasets that can easily reach billions of...correlations. Tech industry leaders like Facebook, Google, and Yahoo ! sort and index even larger volumes of “big data” daily. When attempting to process...open source implementation of Google’s MapReduce programming paradigm [13] which has been used for many different things. Using Apache Hadoop, Yahoo
NASA Technical Reports Server (NTRS)
Li, Zhenlong; Hu, Fei; Schnase, John L.; Duffy, Daniel Q.; Lee, Tsengdar; Bowen, Michael K.; Yang, Chaowei
2016-01-01
Climate observations and model simulations are producing vast amounts of array-based spatiotemporal data. Efficient processing of these data is essential for assessing global challenges such as climate change, natural disasters, and diseases. This is challenging not only because of the large data volume, but also because of the intrinsic high-dimensional nature of geoscience data. To tackle this challenge, we propose a spatiotemporal indexing approach to efficiently manage and process big climate data with MapReduce in a highly scalable environment. Using this approach, big climate data are directly stored in a Hadoop Distributed File System in its original, native file format. A spatiotemporal index is built to bridge the logical array-based data model and the physical data layout, which enables fast data retrieval when performing spatiotemporal queries. Based on the index, a data-partitioning algorithm is applied to enable MapReduce to achieve high data locality, as well as balancing the workload. The proposed indexing approach is evaluated using the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. The experimental results show that the index can significantly accelerate querying and processing (10 speedup compared to the baseline test using the same computing cluster), while keeping the index-to-data ratio small (0.0328). The applicability of the indexing approach is demonstrated by a climate anomaly detection deployed on a NASA Hadoop cluster. This approach is also able to support efficient processing of general array-based spatiotemporal data in various geoscience domains without special configuration on a Hadoop cluster.
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2013-11-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS - a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.
A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets
NASA Astrophysics Data System (ADS)
Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr
2017-08-01
This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.
LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis
2016-05-23
LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF. Keywords—LLMapReduce; map-reduce; performance; scheduler; Grid Engine ...SLURM; LSF I. INTRODUCTION Large scale computing is currently dominated by four ecosystems: supercomputing, database, enterprise , and big data [1...interconnects [6]), High performance math libraries (e.g., BLAS [7, 8], LAPACK [9], ScaLAPACK [10]) designed to exploit special processing hardware, High
Large Scale Hierarchical K-Means Based Image Retrieval With MapReduce
2014-03-27
hadoop distributed file system: Architecture and design, 2007. [10] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. [11] Terry Costlow. Big data ...million images running on 20 virtual machines are shown. 15. SUBJECT TERMS Image Retrieval, MapReduce, Hierarchical K-Means, Big Data , Hadoop U U U UU 87...13 2.1.1.2 HDFS Data Representation . . . . . . . . . . . . . . . . 14 2.1.1.3 Hadoop Engine
Towards a Cross-Domain MapReduce Framework
2013-11-01
These Big Data applications typically run as a set of MapReduce jobs to take advantage of Hadoop’s ease of service deployment and large-scale...parallelism. Yet, Hadoop has not been adapted for multilevel secure (MLS) environments where data of different security classifications co-exist. To solve...multilevel security. I. INTRODUCTION The US Department of Defense (DoD) and US Intelligence Community (IC) recognize they have a Big Data problem
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2016-01-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS – a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing. PMID:27617325
Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.
Yehdego, Daniel T; Zhang, Boyu; Kodimala, Vikram K R; Johnson, Kyle L; Taufer, Michela; Leung, Ming-Ying
2013-05-01
Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.
Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search
2009-11-01
i.e., index construction may involve multiple flushes to local disk and on-disk merge sorts outside of MapReduce). Once the local indexes have been...contained 198 cores, which, with current dual -processor quad-core con- figurations, could fit into 25 machines—a far more modest cluster with today’s...signifi- cant impact on effectiveness. Our simple pruning technique was performed at query time and hence could be adapted to query-dependent
Peterson, Kevin J.; Pathak, Jyotishman
2014-01-01
Automated execution of electronic Clinical Quality Measures (eCQMs) from electronic health records (EHRs) on large patient populations remains a significant challenge, and the testability, interoperability, and scalability of measure execution are critical. The High Throughput Phenotyping (HTP; http://phenotypeportal.org) project aligns with these goals by using the standards-based HL7 Health Quality Measures Format (HQMF) and Quality Data Model (QDM) for measure specification, as well as Common Terminology Services 2 (CTS2) for semantic interpretation. The HQMF/QDM representation is automatically transformed into a JBoss® Drools workflow, enabling horizontal scalability via clustering and MapReduce algorithms. Using Project Cypress, automated verification metrics can then be produced. Our results show linear scalability for nine executed 2014 Center for Medicare and Medicaid Services (CMS) eCQMs for eligible professionals and hospitals for >1,000,000 patients, and verified execution correctness of 96.4% based on Project Cypress test data of 58 eCQMs. PMID:25954459
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment.
Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel
2016-08-30
Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks' execution time can be improved, in particular for some regular jobs.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data.
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H
2012-11-06
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the "big data" challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H.
2013-01-01
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the “big data” challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce. PMID:24501719
Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lin, Jian; Hamidouche, Khaled; Zheng, Jie
2015-08-05
Machine Learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literature and practice. However, on high-performance computing systems with high-speed interconnects, it is important to further accelerate existing designs of the k-NN algorithm through taking advantage of scalable programming models. To improve the performance of k-NN on large-scale environment with InfiniBand network, this paper proposes several alternative hybrid MPI+OpenSHMEM designs and performs a systemicmore » evaluation and analysis on typical workloads. The hybrid designs leverage the one-sided memory access to better overlap communication with computation than the existing pure MPI design, and propose better schemes for efficient buffer management. The implementation based on k-NN program from MaTEx with MVAPICH2-X (Unified MPI+PGAS Communication Runtime over InfiniBand) shows up to 9.0% time reduction for training KDD Cup 2010 workload over 512 cores, and 27.6% time reduction for small workload with balanced communication and computation. Experiments of running with varied number of cores show that our design can maintain good scalability.« less
NASA Astrophysics Data System (ADS)
Rizki, Permata Nur Miftahur; Lee, Heezin; Lee, Minsu; Oh, Sangyoon
2017-01-01
With the rapid advance of remote sensing technology, the amount of three-dimensional point-cloud data has increased extraordinarily, requiring faster processing in the construction of digital elevation models. There have been several attempts to accelerate the computation using parallel methods; however, little attention has been given to investigating different approaches for selecting the most suited parallel programming model for a given computing environment. We present our findings and insights identified by implementing three popular high-performance parallel approaches (message passing interface, MapReduce, and GPGPU) on time demanding but accurate kriging interpolation. The performances of the approaches are compared by varying the size of the grid and input data. In our empirical experiment, we demonstrate the significant acceleration by all three approaches compared to a C-implemented sequential-processing method. In addition, we also discuss the pros and cons of each method in terms of usability, complexity infrastructure, and platform limitation to give readers a better understanding of utilizing those parallel approaches for gridding purposes.
A hadoop-based method to predict potential effective drug combination.
Sun, Yifan; Xiong, Yi; Xu, Qian; Wei, Dongqing
2014-01-01
Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request.
A Hadoop-Based Method to Predict Potential Effective Drug Combination
Xiong, Yi; Xu, Qian; Wei, Dongqing
2014-01-01
Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request. PMID:25147789
Indoor air quality analysis based on Hadoop
NASA Astrophysics Data System (ADS)
Tuo, Wang; Yunhua, Sun; Song, Tian; Liang, Yu; Weihong, Cui
2014-03-01
The air of the office environment is our research object. The data of temperature, humidity, concentrations of carbon dioxide, carbon monoxide and ammonia are collected peer one to eight seconds by the sensor monitoring system. And all the data are stored in the Hbase database of Hadoop platform. With the help of HBase feature of column-oriented store and versioned (automatically add the time column), the time-series data sets are bulit based on the primary key Row-key and timestamp. The parallel computing programming model MapReduce is used to process millions of data collected by sensors. By analysing the changing trend of parameters' value at different time of the same day and at the same time of various dates, the impact of human factor and other factors on the room microenvironment is achieved according to the liquidity of the office staff. Moreover, the effective way to improve indoor air quality is proposed in the end of this paper.
Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data.
Luna, Jose Maria; Padillo, Francisco; Pechenizkiy, Mykola; Ventura, Sebastian
2017-09-27
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms [Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed, which extract any existing item-set in data. Second, two algorithms (space pruning AprioriMR and top AprioriMR) that prune the search space by means of the well-known anti-monotone property are proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed representations of frequent patterns. To test the performance of the proposed algorithms, a varied collection of big data datasets have been considered, comprising up to 3 · 10#x00B9;⁸ transactions and more than 5 million of distinct single-items. The experimental stage includes comparisons against highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying MapReduce versions when complex problems are considered, and also the unsuitability of this paradigm when dealing with small data.
NASA Astrophysics Data System (ADS)
Machalek, P.; Kim, S. M.; Berry, R. D.; Liang, A.; Small, T.; Brevdo, E.; Kuznetsova, A.
2012-12-01
We describe how the Climate Corporation uses Python and Clojure, a language impleneted on top of Java, to generate climatological forecasts for precipitation based on the Advanced Hydrologic Prediction Service (AHPS) radar based daily precipitation measurements. A 2-year-long forecasts is generated on each of the ~650,000 CONUS land based 4-km AHPS grids by constructing 10,000 ensembles sampled from a 30-year reconstructed AHPS history for each grid. The spatial and temporal correlations between neighboring AHPS grids and the sampling of the analogues are handled by Python. The parallelization for all the 650,000 CONUS stations is further achieved by utilizing the MAP-REDUCE framework (http://code.google.com/edu/parallel/mapreduce-tutorial.html). Each full scale computational run requires hundreds of nodes with up to 8 processors each on the Amazon Elastic MapReduce (http://aws.amazon.com/elasticmapreduce/) distributed computing service resulting in 3 terabyte datasets. We further describe how we have productionalized a monthly run of the simulations process at full scale of the 4km AHPS grids and how the resultant terabyte sized datasets are handled.
Efficient feature extraction from wide-area motion imagery by MapReduce in Hadoop
NASA Astrophysics Data System (ADS)
Cheng, Erkang; Ma, Liya; Blaisse, Adam; Blasch, Erik; Sheaff, Carolyn; Chen, Genshe; Wu, Jie; Ling, Haibin
2014-06-01
Wide-Area Motion Imagery (WAMI) feature extraction is important for applications such as target tracking, traffic management and accident discovery. With the increasing amount of WAMI collections and feature extraction from the data, a scalable framework is needed to handle the large amount of information. Cloud computing is one of the approaches recently applied in large scale or big data. In this paper, MapReduce in Hadoop is investigated for large scale feature extraction tasks for WAMI. Specifically, a large dataset of WAMI images is divided into several splits. Each split has a small subset of WAMI images. The feature extractions of WAMI images in each split are distributed to slave nodes in the Hadoop system. Feature extraction of each image is performed individually in the assigned slave node. Finally, the feature extraction results are sent to the Hadoop File System (HDFS) to aggregate the feature information over the collected imagery. Experiments of feature extraction with and without MapReduce are conducted to illustrate the effectiveness of our proposed Cloud-Enabled WAMI Exploitation (CAWE) approach.
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment
Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel
2016-01-01
Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks’ execution time can be improved, in particular for some regular jobs. PMID:27589753
Raster Data Partitioning for Supporting Distributed GIS Processing
NASA Astrophysics Data System (ADS)
Nguyen Thai, B.; Olasz, A.
2015-08-01
In the geospatial sector big data concept also has already impact. Several studies facing originally computer science techniques applied in GIS processing of huge amount of geospatial data. In other research studies geospatial data is considered as it were always been big data (Lee and Kang, 2015). Nevertheless, we can prove data acquisition methods have been improved substantially not only the amount, but the resolution of raw data in spectral, spatial and temporal aspects as well. A significant portion of big data is geospatial data, and the size of such data is growing rapidly at least by 20% every year (Dasgupta, 2013). The produced increasing volume of raw data, in different format, representation and purpose the wealth of information derived from this data sets represents only valuable results. However, the computing capability and processing speed rather tackle with limitations, even if semi-automatic or automatic procedures are aimed on complex geospatial data (Kristóf et al., 2014). In late times, distributed computing has reached many interdisciplinary areas of computer science inclusive of remote sensing and geographic information processing approaches. Cloud computing even more requires appropriate processing algorithms to be distributed and handle geospatial big data. Map-Reduce programming model and distributed file systems have proven their capabilities to process non GIS big data. But sometimes it's inconvenient or inefficient to rewrite existing algorithms to Map-Reduce programming model, also GIS data can not be partitioned as text-based data by line or by bytes. Hence, we would like to find an alternative solution for data partitioning, data distribution and execution of existing algorithms without rewriting or with only minor modifications. This paper focuses on technical overview of currently available distributed computing environments, as well as GIS data (raster data) partitioning, distribution and distributed processing of GIS algorithms. A proof of concept implementation have been made for raster data partitioning, distribution and processing. The first results on performance have been compared against commercial software ERDAS IMAGINE 2011 and 2014. Partitioning methods heavily depend on application areas, therefore we may consider data partitioning as a preprocessing step before applying processing services on data. As a proof of concept we have implemented a simple tile-based partitioning method splitting an image into smaller grids (NxM tiles) and comparing the processing time to existing methods by NDVI calculation. The concept is demonstrated using own development open source processing framework.
Massive Signal Analysis with Hadoop (Invited)
NASA Astrophysics Data System (ADS)
Addair, T.
2013-12-01
The Geophysical Monitoring Program (GMP) at Lawrence Livermore National Laboratory is in the process of transitioning from a primarily human-driven analysis pipeline to a more automated and exploratory system. Waveform correlation represents a significant part of this effort, and the results that come out of this processing could lead to the development of more sophisticated event detection and analysis systems that require less human interaction, and address fundamental shortcomings in existing systems. Furthermore, use of distributed IO systems fundamentally addresses a scalability concern for the GMP as our data holdings continue to grow rapidly. As the data volume increases, it becomes less reasonable to rely upon human analysts to sift through all the information. Not only is more automation essential to keeping up with the ingestion rate, but so too do we require faster and more sophisticated tools for visualizing and interacting with the data. These issues of scalability are not unique to GMP or the seismic domain. All across the lab, and throughout industry, we hear about the promise of 'big data' to address the need of quickly analyzing vast amounts of data in fundamentally new ways. Our waveform correlation system finds and correlates nearby seismic events across the entire Earth. In our original implementation of the system, we processed some 50 TB of data on an in-house traditional HPC cluster (44 cores, 1 filesystem) over the span of 42 days. Having determined the primary bottleneck in the performance to be reading waveforms off a single BlueArc file server, we began investigating distributed IO solutions like Hadoop. As a test case, we took a 1 TB subset of our data and ported it to Livermore Computing's development Hadoop cluster. Through a pilot project sponsored by Livermore Computing (LC), the GMP successfully implemented the waveform correlation system in the Hadoop distributed MapReduce computing framework. Hadoop is an open source implementation of the MapReduce distributed programming framework. We used the Hadoop scripting framework known as Pig for putting together the multi-job MapReduce pipeline used to extract as much parallelism as possible from the algorithms. We also made use the Sqoop data ingestion tool to pull metadata tables from our Oracle database into HDFS (the Hadoop Distributed Filesystem). Running on our in-house HPC cluster, processing this test dataset took 58 hours to complete. In contrast, running our Hadoop implementation on LC's 10 node (160 core) cluster, we were able to cross-correlate the 1 TB of nearby seismic events in just under 3 hours, over a factor of 19 improvement from our existing implementation. This project is one of the first major data mining and analysis tasks performed at the lab or anywhere else correlating the entire Earth's seismicity. Through the success of this project, we believe we've shown that a MapReduce solution can be appropriate for many large-scale Earth science data analysis and exploration problems. Given Hadoop's position as the dominant data analytics solution in industry, we believe Hadoop can be applied to many previously intractable Earth science problems.
An Approach for Stitching Satellite Images in a Bigdata Mapreduce Framework
NASA Astrophysics Data System (ADS)
Sarı, H.; Eken, S.; Sayar, A.
2017-11-01
In this study we present a two-step map/reduce framework to stitch satellite mosaic images. The proposed system enable recognition and extraction of objects whose parts falling in separate satellite mosaic images. However this is a time and resource consuming process. The major aim of the study is improving the performance of the image stitching processes by utilizing big data framework. To realize this, we first convert the images into bitmaps (first mapper) and then String formats in the forms of 255s and 0s (second mapper), and finally, find the best possible matching position of the images by a reduce function.
MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services.
Pratt, Brian; Howbert, J Jeffry; Tasman, Natalie I; Nilsson, Erik J
2012-01-01
MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem runs on any Hadoop cluster but offers special support for Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files, and only minimal modification to X!Tandem-based workflows. MR-Tandem is implemented as a lightly modified X!Tandem C++ executable and a Python script that drives Hadoop clusters including Amazon Web Services (AWS) Elastic Map Reduce (EMR), using the modified X!Tandem program as a Hadoop Streaming mapper and reducer. The modified X!Tandem C++ source code is Artistic licensed, supports pluggable scoring, and is available as part of the Sashimi project at http://sashimi.svn.sourceforge.net/viewvc/sashimi/trunk/trans_proteomic_pipeline/extern/xtandem/. The MR-Tandem Python script is Apache licensed and available as part of the Insilicos Cloud Army project at http://ica.svn.sourceforge.net/viewvc/ica/trunk/mr-tandem/. Full documentation and a windows installer that configures MR-Tandem, Python and all necessary packages are available at this same URL. brian.pratt@insilicos.com
Horiguchi, Hiromasa; Yasunaga, Hideo; Hashimoto, Hideki; Ohe, Kazuhiko
2012-12-22
Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions. Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time. Our newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification
NASA Astrophysics Data System (ADS)
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-12-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification.
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-12-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-01-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value. PMID:27905520
2013-11-01
big data with R is relatively new. RHadoop is a mature product from Revolution Analytics that uses R with Hadoop Streaming [15] and provides...agnostic all- data summaries or computations, in which case we use MapReduce directly. 2.3 D&R Software Environment In this work, we use the Hadoop ...job scheduling and tracking, data distribu- tion, system architecture, heterogeneity, and fault-tolerance. Hadoop also provides a distributed key-value
Mahjani, Behrang; Toor, Salman; Nettelblad, Carl; Holmgren, Sverker
2017-01-01
In quantitative trait locus (QTL) mapping significance of putative QTL is often determined using permutation testing. The computational needs to calculate the significance level are immense, 10 4 up to 10 8 or even more permutations can be needed. We have previously introduced the PruneDIRECT algorithm for multiple QTL scan with epistatic interactions. This algorithm has specific strengths for permutation testing. Here, we present a flexible, parallel computing framework for identifying multiple interacting QTL using the PruneDIRECT algorithm which uses the map-reduce model as implemented in Hadoop. The framework is implemented in R, a widely used software tool among geneticists. This enables users to rearrange algorithmic steps to adapt genetic models, search algorithms, and parallelization steps to their needs in a flexible way. Our work underlines the maturity of accessing distributed parallel computing for computationally demanding bioinformatics applications through building workflows within existing scientific environments. We investigate the PruneDIRECT algorithm, comparing its performance to exhaustive search and DIRECT algorithm using our framework on a public cloud resource. We find that PruneDIRECT is vastly superior for permutation testing, and perform 2 ×10 5 permutations for a 2D QTL problem in 15 hours, using 100 cloud processes. We show that our framework scales out almost linearly for a 3D QTL search.
MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.
Expósito, Roberto R; Veiga, Jorge; González-Domínguez, Jorge; Touriño, Juan
2017-09-01
This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool. Source code in Java and Hadoop as well as a user's guide are freely available under the GNU GPLv3 license at http://mardre.des.udc.es . rreye@udc.es. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets
NASA Astrophysics Data System (ADS)
Juric, Mario
2011-01-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters.
Huang, Hailiang; Tata, Sandeep; Prill, Robert J
2013-01-01
Computational workloads for genome-wide association studies (GWAS) are growing in scale and complexity outpacing the capabilities of single-threaded software designed for personal computers. The BlueSNP R package implements GWAS statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop, a de facto standard framework for distributed data processing using the MapReduce formalism. BlueSNP makes computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype-phenotype datasets. http://github.com/ibm-bioinformatics/bluesnp
Processing Solutions for Big Data in Astronomy
NASA Astrophysics Data System (ADS)
Fillatre, L.; Lepiller, D.
2016-09-01
This paper gives a simple introduction to processing solutions applied to massive amounts of data. It proposes a general presentation of the Big Data paradigm. The Hadoop framework, which is considered as the pioneering processing solution for Big Data, is described together with YARN, the integrated Hadoop tool for resource allocation. This paper also presents the main tools for the management of both the storage (NoSQL solutions) and computing capacities (MapReduce parallel processing schema) of a cluster of machines. Finally, more recent processing solutions like Spark are discussed. Big Data frameworks are now able to run complex applications while keeping the programming simple and greatly improving the computing speed.
Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing
NASA Astrophysics Data System (ADS)
Srivastava, Somesh; Rao, A. Ravishankar; Sheinin, Vadim
2011-03-01
Statistical image reconstruction algorithms potentially offer many advantages to x-ray computed tomography (CT), e.g. lower radiation dose. But, their adoption in practical CT scanners requires extra computation power, which is traditionally provided by incorporating additional computing hardware (e.g. CPU-clusters, GPUs, FPGAs etc.) into a scanner. An alternative solution is to access the required computation power over the internet from a cloud computing service, which is orders-of-magnitude more cost-effective. This is because users only pay a small pay-as-you-go fee for the computation resources used (i.e. CPU time, storage etc.), and completely avoid purchase, maintenance and upgrade costs. In this paper, we investigate the benefits and shortcomings of using cloud computing for statistical image reconstruction. We parallelized the most time-consuming parts of our application, the forward and back projectors, using MapReduce, the standard parallelization library on clouds. From preliminary investigations, we found that a large speedup is possible at a very low cost. But, communication overheads inside MapReduce can limit the maximum speedup, and a better MapReduce implementation might become necessary in the future. All the experiments for this paper, including development and testing, were completed on the Amazon Elastic Compute Cloud (EC2) for less than $20.
MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT
Zhang, Guo-Qiang; Zhu, Wei; Sun, Mengmeng; Tao, Shiqiang; Bodenreider, Olivier; Cui, Licong
2015-01-01
Non-lattice fragments are often indicative of structural anomalies in ontological systems and, as such, represent possible areas of focus for subsequent quality assurance work. However, extracting the non-lattice fragments in large ontological systems is computationally expensive if not prohibitive, using a traditional sequential approach. In this paper we present a general MapReduce pipeline, called MaPLE (MapReduce Pipeline for Lattice-based Evaluation), for extracting non-lattice fragments in large partially ordered sets and demonstrate its applicability in ontology quality assurance. Using MaPLE in a 30-node Hadoop local cloud, we systematically extracted non-lattice fragments in 8 SNOMED CT versions from 2009 to 2014 (each containing over 300k concepts), with an average total computing time of less than 3 hours per version. With dramatically reduced time, MaPLE makes it feasible not only to perform exhaustive structural analysis of large ontological hierarchies, but also to systematically track structural changes between versions. Our change analysis showed that the average change rates on the non-lattice pairs are up to 38.6 times higher than the change rates of the background structure (concept nodes). This demonstrates that fragments around non-lattice pairs exhibit significantly higher rates of change in the process of ontological evolution. PMID:25705725
MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT.
Zhang, Guo-Qiang; Zhu, Wei; Sun, Mengmeng; Tao, Shiqiang; Bodenreider, Olivier; Cui, Licong
2014-10-01
Non-lattice fragments are often indicative of structural anomalies in ontological systems and, as such, represent possible areas of focus for subsequent quality assurance work. However, extracting the non-lattice fragments in large ontological systems is computationally expensive if not prohibitive, using a traditional sequential approach. In this paper we present a general MapReduce pipeline, called MaPLE (MapReduce Pipeline for Lattice-based Evaluation), for extracting non-lattice fragments in large partially ordered sets and demonstrate its applicability in ontology quality assurance. Using MaPLE in a 30-node Hadoop local cloud, we systematically extracted non-lattice fragments in 8 SNOMED CT versions from 2009 to 2014 (each containing over 300k concepts), with an average total computing time of less than 3 hours per version. With dramatically reduced time, MaPLE makes it feasible not only to perform exhaustive structural analysis of large ontological hierarchies, but also to systematically track structural changes between versions. Our change analysis showed that the average change rates on the non-lattice pairs are up to 38.6 times higher than the change rates of the background structure (concept nodes). This demonstrates that fragments around non-lattice pairs exhibit significantly higher rates of change in the process of ontological evolution.
Astronomy In The Cloud: Using Mapreduce For Image Coaddition
NASA Astrophysics Data System (ADS)
Wiley, Keith; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.
2011-01-01
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computational challenges such as anomaly detection, classification, and moving object tracking. Since such studies require the highest quality data, methods such as image coaddition, i.e., registration, stacking, and mosaicing, will be critical to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources, e.g., asteroids, or transient objects, e.g., supernovae, these datastreams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, i.e., platforms where Hadoop is offered as a service. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results compring their performance. This work is funded by the NSF and by NASA.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition
NASA Astrophysics Data System (ADS)
Wiley, K.; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.
2011-03-01
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification and moving-object tracking. Since such studies benefit from the highest-quality data, methods such as image co-addition, i.e., astrometric registration followed by per-pixel summation, will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this article we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data are partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources: i.e., platforms where Hadoop is offered as a service. We report on our experience of implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multiterabyte imaging data set provides a good testbed for algorithm development, since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image co-addition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results comparing their performance.
Kreuzthaler, Markus; Miñarro-Giménez, Jose Antonio; Schulz, Stefan
2016-01-01
Big data resources are difficult to process without a scaled hardware environment that is specifically adapted to the problem. The emergence of flexible cloud-based virtualization techniques promises solutions to this problem. This paper demonstrates how a billion of lines can be processed in a reasonable amount of time in a cloud-based environment. Our use case addresses the accumulation of concept co-occurrence data in MEDLINE annotation as a series of MapReduce jobs, which can be scaled and executed in the cloud. Besides showing an efficient way solving this problem, we generated an additional resource for the scientific community to be used for advanced text mining approaches.
Genotyping in the cloud with Crossbow.
Gurtowski, James; Schatz, Michael C; Langmead, Ben
2012-09-01
Crossbow is a scalable, portable, and automatic cloud computing tool for identifying SNPs from high-coverage, short-read resequencing data. It is built on Apache Hadoop, an implementation of the MapReduce software framework. Hadoop allows Crossbow to distribute read alignment and SNP calling subtasks over a cluster of commodity computers. Two robust tools, Bowtie and SOAPsnp, implement the fundamental alignment and variant calling operations respectively, and have demonstrated capabilities within Crossbow of analyzing approximately one billion short reads per hour on a commodity Hadoop cluster with 320 cores. Through protocol examples, this unit will demonstrate the use of Crossbow for identifying variations in three different operating modes: on a Hadoop cluster, on a single computer, and on the Amazon Elastic MapReduce cloud computing service.
MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services
Pratt, Brian; Howbert, J. Jeffry; Tasman, Natalie I.; Nilsson, Erik J.
2012-01-01
Summary: MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem runs on any Hadoop cluster but offers special support for Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files, and only minimal modification to X!Tandem-based workflows. Availability and implementation: MR-Tandem is implemented as a lightly modified X!Tandem C++ executable and a Python script that drives Hadoop clusters including Amazon Web Services (AWS) Elastic Map Reduce (EMR), using the modified X!Tandem program as a Hadoop Streaming mapper and reducer. The modified X!Tandem C++ source code is Artistic licensed, supports pluggable scoring, and is available as part of the Sashimi project at http://sashimi.svn.sourceforge.net/viewvc/sashimi/trunk/trans_proteomic_pipeline/extern/xtandem/. The MR-Tandem Python script is Apache licensed and available as part of the Insilicos Cloud Army project at http://ica.svn.sourceforge.net/viewvc/ica/trunk/mr-tandem/. Full documentation and a windows installer that configures MR-Tandem, Python and all necessary packages are available at this same URL. Contact: brian.pratt@insilicos.com PMID:22072385
Intelligent inversion method for pre-stack seismic big data based on MapReduce
NASA Astrophysics Data System (ADS)
Yan, Xuesong; Zhu, Zhixin; Wu, Qinghua
2018-01-01
Seismic exploration is a method of oil exploration that uses seismic information; that is, according to the inversion of seismic information, the useful information of the reservoir parameters can be obtained to carry out exploration effectively. Pre-stack data are characterised by a large amount of data, abundant information, and so on, and according to its inversion, the abundant information of the reservoir parameters can be obtained. Owing to the large amount of pre-stack seismic data, existing single-machine environments have not been able to meet the computational needs of the huge amount of data; thus, the development of a method with a high efficiency and the speed to solve the inversion problem of pre-stack seismic data is urgently needed. The optimisation of the elastic parameters by using a genetic algorithm easily falls into a local optimum, which results in a non-obvious inversion effect, especially for the optimisation effect of the density. Therefore, an intelligent optimisation algorithm is proposed in this paper and used for the elastic parameter inversion of pre-stack seismic data. This algorithm improves the population initialisation strategy by using the Gardner formula and the genetic operation of the algorithm, and the improved algorithm obtains better inversion results when carrying out a model test with logging data. All of the elastic parameters obtained by inversion and the logging curve of theoretical model are fitted well, which effectively improves the inversion precision of the density. This algorithm was implemented with a MapReduce model to solve the seismic big data inversion problem. The experimental results show that the parallel model can effectively reduce the running time of the algorithm.
2012-01-01
Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce frame-work. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper. PMID:22759614
Hadoop for High-Performance Climate Analytics: Use Cases and Lessons Learned
NASA Technical Reports Server (NTRS)
Tamkin, Glenn
2013-01-01
Scientific data services are a critical aspect of the NASA Center for Climate Simulations mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyber infrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model inter-comparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services
Zamani, Hamid
2017-01-01
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services. PMID:29375652
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services.
Chrimes, Dillon; Zamani, Hamid
2017-01-01
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.
Large-scale virtual screening on public cloud resources with Apache Spark.
Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola
2017-01-01
Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.
Data Intensive Computing on Amazon Web Services
DOE Office of Scientific and Technical Information (OSTI.GOV)
Magana-Zook, S. A.
The Geophysical Monitoring Program (GMP) has spent the past few years building up the capability to perform data intensive computing using what have been referred to as “big data” tools. These big data tools would be used against massive archives of seismic signals (>300 TB) to conduct research not previously possible. Examples of such tools include Hadoop (HDFS, MapReduce), HBase, Hive, Storm, Spark, Solr, and many more by the day. These tools are useful for performing data analytics on datasets that exceed the resources of traditional analytic approaches. To this end, a research big data cluster (“Cluster A”) was setmore » up as a collaboration between GMP and Livermore Computing (LC).« less
Distributed and parallel approach for handle and perform huge datasets
NASA Astrophysics Data System (ADS)
Konopko, Joanna
2015-12-01
Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.
Near Real-Time Processing of Proteomics Data Using Hadoop.
Hillman, Chris; Ahmad, Yasmeen; Whitehorn, Mark; Cobley, Andy
2014-03-01
This article presents a near real-time processing solution using MapReduce and Hadoop. The solution is aimed at some of the data management and processing challenges facing the life sciences community. Research into genes and their product proteins generates huge volumes of data that must be extensively preprocessed before any biological insight can be gained. In order to carry out this processing in a timely manner, we have investigated the use of techniques from the big data field. These are applied specifically to process data resulting from mass spectrometers in the course of proteomic experiments. Here we present methods of handling the raw data in Hadoop, and then we investigate a process for preprocessing the data using Java code and the MapReduce framework to identify 2D and 3D peaks.
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.
O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D
2015-04-01
The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.
Behavior Life Style Analysis for Mobile Sensory Data in Cloud Computing through MapReduce
Hussain, Shujaat; Bang, Jae Hun; Han, Manhyung; Ahmed, Muhammad Idris; Amin, Muhammad Bilal; Lee, Sungyoung; Nugent, Chris; McClean, Sally; Scotney, Bryan; Parr, Gerard
2014-01-01
Cloud computing has revolutionized healthcare in today's world as it can be seamlessly integrated into a mobile application and sensor devices. The sensory data is then transferred from these devices to the public and private clouds. In this paper, a hybrid and distributed environment is built which is capable of collecting data from the mobile phone application and store it in the cloud. We developed an activity recognition application and transfer the data to the cloud for further processing. Big data technology Hadoop MapReduce is employed to analyze the data and create user timeline of user's activities. These activities are visualized to find useful health analytics and trends. In this paper a big data solution is proposed to analyze the sensory data and give insights into user behavior and lifestyle trends. PMID:25420151
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
NASA Astrophysics Data System (ADS)
Farhan Husain, Mohammad; Doshi, Pankil; Khan, Latifur; Thuraisingham, Bhavani
Handling huge amount of data scalably is a matter of concern for a long time. Same is true for semantic web data. Current semantic web frameworks lack this ability. In this paper, we describe a framework that we built using Hadoop to store and retrieve large number of RDF triples. We describe our schema to store RDF data in Hadoop Distribute File System. We also present our algorithms to answer a SPARQL query. We make use of Hadoop's MapReduce framework to actually answer the queries. Our results reveal that we can store huge amount of semantic web data in Hadoop clusters built mostly by cheap commodity class hardware and still can answer queries fast enough. We conclude that ours is a scalable framework, able to handle large amount of RDF data efficiently.
Behavior life style analysis for mobile sensory data in cloud computing through MapReduce.
Hussain, Shujaat; Bang, Jae Hun; Han, Manhyung; Ahmed, Muhammad Idris; Amin, Muhammad Bilal; Lee, Sungyoung; Nugent, Chris; McClean, Sally; Scotney, Bryan; Parr, Gerard
2014-11-20
Cloud computing has revolutionized healthcare in today's world as it can be seamlessly integrated into a mobile application and sensor devices. The sensory data is then transferred from these devices to the public and private clouds. In this paper, a hybrid and distributed environment is built which is capable of collecting data from the mobile phone application and store it in the cloud. We developed an activity recognition application and transfer the data to the cloud for further processing. Big data technology Hadoop MapReduce is employed to analyze the data and create user timeline of user's activities. These activities are visualized to find useful health analytics and trends. In this paper a big data solution is proposed to analyze the sensory data and give insights into user behavior and lifestyle trends.
Kalyanaraman, Ananth; Cannon, William R; Latt, Benjamin; Baxter, Douglas J
2011-11-01
A MapReduce-based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPolygraph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400-core Hadoop cluster using spectral datasets from environmental microbial communities as inputs. The source code along with user documentation are available on http://compbio.eecs.wsu.edu/MR-MSPolygraph. ananth@eecs.wsu.edu; william.cannon@pnnl.gov. Supplementary data are available at Bioinformatics online.
Cloud-Enabled Climate Analytics-as-a-Service using Reanalysis data: A case study.
NASA Astrophysics Data System (ADS)
Nadeau, D.; Duffy, D.; Schnase, J. L.; McInerney, M.; Tamkin, G.; Potter, G. L.; Thompson, J. H.
2014-12-01
The NASA Center for Climate Simulation (NCCS) maintains advanced data capabilities and facilities that allow researchers to access the enormous volume of data generated by weather and climate models. The NASA Climate Model Data Service (CDS) and the NCCS are merging their efforts to provide Climate Analytics-as-a-Service for the comparative study of the major reanalysis projects: ECMWF ERA-Interim, NASA/GMAO MERRA, NOAA/NCEP CFSR, NOAA/ESRL 20CR, JMA JRA25, and JRA55. These reanalyses have been repackaged to netCDF4 file format following the CMIP5 Climate and Forecast (CF) metadata convention prior to be sequenced into the Hadoop Distributed File System ( HDFS ). A small set of operations that represent a common starting point in many analysis workflows was then created: min, max, sum, count, variance and average. In this example, Reanalysis data exploration was performed with the use of Hadoop MapReduce and accessibility was achieved using the Climate Data Service(CDS) application programming interface (API) created at NCCS. This API provides a uniform treatment of large amount of data. In this case study, we have limited our exploration to 2 variables, temperature and precipitation, using 3 operations, min, max and avg and using 30-year of Reanalysis data for 3 regions of the world: global, polar, subtropical.
YARNsim: Simulating Hadoop YARN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Ning; Yang, Xi; Sun, Xian-He
Despite the popularity of the Apache Hadoop system, its success has been limited by issues such as single points of failure, centralized job/task management, and lack of support for programming models other than MapReduce. The next generation of Hadoop, Apache Hadoop YARN, is designed to address these issues. In this paper, we propose YARNsim, a simulation system for Hadoop YARN. YARNsim is based on parallel discrete event simulation and provides protocol-level accuracy in simulating key components of YARN. YARNsim provides a virtual platform on which system architects can evaluate the design and implementation of Hadoop YARN systems. Also, application developersmore » can tune job performance and understand the tradeoffs between different configurations, and Hadoop YARN system vendors can evaluate system efficiency under limited budgets. To demonstrate the validity of YARNsim, we use it to model two real systems and compare the experimental results from YARNsim and the real systems. The experiments include standard Hadoop benchmarks, synthetic workloads, and a bioinformatics application. The results show that the error rate is within 10% for the majority of test cases. The experiments prove that YARNsim can provide what-if analysis for system designers in a timely manner and at minimal cost compared with testing and evaluating on a real system.« less
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-08-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-01-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive. PMID:24187650
Monitoring of services with non-relational databases and map-reduce framework
NASA Astrophysics Data System (ADS)
Babik, M.; Souto, F.
2012-12-01
Service Availability Monitoring (SAM) is a well-established monitoring framework that performs regular measurements of the core site services and reports the corresponding availability and reliability of the Worldwide LHC Computing Grid (WLCG) infrastructure. One of the existing extensions of SAM is Site Wide Area Testing (SWAT), which gathers monitoring information from the worker nodes via instrumented jobs. This generates quite a lot of monitoring data to process, as there are several data points for every job and several million jobs are executed every day. The recent uptake of non-relational databases opens a new paradigm in the large-scale storage and distributed processing of systems with heavy read-write workloads. For SAM this brings new possibilities to improve its model, from performing aggregation of measurements to storing raw data and subsequent re-processing. Both SAM and SWAT are currently tuned to run at top performance, reaching some of the limits in storage and processing power of their existing Oracle relational database. We investigated the usability and performance of non-relational storage together with its distributed data processing capabilities. For this, several popular systems have been compared. In this contribution we describe our investigation of the existing non-relational databases suited for monitoring systems covering Cassandra, HBase and MongoDB. Further, we present our experiences in data modeling and prototyping map-reduce algorithms focusing on the extension of the already existing availability and reliability computations. Finally, possible future directions in this area are discussed, analyzing the current deficiencies of the existing Grid monitoring systems and proposing solutions to leverage the benefits of the non-relational databases to get more scalable and flexible frameworks.
NASA Astrophysics Data System (ADS)
Schnase, J. L.; Duffy, D. Q.; McInerney, M. A.; Tamkin, G. S.; Thompson, J. H.; Gill, R.; Grieg, C. M.
2012-12-01
MERRA Analytic Services (MERRA/AS) is a cyberinfrastructure resource for developing and evaluating a new generation of climate data analysis capabilities. MERRA/AS supports OBS4MIP activities by reducing the time spent in the preparation of Modern Era Retrospective-Analysis for Research and Applications (MERRA) data used in data-model intercomparison. It also provides a testbed for experimental development of high-performance analytics. MERRA/AS is a cloud-based service built around the Virtual Climate Data Server (vCDS) technology that is currently used by the NASA Center for Climate Simulation (NCCS) to deliver Intergovernmental Panel on Climate Change (IPCC) data to the Earth System Grid Federation (ESGF). Crucial to its effectiveness, MERRA/AS's servers will use a workflow-generated realizable object capability to perform analyses over the MERRA data using the MapReduce approach to parallel storage-based computation. The results produced by these operations will be stored by the vCDS, which will also be able to host code sets for those who wish to explore the use of MapReduce for more advanced analytics. While the work described here will focus on the MERRA collection, these technologies can be used to publish other reanalysis, observational, and ancillary OBS4MIP data to ESGF and, importantly, offer an architectural approach to climate data services that can be generalized to applications and customers beyond the traditional climate research community. In this presentation, we describe our approach, experiences, lessons learned,and plans for the future.; (A) MERRA/AS software stack. (B) Example MERRA/AS interfaces.
Efficient Retrieval of Massive Ocean Remote Sensing Images via a Cloud-Based Mean-Shift Algorithm.
Yang, Mengzhao; Song, Wei; Mei, Haibin
2017-07-23
The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform. We achieve high-performance processing of massive RS images in the Hadoop system. Based on the pyramid Hadoop distributed file system (HDFS) storage method, an improved mean-shift algorithm for RS image retrieval is presented by fusion with the canopy algorithm via Hadoop MapReduce programming. The results show that the new method can achieve better performance for data storage than HDFS alone and WebGIS-based HDFS. Speedup and scaleup are very close to linear changes with an increase of RS images, which proves that image retrieval using our method is efficient.
Efficient Retrieval of Massive Ocean Remote Sensing Images via a Cloud-Based Mean-Shift Algorithm
Song, Wei; Mei, Haibin
2017-01-01
The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform. We achieve high-performance processing of massive RS images in the Hadoop system. Based on the pyramid Hadoop distributed file system (HDFS) storage method, an improved mean-shift algorithm for RS image retrieval is presented by fusion with the canopy algorithm via Hadoop MapReduce programming. The results show that the new method can achieve better performance for data storage than HDFS alone and WebGIS-based HDFS. Speedup and scaleup are very close to linear changes with an increase of RS images, which proves that image retrieval using our method is efficient. PMID:28737699
Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce.
Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D; Leek, Jeffrey T; Langmead, Ben
2016-08-15
Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. Rail-RNA is available from http://rail.bio Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/ : anellore@gmail.com or langmea@cs.jhu.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Estrada, T; Zhang, B; Cicotti, P; Armen, R S; Taufer, M
2012-07-01
We present a scalable and accurate method for classifying protein-ligand binding geometries in molecular docking. Our method is a three-step process: the first step encodes the geometry of a three-dimensional (3D) ligand conformation into a single 3D point in the space; the second step builds an octree by assigning an octant identifier to every single point in the space under consideration; and the third step performs an octree-based clustering on the reduced conformation space and identifies the most dense octant. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allow screening of very large conformation spaces not approachable with traditional clustering methods. We analyze results for docking trials for 23 protein-ligand complexes for HIV protease, 21 protein-ligand complexes for Trypsin, and 12 protein-ligand complexes for P38alpha kinase. We also analyze cross docking trials for 24 ligands, each docking into 24 protein conformations of the HIV protease, and receptor ensemble docking trials for 24 ligands, each docking in a pool of HIV protease receptors. Our method demonstrates significant improvement over energy-only scoring for the accurate identification of native ligand geometries in all these docking assessments. The advantages of our clustering approach make it attractive for complex applications in real-world drug design efforts. We demonstrate that our method is particularly useful for clustering docking results using a minimal ensemble of representative protein conformational states (receptor ensemble docking), which is now a common strategy to address protein flexibility in molecular docking. Copyright © 2012 Elsevier Ltd. All rights reserved.
a Hadoop-Based Distributed Framework for Efficient Managing and Processing Big Remote Sensing Images
NASA Astrophysics Data System (ADS)
Wang, C.; Hu, F.; Hu, X.; Zhao, S.; Wen, W.; Yang, C.
2015-07-01
Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.
Cloud-based large-scale air traffic flow optimization
NASA Astrophysics Data System (ADS)
Cao, Yi
The ever-increasing traffic demand makes the efficient use of airspace an imperative mission, and this paper presents an effort in response to this call. Firstly, a new aggregate model, called Link Transmission Model (LTM), is proposed, which models the nationwide traffic as a network of flight routes identified by origin-destination pairs. The traversal time of a flight route is assumed to be the mode of distribution of historical flight records, and the mode is estimated by using Kernel Density Estimation. As this simplification abstracts away physical trajectory details, the complexity of modeling is drastically decreased, resulting in efficient traffic forecasting. The predicative capability of LTM is validated against recorded traffic data. Secondly, a nationwide traffic flow optimization problem with airport and en route capacity constraints is formulated based on LTM. The optimization problem aims at alleviating traffic congestions with minimal global delays. This problem is intractable due to millions of variables. A dual decomposition method is applied to decompose the large-scale problem such that the subproblems are solvable. However, the whole problem is still computational expensive to solve since each subproblem is an smaller integer programming problem that pursues integer solutions. Solving an integer programing problem is known to be far more time-consuming than solving its linear relaxation. In addition, sequential execution on a standalone computer leads to linear runtime increase when the problem size increases. To address the computational efficiency problem, a parallel computing framework is designed which accommodates concurrent executions via multithreading programming. The multithreaded version is compared with its monolithic version to show decreased runtime. Finally, an open-source cloud computing framework, Hadoop MapReduce, is employed for better scalability and reliability. This framework is an "off-the-shelf" parallel computing model that can be used for both offline historical traffic data analysis and online traffic flow optimization. It provides an efficient and robust platform for easy deployment and implementation. A small cloud consisting of five workstations was configured and used to demonstrate the advantages of cloud computing in dealing with large-scale parallelizable traffic problems.
Simplified Parallel Domain Traversal
DOE Office of Scientific and Technical Information (OSTI.GOV)
Erickson III, David J
2011-01-01
Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less
Hodor, Paul; Chawla, Amandeep; Clark, Andrew; Neal, Lauren
2016-01-15
: One of the solutions proposed for addressing the challenge of the overwhelming abundance of genomic sequence and other biological data is the use of the Hadoop computing framework. Appropriate tools are needed to set up computational environments that facilitate research of novel bioinformatics methodology using Hadoop. Here, we present cl-dash, a complete starter kit for setting up such an environment. Configuring and deploying new Hadoop clusters can be done in minutes. Use of Amazon Web Services ensures no initial investment and minimal operation costs. Two sample bioinformatics applications help the researcher understand and learn the principles of implementing an algorithm using the MapReduce programming pattern. Source code is available at https://bitbucket.org/booz-allen-sci-comp-team/cl-dash.git. hodor_paul@bah.com. © The Author 2015. Published by Oxford University Press.
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Taylor, Ronald C.
Bioinformatics researchers are increasingly confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBasemore » project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.« less
Hodor, Paul; Chawla, Amandeep; Clark, Andrew; Neal, Lauren
2016-01-01
Summary: One of the solutions proposed for addressing the challenge of the overwhelming abundance of genomic sequence and other biological data is the use of the Hadoop computing framework. Appropriate tools are needed to set up computational environments that facilitate research of novel bioinformatics methodology using Hadoop. Here, we present cl-dash, a complete starter kit for setting up such an environment. Configuring and deploying new Hadoop clusters can be done in minutes. Use of Amazon Web Services ensures no initial investment and minimal operation costs. Two sample bioinformatics applications help the researcher understand and learn the principles of implementing an algorithm using the MapReduce programming pattern. Availability and implementation: Source code is available at https://bitbucket.org/booz-allen-sci-comp-team/cl-dash.git. Contact: hodor_paul@bah.com PMID:26428290
A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset.
Kamal, Sarwar; Ripon, Shamim Hasnat; Dey, Nilanjan; Ashour, Amira S; Santhi, V
2016-07-01
In the age of information superhighway, big data play a significant role in information processing, extractions, retrieving and management. In computational biology, the continuous challenge is to manage the biological data. Data mining techniques are sometimes imperfect for new space and time requirements. Thus, it is critical to process massive amounts of data to retrieve knowledge. The existing software and automated tools to handle big data sets are not sufficient. As a result, an expandable mining technique that enfolds the large storage and processing capability of distributed or parallel processing platforms is essential. In this analysis, a contemporary distributed clustering methodology for imbalance data reduction using k-nearest neighbor (K-NN) classification approach has been introduced. The pivotal objective of this work is to illustrate real training data sets with reduced amount of elements or instances. These reduced amounts of data sets will ensure faster data classification and standard storage management with less sensitivity. However, general data reduction methods cannot manage very big data sets. To minimize these difficulties, a MapReduce-oriented framework is designed using various clusters of automated contents, comprising multiple algorithmic approaches. To test the proposed approach, a real DNA (deoxyribonucleic acid) dataset that consists of 90 million pairs has been used. The proposed model reduces the imbalance data sets from large-scale data sets without loss of its accuracy. The obtained results depict that MapReduce based K-NN classifier provided accurate results for big data of DNA. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
A framework for building hypercubes using MapReduce
NASA Astrophysics Data System (ADS)
Tapiador, D.; O'Mullane, W.; Brown, A. G. A.; Luri, X.; Huedo, E.; Osuna, P.
2014-05-01
The European Space Agency's Gaia mission will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), by providing unprecedented position, parallax, proper motion, and radial velocity measurements for about one billion stars. The resulting catalog will be made available to the scientific community and will be analyzed in many different ways, including the production of a variety of statistics. The latter will often entail the generation of multidimensional histograms and hypercubes as part of the precomputed statistics for each data release, or for scientific analysis involving either the final data products or the raw data coming from the satellite instruments. In this paper we present and analyze a generic framework that allows the hypercube generation to be easily done within a MapReduce infrastructure, providing all the advantages of the new Big Data analysis paradigm but without dealing with any specific interface to the lower level distributed system implementation (Hadoop). Furthermore, we show how executing the framework for different data storage model configurations (i.e. row or column oriented) and compression techniques can considerably improve the response time of this type of workload for the currently available simulated data of the mission. In addition, we put forward the advantages and shortcomings of the deployment of the framework on a public cloud provider, benchmark against other popular solutions available (that are not always the best for such ad-hoc applications), and describe some user experiences with the framework, which was employed for a number of dedicated astronomical data analysis techniques workshops.
a Hadoop-Based Algorithm of Generating dem Grid from Point Cloud Data
NASA Astrophysics Data System (ADS)
Jian, X.; Xiao, X.; Chengfang, H.; Zhizhong, Z.; Zhaohui, W.; Dengzhong, Z.
2015-04-01
Airborne LiDAR technology has proven to be the most powerful tools to obtain high-density, high-accuracy and significantly detailed surface information of terrain and surface objects within a short time, and from which the Digital Elevation Model of high quality can be extracted. Point cloud data generated from the pre-processed data should be classified by segmentation algorithms, so as to differ the terrain points from disorganized points, then followed by a procedure of interpolating the selected points to turn points into DEM data. The whole procedure takes a long time and huge computing resource due to high-density, that is concentrated on by a number of researches. Hadoop is a distributed system infrastructure developed by the Apache Foundation, which contains a highly fault-tolerant distributed file system (HDFS) with high transmission rate and a parallel programming model (Map/Reduce). Such a framework is appropriate for DEM generation algorithms to improve efficiency. Point cloud data of Dongting Lake acquired by Riegl LMS-Q680i laser scanner was utilized as the original data to generate DEM by a Hadoop-based algorithms implemented in Linux, then followed by another traditional procedure programmed by C++ as the comparative experiment. Then the algorithm's efficiency, coding complexity, and performance-cost ratio were discussed for the comparison. The results demonstrate that the algorithm's speed depends on size of point set and density of DEM grid, and the non-Hadoop implementation can achieve a high performance when memory is big enough, but the multiple Hadoop implementation can achieve a higher performance-cost ratio, while point set is of vast quantities on the other hand.
NoSQL Based 3D City Model Management System
NASA Astrophysics Data System (ADS)
Mao, B.; Harrie, L.; Cao, J.; Wu, Z.; Shen, J.
2014-04-01
To manage increasingly complicated 3D city models, a framework based on NoSQL database is proposed in this paper. The framework supports import and export of 3D city model according to international standards such as CityGML, KML/COLLADA and X3D. We also suggest and implement 3D model analysis and visualization in the framework. For city model analysis, 3D geometry data and semantic information (such as name, height, area, price and so on) are stored and processed separately. We use a Map-Reduce method to deal with the 3D geometry data since it is more complex, while the semantic analysis is mainly based on database query operation. For visualization, a multiple 3D city representation structure CityTree is implemented within the framework to support dynamic LODs based on user viewpoint. Also, the proposed framework is easily extensible and supports geoindexes to speed up the querying. Our experimental results show that the proposed 3D city management system can efficiently fulfil the analysis and visualization requirements.
ASSURED CLOUD COMPUTING UNIVERSITY CENTER OFEXCELLENCE (ACC UCOE)
2018-01-18
average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed...infrastructure security -Design of algorithms and techniques for real- time assuredness in cloud computing -Map-reduce task assignment with data locality...46 DESIGN OF ALGORITHMS AND TECHNIQUES FOR REAL- TIME ASSUREDNESS IN CLOUD COMPUTING
Secure and Resilient Cloud Computing for the Department of Defense
2015-11-16
platform as a service (PaaS), and software as a service ( SaaS )—that target system administrators, developers, and end-users respectively (see Table 2...interfaces (API) and services Medium Amazon Elastic MapReduce, MathWorks Cloud, Red Hat OpenShift SaaS Full-fledged applications Low Google gMail
Parallel processing optimization strategy based on MapReduce model in cloud storage environment
NASA Astrophysics Data System (ADS)
Cui, Jianming; Liu, Jiayi; Li, Qiuyan
2017-05-01
Currently, a large number of documents in the cloud storage process employed the way of packaging after receiving all the packets. From the local transmitter this stored procedure to the server, packing and unpacking will consume a lot of time, and the transmission efficiency is low as well. A new parallel processing algorithm is proposed to optimize the transmission mode. According to the operation machine graphs model work, using MPI technology parallel execution Mapper and Reducer mechanism. It is good to use MPI technology to implement Mapper and Reducer parallel mechanism. After the simulation experiment of Hadoop cloud computing platform, this algorithm can not only accelerate the file transfer rate, but also shorten the waiting time of the Reducer mechanism. It will break through traditional sequential transmission constraints and reduce the storage coupling to improve the transmission efficiency.
ASSET Queries: A Set-Oriented and Column-Wise Approach to Modern OLAP
NASA Astrophysics Data System (ADS)
Chatziantoniou, Damianos; Sotiropoulos, Yannis
Modern data analysis has given birth to numerous grouping constructs and programming paradigms, way beyond the traditional group by. Applications such as data warehousing, web log analysis, streams monitoring and social networks understanding necessitated the use of data cubes, grouping variables, windows and MapReduce. In this paper we review the associated set (ASSET) concept and discuss its applicability in both continuous and traditional data settings. Given a set of values B, an associated set over B is just a collection of annotated data multisets, one for each b(B. The goal is to efficiently compute aggregates over these data sets. An ASSET query consists of repeated definitions of associated sets and aggregates of these, possibly correlated, resembling a spreadsheet document. We review systems implementing ASSET queries both in continuous and persistent contexts and argue for associated sets' analytical abilities and optimization opportunities.
Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele
2018-06-01
Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. umberto.ferraro@uniroma1.it. Supplementary data are available at Bioinformatics online.
BESIII Physics Data Storing and Processing on HBase and MapReduce
NASA Astrophysics Data System (ADS)
LEI, Xiaofeng; Li, Qiang; Kan, Bowen; Sun, Gongxing; Sun, Zhenyu
2015-12-01
In the past years, we have successfully applied Hadoop to high-energy physics analysis. Although, it has not only improved the efficiency of data analysis, but also reduced the cost of cluster building so far, there are still some spaces to be optimized, like inflexible pre-selection, low-efficient random data reading and I/O bottleneck caused by Fuse that is used to access HDFS. In order to change this situation, this paper presents a new analysis platform for high-energy physics data storing and analysing. The data structure is changed from DST tree-like files to HBase according to the features of the data itself and analysis processes, since HBase is more suitable for processing random data reading than DST files and enable HDFS to be accessed directly. A few of optimization measures are taken for the purpose of getting a good performance. A customized protocol is defined for data serializing and desterilizing for the sake of decreasing the storage space in HBase. In order to make full use of locality of data storing in HBase, utilizing a new MapReduce model and a new split policy for HBase regions are proposed in the paper. In addition, a dynamic pluggable easy-to-use TAG (event metadata) based pre-selection subsystem is established. It can assist physicists even to filter out 999%o uninterested data, if the conditions are set properly. This means that a lot of I/O resources can be saved, the CPU usage can be improved and consuming time for data analysis can be reduced. Finally, several use cases are designed, the test results show that the new platform has an excellent performance with 3.4 times faster with pre-selection and 20% faster without preselection, and the new platform is stable and scalable as well.
Innovating Big Data Computing Geoprocessing for Analysis of Engineered-Natural Systems
NASA Astrophysics Data System (ADS)
Rose, K.; Baker, V.; Bauer, J. R.; Vasylkivska, V.
2016-12-01
Big data computing and analytical techniques offer opportunities to improve predictions about subsurface systems while quantifying and characterizing associated uncertainties from these analyses. Spatial analysis, big data and otherwise, of subsurface natural and engineered systems are based on variable resolution, discontinuous, and often point-driven data to represent continuous phenomena. We will present examples from two spatio-temporal methods that have been adapted for use with big datasets and big data geo-processing capabilities. The first approach uses regional earthquake data to evaluate spatio-temporal trends associated with natural and induced seismicity. The second algorithm, the Variable Grid Method (VGM), is a flexible approach that presents spatial trends and patterns, such as those resulting from interpolation methods, while simultaneously visualizing and quantifying uncertainty in the underlying spatial datasets. In this presentation we will show how we are utilizing Hadoop to store and perform spatial analyses to efficiently consume and utilize large geospatial data in these custom analytical algorithms through the development of custom Spark and MapReduce applications that incorporate ESRI Hadoop libraries. The team will present custom `Big Data' geospatial applications that run on the Hadoop cluster and integrate with ESRI ArcMap with the team's probabilistic VGM approach. The VGM-Hadoop tool has been specially built as a multi-step MapReduce application running on the Hadoop cluster for the purpose of data reduction. This reduction is accomplished by generating multi-resolution, non-overlapping, attributed topology that is then further processed using ESRI's geostatistical analyst to convey a probabilistic model of a chosen study region. Finally, we will share our approach for implementation of data reduction and topology generation via custom multi-step Hadoop applications, performance benchmarking comparisons, and Hadoop-centric opportunities for greater parallelization of geospatial operations.
Adding Big Data Analytics to GCSS-MC
2014-09-30
TERMS Big Data , Hadoop , MapReduce, GCSS-MC 15. NUMBER OF PAGES 93 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY...10 2.5 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 The Experiment Design 23 3.1 Why Add a Big Data Element...23 3.2 Adding a Big Data Element to GCSS-MC . . . . . . . . . . . . . . 24 3.3 Building a Hadoop Cluster
Scalability and Validation of Big Data Bioinformatics Software.
Yang, Andrian; Troup, Michael; Ho, Joshua W K
2017-01-01
This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.
Scaling Bulk Data Analysis with Mapreduce
2017-09-01
Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE from the NAVAL POSTGRADUATE SCHOOL September...2017 Approved by: Michael McCarrin Thesis Co-Advisor Marcus S. Stefanou Thesis Co-Advisor Peter J. Denning Chair, Department of Computer Science iii...98 xiii THIS PAGE INTENTIONALLY LEFT BLANK xiv List of Acronyms and Abbreviations CART Computer Analysis and Response Team DELV Distributed Environment
Aphinyanaphongs, Yin; Fu, Lawrence D; Aliferis, Constantin F
2013-01-01
Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.
Nordberg, Henrik; Bhatia, Karan; Wang, Kai; Wang, Zhong
2013-12-01
The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
Cloud-based Predictive Modeling System and its Application to Asthma Readmission Prediction
Chen, Robert; Su, Hang; Khalilia, Mohammed; Lin, Sizhe; Peng, Yue; Davis, Tod; Hirsh, Daniel A; Searles, Elizabeth; Tejedor-Sojo, Javier; Thompson, Michael; Sun, Jimeng
2015-01-01
The predictive modeling process is time consuming and requires clinical researchers to handle complex electronic health record (EHR) data in restricted computational environments. To address this problem, we implemented a cloud-based predictive modeling system via a hybrid setup combining a secure private server with the Amazon Web Services (AWS) Elastic MapReduce platform. EHR data is preprocessed on a private server and the resulting de-identified event sequences are hosted on AWS. Based on user-specified modeling configurations, an on-demand web service launches a cluster of Elastic Compute 2 (EC2) instances on AWS to perform feature selection and classification algorithms in a distributed fashion. Afterwards, the secure private server aggregates results and displays them via interactive visualization. We tested the system on a pediatric asthma readmission task on a de-identified EHR dataset of 2,967 patients. We conduct a larger scale experiment on the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File dataset of 2 million patients, which achieves over 25-fold speedup compared to sequential execution. PMID:26958172
Is searching full text more effective than searching abstracts?
Lin, Jimmy
2009-01-01
Background With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE® abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine. Results Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles. Conclusion Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations. PMID:19192280
CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping.
Nguyen, Tung; Shi, Weisong; Ruden, Douglas
2011-06-06
Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version is at http://mine.cs.wayne.edu:8080/CloudAligner/. Our results show that CloudAligner is faster than CloudBurst, provides more accurate results than RMAP, and supports various input as well as output formats. In addition, with the web-based interface, it is easier to use than its counterparts.
VariantSpark: population scale clustering of genotype information.
O'Brien, Aidan R; Saunders, Neil F W; Guo, Yi; Buske, Fabian A; Scott, Rodney J; Bauer, Denis C
2015-12-10
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed SPARK engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VARIANTSPARK provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. To demonstrate the capabilities of VARIANTSPARK, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VARIANTSPARK is 80 % faster than the SPARK-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as ADMIXTURE, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. The benefits of speed, resource consumption and scalability enables VARIANTSPARK to open up the usage of advanced, efficient machine learning algorithms to genomic data.
Scalable Regression Tree Learning on Hadoop using OpenPlanet
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor
As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore » a hybrid approach. Further, we evaluate the performance of OpenPlanet using realworld datasets from the Smart Power Grid domain to perform energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.« less
Li, Yanfei; Tian, Yun
2018-01-01
The development of network technology and the popularization of image capturing devices have led to a rapid increase in the number of digital images available, and it is becoming increasingly difficult to identify a desired image from among the massive number of possible images. Images usually contain rich semantic information, and people usually understand images at a high semantic level. Therefore, achieving the ability to use advanced technology to identify the emotional semantics contained in images to enable emotional semantic image classification remains an urgent issue in various industries. To this end, this study proposes an improved OCC emotion model that integrates personality and mood factors for emotional modelling to describe the emotional semantic information contained in an image. The proposed classification system integrates the k-Nearest Neighbour (KNN) algorithm with the Support Vector Machine (SVM) algorithm. The MapReduce parallel programming model was used to adapt the KNN-SVM algorithm for parallel implementation in the Hadoop cluster environment, thereby achieving emotional semantic understanding for the classification of a massive collection of images. For training and testing, 70,000 scene images were randomly selected from the SUN Database. The experimental results indicate that users with different personalities show overall consistency in their emotional understanding of the same image. For a training sample size of 50,000, the classification accuracies for different emotional categories targeted at users with different personalities were approximately 95%, and the training time was only 1/5 of that required for the corresponding algorithm with a single-node architecture. Furthermore, the speedup of the system also showed a linearly increasing tendency. Thus, the experiments achieved a good classification effect and can lay a foundation for classification in terms of additional types of emotional image semantics, thereby demonstrating the practical significance of the proposed model. PMID:29320579
Cao, Jianfang; Li, Yanfei; Tian, Yun
2018-01-01
The development of network technology and the popularization of image capturing devices have led to a rapid increase in the number of digital images available, and it is becoming increasingly difficult to identify a desired image from among the massive number of possible images. Images usually contain rich semantic information, and people usually understand images at a high semantic level. Therefore, achieving the ability to use advanced technology to identify the emotional semantics contained in images to enable emotional semantic image classification remains an urgent issue in various industries. To this end, this study proposes an improved OCC emotion model that integrates personality and mood factors for emotional modelling to describe the emotional semantic information contained in an image. The proposed classification system integrates the k-Nearest Neighbour (KNN) algorithm with the Support Vector Machine (SVM) algorithm. The MapReduce parallel programming model was used to adapt the KNN-SVM algorithm for parallel implementation in the Hadoop cluster environment, thereby achieving emotional semantic understanding for the classification of a massive collection of images. For training and testing, 70,000 scene images were randomly selected from the SUN Database. The experimental results indicate that users with different personalities show overall consistency in their emotional understanding of the same image. For a training sample size of 50,000, the classification accuracies for different emotional categories targeted at users with different personalities were approximately 95%, and the training time was only 1/5 of that required for the corresponding algorithm with a single-node architecture. Furthermore, the speedup of the system also showed a linearly increasing tendency. Thus, the experiments achieved a good classification effect and can lay a foundation for classification in terms of additional types of emotional image semantics, thereby demonstrating the practical significance of the proposed model.
Portable Map-Reduce Utility for MIT SuperCloud Environment
2015-09-17
Reuther, A. Rosa, C. Yee, “Driving Big Data With Big Compute,” IEEE HPEC, Sep 10-12, 2012, Waltham, MA. [6] Apache Hadoop 1.2.1 Documentation: HDFS... big data architecture, which is designed to address these challenges, is made of the computing resources, scheduler, central storage file system...databases, analytics software and web interfaces [1]. These components are common to many big data and supercomputing systems. The platform is
QMachine: commodity supercomputing in web browsers.
Wilkinson, Sean R; Almeida, Jonas S
2014-06-09
Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics' "Big Data" from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine. QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running "download and install" software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months. QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
NASA Astrophysics Data System (ADS)
Ghaemi, Z.; Farnaghi, M.; Alimohammadi, A.
2015-12-01
The critical impact of air pollution on human health and environment in one hand and the complexity of pollutant concentration behavior in the other hand lead the scientists to look for advance techniques for monitoring and predicting the urban air quality. Additionally, recent developments in data measurement techniques have led to collection of various types of data about air quality. Such data is extremely voluminous and to be useful it must be processed at high velocity. Due to the complexity of big data analysis especially for dynamic applications, online forecasting of pollutant concentration trends within a reasonable processing time is still an open problem. The purpose of this paper is to present an online forecasting approach based on Support Vector Machine (SVM) to predict the air quality one day in advance. In order to overcome the computational requirements for large-scale data analysis, distributed computing based on the Hadoop platform has been employed to leverage the processing power of multiple processing units. The MapReduce programming model is adopted for massive parallel processing in this study. Based on the online algorithm and Hadoop framework, an online forecasting system is designed to predict the air pollution of Tehran for the next 24 hours. The results have been assessed on the basis of Processing Time and Efficiency. Quite accurate predictions of air pollutant indicator levels within an acceptable processing time prove that the presented approach is very suitable to tackle large scale air pollution prediction problems.
Hyrax: Cloud Computing on Mobile Devices using MapReduce
2009-09-01
grant CCR-0238381 and CNS-0326453, and the Army Research Ofce grant number DAAD19-02-1-0389 (Perpetually Available and Secure Information Systems) to...also like to acknowledge the rest of the people with whom I collaborated in research and class projects this year, including Mike Kasick, Keith Bare...thank Frank Pfenning for hiring me as a teaching assistant and advising me on research and academics. Finally, I would like to thank my father, Gene
MERRA/AS: The MERRA Analytic Services Project Interim Report
NASA Technical Reports Server (NTRS)
Schnase, John; Duffy, Dan; Tamkin, Glenn; Nadeau, Denis; Thompson, Hoot; Grieg, Cristina; Luczak, Ed; McInerney, Mark
2013-01-01
MERRA AS is a cyberinfrastructure resource that will combine iRODS-based Climate Data Server (CDS) capabilities with Coudera MapReduce to serve MERRA analytic products, store the MERRA reanalysis data collection in an HDFS to enable parallel, high-performance, storage-side data reductions, manage storage-side driver, mapper, reducer code sets and realized objects for users, and provide a library of commonly used spatiotemporal operations that can be composed to enable higher-order analyses.
Data Mining of Extremely Large Ad-Hoc Data Sets to Produce Reverse Web-Link Graphs
2017-03-01
in most of the MR cases. From these studies , we also learned that computing -optimized instances should be chosen for serialized/compressed input data...maximum 200 words) Data mining can be a valuable tool, particularly in the acquisition of military intelligence. As the second study within a larger Naval...open web crawler data set Common Crawl. Similar to previous studies , this research employs MapReduce (MR) for sorting and categorizing output value
Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele
2017-05-15
MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . umberto.ferraro@uniroma1.it. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Use of Mechanical Turk as a MapReduce Framework for Macular OCT Segmentation.
Lee, Aaron Y; Lee, Cecilia S; Keane, Pearse A; Tufail, Adnan
2016-01-01
Purpose. To evaluate the feasibility of using Mechanical Turk as a massively parallel platform to perform manual segmentations of macular spectral domain optical coherence tomography (SD-OCT) images using a MapReduce framework. Methods. A macular SD-OCT volume of 61 slice images was map-distributed to Amazon Mechanical Turk. Each Human Intelligence Task was set to $0.01 and required the user to draw five lines to outline the sublayers of the retinal OCT image after being shown example images. Each image was submitted twice for segmentation, and interrater reliability was calculated. The interface was created using custom HTML5 and JavaScript code, and data analysis was performed using R. An automated pipeline was developed to handle the map and reduce steps of the framework. Results. More than 93,500 data points were collected using this framework for the 61 images submitted. Pearson's correlation of interrater reliability was 0.995 (p < 0.0001) and coefficient of determination was 0.991. The cost of segmenting the macular volume was $1.21. A total of 22 individual Mechanical Turk users provided segmentations, each completing an average of 5.5 HITs. Each HIT was completed in an average of 4.43 minutes. Conclusions. Amazon Mechanical Turk provides a cost-effective, scalable, high-availability infrastructure for manual segmentation of OCT images.
Use of Mechanical Turk as a MapReduce Framework for Macular OCT Segmentation
Lee, Aaron Y.; Lee, Cecilia S.; Keane, Pearse A.; Tufail, Adnan
2016-01-01
Purpose. To evaluate the feasibility of using Mechanical Turk as a massively parallel platform to perform manual segmentations of macular spectral domain optical coherence tomography (SD-OCT) images using a MapReduce framework. Methods. A macular SD-OCT volume of 61 slice images was map-distributed to Amazon Mechanical Turk. Each Human Intelligence Task was set to $0.01 and required the user to draw five lines to outline the sublayers of the retinal OCT image after being shown example images. Each image was submitted twice for segmentation, and interrater reliability was calculated. The interface was created using custom HTML5 and JavaScript code, and data analysis was performed using R. An automated pipeline was developed to handle the map and reduce steps of the framework. Results. More than 93,500 data points were collected using this framework for the 61 images submitted. Pearson's correlation of interrater reliability was 0.995 (p < 0.0001) and coefficient of determination was 0.991. The cost of segmenting the macular volume was $1.21. A total of 22 individual Mechanical Turk users provided segmentations, each completing an average of 5.5 HITs. Each HIT was completed in an average of 4.43 minutes. Conclusions. Amazon Mechanical Turk provides a cost-effective, scalable, high-availability infrastructure for manual segmentation of OCT images. PMID:27293877
Large-scale seismic signal analysis with Hadoop
Addair, T. G.; Dodge, D. A.; Walter, W. R.; ...
2014-02-11
In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less
Pandey, Vaibhav; Saini, Poonam
2018-06-01
MapReduce (MR) computing paradigm and its open source implementation Hadoop have become a de facto standard to process big data in a distributed environment. Initially, the Hadoop system was homogeneous in three significant aspects, namely, user, workload, and cluster (hardware). However, with growing variety of MR jobs and inclusion of different configurations of nodes in the existing cluster, heterogeneity has become an essential part of Hadoop systems. The heterogeneity factors adversely affect the performance of a Hadoop scheduler and limit the overall throughput of the system. To overcome this problem, various heterogeneous Hadoop schedulers have been proposed in the literature. Existing survey works in this area mostly cover homogeneous schedulers and classify them on the basis of quality of service parameters they optimize. Hence, there is a need to study the heterogeneous Hadoop schedulers on the basis of various heterogeneity factors considered by them. In this survey article, we first discuss different heterogeneity factors that typically exist in a Hadoop system and then explore various challenges that arise while designing the schedulers in the presence of such heterogeneity. Afterward, we present the comparative study of heterogeneous scheduling algorithms available in the literature and classify them by the previously said heterogeneity factors. Lastly, we investigate different methods and environment used for evaluation of discussed Hadoop schedulers.
Large-scale seismic signal analysis with Hadoop
DOE Office of Scientific and Technical Information (OSTI.GOV)
Addair, T. G.; Dodge, D. A.; Walter, W. R.
In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less
Kong, Bingxin; Liu, Siqi; Yin, Jie; Li, Shengru; Zhu, Zuqing
2018-05-28
Nowadays, it is common for service providers (SPs) to leverage hybrid clouds to improve the quality-of-service (QoS) of their Big Data applications. However, for achieving guaranteed latency and/or bandwidth in its hybrid cloud, an SP might desire to have a virtual datacenter (vDC) network, in which it can manage and manipulate the network connections freely. To address this requirement, we design and implement a network slicing and orchestration (NSO) system that can create and expand vDCs across optical/packet domains on-demand. Considering Hadoop MapReduce (M/R) as the use-case, we describe the proposed architectures of the system's data, control and management planes, and present the operation procedures for creating, expanding, monitoring and managing a vDC for M/R optimization. The proposed NSO system is then realized in a small-scale network testbed that includes four optical/packet domains, and we conduct experiments in it to demonstrate the whole operations of the data, control and management planes. Our experimental results verify that application-driven on-demand vDC expansion across optical/packet domains can be achieved for M/R optimization, and after being provisioned with a vDC, the SP using the NSO system can fully control the vDC network and further optimize the M/R jobs in it with network orchestration.
A privacy-preserving parallel and homomorphic encryption scheme
NASA Astrophysics Data System (ADS)
Min, Zhaoe; Yang, Geng; Shi, Jingqi
2017-04-01
In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
Survey of gene splicing algorithms based on reads.
Si, Xiuhua; Wang, Qian; Zhang, Lei; Wu, Ruo; Ma, Jiquan
2017-11-02
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made.
Exploiting analytics techniques in CMS computing monitoring
NASA Astrophysics Data System (ADS)
Bonacorsi, D.; Kuznetsov, V.; Magini, N.; Repečka, A.; Vaandering, E.
2017-10-01
The CMS experiment has collected an enormous volume of metadata about its computing operations in its monitoring systems, describing its experience in operating all of the CMS workflows on all of the Worldwide LHC Computing Grid Tiers. Data mining efforts into all these information have rarely been done, but are of crucial importance for a better understanding of how CMS did successful operations, and to reach an adequate and adaptive modelling of the CMS operations, in order to allow detailed optimizations and eventually a prediction of system behaviours. These data are now streamed into the CERN Hadoop data cluster for further analysis. Specific sets of information (e.g. data on how many replicas of datasets CMS wrote on disks at WLCG Tiers, data on which datasets were primarily requested for analysis, etc) were collected on Hadoop and processed with MapReduce applications profiting of the parallelization on the Hadoop cluster. We present the implementation of new monitoring applications on Hadoop, and discuss the new possibilities in CMS computing monitoring introduced with the ability to quickly process big data sets from mulltiple sources, looking forward to a predictive modeling of the system.
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.
Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene
2011-01-01
To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.
NASA Astrophysics Data System (ADS)
Rose, K.; Bauer, J. R.; Baker, D. V.
2015-12-01
As big data computing capabilities are increasingly paired with spatial analytical tools and approaches, there is a need to ensure uncertainty associated with the datasets used in these analyses is adequately incorporated and portrayed in results. Often the products of spatial analyses, big data and otherwise, are developed using discontinuous, sparse, and often point-driven data to represent continuous phenomena. Results from these analyses are generally presented without clear explanations of the uncertainty associated with the interpolated values. The Variable Grid Method (VGM) offers users with a flexible approach designed for application to a variety of analyses where users there is a need to study, evaluate, and analyze spatial trends and patterns while maintaining connection to and communicating the uncertainty in the underlying spatial datasets. The VGM outputs a simultaneous visualization representative of the spatial data analyses and quantification of underlying uncertainties, which can be calculated using data related to sample density, sample variance, interpolation error, uncertainty calculated from multiple simulations. In this presentation we will show how we are utilizing Hadoop to store and perform spatial analysis through the development of custom Spark and MapReduce applications that incorporate ESRI Hadoop libraries. The team will present custom 'Big Data' geospatial applications that run on the Hadoop cluster and integrate with ESRI ArcMap with the team's probabilistic VGM approach. The VGM-Hadoop tool has been specially built as a multi-step MapReduce application running on the Hadoop cluster for the purpose of data reduction. This reduction is accomplished by generating multi-resolution, non-overlapping, attributed topology that is then further processed using ESRI's geostatistical analyst to convey a probabilistic model of a chosen study region. Finally, we will share our approach for implementation of data reduction and topology generation via custom multi-step Hadoop applications, performance benchmarking comparisons, and Hadoop-centric opportunities for greater parallelization of geospatial operations. The presentation includes examples of the approach being applied to a range of subsurface, geospatial studies (e.g. induced seismicity risk).
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-12-01
The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-01-01
Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Contact: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26254488
QMachine: commodity supercomputing in web browsers
2014-01-01
Background Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics’ “Big Data” from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine. Results QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running “download and install” software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months. Conclusions QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments. PMID:24913605
Arctic Boreal Vulnerability Experiment (ABoVE) Science Cloud
NASA Astrophysics Data System (ADS)
Duffy, D.; Schnase, J. L.; McInerney, M.; Webster, W. P.; Sinno, S.; Thompson, J. H.; Griffith, P. C.; Hoy, E.; Carroll, M.
2014-12-01
The effects of climate change are being revealed at alarming rates in the Arctic and Boreal regions of the planet. NASA's Terrestrial Ecology Program has launched a major field campaign to study these effects over the next 5 to 8 years. The Arctic Boreal Vulnerability Experiment (ABoVE) will challenge scientists to take measurements in the field, study remote observations, and even run models to better understand the impacts of a rapidly changing climate for areas of Alaska and western Canada. The NASA Center for Climate Simulation (NCCS) at the Goddard Space Flight Center (GSFC) has partnered with the Terrestrial Ecology Program to create a science cloud designed for this field campaign - the ABoVE Science Cloud. The cloud combines traditional high performance computing with emerging technologies to create an environment specifically designed for large-scale climate analytics. The ABoVE Science Cloud utilizes (1) virtualized high-speed InfiniBand networks, (2) a combination of high-performance file systems and object storage, and (3) virtual system environments tailored for data intensive, science applications. At the center of the architecture is a large object storage environment, much like a traditional high-performance file system, that supports data proximal processing using technologies like MapReduce on a Hadoop Distributed File System (HDFS). Surrounding the storage is a cloud of high performance compute resources with many processing cores and large memory coupled to the storage through an InfiniBand network. Virtual systems can be tailored to a specific scientist and provisioned on the compute resources with extremely high-speed network connectivity to the storage and to other virtual systems. In this talk, we will present the architectural components of the science cloud and examples of how it is being used to meet the needs of the ABoVE campaign. In our experience, the science cloud approach significantly lowers the barriers and risks to organizations that require high performance computing solutions and provides the NCCS with the agility required to meet our customers' rapidly increasing and evolving requirements.
Free Factories: Unified Infrastructure for Data Intensive Web Services
Zaranek, Alexander Wait; Clegg, Tom; Vandewege, Ward; Church, George M.
2010-01-01
We introduce the Free Factory, a platform for deploying data-intensive web services using small clusters of commodity hardware and free software. Independently administered virtual machines called Freegols give application developers the flexibility of a general purpose web server, along with access to distributed batch processing, cache and storage services. Each cluster exploits idle RAM and disk space for cache, and reserves disks in each node for high bandwidth storage. The batch processing service uses a variation of the MapReduce model. Virtualization allows every CPU in the cluster to participate in batch jobs. Each 48-node cluster can achieve 4-8 gigabytes per second of disk I/O. Our intent is to use multiple clusters to process hundreds of simultaneous requests on multi-hundred terabyte data sets. Currently, our applications achieve 1 gigabyte per second of I/O with 123 disks by scheduling batch jobs on two clusters, one of which is located in a remote data center. PMID:20514356
Climate Analytics as a Service
NASA Technical Reports Server (NTRS)
Schnase, John L.; Duffy, Daniel Q.; McInerney, Mark A.; Webster, W. Phillip; Lee, Tsengdar J.
2014-01-01
Climate science is a big data domain that is experiencing unprecedented growth. In our efforts to address the big data challenges of climate science, we are moving toward a notion of Climate Analytics-as-a-Service (CAaaS). CAaaS combines high-performance computing and data-proximal analytics with scalable data management, cloud computing virtualization, the notion of adaptive analytics, and a domain-harmonized API to improve the accessibility and usability of large collections of climate data. MERRA Analytic Services (MERRA/AS) provides an example of CAaaS. MERRA/AS enables MapReduce analytics over NASA's Modern-Era Retrospective Analysis for Research and Applications (MERRA) data collection. The MERRA reanalysis integrates observational data with numerical models to produce a global temporally and spatially consistent synthesis of key climate variables. The effectiveness of MERRA/AS has been demonstrated in several applications. In our experience, CAaaS is providing the agility required to meet our customers' increasing and changing data management and data analysis needs.
Graphing trillions of triangles.
Burkhardt, Paul
2017-07-01
The increasing size of Big Data is often heralded but how data are transformed and represented is also profoundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has been placed on the scale of the input graph but the product of a graph algorithm can be many times larger than the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scalable graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analytics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graph analysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles in graphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be presented followed by a description of the experiments with that algorithm that led to the current largest and fastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graph exploration technologies is proposed.
Machine Learning for Knowledge Extraction from PHR Big Data.
Poulymenopoulou, Michaela; Malamateniou, Flora; Vassilacopoulos, George
2014-01-01
Cloud computing, Internet of things (IOT) and NoSQL database technologies can support a new generation of cloud-based PHR services that contain heterogeneous (unstructured, semi-structured and structured) patient data (health, social and lifestyle) from various sources, including automatically transmitted data from Internet connected devices of patient living space (e.g. medical devices connected to patients at home care). The patient data stored in such PHR systems constitute big data whose analysis with the use of appropriate machine learning algorithms is expected to improve diagnosis and treatment accuracy, to cut healthcare costs and, hence, to improve the overall quality and efficiency of healthcare provided. This paper describes a health data analytics engine which uses machine learning algorithms for analyzing cloud based PHR big health data towards knowledge extraction to support better healthcare delivery as regards disease diagnosis and prognosis. This engine comprises of the data preparation, the model generation and the data analysis modules and runs on the cloud taking advantage from the map/reduce paradigm provided by Apache Hadoop.
GATE Monte Carlo simulation of dose distribution using MapReduce in a cloud computing environment.
Liu, Yangchuan; Tang, Yuguo; Gao, Xin
2017-12-01
The GATE Monte Carlo simulation platform has good application prospects of treatment planning and quality assurance. However, accurate dose calculation using GATE is time consuming. The purpose of this study is to implement a novel cloud computing method for accurate GATE Monte Carlo simulation of dose distribution using MapReduce. An Amazon Machine Image installed with Hadoop and GATE is created to set up Hadoop clusters on Amazon Elastic Compute Cloud (EC2). Macros, the input files for GATE, are split into a number of self-contained sub-macros. Through Hadoop Streaming, the sub-macros are executed by GATE in Map tasks and the sub-results are aggregated into final outputs in Reduce tasks. As an evaluation, GATE simulations were performed in a cubical water phantom for X-ray photons of 6 and 18 MeV. The parallel simulation on the cloud computing platform is as accurate as the single-threaded simulation on a local server and the simulation correctness is not affected by the failure of some worker nodes. The cloud-based simulation time is approximately inversely proportional to the number of worker nodes. For the simulation of 10 million photons on a cluster with 64 worker nodes, time decreases of 41× and 32× were achieved compared to the single worker node case and the single-threaded case, respectively. The test of Hadoop's fault tolerance showed that the simulation correctness was not affected by the failure of some worker nodes. The results verify that the proposed method provides a feasible cloud computing solution for GATE.
Kumar, Mukesh; Rath, Nitish Kumar; Rath, Santanu Kumar
2016-04-01
Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data. Copyright © 2016 Elsevier Inc. All rights reserved.
Exploiting Analytics Techniques in CMS Computing Monitoring
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bonacorsi, D.; Kuznetsov, V.; Magini, N.
The CMS experiment has collected an enormous volume of metadata about its computing operations in its monitoring systems, describing its experience in operating all of the CMS workflows on all of the Worldwide LHC Computing Grid Tiers. Data mining efforts into all these information have rarely been done, but are of crucial importance for a better understanding of how CMS did successful operations, and to reach an adequate and adaptive modelling of the CMS operations, in order to allow detailed optimizations and eventually a prediction of system behaviours. These data are now streamed into the CERN Hadoop data cluster formore » further analysis. Specific sets of information (e.g. data on how many replicas of datasets CMS wrote on disks at WLCG Tiers, data on which datasets were primarily requested for analysis, etc) were collected on Hadoop and processed with MapReduce applications profiting of the parallelization on the Hadoop cluster. We present the implementation of new monitoring applications on Hadoop, and discuss the new possibilities in CMS computing monitoring introduced with the ability to quickly process big data sets from mulltiple sources, looking forward to a predictive modeling of the system.« less
Elastic Cloud Computing Infrastructures in the Open Cirrus Testbed Implemented via Eucalyptus
NASA Astrophysics Data System (ADS)
Baun, Christian; Kunze, Marcel
Cloud computing realizes the advantages and overcomes some restrictionsof the grid computing paradigm. Elastic infrastructures can easily be createdand managed by cloud users. In order to accelerate the research ondata center management and cloud services the OpenCirrusTM researchtestbed has been started by HP, Intel and Yahoo!. Although commercialcloud offerings are proprietary, Open Source solutions exist in the field ofIaaS with Eucalyptus, PaaS with AppScale and at the applications layerwith Hadoop MapReduce. This paper examines the I/O performance ofcloud computing infrastructures implemented with Eucalyptus in contrastto Amazon S3.
SeqHBase: a big data toolset for family based sequencing data analysis.
He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai
2015-04-01
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Jun
Our group has been working with ANL collaborators on the topic bridging the gap between parallel file system and local file system during the course of this project period. We visited Argonne National Lab -- Dr. Robert Ross's group for one week in the past summer 2007. We looked over our current project progress and planned the activities for the incoming years 2008-09. The PI met Dr. Robert Ross several times such as HEC FSIO workshop 08, SC08 and SC10. We explored the opportunities to develop a production system by leveraging our current prototype to (SOGP+PVFS) a new PVFS version.more » We delivered SOGP+PVFS codes to ANL PVFS2 group in 2008.We also talked about exploring a potential project on developing new parallel programming models and runtime systems for data-intensive scalable computing (DISC). The methodology is to evolve MPI towards DISC by incorporating some functions of Google MapReduce parallel programming model. More recently, we are together exploring how to leverage existing works to perform (1) coordination/aggregation of local I/O operations prior to movement over the WAN, (2) efficient bulk data movement over the WAN, (3) latency hiding techniques for latency-intensive operations. Since 2009, we start applying Hadoop/MapReduce to some HEC applications with LANL scientists John Bent and Salman Habib. Another on-going work is to improve checkpoint performance at I/O forwarding Layer for the Road Runner super computer with James Nuetz and Gary Gridder at LANL. Two senior undergraduates from our research group did summer internships about high-performance file and storage system projects in LANL since 2008 for consecutive three years. Both of them are now pursuing Ph.D. degree in our group and will be 4th year in the PhD program in Fall 2011 and go to LANL to advance two above-mentioned works during this winter break. Since 2009, we have been collaborating with several computer scientists (Gary Grider, John bent, Parks Fields, James Nunez, Hsing-Bung Chen, etc) from HPC5 and James Ahrens from Advanced Computing Laboratory in Los Alamos National Laboratory. We hold a weekly conference and/or video meeting on advancing works at two fronts: the hardware/software infrastructure of building large-scale data intensive cluster and research publications. Our group members assist in constructing several onsite LANL data intensive clusters. Two parties have been developing software codes and research papers together using both sides resources.« less
A Computing Infrastructure for Supporting Climate Studies
NASA Astrophysics Data System (ADS)
Yang, C.; Bambacus, M.; Freeman, S. M.; Huang, Q.; Li, J.; Sun, M.; Xu, C.; Wojcik, G. S.; Cahalan, R. F.; NASA Climate @ Home Project Team
2011-12-01
Climate change is one of the major challenges facing us on the Earth planet in the 21st century. Scientists build many models to simulate the past and predict the climate change for the next decades or century. Most of the models are at a low resolution with some targeting high resolution in linkage to practical climate change preparedness. To calibrate and validate the models, millions of model runs are needed to find the best simulation and configuration. This paper introduces the NASA effort on Climate@Home project to build a supercomputer based-on advanced computing technologies, such as cloud computing, grid computing, and others. Climate@Home computing infrastructure includes several aspects: 1) a cloud computing platform is utilized to manage the potential spike access to the centralized components, such as grid computing server for dispatching and collecting models runs results; 2) a grid computing engine is developed based on MapReduce to dispatch models, model configuration, and collect simulation results and contributing statistics; 3) a portal serves as the entry point for the project to provide the management, sharing, and data exploration for end users; 4) scientists can access customized tools to configure model runs and visualize model results; 5) the public can access twitter and facebook to get the latest about the project. This paper will introduce the latest progress of the project and demonstrate the operational system during the AGU fall meeting. It will also discuss how this technology can become a trailblazer for other climate studies and relevant sciences. It will share how the challenges in computation and software integration were solved.
SciSpark: In-Memory Map-Reduce for Earth Science Algorithms
NASA Astrophysics Data System (ADS)
Ramirez, P.; Wilson, B. D.; Whitehall, K. D.; Palamuttam, R. S.; Mattmann, C. A.; Shah, S.; Goodman, A.; Burke, W.
2016-12-01
We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based Apache Hadoop by 100x in memory and by 10x on disk. SciSpark extends Spark to support Earth Science use in three ways: Efficient ingest of N-dimensional geo-located arrays (physical variables) from netCDF3/4, HDF4/5, and/or OPeNDAP URLS; Array operations for dense arrays in scala and Java using the ND4S/ND4J or Breeze libraries; Operations to "split" datasets across a Spark cluster by time or space or both. For example, a decade-long time-series of geo-variables can be split across time to enable parallel "speedups" of analysis by day, month, or season. Similarly, very high-resolution climate grids can be partitioned into spatial tiles for parallel operations across rows, columns, or blocks. In addition, using Spark's gateway into python, PySpark, one can utilize the entire ecosystem of numpy, scipy, etc. Finally, SciSpark Notebooks provide a modern eNotebook technology in which scala, python, or spark-sql codes are entered into cells in the Notebook and executed on the cluster, with results, plots, or graph visualizations displayed in "live widgets". We have exercised SciSpark by implementing three complex Use Cases: discovery and evolution of Mesoscale Convective Complexes (MCCs) in storms, yielding a graph of connected components; PDF Clustering of atmospheric state using parallel K-Means; and statistical "rollups" of geo-variables or model-to-obs. differences (i.e. mean, stddev, skewness, & kurtosis) by day, month, season, year, and multi-year. Geo-variables are ingested and split across the cluster using methods on the sciSparkContext object including netCDFVariables() for spatial decomposition and wholeNetCDFVariables() for time-series. The presentation will cover the architecture of SciSpark, the design of the scientific RDD (sRDD) data structures for N-dim. arrays, results from the three science Use Cases, example Notebooks, lessons learned from the algorithm implementations, and parallel performance metrics.
Graphing trillions of triangles
Burkhardt, Paul
2016-01-01
The increasing size of Big Data is often heralded but how data are transformed and represented is also profoundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has been placed on the scale of the input graph but the product of a graph algorithm can be many times larger than the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scalable graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analytics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graph analysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles in graphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be presented followed by a description of the experiments with that algorithm that led to the current largest and fastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graph exploration technologies is proposed. PMID:28690426
OceanRoute: Vessel Mobility Data Processing and Analyzing Model Based on MapReduce
NASA Astrophysics Data System (ADS)
Liu, Chao; Liu, Yingjian; Guo, Zhongwen; Jing, Wei
2018-06-01
The network coverage is a big problem in ocean communication, and there is no low-cost solution in the short term. Based on the knowledge of Mobile Delay Tolerant Network (MDTN), the mobility of vessels can create the chances of end-to-end communication. The mobility pattern of vessel is one of the key metrics on ocean MDTN network. Because of the high cost, few experiments have focused on research of vessel mobility pattern for the moment. In this paper, we study the traces of more than 4000 fishing and freight vessels. Firstly, to solve the data noise and sparsity problem, we design two algorithms to filter the noise and complement the missing data based on the vessel's turning feature. Secondly, after studying the traces of vessels, we observe that the vessel's traces are confined by invisible boundary. Thirdly, through defining the distance between traces, we design MR-Similarity algorithm to find the mobility pattern of vessels. Finally, we realize our algorithm on cluster and evaluate the performance and accuracy. Our results can provide the guidelines on design of data routing protocols on ocean MDTN.
Li, Zhenlong; Yang, Chaowei; Jin, Baoxuan; Yu, Manzhu; Liu, Kai; Sun, Min; Zhan, Matthew
2015-01-01
Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists. PMID:25742012
Li, Zhenlong; Yang, Chaowei; Jin, Baoxuan; Yu, Manzhu; Liu, Kai; Sun, Min; Zhan, Matthew
2015-01-01
Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists.
[Traditional Chinese Medicine data management policy in big data environment].
Liang, Yang; Ding, Chang-Song; Huang, Xin-di; Deng, Le
2018-02-01
As traditional data management model cannot effectively manage the massive data in traditional Chinese medicine(TCM) due to the uncertainty of data object attributes as well as the diversity and abstraction of data representation, a management strategy for TCM data based on big data technology is proposed. Based on true characteristics of TCM data, this strategy could solve the problems of the uncertainty of data object attributes in TCM information and the non-uniformity of the data representation by using modeless properties of stored objects in big data technology. Hybrid indexing mode was also used to solve the conflicts brought by different storage modes in indexing process, with powerful capabilities in query processing of massive data through efficient parallel MapReduce process. The theoretical analysis provided the management framework and its key technology, while its performance was tested on Hadoop by using several common traditional Chinese medicines and prescriptions from practical TCM data source. Result showed that this strategy can effectively solve the storage problem of TCM information, with good performance in query efficiency, completeness and robustness. Copyright© by the Chinese Pharmaceutical Association.
Text Content Pushing Technology Research Based on Location and Topic
NASA Astrophysics Data System (ADS)
Wei, Dongqi; Wei, Jianxin; Wumuti, Naheman; Jiang, Baode
2016-11-01
In the field, geological workers usually want to obtain related geological background information in the working area quickly and accurately. This information exists in the massive geological data, text data is described in natural language accounted for a large proportion. This paper studied location information extracting method in the mass text data; proposed a geographic location—geological content—geological content related algorithm based on Spark and Mapreduce2, finally classified content by using KNN, and built the content pushing system based on location and topic. It is running in the geological survey cloud, and we have gained a good effect in testing by using real geological data.
Paradigms for Realizing Machine Learning Algorithms.
Agneeswaran, Vijay Srinivas; Tonpay, Pranay; Tiwary, Jayati
2013-12-01
The article explains the three generations of machine learning algorithms-with all three trying to operate on big data. The first generation tools are SAS, SPSS, etc., while second generation realizations include Mahout and RapidMiner (that work over Hadoop), and the third generation paradigms include Spark and GraphLab, among others. The essence of the article is that for a number of machine learning algorithms, it is important to look beyond the Hadoop's Map-Reduce paradigm in order to make them work on big data. A number of promising contenders have emerged in the third generation that can be exploited to realize deep analytics on big data.
An Improved Clustering Algorithm of Tunnel Monitoring Data for Cloud Computing
Zhong, Luo; Tang, KunHao; Li, Lin; Yang, Guang; Ye, JingJing
2014-01-01
With the rapid development of urban construction, the number of urban tunnels is increasing and the data they produce become more and more complex. It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnel. To solve this problem, an improved parallel clustering algorithm based on k-means has been proposed. It is a clustering algorithm using the MapReduce within cloud computing that deals with data. It not only has the advantage of being used to deal with mass data but also is more efficient. Moreover, it is able to compute the average dissimilarity degree of each cluster in order to clean the abnormal data. PMID:24982971
Forget the hype or reality. Big data presents new opportunities in Earth Science.
NASA Astrophysics Data System (ADS)
Lee, T. J.
2015-12-01
Earth science is arguably one of the most mature science discipline which constantly acquires, curates, and utilizes a large volume of data with diverse variety. We deal with big data before there is big data. For example, while developing the EOS program in the 1980s, the EOS data and information system (EOSDIS) was developed to manage the vast amount of data acquired by the EOS fleet of satellites. EOSDIS continues to be a shining example of modern science data systems in the past two decades. With the explosion of internet, the usage of social media, and the provision of sensors everywhere, the big data era has bring new challenges. First, Goggle developed the search algorithm and a distributed data management system. The open source communities quickly followed up and developed Hadoop file system to facility the map reduce workloads. The internet continues to generate tens of petabytes of data every day. There is a significant shortage of algorithms and knowledgeable manpower to mine the data. In response, the federal government developed the big data programs that fund research and development projects and training programs to tackle these new challenges. Meanwhile, comparatively to the internet data explosion, Earth science big data problem has become quite small. Nevertheless, the big data era presents an opportunity for Earth science to evolve. We learned about the MapReduce algorithms, in memory data mining, machine learning, graph analysis, and semantic web technologies. How do we apply these new technologies to our discipline and bring the hype to Earth? In this talk, I will discuss how we might want to apply some of the big data technologies to our discipline and solve many of our challenging problems. More importantly, I will propose new Earth science data system architecture to enable new type of scientific inquires.
Galpert, Deborah; del Río, Sara; Herrera, Francisco; Ancede-Gallardo, Evys; Antunes, Agostinho; Agüero-Chapin, Guillermin
2015-01-01
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification. PMID:26605337
Machine learning patterns for neuroimaging-genetic studies in the cloud.
Da Mota, Benoit; Tudoran, Radu; Costan, Alexandru; Varoquaux, Gaël; Brasche, Goetz; Conrod, Patricia; Lemaitre, Herve; Paus, Tomas; Rietschel, Marcella; Frouin, Vincent; Poline, Jean-Baptiste; Antoniu, Gabriel; Thirion, Bertrand
2014-01-01
Brain imaging is a natural intermediate phenotype to understand the link between genetic information and behavior or brain pathologies risk factors. Massive efforts have been made in the last few years to acquire high-dimensional neuroimaging and genetic data on large cohorts of subjects. The statistical analysis of such data is carried out with increasingly sophisticated techniques and represents a great computational challenge. Fortunately, increasing computational power in distributed architectures can be harnessed, if new neuroinformatics infrastructures are designed and training to use these new tools is provided. Combining a MapReduce framework (TomusBLOB) with machine learning algorithms (Scikit-learn library), we design a scalable analysis tool that can deal with non-parametric statistics on high-dimensional data. End-users describe the statistical procedure to perform and can then test the model on their own computers before running the very same code in the cloud at a larger scale. We illustrate the potential of our approach on real data with an experiment showing how the functional signal in subcortical brain regions can be significantly fit with genome-wide genotypes. This experiment demonstrates the scalability and the reliability of our framework in the cloud with a 2 weeks deployment on hundreds of virtual machines.
Galpert, Deborah; Del Río, Sara; Herrera, Francisco; Ancede-Gallardo, Evys; Antunes, Agostinho; Agüero-Chapin, Guillermin
2015-01-01
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.
A quantitative model of application slow-down in multi-resource shared systems
Lim, Seung-Hwan; Kim, Youngjae
2016-12-26
Scheduling multiple jobs onto a platform enhances system utilization by sharing resources. The benefits from higher resource utilization include reduced cost to construct, operate, and maintain a system, which often include energy consumption. Maximizing these benefits comes at a price-resource contention among jobs increases job completion time. In this study, we analyze slow-downs of jobs due to contention for multiple resources in a system; referred to as dilation factor. We observe that multiple-resource contention creates non-linear dilation factors of jobs. From this observation, we establish a general quantitative model for dilation factors of jobs in multi-resource systems. A job ismore » characterized by a vector-valued loading statistics and dilation factors of a job set are given by a quadratic function of their loading vectors. We demonstrate how to systematically characterize a job, maintain the data structure to calculate the dilation factor (loading matrix), and calculate the dilation factor of each job. We validate the accuracy of the model with multiple processes running on a native Linux server, virtualized servers, and with multiple MapReduce workloads co-scheduled in a cluster. Evaluation with measured data shows that the D-factor model has an error margin of less than 16%. We extended the D-factor model to capture the slow-down of applications when multiple identical resources exist such as multi-core environments and multi-disks environments. Finally, validation results of the extended D-factor model with HPC checkpoint applications on the parallel file systems show that D-factor accurately captures the slow down of concurrent applications in such environments.« less
A quantitative model of application slow-down in multi-resource shared systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lim, Seung-Hwan; Kim, Youngjae
Scheduling multiple jobs onto a platform enhances system utilization by sharing resources. The benefits from higher resource utilization include reduced cost to construct, operate, and maintain a system, which often include energy consumption. Maximizing these benefits comes at a price-resource contention among jobs increases job completion time. In this study, we analyze slow-downs of jobs due to contention for multiple resources in a system; referred to as dilation factor. We observe that multiple-resource contention creates non-linear dilation factors of jobs. From this observation, we establish a general quantitative model for dilation factors of jobs in multi-resource systems. A job ismore » characterized by a vector-valued loading statistics and dilation factors of a job set are given by a quadratic function of their loading vectors. We demonstrate how to systematically characterize a job, maintain the data structure to calculate the dilation factor (loading matrix), and calculate the dilation factor of each job. We validate the accuracy of the model with multiple processes running on a native Linux server, virtualized servers, and with multiple MapReduce workloads co-scheduled in a cluster. Evaluation with measured data shows that the D-factor model has an error margin of less than 16%. We extended the D-factor model to capture the slow-down of applications when multiple identical resources exist such as multi-core environments and multi-disks environments. Finally, validation results of the extended D-factor model with HPC checkpoint applications on the parallel file systems show that D-factor accurately captures the slow down of concurrent applications in such environments.« less
A Parallel and Incremental Approach for Data-Intensive Learning of Bayesian Networks.
Yue, Kun; Fang, Qiyu; Wang, Xiaoling; Li, Jin; Liu, Weiyi
2015-12-01
Bayesian network (BN) has been adopted as the underlying model for representing and inferring uncertain knowledge. As the basis of realistic applications centered on probabilistic inferences, learning a BN from data is a critical subject of machine learning, artificial intelligence, and big data paradigms. Currently, it is necessary to extend the classical methods for learning BNs with respect to data-intensive computing or in cloud environments. In this paper, we propose a parallel and incremental approach for data-intensive learning of BNs from massive, distributed, and dynamically changing data by extending the classical scoring and search algorithm and using MapReduce. First, we adopt the minimum description length as the scoring metric and give the two-pass MapReduce-based algorithms for computing the required marginal probabilities and scoring the candidate graphical model from sample data. Then, we give the corresponding strategy for extending the classical hill-climbing algorithm to obtain the optimal structure, as well as that for storing a BN by
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
NASA Astrophysics Data System (ADS)
Xu, Meng; Gao, Dan; Deng, Chao; Luo, Zhiguo; Sun, Shaoling
Business Intelligence becomes an attracting topic in today's data intensive applications, especially in telecommunication industry. Meanwhile, Cloud Computing providing IT supporting Infrastructure with excellent scalability, large scale storage, and high performance becomes an effective way to implement parallel data processing and data mining algorithms. BC-PDM (Big Cloud based Parallel Data Miner) is a new MapReduce based parallel data mining platform developed by CMRI (China Mobile Research Institute) to fit the urgent requirements of business intelligence in telecommunication industry. In this paper, the architecture, functionality and performance of BC-PDM are presented, together with the experimental evaluation and case studies of its applications. The evaluation result demonstrates both the usability and the cost-effectiveness of Cloud Computing based Business Intelligence system in applications of telecommunication industry.
Ng, Kenney; Ghoting, Amol; Steinhubl, Steven R.; Stewart, Walter F.; Malin, Bradley; Sun, Jimeng
2014-01-01
Objective Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: 1) cohort construction, 2) feature construction, 3) cross-validation, 4) feature selection, and 5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. Methods To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which 1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, 2) schedules the tasks in a topological ordering of the graph, and 3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. Results We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3 hours in parallel compared to 9 days if running sequentially. Conclusion This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers. PMID:24370496
Ng, Kenney; Ghoting, Amol; Steinhubl, Steven R; Stewart, Walter F; Malin, Bradley; Sun, Jimeng
2014-04-01
Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers. Copyright © 2013 Elsevier Inc. All rights reserved.
On the Modeling and Management of Cloud Data Analytics
NASA Astrophysics Data System (ADS)
Castillo, Claris; Tantawi, Asser; Steinder, Malgorzata; Pacifici, Giovanni
A new era is dawning where vast amount of data is subjected to intensive analysis in a cloud computing environment. Over the years, data about a myriad of things, ranging from user clicks to galaxies, have been accumulated, and continue to be collected, on storage media. The increasing availability of such data, along with the abundant supply of compute power and the urge to create useful knowledge, gave rise to a new data analytics paradigm in which data is subjected to intensive analysis, and additional data is created in the process. Meanwhile, a new cloud computing environment has emerged where seemingly limitless compute and storage resources are being provided to host computation and data for multiple users through virtualization technologies. Such a cloud environment is becoming the home for data analytics. Consequently, providing good performance at run-time to data analytics workload is an important issue for cloud management. In this paper, we provide an overview of the data analytics and cloud environment landscapes, and investigate the performance management issues related to running data analytics in the cloud. In particular, we focus on topics such as workload characterization, profiling analytics applications and their pattern of data usage, cloud resource allocation, placement of computation and data and their dynamic migration in the cloud, and performance prediction. In solving such management problems one relies on various run-time analytic models. We discuss approaches for modeling and optimizing the dynamic data analytics workload in the cloud environment. All along, we use the Map-Reduce paradigm as an illustration of data analytics.
Using Ontology Fingerprints to disambiguate gene name entities in the biomedical literature
Chen, Guocai; Zhao, Jieyi; Cohen, Trevor; Tao, Cui; Sun, Jingchun; Xu, Hua; Bernstam, Elmer V.; Lawson, Andrew; Zeng, Jia; Johnson, Amber M.; Holla, Vijaykumar; Bailey, Ann M.; Lara-Guerra, Humberto; Litzenburger, Beate; Meric-Bernstam, Funda; Jim Zheng, W.
2015-01-01
Ambiguous gene names in the biomedical literature are a barrier to accurate information extraction. To overcome this hurdle, we generated Ontology Fingerprints for selected genes that are relevant for personalized cancer therapy. These Ontology Fingerprints were used to evaluate the association between genes and biomedical literature to disambiguate gene names. We obtained 93.6% precision for the test gene set and 80.4% for the area under a receiver-operating characteristics curve for gene and article association. The core algorithm was implemented using a graphics processing unit-based MapReduce framework to handle big data and to improve performance. We conclude that Ontology Fingerprints can help disambiguate gene names mentioned in text and analyse the association between genes and articles. Database URL: http://www.ontologyfingerprint.org PMID:25858285
Big data mining analysis method based on cloud computing
NASA Astrophysics Data System (ADS)
Cai, Qing Qiu; Cui, Hong Gang; Tang, Hao
2017-08-01
Information explosion era, large data super-large, discrete and non-(semi) structured features have gone far beyond the traditional data management can carry the scope of the way. With the arrival of the cloud computing era, cloud computing provides a new technical way to analyze the massive data mining, which can effectively solve the problem that the traditional data mining method cannot adapt to massive data mining. This paper introduces the meaning and characteristics of cloud computing, analyzes the advantages of using cloud computing technology to realize data mining, designs the mining algorithm of association rules based on MapReduce parallel processing architecture, and carries out the experimental verification. The algorithm of parallel association rule mining based on cloud computing platform can greatly improve the execution speed of data mining.
Sadiq, Abderrahmane; El Fazziki, Abdelaziz; Ouarzazi, Jamal; Sadgal, Mohamed
2016-01-01
This paper presents an integrated and adaptive problem-solving approach to control the on-road air quality by modeling the road infrastructure, managing traffic based on pollution level and generating recommendations for road users. The aim is to reduce vehicle emissions in the most polluted road segments and optimizing the pollution levels. For this we propose the use of historical and real time pollution records and contextual data to calculate the air quality index on road networks and generate recommendations for reassigning traffic flow in order to improve the on-road air quality. The resulting air quality indexes are used in the system's traffic network generation, which the cartography is represented by a weighted graph. The weights evolve according to the pollution indexes and path properties and the graph is therefore dynamic. Furthermore, the systems use the available pollution data and meteorological records in order to predict the on-road pollutant levels by using an artificial neural network based prediction model. The proposed approach combines the benefits of multi-agent systems, Big data technology, machine learning tools and the available data sources. For the shortest path searching in the road network, we use the Dijkstra algorithm over Hadoop MapReduce framework. The use Hadoop framework in the data retrieve and analysis process has significantly improved the performance of the proposed system. Also, the agent technology allowed proposing a suitable solution in terms of robustness and agility.
Analyzing large scale genomic data on the cloud with Sparkhit
Huang, Liren; Krüger, Jan
2018-01-01
Abstract Motivation The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. Availability and implementation Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact asczyrba@cebitec.uni-bielefeld.de Supplementary information Supplementary data are available at Bioinformatics online. PMID:29253074
Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T.; Girard, F. Pascal; Stephens, Robert M.
2013-01-01
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. PMID:24312478
PLAStiCC: Predictive Look-Ahead Scheduling for Continuous dataflows on Clouds
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumbhare, Alok; Simmhan, Yogesh; Prasanna, Viktor K.
2014-05-27
Scalable stream processing and continuous dataflow systems are gaining traction with the rise of big data due to the need for processing high velocity data in near real time. Unlike batch processing systems such as MapReduce and workflows, static scheduling strategies fall short for continuous dataflows due to the variations in the input data rates and the need for sustained throughput. The elastic resource provisioning of cloud infrastructure is valuable to meet the changing resource needs of such continuous applications. However, multi-tenant cloud resources introduce yet another dimension of performance variability that impacts the application’s throughput. In this paper wemore » propose PLAStiCC, an adaptive scheduling algorithm that balances resource cost and application throughput using a prediction-based look-ahead approach. It not only addresses variations in the input data rates but also the underlying cloud infrastructure. In addition, we also propose several simpler static scheduling heuristics that operate in the absence of accurate performance prediction model. These static and adaptive heuristics are evaluated through extensive simulations using performance traces obtained from public and private IaaS clouds. Our results show an improvement of up to 20% in the overall profit as compared to the reactive adaptation algorithm.« less
Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF
NASA Astrophysics Data System (ADS)
Li, Z.; Yang, C. P.; Schnase, J. L.; Duffy, D.; Lee, T. J.
2014-12-01
Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.
Mudunuri, Uma S; Khouja, Mohamad; Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T; Girard, F Pascal; Stephens, Robert M
2013-01-01
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.
Meeker, Daniella; Jiang, Xiaoqian; Matheny, Michael E; Farcas, Claudiu; D'Arcy, Michel; Pearlman, Laura; Nookala, Lavanya; Day, Michele E; Kim, Katherine K; Kim, Hyeoneui; Boxwala, Aziz; El-Kareh, Robert; Kuo, Grace M; Resnic, Frederic S; Kesselman, Carl; Ohno-Machado, Lucila
2015-11-01
Centralized and federated models for sharing data in research networks currently exist. To build multivariate data analysis for centralized networks, transfer of patient-level data to a central computation resource is necessary. The authors implemented distributed multivariate models for federated networks in which patient-level data is kept at each site and data exchange policies are managed in a study-centric manner. The objective was to implement infrastructure that supports the functionality of some existing research networks (e.g., cohort discovery, workflow management, and estimation of multivariate analytic models on centralized data) while adding additional important new features, such as algorithms for distributed iterative multivariate models, a graphical interface for multivariate model specification, synchronous and asynchronous response to network queries, investigator-initiated studies, and study-based control of staff, protocols, and data sharing policies. Based on the requirements gathered from statisticians, administrators, and investigators from multiple institutions, the authors developed infrastructure and tools to support multisite comparative effectiveness studies using web services for multivariate statistical estimation in the SCANNER federated network. The authors implemented massively parallel (map-reduce) computation methods and a new policy management system to enable each study initiated by network participants to define the ways in which data may be processed, managed, queried, and shared. The authors illustrated the use of these systems among institutions with highly different policies and operating under different state laws. Federated research networks need not limit distributed query functionality to count queries, cohort discovery, or independently estimated analytic models. Multivariate analyses can be efficiently and securely conducted without patient-level data transport, allowing institutions with strict local data storage requirements to participate in sophisticated analyses based on federated research networks. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
NASA Astrophysics Data System (ADS)
Zhou, S.; Tao, W. K.; Li, X.; Matsui, T.; Sun, X. H.; Yang, X.
2015-12-01
A cloud-resolving model (CRM) is an atmospheric numerical model that can numerically resolve clouds and cloud systems at 0.25~5km horizontal grid spacings. The main advantage of the CRM is that it can allow explicit interactive processes between microphysics, radiation, turbulence, surface, and aerosols without subgrid cloud fraction, overlapping and convective parameterization. Because of their fine resolution and complex physical processes, it is challenging for the CRM community to i) visualize/inter-compare CRM simulations, ii) diagnose key processes for cloud-precipitation formation and intensity, and iii) evaluate against NASA's field campaign data and L1/L2 satellite data products due to large data volume (~10TB) and complexity of CRM's physical processes. We have been building the Super Cloud Library (SCL) upon a Hadoop framework, capable of CRM database management, distribution, visualization, subsetting, and evaluation in a scalable way. The current SCL capability includes (1) A SCL data model enables various CRM simulation outputs in NetCDF, including the NASA-Unified Weather Research and Forecasting (NU-WRF) and Goddard Cumulus Ensemble (GCE) model, to be accessed and processed by Hadoop, (2) A parallel NetCDF-to-CSV converter supports NU-WRF and GCE model outputs, (3) A technique visualizes Hadoop-resident data with IDL, (4) A technique subsets Hadoop-resident data, compliant to the SCL data model, with HIVE or Impala via HUE's Web interface, (5) A prototype enables a Hadoop MapReduce application to dynamically access and process data residing in a parallel file system, PVFS2 or CephFS, where high performance computing (HPC) simulation outputs such as NU-WRF's and GCE's are located. We are testing Apache Spark to speed up SCL data processing and analysis.With the SCL capabilities, SCL users can conduct large-domain on-demand tasks without downloading voluminous CRM datasets and various observations from NASA Field Campaigns and Satellite data to a local computer, and inter-compare CRM output and data with GCE and NU-WRF.
Design of material management system of mining group based on Hadoop
NASA Astrophysics Data System (ADS)
Xia, Zhiyuan; Tan, Zhuoying; Qi, Kuan; Li, Wen
2018-01-01
Under the background of persistent slowdown in mining market at present, improving the management level in mining group has become the key link to improve the economic benefit of the mine. According to the practical material management in mining group, three core components of Hadoop are applied: distributed file system HDFS, distributed computing framework Map/Reduce and distributed database HBase. Material management system of mining group based on Hadoop is constructed with the three core components of Hadoop and SSH framework technology. This system was found to strengthen collaboration between mining group and affiliated companies, and then the problems such as inefficient management, server pressure, hardware equipment performance deficiencies that exist in traditional mining material-management system are solved, and then mining group materials management is optimized, the cost of mining management is saved, the enterprise profit is increased.
Using Ontology Fingerprints to disambiguate gene name entities in the biomedical literature.
Chen, Guocai; Zhao, Jieyi; Cohen, Trevor; Tao, Cui; Sun, Jingchun; Xu, Hua; Bernstam, Elmer V; Lawson, Andrew; Zeng, Jia; Johnson, Amber M; Holla, Vijaykumar; Bailey, Ann M; Lara-Guerra, Humberto; Litzenburger, Beate; Meric-Bernstam, Funda; Jim Zheng, W
2015-01-01
Ambiguous gene names in the biomedical literature are a barrier to accurate information extraction. To overcome this hurdle, we generated Ontology Fingerprints for selected genes that are relevant for personalized cancer therapy. These Ontology Fingerprints were used to evaluate the association between genes and biomedical literature to disambiguate gene names. We obtained 93.6% precision for the test gene set and 80.4% for the area under a receiver-operating characteristics curve for gene and article association. The core algorithm was implemented using a graphics processing unit-based MapReduce framework to handle big data and to improve performance. We conclude that Ontology Fingerprints can help disambiguate gene names mentioned in text and analyse the association between genes and articles. Database URL: http://www.ontologyfingerprint.org © The Author(s) 2015. Published by Oxford University Press.
Cost-effective cloud computing: a case study using the comparative genomics tool, roundup.
Kudtarkar, Parul; Deluca, Todd F; Fusaro, Vincent A; Tonellato, Peter J; Wall, Dennis P
2010-12-22
Comparative genomics resources, such as ortholog detection tools and repositories are rapidly increasing in scale and complexity. Cloud computing is an emerging technological paradigm that enables researchers to dynamically build a dedicated virtual cluster and may represent a valuable alternative for large computational tools in bioinformatics. In the present manuscript, we optimize the computation of a large-scale comparative genomics resource-Roundup-using cloud computing, describe the proper operating principles required to achieve computational efficiency on the cloud, and detail important procedures for improving cost-effectiveness to ensure maximal computation at minimal costs. Utilizing the comparative genomics tool, Roundup, as a case study, we computed orthologs among 902 fully sequenced genomes on Amazon's Elastic Compute Cloud. For managing the ortholog processes, we designed a strategy to deploy the web service, Elastic MapReduce, and maximize the use of the cloud while simultaneously minimizing costs. Specifically, we created a model to estimate cloud runtime based on the size and complexity of the genomes being compared that determines in advance the optimal order of the jobs to be submitted. We computed orthologous relationships for 245,323 genome-to-genome comparisons on Amazon's computing cloud, a computation that required just over 200 hours and cost $8,000 USD, at least 40% less than expected under a strategy in which genome comparisons were submitted to the cloud randomly with respect to runtime. Our cost savings projections were based on a model that not only demonstrates the optimal strategy for deploying RSD to the cloud, but also finds the optimal cluster size to minimize waste and maximize usage. Our cost-reduction model is readily adaptable for other comparative genomics tools and potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure.
Bigdata Driven Cloud Security: A Survey
NASA Astrophysics Data System (ADS)
Raja, K.; Hanifa, Sabibullah Mohamed
2017-08-01
Cloud Computing (CC) is a fast-growing technology to perform massive-scale and complex computing. It eliminates the need to maintain expensive computing hardware, dedicated space, and software. Recently, it has been observed that massive growth in the scale of data or big data generated through cloud computing. CC consists of a front-end, includes the users’ computers and software required to access the cloud network, and back-end consists of various computers, servers and database systems that create the cloud. In SaaS (Software as-a-Service - end users to utilize outsourced software), PaaS (Platform as-a-Service-platform is provided) and IaaS (Infrastructure as-a-Service-physical environment is outsourced), and DaaS (Database as-a-Service-data can be housed within a cloud), where leading / traditional cloud ecosystem delivers the cloud services become a powerful and popular architecture. Many challenges and issues are in security or threats, most vital barrier for cloud computing environment. The main barrier to the adoption of CC in health care relates to Data security. When placing and transmitting data using public networks, cyber attacks in any form are anticipated in CC. Hence, cloud service users need to understand the risk of data breaches and adoption of service delivery model during deployment. This survey deeply covers the CC security issues (covering Data Security in Health care) so as to researchers can develop the robust security application models using Big Data (BD) on CC (can be created / deployed easily). Since, BD evaluation is driven by fast-growing cloud-based applications developed using virtualized technologies. In this purview, MapReduce [12] is a good example of big data processing in a cloud environment, and a model for Cloud providers.
MzJava: An open source library for mass spectrometry data processing.
Horlacher, Oliver; Nikitin, Frederic; Alocci, Davide; Mariethoz, Julien; Müller, Markus; Lisacek, Frederique
2015-11-03
Mass spectrometry (MS) is a widely used and evolving technique for the high-throughput identification of molecules in biological samples. The need for sharing and reuse of code among bioinformaticians working with MS data prompted the design and implementation of MzJava, an open-source Java Application Programming Interface (API) for MS related data processing. MzJava provides data structures and algorithms for representing and processing mass spectra and their associated biological molecules, such as metabolites, glycans and peptides. MzJava includes functionality to perform mass calculation, peak processing (e.g. centroiding, filtering, transforming), spectrum alignment and clustering, protein digestion, fragmentation of peptides and glycans as well as scoring functions for spectrum-spectrum and peptide/glycan-spectrum matches. For data import and export MzJava implements readers and writers for commonly used data formats. For many classes support for the Hadoop MapReduce (hadoop.apache.org) and Apache Spark (spark.apache.org) frameworks for cluster computing was implemented. The library has been developed applying best practices of software engineering. To ensure that MzJava contains code that is correct and easy to use the library's API was carefully designed and thoroughly tested. MzJava is an open-source project distributed under the AGPL v3.0 licence. MzJava requires Java 1.7 or higher. Binaries, source code and documentation can be downloaded from http://mzjava.expasy.org and https://bitbucket.org/sib-pig/mzjava. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes
2013-01-01
Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. Availability The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. PMID:24564704
Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni
2013-01-01
Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.
Emerging Cyber Infrastructure for NASA's Large-Scale Climate Data Analytics
NASA Astrophysics Data System (ADS)
Duffy, D.; Spear, C.; Bowen, M. K.; Thompson, J. H.; Hu, F.; Yang, C. P.; Pierce, D.
2016-12-01
The resolution of NASA climate and weather simulations have grown dramatically over the past few years with the highest-fidelity models reaching down to 1.5 KM global resolutions. With each doubling of the resolution, the resulting data sets grow by a factor of eight in size. As the climate and weather models push the envelope even further, a new infrastructure to store data and provide large-scale data analytics is necessary. The NASA Center for Climate Simulation (NCCS) has deployed the Data Analytics Storage Service (DASS) that combines scalable storage with the ability to perform in-situ analytics. Within this system, large, commonly used data sets are stored in a POSIX file system (write once/read many); examples of data stored include Landsat, MERRA2, observing system simulation experiments, and high-resolution downscaled reanalysis. The total size of this repository is on the order of 15 petabytes of storage. In addition to the POSIX file system, the NCCS has deployed file system connectors to enable emerging analytics built on top of the Hadoop File System (HDFS) to run on the same storage servers within the DASS. Coupled with a custom spatiotemporal indexing approach, users can now run emerging analytical operations built on MapReduce and Spark on the same data files stored within the POSIX file system without having to make additional copies. This presentation will discuss the architecture of this system and present benchmark performance measurements from traditional TeraSort and Wordcount to large-scale climate analytical operations on NetCDF data.
NASA Astrophysics Data System (ADS)
Jiang, Guodong; Fan, Ming; Li, Lihua
2016-03-01
Mammography is the gold standard for breast cancer screening, reducing mortality by about 30%. The application of a computer-aided detection (CAD) system to assist a single radiologist is important to further improve mammographic sensitivity for breast cancer detection. In this study, a design and realization of the prototype for remote diagnosis system in mammography based on cloud platform were proposed. To build this system, technologies were utilized including medical image information construction, cloud infrastructure and human-machine diagnosis model. Specifically, on one hand, web platform for remote diagnosis was established by J2EE web technology. Moreover, background design was realized through Hadoop open-source framework. On the other hand, storage system was built up with Hadoop distributed file system (HDFS) technology which enables users to easily develop and run on massive data application, and give full play to the advantages of cloud computing which is characterized by high efficiency, scalability and low cost. In addition, the CAD system was realized through MapReduce frame. The diagnosis module in this system implemented the algorithms of fusion of machine and human intelligence. Specifically, we combined results of diagnoses from doctors' experience and traditional CAD by using the man-machine intelligent fusion model based on Alpha-Integration and multi-agent algorithm. Finally, the applications on different levels of this system in the platform were also discussed. This diagnosis system will have great importance for the balanced health resource, lower medical expense and improvement of accuracy of diagnosis in basic medical institutes.
Network selection, Information filtering and Scalable computation
NASA Astrophysics Data System (ADS)
Ye, Changqing
This dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high dimensional structured data, where predictors corresponds to nodes of a given directed graph. This arises in, for instance, identification of disease genes for the Parkinson's diseases from a network of candidate genes. In such a situation, directed graph describes dependencies among the genes, where direction of edges represent certain causal effects. Key to high-dimensional structured classification and regression is how to utilize dependencies among predictors as specified by directions of the graph. In this dissertation, we develop a novel method that fully takes into account such dependencies formulated through certain nonlinear constraints. We apply the proposed method to two applications, feature selection in large margin binary classification and in linear regression. We implement the proposed method through difference convex programming for the cost function and constraints. Finally, theoretical and numerical analyses suggest that the proposed method achieves the desired objectives. An application to disease gene identification is presented. The second application scenario is personalized information filtering which extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we introduce novel partial latent models to integrate additional user-specific and content-specific predictors, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user's preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of over-complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.
mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud.
Weissensteiner, Hansi; Forer, Lukas; Fuchsberger, Christian; Schöpf, Bernd; Kloss-Brandstätter, Anita; Specht, Günther; Kronenberg, Florian; Schönherr, Sebastian
2016-07-08
Next generation sequencing (NGS) allows investigating mitochondrial DNA (mtDNA) characteristics such as heteroplasmy (i.e. intra-individual sequence variation) to a higher level of detail. While several pipelines for analyzing heteroplasmies exist, issues in usability, accuracy of results and interpreting final data limit their usage. Here we present mtDNA-Server, a scalable web server for the analysis of mtDNA studies of any size with a special focus on usability as well as reliable identification and quantification of heteroplasmic variants. The mtDNA-Server workflow includes parallel read alignment, heteroplasmy detection, artefact or contamination identification, variant annotation as well as several quality control metrics, often neglected in current mtDNA NGS studies. All computational steps are parallelized with Hadoop MapReduce and executed graphically with Cloudgene. We validated the underlying heteroplasmy and contamination detection model by generating four artificial sample mix-ups on two different NGS devices. Our evaluation data shows that mtDNA-Server detects heteroplasmies and artificial recombinations down to the 1% level with perfect specificity and outperforms existing approaches regarding sensitivity. mtDNA-Server is currently able to analyze the 1000G Phase 3 data (n = 2,504) in less than 5 h and is freely accessible at https://mtdna-server.uibk.ac.at. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Cross-domain question classification in community question answering via kernel mapping
NASA Astrophysics Data System (ADS)
Su, Lei; Hu, Zuoliang; Yang, Bin; Li, Yiyang; Chen, Jun
2015-10-01
An increasingly popular method for retrieving information is via the community question answering (CQA) systems such as Yahoo! Answers and Baidu Knows. In CQA, question classification plays an important role to find the answers. However, the labeled training examples for statistical question classifier are fairly expensive to obtain, as they require the experienced human efforts. Meanwhile, unlabeled data are readily available. This paper employs the method of domain adaptation via kernel mapping to solve this problem. In detail, the kernel approach is utilized to map the target-domain data and the source-domain data into a common space, where the question classifiers are trained under the closer conditional probabilities. The kernel mapping function is constructed by domain knowledge. Therefore, domain knowledge could be transferred from the labeled examples in the source domain to the unlabeled ones in the targeted domain. The statistical training model can be improved by using a large number of unlabeled data. Meanwhile, the Hadoop Platform is used to construct the mapping mechanism to reduce the time complexity. Map/Reduce enable kernel mapping for domain adaptation in parallel in the Hadoop Platform. Experimental results show that the accuracy of question classification could be improved by the method of kernel mapping. Furthermore, the parallel method in the Hadoop Platform could effective schedule the computing resources to reduce the running time.
A Columnar Storage Strategy with Spatiotemporal Index for Big Climate Data
NASA Astrophysics Data System (ADS)
Hu, F.; Bowen, M. K.; Li, Z.; Schnase, J. L.; Duffy, D.; Lee, T. J.; Yang, C. P.
2015-12-01
Large collections of observational, reanalysis, and climate model output data may grow to as large as a 100 PB in the coming years, so climate dataset is in the Big Data domain, and various distributed computing frameworks have been utilized to address the challenges by big climate data analysis. However, due to the binary data format (NetCDF, HDF) with high spatial and temporal dimensions, the computing frameworks in Apache Hadoop ecosystem are not originally suited for big climate data. In order to make the computing frameworks in Hadoop ecosystem directly support big climate data, we propose a columnar storage format with spatiotemporal index to store climate data, which will support any project in the Apache Hadoop ecosystem (e.g. MapReduce, Spark, Hive, Impala). With this approach, the climate data will be transferred into binary Parquet data format, a columnar storage format, and spatial and temporal index will be built and attached into the end of Parquet files to enable real-time data query. Then such climate data in Parquet data format could be available to any computing frameworks in Hadoop ecosystem. The proposed approach is evaluated using the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. Experimental results show that this approach could efficiently overcome the gap between the big climate data and the distributed computing frameworks, and the spatiotemporal index could significantly accelerate data querying and processing.
Hadoop distributed batch processing for Gaia: a success story
NASA Astrophysics Data System (ADS)
Riello, Marco
2015-12-01
The DPAC Cambridge Data Processing Centre (DPCI) is responsible for the photometric calibration of the Gaia data including the low resolution spectra. The large data volume produced by Gaia (~26 billion transits/year), the complexity of its data stream and the self-calibrating approach pose unique challenges for scalability, reliability and robustness of both the software pipelines and the operations infrastructure. DPCI has been the first in DPAC to realise the potential of Hadoop and Map/Reduce and to adopt them as the core technologies for its infrastructure. This has proven a winning choice allowing DPCI unmatched processing throughput and reliability within DPAC to the point that other DPCs have started following our footsteps. In this talk we will present the software infrastructure developed to build the distributed and scalable batch data processing system that is currently used in production at DPCI and the excellent results in terms of performance of the system.
NASA Astrophysics Data System (ADS)
Tokareva, Victoria
2018-04-01
New generation medicine demands a better quality of analysis increasing the amount of data collected during checkups, and simultaneously decreasing the invasiveness of a procedure. Thus it becomes urgent not only to develop advanced modern hardware, but also to implement special software infrastructure for using it in everyday clinical practice, so-called Picture Archiving and Communication Systems (PACS). Developing distributed PACS is a challenging task for nowadays medical informatics. The paper discusses the architecture of distributed PACS server for processing large high-quality medical images, with respect to technical specifications of modern medical imaging hardware, as well as international standards in medical imaging software. The MapReduce paradigm is proposed for image reconstruction by server, and the details of utilizing the Hadoop framework for this task are being discussed in order to provide the design of distributed PACS as ergonomic and adapted to the needs of end users as possible.
Efficient LIDAR Point Cloud Data Managing and Processing in a Hadoop-Based Distributed Framework
NASA Astrophysics Data System (ADS)
Wang, C.; Hu, F.; Sha, D.; Han, X.
2017-10-01
Light Detection and Ranging (LiDAR) is one of the most promising technologies in surveying and mapping city management, forestry, object recognition, computer vision engineer and others. However, it is challenging to efficiently storage, query and analyze the high-resolution 3D LiDAR data due to its volume and complexity. In order to improve the productivity of Lidar data processing, this study proposes a Hadoop-based framework to efficiently manage and process LiDAR data in a distributed and parallel manner, which takes advantage of Hadoop's storage and computing ability. At the same time, the Point Cloud Library (PCL), an open-source project for 2D/3D image and point cloud processing, is integrated with HDFS and MapReduce to conduct the Lidar data analysis algorithms provided by PCL in a parallel fashion. The experiment results show that the proposed framework can efficiently manage and process big LiDAR data.
Using Browser Notebooks to Analyse Big Atmospheric Data-sets in the Cloud
NASA Astrophysics Data System (ADS)
Robinson, N.; Tomlinson, J.; Arribas, A.; Prudden, R.
2016-12-01
We are presenting an account of our experience building an ecosystem for the analysis of big atmospheric data-sets. By using modern technologies we have developed a prototype platform which is scaleable and capable of analysing very large atmospheric datasets. We tested different big-data ecosystems such as Hadoop MapReduce, Spark and Dask, in order to find the one which was best suited for analysis of multidimensional binary data such as NetCDF. We make extensive use of infrastructure-as-code and containerisation to provide a platform which is reusable, and which can scale to accommodate changes in demand. We make this platform readily accessible using browser based notebooks. As a result, analysts with minimal technology experience can, in tens of lines of Python, make interactive data-visualisation web pages, which can analyse very large amounts of data using cutting edge big-data technology
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.
2014-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Sahoo, Satya S; Jayapandian, Catherine; Garg, Gaurav; Kaffashi, Farhad; Chung, Stephanie; Bozorgi, Alireza; Chen, Chien-Hun; Loparo, Kenneth; Lhatoo, Samden D; Zhang, Guo-Qiang
2014-01-01
Objective The rapidly growing volume of multimodal electrophysiological signal data is playing a critical role in patient care and clinical research across multiple disease domains, such as epilepsy and sleep medicine. To facilitate secondary use of these data, there is an urgent need to develop novel algorithms and informatics approaches using new cloud computing technologies as well as ontologies for collaborative multicenter studies. Materials and methods We present the Cloudwave platform, which (a) defines parallelized algorithms for computing cardiac measures using the MapReduce parallel programming framework, (b) supports real-time interaction with large volumes of electrophysiological signals, and (c) features signal visualization and querying functionalities using an ontology-driven web-based interface. Cloudwave is currently used in the multicenter National Institute of Neurological Diseases and Stroke (NINDS)-funded Prevention and Risk Identification of SUDEP (sudden unexplained death in epilepsy) Mortality (PRISM) project to identify risk factors for sudden death in epilepsy. Results Comparative evaluations of Cloudwave with traditional desktop approaches to compute cardiac measures (eg, QRS complexes, RR intervals, and instantaneous heart rate) on epilepsy patient data show one order of magnitude improvement for single-channel ECG data and 20 times improvement for four-channel ECG data. This enables Cloudwave to support real-time user interaction with signal data, which is semantically annotated with a novel epilepsy and seizure ontology. Discussion Data privacy is a critical issue in using cloud infrastructure, and cloud platforms, such as Amazon Web Services, offer features to support Health Insurance Portability and Accountability Act standards. Conclusion The Cloudwave platform is a new approach to leverage of large-scale electrophysiological data for advancing multicenter clinical research. PMID:24326538
Sahoo, Satya S; Jayapandian, Catherine; Garg, Gaurav; Kaffashi, Farhad; Chung, Stephanie; Bozorgi, Alireza; Chen, Chien-Hun; Loparo, Kenneth; Lhatoo, Samden D; Zhang, Guo-Qiang
2014-01-01
The rapidly growing volume of multimodal electrophysiological signal data is playing a critical role in patient care and clinical research across multiple disease domains, such as epilepsy and sleep medicine. To facilitate secondary use of these data, there is an urgent need to develop novel algorithms and informatics approaches using new cloud computing technologies as well as ontologies for collaborative multicenter studies. We present the Cloudwave platform, which (a) defines parallelized algorithms for computing cardiac measures using the MapReduce parallel programming framework, (b) supports real-time interaction with large volumes of electrophysiological signals, and (c) features signal visualization and querying functionalities using an ontology-driven web-based interface. Cloudwave is currently used in the multicenter National Institute of Neurological Diseases and Stroke (NINDS)-funded Prevention and Risk Identification of SUDEP (sudden unexplained death in epilepsy) Mortality (PRISM) project to identify risk factors for sudden death in epilepsy. Comparative evaluations of Cloudwave with traditional desktop approaches to compute cardiac measures (eg, QRS complexes, RR intervals, and instantaneous heart rate) on epilepsy patient data show one order of magnitude improvement for single-channel ECG data and 20 times improvement for four-channel ECG data. This enables Cloudwave to support real-time user interaction with signal data, which is semantically annotated with a novel epilepsy and seizure ontology. Data privacy is a critical issue in using cloud infrastructure, and cloud platforms, such as Amazon Web Services, offer features to support Health Insurance Portability and Accountability Act standards. The Cloudwave platform is a new approach to leverage of large-scale electrophysiological data for advancing multicenter clinical research.
Implementation of a Big Data Accessing and Processing Platform for Medical Records in Cloud.
Yang, Chao-Tung; Liu, Jung-Chun; Chen, Shuo-Tsung; Lu, Hsin-Wen
2017-08-18
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture. Some issues must be considered in order to use the cloud computing to quickly integrate big medical data into database for easy analyzing, searching, and filtering big data to obtain valuable information.This work builds a cloud storage system with HBase of Hadoop for storing and analyzing big data of medical records and improves the performance of importing data into database. The data of medical records are stored in HBase database platform for big data analysis. This system performs distributed computing on medical records data processing through Hadoop MapReduce programming, and to provide functions, including keyword search, data filtering, and basic statistics for HBase database. This system uses the Put with the single-threaded method and the CompleteBulkload mechanism to import medical data. From the experimental results, we find that when the file size is less than 300MB, the Put with single-threaded method is used and when the file size is larger than 300MB, the CompleteBulkload mechanism is used to improve the performance of data import into database. This system provides a web interface that allows users to search data, filter out meaningful information through the web, and analyze and convert data in suitable forms that will be helpful for medical staff and institutions.
A Distributed Fuzzy Associative Classifier for Big Data.
Segatori, Armando; Bechini, Alessio; Ducange, Pietro; Marcelloni, Francesco
2017-09-19
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11,000,000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.
Cost-Effective Cloud Computing: A Case Study Using the Comparative Genomics Tool, Roundup
Kudtarkar, Parul; DeLuca, Todd F.; Fusaro, Vincent A.; Tonellato, Peter J.; Wall, Dennis P.
2010-01-01
Background Comparative genomics resources, such as ortholog detection tools and repositories are rapidly increasing in scale and complexity. Cloud computing is an emerging technological paradigm that enables researchers to dynamically build a dedicated virtual cluster and may represent a valuable alternative for large computational tools in bioinformatics. In the present manuscript, we optimize the computation of a large-scale comparative genomics resource—Roundup—using cloud computing, describe the proper operating principles required to achieve computational efficiency on the cloud, and detail important procedures for improving cost-effectiveness to ensure maximal computation at minimal costs. Methods Utilizing the comparative genomics tool, Roundup, as a case study, we computed orthologs among 902 fully sequenced genomes on Amazon’s Elastic Compute Cloud. For managing the ortholog processes, we designed a strategy to deploy the web service, Elastic MapReduce, and maximize the use of the cloud while simultaneously minimizing costs. Specifically, we created a model to estimate cloud runtime based on the size and complexity of the genomes being compared that determines in advance the optimal order of the jobs to be submitted. Results We computed orthologous relationships for 245,323 genome-to-genome comparisons on Amazon’s computing cloud, a computation that required just over 200 hours and cost $8,000 USD, at least 40% less than expected under a strategy in which genome comparisons were submitted to the cloud randomly with respect to runtime. Our cost savings projections were based on a model that not only demonstrates the optimal strategy for deploying RSD to the cloud, but also finds the optimal cluster size to minimize waste and maximize usage. Our cost-reduction model is readily adaptable for other comparative genomics tools and potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure. PMID:21258651
GrigoraSNPs: Optimized Analysis of SNPs for DNA Forensics.
Ricke, Darrell O; Shcherbina, Anna; Michaleas, Adam; Fremont-Smith, Philip
2018-04-16
High-throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) enables additional DNA forensic capabilities not attainable using traditional STR panels. However, the inclusion of sets of loci selected for mixture analysis, extended kinship, phenotype, biogeographic ancestry prediction, etc., can result in large panel sizes that are difficult to analyze in a rapid fashion. GrigoraSNP was developed to address the allele-calling bottleneck that was encountered when analyzing SNP panels with more than 5000 loci using HTS. GrigoraSNPs uses a MapReduce parallel data processing on multiple computational threads plus a novel locus-identification hashing strategy leveraging target sequence tags. This tool optimizes the SNP calling module of the DNA analysis pipeline with runtimes that scale linearly with the number of HTS reads. Results are compared with SNP analysis pipelines implemented with SAMtools and GATK. GrigoraSNPs removes a computational bottleneck for processing forensic samples with large HTS SNP panels. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.
Design and Verification of Remote Sensing Image Data Center Storage Architecture Based on Hadoop
NASA Astrophysics Data System (ADS)
Tang, D.; Zhou, X.; Jing, Y.; Cong, W.; Li, C.
2018-04-01
The data center is a new concept of data processing and application proposed in recent years. It is a new method of processing technologies based on data, parallel computing, and compatibility with different hardware clusters. While optimizing the data storage management structure, it fully utilizes cluster resource computing nodes and improves the efficiency of data parallel application. This paper used mature Hadoop technology to build a large-scale distributed image management architecture for remote sensing imagery. Using MapReduce parallel processing technology, it called many computing nodes to process image storage blocks and pyramids in the background to improve the efficiency of image reading and application and sovled the need for concurrent multi-user high-speed access to remotely sensed data. It verified the rationality, reliability and superiority of the system design by testing the storage efficiency of different image data and multi-users and analyzing the distributed storage architecture to improve the application efficiency of remote sensing images through building an actual Hadoop service system.
A Technical Survey on Optimization of Processing Geo Distributed Data
NASA Astrophysics Data System (ADS)
Naga Malleswari, T. Y. J.; Ushasukhanya, S.; Nithyakalyani, A.; Girija, S.
2018-04-01
With growing cloud services and technology, there is growth in some geographically distributed data centers to store large amounts of data. Analysis of geo-distributed data is required in various services for data processing, storage of essential information, etc., processing this geo-distributed data and performing analytics on this data is a challenging task. The distributed data processing is accompanied by issues in storage, computation and communication. The key issues to be dealt with are time efficiency, cost minimization, utility maximization. This paper describes various optimization methods like end-to-end multiphase, G-MR, etc., using the techniques like Map-Reduce, CDS (Community Detection based Scheduling), ROUT, Workload-Aware Scheduling, SAGE, AMP (Ant Colony Optimization) to handle these issues. In this paper various optimization methods and techniques used are analyzed. It has been observed that end-to end multiphase achieves time efficiency; Cost minimization concentrates to achieve Quality of Service, Computation and reduction of Communication cost. SAGE achieves performance improvisation in processing geo-distributed data sets.
Risk intelligence: making profit from uncertainty in data processing system.
Zheng, Si; Liao, Xiangke; Liu, Xiaodong
2014-01-01
In extreme scale data processing systems, fault tolerance is an essential and indispensable part. Proactive fault tolerance scheme (such as the speculative execution in MapReduce framework) is introduced to dramatically improve the response time of job executions when the failure becomes a norm rather than an exception. Efficient proactive fault tolerance schemes require precise knowledge on the task executions, which has been an open challenge for decades. To well address the issue, in this paper we design and implement RiskI, a profile-based prediction algorithm in conjunction with a riskaware task assignment algorithm, to accelerate task executions, taking the uncertainty nature of tasks into account. Our design demonstrates that the nature uncertainty brings not only great challenges, but also new opportunities. With a careful design, we can benefit from such uncertainties. We implement the idea in Hadoop 0.21.0 systems and the experimental results show that, compared with the traditional LATE algorithm, the response time can be improved by 46% with the same system throughput.
Novel and efficient tag SNPs selection algorithms.
Chen, Wen-Pei; Hung, Che-Lun; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
2014-01-01
SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels.
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
2012-01-01
Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Cloud Computing for Protein-Ligand Binding Site Comparison
2013-01-01
The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery. PMID:23762824
Cloud computing for protein-ligand binding site comparison.
Hung, Che-Lun; Hua, Guan-Jie
2013-01-01
The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery.
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.
Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John
2012-12-05
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
Computational Intelligence for Medical Imaging Simulations.
Chang, Victor
2017-11-25
This paper describes how to simulate medical imaging by computational intelligence to explore areas that cannot be easily achieved by traditional ways, including genes and proteins simulations related to cancer development and immunity. This paper has presented simulations and virtual inspections of BIRC3, BIRC6, CCL4, KLKB1 and CYP2A6 with their outputs and explanations, as well as brain segment intensity due to dancing. Our proposed MapReduce framework with the fusion algorithm can simulate medical imaging. The concept is very similar to the digital surface theories to simulate how biological units can get together to form bigger units, until the formation of the entire unit of biological subject. The M-Fusion and M-Update function by the fusion algorithm can achieve a good performance evaluation which can process and visualize up to 40 GB of data within 600 s. We conclude that computational intelligence can provide effective and efficient healthcare research offered by simulations and visualization.
Risk Intelligence: Making Profit from Uncertainty in Data Processing System
Liao, Xiangke; Liu, Xiaodong
2014-01-01
In extreme scale data processing systems, fault tolerance is an essential and indispensable part. Proactive fault tolerance scheme (such as the speculative execution in MapReduce framework) is introduced to dramatically improve the response time of job executions when the failure becomes a norm rather than an exception. Efficient proactive fault tolerance schemes require precise knowledge on the task executions, which has been an open challenge for decades. To well address the issue, in this paper we design and implement RiskI, a profile-based prediction algorithm in conjunction with a riskaware task assignment algorithm, to accelerate task executions, taking the uncertainty nature of tasks into account. Our design demonstrates that the nature uncertainty brings not only great challenges, but also new opportunities. With a careful design, we can benefit from such uncertainties. We implement the idea in Hadoop 0.21.0 systems and the experimental results show that, compared with the traditional LATE algorithm, the response time can be improved by 46% with the same system throughput. PMID:24883392
Classification Algorithms for Big Data Analysis, a Map Reduce Approach
NASA Astrophysics Data System (ADS)
Ayma, V. A.; Ferreira, R. S.; Happ, P.; Oliveira, D.; Feitosa, R.; Costa, G.; Plaza, A.; Gamba, P.
2015-03-01
Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods, and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by remote sensors raises more challenges to be overcome. In this work, a tool within the scope of InterIMAGE Cloud Platform (ICP), which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named ICP: Data Mining Package, is able to perform supervised classification procedures on huge amounts of data, usually referred as big data, on a distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA's machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the potential of the tool, as well as aspects that affect its performance.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Cloud computing-based TagSNP selection algorithm for human genome data.
Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
2015-01-05
Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used.
Towards Large-scale Twitter Mining for Drug-related Adverse Events.
Bian, Jiang; Topaloglu, Umit; Yu, Fan
2012-10-29
Drug-related adverse events pose substantial risks to patients who consume post-market or Drug-related adverse events pose substantial risks to patients who consume post-market or investigational drugs. Early detection of adverse events benefits not only the drug regulators, but also the manufacturers for pharmacovigilance. Existing methods rely on patients' "spontaneous" self-reports that attest problems. The increasing popularity of social media platforms like the Twitter presents us a new information source for finding potential adverse events. Given the high frequency of user updates, mining Twitter messages can lead us to real-time pharmacovigilance. In this paper, we describe an approach to find drug users and potential adverse events by analyzing the content of twitter messages utilizing Natural Language Processing (NLP) and to build Support Vector Machine (SVM) classifiers. Due to the size nature of the dataset (i.e., 2 billion Tweets), the experiments were conducted on a High Performance Computing (HPC) platform using MapReduce, which exhibits the trend of big data analytics. The results suggest that daily-life social networking data could help early detection of important patient safety issues.
Cloud Computing-Based TagSNP Selection Algorithm for Human Genome Data
Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
2015-01-01
Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used. PMID:25569088
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
Monte Carlo verification of radiotherapy treatments with CloudMC.
Miras, Hector; Jiménez, Rubén; Perales, Álvaro; Terrón, José Antonio; Bertolet, Alejandro; Ortiz, Antonio; Macías, José
2018-06-27
A new implementation has been made on CloudMC, a cloud-based platform presented in a previous work, in order to provide services for radiotherapy treatment verification by means of Monte Carlo in a fast, easy and economical way. A description of the architecture of the application and the new developments implemented is presented together with the results of the tests carried out to validate its performance. CloudMC has been developed over Microsoft Azure cloud. It is based on a map/reduce implementation for Monte Carlo calculations distribution over a dynamic cluster of virtual machines in order to reduce calculation time. CloudMC has been updated with new methods to read and process the information related to radiotherapy treatment verification: CT image set, treatment plan, structures and dose distribution files in DICOM format. Some tests have been designed in order to determine, for the different tasks, the most suitable type of virtual machines from those available in Azure. Finally, the performance of Monte Carlo verification in CloudMC is studied through three real cases that involve different treatment techniques, linac models and Monte Carlo codes. Considering computational and economic factors, D1_v2 and G1 virtual machines were selected as the default type for the Worker Roles and the Reducer Role respectively. Calculation times up to 33 min and costs of 16 € were achieved for the verification cases presented when a statistical uncertainty below 2% (2σ) was required. The costs were reduced to 3-6 € when uncertainty requirements are relaxed to 4%. Advantages like high computational power, scalability, easy access and pay-per-usage model, make Monte Carlo cloud-based solutions, like the one presented in this work, an important step forward to solve the long-lived problem of truly introducing the Monte Carlo algorithms in the daily routine of the radiotherapy planning process.
Astronomical Image Processing with Hadoop
NASA Astrophysics Data System (ADS)
Wiley, K.; Connolly, A.; Krughoff, S.; Gardner, J.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.
2011-07-01
In the coming decade astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. With a requirement that these images be analyzed in real time to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. In the commercial world, new techniques that utilize cloud computing have been developed to handle massive data streams. In this paper we describe how cloud computing, and in particular the map-reduce paradigm, can be used in astronomical data processing. We will focus on our experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop (http://hadoop.apache.org). This multi-terabyte imaging dataset approximates future surveys such as those which will be conducted with the LSST. Our pipeline performs image coaddition in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. We will first present our initial implementation, then describe several critical optimizations that have enabled us to achieve high performance, and finally describe how we are incorporating a large in-house existing image processing library into our Hadoop system. The optimizations involve prefiltering of the input to remove irrelevant images from consideration, grouping individual FITS files into larger, more efficient indexed files, and a hybrid system in which a relational database is used to determine the input images relevant to the task. The incorporation of an existing image processing library, written in C++, presented difficult challenges since Hadoop is programmed primarily in Java. We will describe how we achieved this integration and the sophisticated image processing routines that were made feasible as a result. We will end by briefly describing the longer term goals of our work, namely detection and classification of transient objects and automated object classification.
NDSI products system based on Hadoop platform
NASA Astrophysics Data System (ADS)
Zhou, Yan; Jiang, He; Yang, Xiaoxia; Geng, Erhui
2015-12-01
Snow is solid state of water resources on earth, and plays an important role in human life. Satellite remote sensing is significant in snow extraction with the advantages of cyclical, macro, comprehensiveness, objectivity, timeliness. With the continuous development of remote sensing technology, remote sensing data access to the trend of multiple platforms, multiple sensors and multiple perspectives. At the same time, in view of the remote sensing data of compute-intensive applications demand increase gradually. However, current the producing system of remote sensing products is in a serial mode, and this kind of production system is used for professional remote sensing researchers mostly, and production systems achieving automatic or semi-automatic production are relatively less. Facing massive remote sensing data, the traditional serial mode producing system with its low efficiency has been difficult to meet the requirements of mass data timely and efficient processing. In order to effectively improve the production efficiency of NDSI products, meet the demand of large-scale remote sensing data processed timely and efficiently, this paper build NDSI products production system based on Hadoop platform, and the system mainly includes the remote sensing image management module, NDSI production module, and system service module. Main research contents and results including: (1)The remote sensing image management module: includes image import and image metadata management two parts. Import mass basis IRS images and NDSI product images (the system performing the production task output) into HDFS file system; At the same time, read the corresponding orbit ranks number, maximum/minimum longitude and latitude, product date, HDFS storage path, Hadoop task ID (NDSI products), and other metadata information, and then create thumbnails, and unique ID number for each record distribution, import it into base/product image metadata database. (2)NDSI production module: includes the index calculation, production tasks submission and monitoring two parts. Read HDF images related to production task in the form of a byte stream, and use Beam library to parse image byte stream to the form of Product; Use MapReduce distributed framework to perform production tasks, at the same time monitoring task status; When the production task complete, calls remote sensing image management module to store NDSI products. (3)System service module: includes both image search and DNSI products download. To image metadata attributes described in JSON format, return to the image sequence ID existing in the HDFS file system; For the given MapReduce task ID, package several task output NDSI products into ZIP format file, and return to the download link (4)System evaluation: download massive remote sensing data and use the system to process it to get the NDSI products testing the performance, and the result shows that the system has high extendibility, strong fault tolerance, fast production speed, and the image processing results with high accuracy.
NASA Astrophysics Data System (ADS)
Tamkin, G.; Schnase, J. L.; Duffy, D.; Li, J.; Strong, S.; Thompson, J. H.
2016-12-01
We are extending climate analytics-as-a-service, including: (1) A high-performance Virtual Real-Time Analytics Testbed supporting six major reanalysis data sets using advanced technologies like the Cloudera Impala-based SQL and Hadoop-based MapReduce analytics over native NetCDF files. (2) A Reanalysis Ensemble Service (RES) that offers a basic set of commonly used operations over the reanalysis collections that are accessible through NASA's climate data analytics Web services and our client-side Climate Data Services Python library, CDSlib. (3) An Open Geospatial Consortium (OGC) WPS-compliant Web service interface to CDSLib to accommodate ESGF's Web service endpoints. This presentation will report on the overall progress of this effort, with special attention to recent enhancements that have been made to the Reanalysis Ensemble Service, including the following: - An CDSlib Python library that supports full temporal, spatial, and grid-based resolution services - A new reanalysis collections reference model to enable operator design and implementation - An enhanced library of sample queries to demonstrate and develop use case scenarios - Extended operators that enable single- and multiple reanalysis area average, vertical average, re-gridding, and trend, climatology, and anomaly computations - Full support for the MERRA-2 reanalysis and the initial integration of two additional reanalyses - A prototype Jupyter notebook-based distribution mechanism that combines CDSlib documentation with interactive use case scenarios and personalized project management - Prototyped uncertainty quantification services that combine ensemble products with comparative observational products - Convenient, one-stop shopping for commonly used data products from multiple reanalyses, including basic subsetting and arithmetic operations over the data and extractions of trends, climatologies, and anomalies - The ability to compute and visualize multiple reanalysis intercomparisons
Big Data and Biomedical Informatics: A Challenging Opportunity
2014-01-01
Summary Big data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software, methods, and tools, such as map-reduce, cloud computing, and concept drift machine learning algorithms, which will not only contribute to big data research, but may be beneficial in many biomedical informatics applications. The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations. PMID:24853034
Big data and biomedical informatics: a challenging opportunity.
Bellazzi, R
2014-05-22
Big data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software, methods, and tools, such as map-reduce, cloud computing, and concept drift machine learning algorithms, which will not only contribute to big data research, but may be beneficial in many biomedical informatics applications. The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.
NASA Astrophysics Data System (ADS)
Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram
2010-05-01
Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.
Parallel processing considerations for image recognition tasks
NASA Astrophysics Data System (ADS)
Simske, Steven J.
2011-01-01
Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
Leveraging human oversight and intervention in large-scale parallel processing of open-source data
NASA Astrophysics Data System (ADS)
Casini, Enrico; Suri, Niranjan; Bradshaw, Jeffrey M.
2015-05-01
The popularity of cloud computing along with the increased availability of cheap storage have led to the necessity of elaboration and transformation of large volumes of open-source data, all in parallel. One way to handle such extensive volumes of information properly is to take advantage of distributed computing frameworks like Map-Reduce. Unfortunately, an entirely automated approach that excludes human intervention is often unpredictable and error prone. Highly accurate data processing and decision-making can be achieved by supporting an automatic process through human collaboration, in a variety of environments such as warfare, cyber security and threat monitoring. Although this mutual participation seems easily exploitable, human-machine collaboration in the field of data analysis presents several challenges. First, due to the asynchronous nature of human intervention, it is necessary to verify that once a correction is made, all the necessary reprocessing is done in chain. Second, it is often needed to minimize the amount of reprocessing in order to optimize the usage of resources due to limited availability. In order to improve on these strict requirements, this paper introduces improvements to an innovative approach for human-machine collaboration in the processing of large amounts of open-source data in parallel.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hameed, Abdul; Khoshkbarforoushha, Alireza; Ranjan, Rajiv
In a cloud computing paradigm, energy efficient allocation of different virtualized ICT resources (servers, storage disks, and networks, and the like) is a complex problem due to the presence of heterogeneous application (e.g., content delivery networks, MapReduce, web applications, and the like) workloads having contentious allocation requirements in terms of ICT resource capacities (e.g., network bandwidth, processing speed, response time, etc.). Several recent papers have tried to address the issue of improving energy efficiency in allocating cloud resources to applications with varying degree of success. However, to the best of our knowledge there is no published literature on this subjectmore » that clearly articulates the research problem and provides research taxonomy for succinct classification of existing techniques. Hence, the main aim of this paper is to identify open challenges associated with energy efficient resource allocation. In this regard, the study, first, outlines the problem and existing hardware and software-based techniques available for this purpose. Furthermore, available techniques already presented in the literature are summarized based on the energy-efficient research dimension taxonomy. The advantages and disadvantages of the existing techniques are comprehensively analyzed against the proposed research dimension taxonomy namely: resource adaption policy, objective function, allocation method, allocation operation, and interoperability.« less
Streaming data analytics via message passing with application to graph algorithms
Plimpton, Steven J.; Shead, Tim
2014-05-06
The need to process streaming data, which arrives continuously at high-volume in real-time, arises in a variety of contexts including data produced by experiments, collections of environmental or network sensors, and running simulations. Streaming data can also be formulated as queries or transactions which operate on a large dynamic data store, e.g. a distributed database. We describe a lightweight, portable framework named PHISH which enables a set of independent processes to compute on a stream of data in a distributed-memory parallel manner. Datums are routed between processes in patterns defined by the application. PHISH can run on top of eithermore » message-passing via MPI or sockets via ZMQ. The former means streaming computations can be run on any parallel machine which supports MPI; the latter allows them to run on a heterogeneous, geographically dispersed network of machines. We illustrate how PHISH can support streaming MapReduce operations, and describe streaming versions of three algorithms for large, sparse graph analytics: triangle enumeration, subgraph isomorphism matching, and connected component finding. Lastly, we also provide benchmark timings for MPI versus socket performance of several kernel operations useful in streaming algorithms.« less
High Resolution Nature Runs and the Big Data Challenge
NASA Technical Reports Server (NTRS)
Webster, W. Phillip; Duffy, Daniel Q.
2015-01-01
NASA's Global Modeling and Assimilation Office at Goddard Space Flight Center is undertaking a series of very computationally intensive Nature Runs and a downscaled reanalysis. The nature runs use the GEOS-5 as an Atmospheric General Circulation Model (AGCM) while the reanalysis uses the GEOS-5 in Data Assimilation mode. This paper will present computational challenges from three runs, two of which are AGCM and one is downscaled reanalysis using the full DAS. The nature runs will be completed at two surface grid resolutions, 7 and 3 kilometers and 72 vertical levels. The 7 km run spanned 2 years (2005-2006) and produced 4 PB of data while the 3 km run will span one year and generate 4 BP of data. The downscaled reanalysis (MERRA-II Modern-Era Reanalysis for Research and Applications) will cover 15 years and generate 1 PB of data. Our efforts to address the big data challenges of climate science, we are moving toward a notion of Climate Analytics-as-a-Service (CAaaS), a specialization of the concept of business process-as-a-service that is an evolving extension of IaaS, PaaS, and SaaS enabled by cloud computing. In this presentation, we will describe two projects that demonstrate this shift. MERRA Analytic Services (MERRA/AS) is an example of cloud-enabled CAaaS. MERRA/AS enables MapReduce analytics over MERRA reanalysis data collection by bringing together the high-performance computing, scalable data management, and a domain-specific climate data services API. NASA's High-Performance Science Cloud (HPSC) is an example of the type of compute-storage fabric required to support CAaaS. The HPSC comprises a high speed Infinib and network, high performance file systems and object storage, and a virtual system environments specific for data intensive, science applications. These technologies are providing a new tier in the data and analytic services stack that helps connect earthbound, enterprise-level data and computational resources to new customers and new mobility-driven applications and modes of work. In our experience, CAaaS lowers the barriers and risk to organizational change, fosters innovation and experimentation, and provides the agility required to meet our customers' increasing and changing needs
Monitoring and controlling ATLAS data management: The Rucio web user interface
NASA Astrophysics Data System (ADS)
Lassnig, M.; Beermann, T.; Vigne, R.; Barisits, M.; Garonne, V.; Serfon, C.
2015-12-01
The monitoring and controlling interfaces of the previous data management system DQ2 followed the evolutionary requirements and needs of the ATLAS collaboration. The new data management system, Rucio, has put in place a redesigned web-based interface based upon the lessons learnt from DQ2, and the increased volume of managed information. This interface encompasses both a monitoring and controlling component, and allows easy integration for usergenerated views. The interface follows three design principles. First, the collection and storage of data from internal and external systems is asynchronous to reduce latency. This includes the use of technologies like ActiveMQ or Nagios. Second, analysis of the data into information is done massively parallel due to its volume, using a combined approach with an Oracle database and Hadoop MapReduce. Third, sharing of the information does not distinguish between human or programmatic access, making it easy to access selective parts of the information both in constrained frontends like web-browsers as well as remote services. This contribution will detail the reasons for these principles and the design choices taken. Additionally, the implementation, the interactions with external systems, and an evaluation of the system in production, both from a technological and user perspective, conclude this contribution.
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hameed, Abdul; Khoshkbarforoushha, Alireza; Ranjan, Rajiv
In a cloud computing paradigm, energy efficient allocation of different virtualized ICT resources (servers, storage disks, and networks, and the like) is a complex problem due to the presence of heterogeneous application (e.g., content delivery networks, MapReduce, web applications, and the like) workloads having contentious allocation requirements in terms of ICT resource capacities (e.g., network bandwidth, processing speed, response time, etc.). Several recent papers have tried to address the issue of improving energy efficiency in allocating cloud resources to applications with varying degree of success. However, to the best of our knowledge there is no published literature on this subjectmore » that clearly articulates the research problem and provides research taxonomy for succinct classification of existing techniques. Hence, the main aim of this paper is to identify open challenges associated with energy efficient resource allocation. In this regard, the study, first, outlines the problem and existing hardware and software-based techniques available for this purpose. Furthermore, available techniques already presented in the literature are summarized based on the energy-efficient research dimension taxonomy. The advantages and disadvantages of the existing techniques are comprehensively analyzed against the proposed research dimension taxonomy namely: resource adaption policy, objective function, allocation method, allocation operation, and interoperability.« less
Wiewiórka, Marek S; Messina, Antonio; Pacholewska, Alicja; Maffioletti, Sergio; Gawrysiak, Piotr; Okoniewski, Michał J
2014-09-15
Many time-consuming analyses of next -: generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics BECAUSE OF: their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes. Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Mohan, C.
In this paper, I survey briefly some of the recent and emerging trends in hardware and software features which impact high performance transaction processing and data analytics applications. These features include multicore processor chips, ultra large main memories, flash storage, storage class memories, database appliances, field programmable gate arrays, transactional memory, key-value stores, and cloud computing. While some applications, e.g., Web 2.0 ones, were initially built without traditional transaction processing functionality in mind, slowly system architects and designers are beginning to address such previously ignored issues. The availability, analytics and response time requirements of these applications were initially given more importance than ACID transaction semantics and resource consumption characteristics. A project at IBM Almaden is studying the implications of phase change memory on transaction processing, in the context of a key-value store. Bitemporal data management has also become an important requirement, especially for financial applications. Power consumption and heat dissipation properties are also major considerations in the emergence of modern software and hardware architectural features. Considerations relating to ease of configuration, installation, maintenance and monitoring, and improvement of total cost of ownership have resulted in database appliances becoming very popular. The MapReduce paradigm is now quite popular for large scale data analysis, in spite of the major inefficiencies associated with it.
Application and Prospect of Big Data in Water Resources
NASA Astrophysics Data System (ADS)
Xi, Danchi; Xu, Xinyi
2017-04-01
Because of developed information technology and affordable data storage, we h ave entered the era of data explosion. The term "Big Data" and technology relate s to it has been created and commonly applied in many fields. However, academic studies just got attention on Big Data application in water resources recently. As a result, water resource Big Data technology has not been fully developed. This paper introduces the concept of Big Data and its key technologies, including the Hadoop system and MapReduce. In addition, this paper focuses on the significance of applying the big data in water resources and summarizing prior researches by others. Most studies in this field only set up theoretical frame, but we define the "Water Big Data" and explain its tridimensional properties which are time dimension, spatial dimension and intelligent dimension. Based on HBase, the classification system of Water Big Data is introduced: hydrology data, ecology data and socio-economic data. Then after analyzing the challenges in water resources management, a series of solutions using Big Data technologies such as data mining and web crawler, are proposed. Finally, the prospect of applying big data in water resources is discussed, it can be predicted that as Big Data technology keeps developing, "3D" (Data Driven Decision) will be utilized more in water resources management in the future.
The Convergence of High Performance Computing and Large Scale Data Analytics
NASA Astrophysics Data System (ADS)
Duffy, D.; Bowen, M. K.; Thompson, J. H.; Yang, C. P.; Hu, F.; Wills, B.
2015-12-01
As the combinations of remote sensing observations and model outputs have grown, scientists are increasingly burdened with both the necessity and complexity of large-scale data analysis. Scientists are increasingly applying traditional high performance computing (HPC) solutions to solve their "Big Data" problems. While this approach has the benefit of limiting data movement, the HPC system is not optimized to run analytics, which can create problems that permeate throughout the HPC environment. To solve these issues and to alleviate some of the strain on the HPC environment, the NASA Center for Climate Simulation (NCCS) has created the Advanced Data Analytics Platform (ADAPT), which combines both HPC and cloud technologies to create an agile system designed for analytics. Large, commonly used data sets are stored in this system in a write once/read many file system, such as Landsat, MODIS, MERRA, and NGA. High performance virtual machines are deployed and scaled according to the individual scientist's requirements specifically for data analysis. On the software side, the NCCS and GMU are working with emerging commercial technologies and applying them to structured, binary scientific data in order to expose the data in new ways. Native NetCDF data is being stored within a Hadoop Distributed File System (HDFS) enabling storage-proximal processing through MapReduce while continuing to provide accessibility of the data to traditional applications. Once the data is stored within HDFS, an additional indexing scheme is built on top of the data and placed into a relational database. This spatiotemporal index enables extremely fast mappings of queries to data locations to dramatically speed up analytics. These are some of the first steps toward a single unified platform that optimizes for both HPC and large-scale data analysis, and this presentation will elucidate the resulting and necessary exascale architectures required for future systems.
NASA Astrophysics Data System (ADS)
Stumpf, André; Malet, Jean-Philippe
2016-04-01
Since more than 20 years, "Earth Observation" (EO) satellites developed or operated by ESA have provided a wealth of data. In the coming years, the Sentinel missions, along with the Copernicus Contributing Missions as well as Earth Explorers and other, Third Party missions will provide routine monitoring of our environment at the global scale, thereby delivering an unprecedented amount of data. While the availability of the growing volume of environmental data from space represents a unique opportunity for science, general R&D, and applications, it also poses major challenges to fully exploit the potential of archived and daily incoming datasets. Those challenges do not only comprise the discovery, access, processing, and visualization of large data volumes but also an increasing diversity of data sources and end users from different fields (e.g. EO, in-situ monitoring, and modeling). In this context, the GTEP (Geohazards Thematic Exploitation Platform) initiative aims to build an operational distributed processing platform to maximize the exploitation of EO data from past and future satellite missions for the detection and monitoring of natural hazards. This presentation focuses on the "Optical Image Correlation" Pilot Project (funded by ESA within the GTEP platform) which objectives are to develop an easy-to-use, flexible and distributed processing chain for: 1) the automated reconstruction of surface Digital Elevation Models from stereo (and tristereo) pairs of Spot 6/7 and Pléiades satellite imagery, 2) the creation of ortho-images (panchromatic and multi-spectral) of Landsat 8, Sentinel-2, Spot 6/7 and Pléiades scenes, 3) the calculation of horizontal (E-N) displacement vectors based on sub-pixel image correlation. The processing chains is being implemented on the GEP cloud-based (Hadoop, MapReduce) environment and designed for analysis of surface displacements at local to regional scale (10-1000 km2) targeting in particular co-seismic displacement and slow-moving landslides. The processing targets both the analysis of time-series of archived data (Pléiades, Landsat 8) and current satellite missions Spot 6/7 and Sentinel-2. The possibility of rapid calculation in near-real time is an important aspect of the design of the processing chain. Archived datasets will be processed for some 'demonstrator' test sites in order to develop and test the implemented workflows.
NASA Astrophysics Data System (ADS)
Irving, D. H.; Rasheed, M.; Hillman, C.; O'Doherty, N.
2012-12-01
Oilfield management is moving to a more operational footing with near-realtime seismic and sensor monitoring governing drilling, fluid injection and hydrocarbon extraction workflows within safety, productivity and profitability constraints. To date, the geoscientific analytical architectures employed are configured for large volumes of data, computational power or analytical latency and compromises in system design must be made to achieve all three aspects. These challenges are encapsulated by the phrase 'Big Data' which has been employed for over a decade in the IT industry to describe the challenges presented by data sets that are too large, volatile and diverse for existing computational architectures and paradigms. We present a data-centric architecture developed to support a geoscientific and geotechnical workflow whereby: ●scientific insight is continuously applied to fresh data ●insights and derived information are incorporated into engineering and operational decisions ●data governance and provenance are routine within a broader data management framework Strategic decision support systems in large infrastructure projects such as oilfields are typically relational data environments; data modelling is pervasive across analytical functions. However, subsurface data and models are typically non-relational (i.e. file-based) in the form of large volumes of seismic imaging data or rapid streams of sensor feeds and are analysed and interpreted using niche applications. The key architectural challenge is to move data and insight from a non-relational to a relational, or structured, data environment for faster and more integrated analytics. We describe how a blend of MapReduce and relational database technologies can be applied in geoscientific decision support, and the strengths and weaknesses of each in such an analytical ecosystem. In addition we discuss hybrid technologies that use aspects of both and translational technologies for moving data and analytics across these platforms. Moving to a data-centric architecture requires data management methodologies to be overhauled by default and we show how end-to-end data provenancing and dependency management is implicit in such an environment and how it benefits system administration as well as the user community. Whilst the architectural experiences are drawn from the oil industry, we believe that they are more broadly applicable in academic and government settings where large volumes of data are added to incrementally and require revisiting with low analytical latency and we suggest application to earthquake monitoring and remote sensing networks.
Benchmarking undedicated cloud computing providers for analysis of genomic datasets.
Yazar, Seyhan; Gooden, George E C; Mackey, David A; Hewitt, Alex W
2014-01-01
A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2) for E.coli and 53.5% (95% CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1) and 173.9% (95% CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mahadevan, Sankaran; Agarwal, Vivek; Neal, Kyle
Assessment and management of aging concrete structures in nuclear power plants require a more systematic approach than simple reliance on existing code margins of safety. Structural health monitoring of concrete structures aims to understand the current health condition of a structure based on heterogeneous measurements to produce high-confidence actionable information regarding structural integrity that supports operational and maintenance decisions. This ongoing research project is seeking to develop a probabilistic framework for health diagnosis and prognosis of aging concrete structures in a nuclear power plant that is subjected to physical, chemical, environment, and mechanical degradation. The proposed framework consists of fourmore » elements: monitoring, data analytics, uncertainty quantification and prognosis. This report focuses on degradation caused by ASR (alkali-silica reaction). Controlled specimens were prepared to develop accelerated ASR degradation. Different monitoring techniques – thermography, digital image correlation (DIC), mechanical deformation measurements, nonlinear impact resonance acoustic spectroscopy (NIRAS), and vibro-acoustic modulation (VAM) -- were used to detect the damage caused by ASR. Heterogeneous data from the multiple techniques was used for damage diagnosis and prognosis, and quantification of the associated uncertainty using a Bayesian network approach. Additionally, MapReduce technique has been demonstrated with synthetic data. This technique can be used in future to handle large amounts of observation data obtained from the online monitoring of realistic structures.« less
NASA Astrophysics Data System (ADS)
Sangaline, E.; Lauret, J.
2014-06-01
The quantity of information produced in Nuclear and Particle Physics (NPP) experiments necessitates the transmission and storage of data across diverse collections of computing resources. Robust solutions such as XRootD have been used in NPP, but as the usage of cloud resources grows, the difficulties in the dynamic configuration of these systems become a concern. Hadoop File System (HDFS) exists as a possible cloud storage solution with a proven track record in dynamic environments. Though currently not extensively used in NPP, HDFS is an attractive solution offering both elastic storage and rapid deployment. We will present the performance of HDFS in both canonical I/O tests and for a typical data analysis pattern within the RHIC/STAR experimental framework. These tests explore the scaling with different levels of redundancy and numbers of clients. Additionally, the performance of FUSE and NFS interfaces to HDFS were evaluated as a way to allow existing software to function without modification. Unfortunately, the complicated data structures in NPP are non-trivial to integrate with Hadoop and so many of the benefits of the MapReduce paradigm could not be directly realized. Despite this, our results indicate that using HDFS as a distributed filesystem offers reasonable performance and scalability and that it excels in its ease of configuration and deployment in a cloud environment.
Benchmarking Undedicated Cloud Computing Providers for Analysis of Genomic Datasets
Yazar, Seyhan; Gooden, George E. C.; Mackey, David A.; Hewitt, Alex W.
2014-01-01
A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5–78.2) for E.coli and 53.5% (95% CI: 34.4–72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5–303.1) and 173.9% (95% CI: 134.6–213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE. PMID:25247298
Visual Semantic Based 3D Video Retrieval System Using HDFS.
Kumar, C Ranjith; Suguna, S
2016-08-01
This paper brings out a neoteric frame of reference for visual semantic based 3d video search and retrieval applications. Newfangled 3D retrieval application spotlight on shape analysis like object matching, classification and retrieval not only sticking up entirely with video retrieval. In this ambit, we delve into 3D-CBVR (Content Based Video Retrieval) concept for the first time. For this purpose, we intent to hitch on BOVW and Mapreduce in 3D framework. Instead of conventional shape based local descriptors, we tried to coalesce shape, color and texture for feature extraction. For this purpose, we have used combination of geometric & topological features for shape and 3D co-occurrence matrix for color and texture. After thriving extraction of local descriptors, TB-PCT (Threshold Based- Predictive Clustering Tree) algorithm is used to generate visual codebook and histogram is produced. Further, matching is performed using soft weighting scheme with L 2 distance function. As a final step, retrieved results are ranked according to the Index value and acknowledged to the user as a feedback .In order to handle prodigious amount of data and Efficacious retrieval, we have incorporated HDFS in our Intellection. Using 3D video dataset, we future the performance of our proposed system which can pan out that the proposed work gives meticulous result and also reduce the time intricacy.
Kim, Hyerin; Kang, NaNa; An, KyuHyeon; Koo, JaeHyung; Kim, Min-Soo
2016-01-01
Design of high-quality primers for multiple target sequences is essential for qPCR experiments, but is challenging due to the need to consider both homology tests on off-target sequences and the same stringent filtering constraints on the primers. Existing web servers for primer design have major drawbacks, including requiring the use of BLAST-like tools for homology tests, lack of support for ranking of primers, TaqMan probes and simultaneous design of primers against multiple targets. Due to the large-scale computational overhead, the few web servers supporting homology tests use heuristic approaches or perform homology tests within a limited scope. Here, we describe the MRPrimerW, which performs complete homology testing, supports batch design of primers for multi-target qPCR experiments, supports design of TaqMan probes and ranks the resulting primers to return the top-1 best primers to the user. To ensure high accuracy, we adopted the core algorithm of a previously reported MapReduce-based method, MRPrimer, but completely redesigned it to allow users to receive query results quickly in a web interface, without requiring a MapReduce cluster or a long computation. MRPrimerW provides primer design services and a complete set of 341 963 135 in silico validated primers covering 99% of human and mouse genes. Free access: http://MRPrimerW.com. PMID:27154272
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy,more » and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel. We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.« less
An integrated system for land resources supervision based on the IoT and cloud computing
NASA Astrophysics Data System (ADS)
Fang, Shifeng; Zhu, Yunqiang; Xu, Lida; Zhang, Jinqu; Zhou, Peiji; Luo, Kan; Yang, Jie
2017-01-01
Integrated information systems are important safeguards for the utilisation and development of land resources. Information technologies, including the Internet of Things (IoT) and cloud computing, are inevitable requirements for the quality and efficiency of land resources supervision tasks. In this study, an economical and highly efficient supervision system for land resources has been established based on IoT and cloud computing technologies; a novel online and offline integrated system with synchronised internal and field data that includes the entire process of 'discovering breaches, analysing problems, verifying fieldwork and investigating cases' was constructed. The system integrates key technologies, such as the automatic extraction of high-precision information based on remote sensing, semantic ontology-based technology to excavate and discriminate public sentiment on the Internet that is related to illegal incidents, high-performance parallel computing based on MapReduce, uniform storing and compressing (bitwise) technology, global positioning system data communication and data synchronisation mode, intelligent recognition and four-level ('device, transfer, system and data') safety control technology. The integrated system based on a 'One Map' platform has been officially implemented by the Department of Land and Resources of Guizhou Province, China, and was found to significantly increase the efficiency and level of land resources supervision. The system promoted the overall development of informatisation in fields related to land resource management.
A hashtag recommendation system for twitter data streams.
Otsuka, Eriko; Wallace, Scott A; Chiu, David
2016-01-01
Twitter has evolved into a powerful communication and information sharing tool used by millions of people around the world to post what is happening now. A hashtag, a keyword prefixed with a hash symbol (#), is a feature in Twitter to organize tweets and facilitate effective search among a massive volume of data. In this paper, we propose an automatic hashtag recommendation system that helps users find new hashtags related to their interests on-demand. For hashtag ranking, we propose the Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU) ranking scheme, which is a variation of the well-known TF-IDF, that considers hashtag relevancy, as well as data sparseness which is one of the key challenges in analyzing microblog data. Our system is built on top of Hadoop, a leading platform for distributed computing, to provide scalable performance using Map-Reduce. Experiments on a large Twitter data set demonstrate that our method successfully yields relevant hashtags for user's interest and that recommendations are more stable and reliable than ranking tags based on tweet content similarity. Our results show that HF-IHU can achieve over 30 % hashtag recall when asked to identify the top 10 relevant hashtags for a particular tweet. Furthermore, our method out-performs kNN, k-popularity, and Naïve Bayes by 69, 54, and 17 %, respectively, on recall of the top 200 hashtags.
A Demonstration of Big Data Technology for Data Intensive Earth Science (Invited)
NASA Astrophysics Data System (ADS)
Kuo, K.; Clune, T.; Ramachandran, R.; Rushing, J.; Fekete, G.; Lin, A.; Doan, K.; Oloso, A. O.; Duffy, D.
2013-12-01
Big Data technologies exhibit great potential to change the way we conduct scientific investigations, especially analysis of voluminous and diverse data sets. Obviously, not all Big Data technologies are applicable to all aspects of scientific data analysis. Our NASA Earth Science Technology Office (ESTO) Advanced Information Systems Technology (AIST) project, Automated Event Service (AES), pioneers the exploration of Big Data technologies for data intensive Earth science. Since Earth science data are largely stored and manipulated in the form of multidimensional arrays, the project first evaluates array performance of several candidate Big Data technologies, including MapReduce (Hadoop), SciDB, and a custom-built Polaris system, which have one important feature in common: shared nothing architecture. The evaluation finds SicDB to be the most promising. In this presentation, we demonstrate SciDB using a couple of use cases, each operating on a distinct data set in the regular latitude-longitude grid. The first use case is the discovery and identification of blizzards using NASA's Modern Era Retrospective-analysis for Research and Application (MERRA) data sets. The other finds diurnal signals in the same 8-year period using SSMI data from three different instruments with different equator crossing times by correlating their retrieved parameters. In addition, the AES project is also developing a collaborative component to enable the sharing of event queries and results. Preliminary capabilities will be presented as well.
Considerations for Future Climate Data Stewardship
NASA Astrophysics Data System (ADS)
Halem, M.; Nguyen, P. T.; Chapman, D. R.
2009-12-01
In this talk, we will describe the lessons learned based on processing and generating a decade of gridded AIRS and MODIS IR sounding data. We describe the challenges faced in accessing and sharing very large data sets, maintaining data provenance under evolving technologies, obtaining access to legacy calibration data and the permanent preservation of Earth science data records for on demand services. These lessons suggest a new approach to data stewardship will be required for the next decade of hyper spectral instruments combined with cloud resolving models. It will not be sufficient for stewards of future data centers to just provide the public with access to archived data but our experience indicates that data needs to reside close to computers with ultra large disc farms and tens of thousands of processors to deliver complex services on demand over very high speed networks much like the offerings of search engines today. Over the first decade of the 21st century, petabyte data records were acquired from the AIRS instrument on Aqua and the MODIS instrument on Aqua and Terra. NOAA data centers also maintain petabytes of operational IR sounders collected over the past four decades. The UMBC Multicore Computational Center (MC2) developed a Service Oriented Atmospheric Radiance gridding system (SOAR) to allow users to select IR sounding instruments from multiple archives and choose space-time- spectral periods of Level 1B data to download, grid, visualize and analyze on demand. Providing this service requires high data rate bandwidth access to the on line disks at Goddard. After 10 years, cost effective disk storage technology finally caught up with the MODIS data volume making it possible for Level 1B MODIS data to be available on line. However, 10Ge fiber optic networks to access large volumes of data are still not available from CSFC to serve the broader community. Data transfer rates are well below 10MB/s limiting their usefulness for climate studies. During this decade, processor performance hit a power wall leading computer vendors to design multicore processor chips. High performance computer systems obtained petaflop performance by clustering tens of thousands of multicore processor chips. Thus, power consumption and autonomic recovery from processor and disc failures have become major cost and technical considerations for future data archives. To address these new architecture requirements, a transparent parallel programming paradigm, the Hadoop MapReduce cloud computing system, became available as an open S/W system. In addition, the Hadoop File System and manages the distribution of data to these processors as well as backs up the processing in the event of any processor or disc failure. However, to employ this paradigm, the data needs to be stored on the computer system. We conclude this talk with a climate data preservation approach that addresses the scalability crisis to exabyte data requirements for the next decade based on projections of processor, disc data density and bandwidth doubling rates.
Use of program logic models in the Southern Rural Access Program evaluation.
Pathman, Donald; Thaker, Samruddhi; Ricketts, Thomas C; Albright, Jennifer B
2003-01-01
The Southern Rural Access Program (SRAP) evaluation team used program logic models to clarify grantees' activities, objectives, and timelines. This information was used to benchmark data from grantees' progress reports to assess the program's successes. This article presents a brief background on the use of program logic models--essentially charts or diagrams specifying a program's planned activities, objectives, and goals--for evaluating and managing a program. It discusses the structure of the logic models chosen for the SRAP and how the model concept was introduced to the grantees to promote acceptance and use of the models. The article describes how the models helped clarify the program's objectives and helped lead agencies plan and manage the many program initiatives and subcontractors in their states. Models also provided a framework for grantees to report their progress to the National Program Office and evaluators and promoted the evaluators' visibility and acceptance by the grantees. Program logics, however, increased grantees' reporting requirements and demanded substantial time of the evaluators. Program logic models, on balance, proved their merit in the SRAP through their contributions to its management and evaluation and by providing a better understanding of the program's initiatives, successes, and potential impact.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shipman, Galen M.
These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less
Logic models as a tool for sexual violence prevention program development.
Hawkins, Stephanie R; Clinton-Sherrod, A Monique; Irvin, Neil; Hart, Laurie; Russell, Sarah Jane
2009-01-01
Sexual violence is a growing public health problem, and there is an urgent need to develop sexual violence prevention programs. Logic models have emerged as a vital tool in program development. The Centers for Disease Control and Prevention funded an empowerment evaluation designed to work with programs focused on the prevention of first-time male perpetration of sexual violence, and it included as one of its goals, the development of program logic models. Two case studies are presented that describe how significant positive changes can be made to programs as a result of their developing logic models that accurately describe desired outcomes. The first case study describes how the logic model development process made an organization aware of the importance of a program's environmental context for program success; the second case study demonstrates how developing a program logic model can elucidate gaps in organizational programming and suggest ways to close those gaps.
New algorithms for processing time-series big EEG data within mobile health monitoring systems.
Serhani, Mohamed Adel; Menshawy, Mohamed El; Benharref, Abdelghani; Harous, Saad; Navaz, Alramzana Nujum
2017-10-01
Recent advances in miniature biomedical sensors, mobile smartphones, wireless communications, and distributed computing technologies provide promising techniques for developing mobile health systems. Such systems are capable of monitoring epileptic seizures reliably, which are classified as chronic diseases. Three challenging issues raised in this context with regard to the transformation, compression, storage, and visualization of big data, which results from a continuous recording of epileptic seizures using mobile devices. In this paper, we address the above challenges by developing three new algorithms to process and analyze big electroencephalography data in a rigorous and efficient manner. The first algorithm is responsible for transforming the standard European Data Format (EDF) into the standard JavaScript Object Notation (JSON) and compressing the transformed JSON data to decrease the size and time through the transfer process and to increase the network transfer rate. The second algorithm focuses on collecting and storing the compressed files generated by the transformation and compression algorithm. The collection process is performed with respect to the on-the-fly technique after decompressing files. The third algorithm provides relevant real-time interaction with signal data by prospective users. It particularly features the following capabilities: visualization of single or multiple signal channels on a smartphone device and query data segments. We tested and evaluated the effectiveness of our approach through a software architecture model implementing a mobile health system to monitor epileptic seizures. The experimental findings from 45 experiments are promising and efficiently satisfy the approach's objectives in a price of linearity. Moreover, the size of compressed JSON files and transfer times are reduced by 10% and 20%, respectively, while the average total time is remarkably reduced by 67% through all performed experiments. Our approach successfully develops efficient algorithms in terms of processing time, memory usage, and energy consumption while maintaining a high scalability of the proposed solution. Our approach efficiently supports data partitioning and parallelism relying on the MapReduce platform, which can help in monitoring and automatic detection of epileptic seizures. Copyright © 2017 Elsevier B.V. All rights reserved.
Leveraging the Cloud for Robust and Efficient Lunar Image Processing
NASA Technical Reports Server (NTRS)
Chang, George; Malhotra, Shan; Wolgast, Paul
2011-01-01
The Lunar Mapping and Modeling Project (LMMP) is tasked to aggregate lunar data, from the Apollo era to the latest instruments on the LRO spacecraft, into a central repository accessible by scientists and the general public. A critical function of this task is to provide users with the best solution for browsing the vast amounts of imagery available. The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day. Despite this ever-increasing amount of data, LMMP must make the data readily available in a timely manner for users to view and analyze. This is accomplished by tiling large images into smaller images using Hadoop, a distributed computing software platform implementation of the MapReduce framework, running on a small cluster of machines locally. Additionally, the software is implemented to use Amazon's Elastic Compute Cloud (EC2) facility. We also developed a hybrid solution to serve images to users by leveraging cloud storage using Amazon's Simple Storage Service (S3) for public data while keeping private information on our own data servers. By using Cloud Computing, we improve upon our local solution by reducing the need to manage our own hardware and computing infrastructure, thereby reducing costs. Further, by using a hybrid of local and cloud storage, we are able to provide data to our users more efficiently and securely. 12 This paper examines the use of a distributed approach with Hadoop to tile images, an approach that provides significant improvements in image processing time, from hours to minutes. This paper describes the constraints imposed on the solution and the resulting techniques developed for the hybrid solution of a customized Hadoop infrastructure over local and cloud resources in managing this ever-growing data set. It examines the performance trade-offs of using the more plentiful resources of the cloud, such as those provided by S3, against the bandwidth limitations such use encounters with remote resources. As part of this discussion this paper will outline some of the technologies employed, the reasons for their selection, the resulting performance metrics and the direction the project is headed based upon the demonstrated capabilities thus far.
ERIC Educational Resources Information Center
Guthrie, Steven P.
In two articles on outdoor programming models, Watters distinguished four models on a continuum ranging from the common adventure model, with minimal organizational structure and leadership control, to the guide service model, in which leaders are autocratic and trips are highly structured. Club programs and instructional programs were in between,…
LDA-Based Unified Topic Modeling for Similar TV User Grouping and TV Program Recommendation.
Pyo, Shinjee; Kim, Eunhui; Kim, Munchurl
2015-08-01
Social TV is a social media service via TV and social networks through which TV users exchange their experiences about TV programs that they are viewing. For social TV service, two technical aspects are envisioned: grouping of similar TV users to create social TV communities and recommending TV programs based on group and personal interests for personalizing TV. In this paper, we propose a unified topic model based on grouping of similar TV users and recommending TV programs as a social TV service. The proposed unified topic model employs two latent Dirichlet allocation (LDA) models. One is a topic model of TV users, and the other is a topic model of the description words for viewed TV programs. The two LDA models are then integrated via a topic proportion parameter for TV programs, which enforces the grouping of similar TV users and associated description words for watched TV programs at the same time in a unified topic modeling framework. The unified model identifies the semantic relation between TV user groups and TV program description word groups so that more meaningful TV program recommendations can be made. The unified topic model also overcomes an item ramp-up problem such that new TV programs can be reliably recommended to TV users. Furthermore, from the topic model of TV users, TV users with similar tastes can be grouped as topics, which can then be recommended as social TV communities. To verify our proposed method of unified topic-modeling-based TV user grouping and TV program recommendation for social TV services, in our experiments, we used real TV viewing history data and electronic program guide data from a seven-month period collected by a TV poll agency. The experimental results show that the proposed unified topic model yields an average 81.4% precision for 50 topics in TV program recommendation and its performance is an average of 6.5% higher than that of the topic model of TV users only. For TV user prediction with new TV programs, the average prediction precision was 79.6%. Also, we showed the superiority of our proposed model in terms of both topic modeling performance and recommendation performance compared to two related topic models such as polylingual topic model and bilingual topic model.
Addressing Dynamic Issues of Program Model Checking
NASA Technical Reports Server (NTRS)
Lerda, Flavio; Visser, Willem
2001-01-01
Model checking real programs has recently become an active research area. Programs however exhibit two characteristics that make model checking difficult: the complexity of their state and the dynamic nature of many programs. Here we address both these issues within the context of the Java PathFinder (JPF) model checker. Firstly, we will show how the state of a Java program can be encoded efficiently and how this encoding can be exploited to improve model checking. Next we show how to use symmetry reductions to alleviate some of the problems introduced by the dynamic nature of Java programs. Lastly, we show how distributed model checking of a dynamic program can be achieved, and furthermore, how dynamic partitions of the state space can improve model checking. We support all our findings with results from applying these techniques within the JPF model checker.
Kim, Hyerin; Kang, NaNa; An, KyuHyeon; Koo, JaeHyung; Kim, Min-Soo
2016-07-08
Design of high-quality primers for multiple target sequences is essential for qPCR experiments, but is challenging due to the need to consider both homology tests on off-target sequences and the same stringent filtering constraints on the primers. Existing web servers for primer design have major drawbacks, including requiring the use of BLAST-like tools for homology tests, lack of support for ranking of primers, TaqMan probes and simultaneous design of primers against multiple targets. Due to the large-scale computational overhead, the few web servers supporting homology tests use heuristic approaches or perform homology tests within a limited scope. Here, we describe the MRPrimerW, which performs complete homology testing, supports batch design of primers for multi-target qPCR experiments, supports design of TaqMan probes and ranks the resulting primers to return the top-1 best primers to the user. To ensure high accuracy, we adopted the core algorithm of a previously reported MapReduce-based method, MRPrimer, but completely redesigned it to allow users to receive query results quickly in a web interface, without requiring a MapReduce cluster or a long computation. MRPrimerW provides primer design services and a complete set of 341 963 135 in silico validated primers covering 99% of human and mouse genes. Free access: http://MRPrimerW.com. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing
Karimi, Ramin; Hajdu, Andras
2016-01-01
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis. PMID:26884678
HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing.
Karimi, Ramin; Hajdu, Andras
2016-01-01
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.
Sequence analysis by iterated maps, a review.
Almeida, Jonas S
2014-05-01
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
The Czech National Grid Infrastructure
NASA Astrophysics Data System (ADS)
Chudoba, J.; Křenková, I.; Mulač, M.; Ruda, M.; Sitera, J.
2017-10-01
The Czech National Grid Infrastructure is operated by MetaCentrum, a CESNET department responsible for coordinating and managing activities related to distributed computing. CESNET as the Czech National Research and Education Network (NREN) provides many e-infrastructure services, which are used by 94% of the scientific and research community in the Czech Republic. Computing and storage resources owned by different organizations are connected by fast enough network to provide transparent access to all resources. We describe in more detail the computing infrastructure, which is based on several different technologies and covers grid, cloud and map-reduce environment. While the largest part of CPUs is still accessible via distributed torque servers, providing environment for long batch jobs, part of infrastructure is available via standard EGI tools in EGI, subset of NGI resources is provided into EGI FedCloud environment with cloud interface and there is also Hadoop cluster provided by the same e-infrastructure.A broad spectrum of computing servers is offered; users can choose from standard 2 CPU servers to large SMP machines with up to 6 TB of RAM or servers with GPU cards. Different groups have different priorities on various resources, resource owners can even have an exclusive access. The software is distributed via AFS. Storage servers offering up to tens of terabytes of disk space to individual users are connected via NFS4 on top of GPFS and access to long term HSM storage with peta-byte capacity is also provided. Overview of available resources and recent statistics of usage will be given.
Streamlining CASTOR to manage the LHC data torrent
NASA Astrophysics Data System (ADS)
Lo Presti, G.; Espinal Curull, X.; Cano, E.; Fiorini, B.; Ieri, A.; Murray, S.; Ponce, S.; Sindrilaru, E.
2014-06-01
This contribution describes the evolution of the main CERN storage system, CASTOR, as it manages the bulk data stream of the LHC and other CERN experiments, achieving over 90 PB of stored data by the end of LHC Run 1. This evolution was marked by the introduction of policies to optimize the tape sub-system throughput, going towards a cold storage system where data placement is managed by the experiments' production managers. More efficient tape migrations and recalls have been implemented and deployed where bulk meta-data operations greatly reduce the overhead due to small files. A repack facility is now integrated in the system and it has been enhanced in order to automate the repacking of several tens of petabytes, required in 2014 in order to prepare for the next LHC run. Finally the scheduling system has been evolved to integrate the internal monitoring. To efficiently manage the service a solid monitoring infrastructure is required, able to analyze the logs produced by the different components (about 1 kHz of log messages). A new system has been developed and deployed, which uses a transport messaging layer provided by the CERN-IT Agile Infrastructure and exploits technologies including Hadoop and HBase. This enables efficient data mining by making use of MapReduce techniques, and real-time data aggregation and visualization. The outlook for the future is also presented. Directions and possible evolution will be discussed in view of the restart of data taking activities.
Extending Climate Analytics-As to the Earth System Grid Federation
NASA Astrophysics Data System (ADS)
Tamkin, G.; Schnase, J. L.; Duffy, D.; McInerney, M.; Nadeau, D.; Li, J.; Strong, S.; Thompson, J. H.
2015-12-01
We are building three extensions to prior-funded work on climate analytics-as-a-service that will benefit the Earth System Grid Federation (ESGF) as it addresses the Big Data challenges of future climate research: (1) We are creating a cloud-based, high-performance Virtual Real-Time Analytics Testbed supporting a select set of climate variables from six major reanalysis data sets. This near real-time capability will enable advanced technologies like the Cloudera Impala-based Structured Query Language (SQL) query capabilities and Hadoop-based MapReduce analytics over native NetCDF files while providing a platform for community experimentation with emerging analytic technologies. (2) We are building a full-featured Reanalysis Ensemble Service comprising monthly means data from six reanalysis data sets. The service will provide a basic set of commonly used operations over the reanalysis collections. The operations will be made accessible through NASA's climate data analytics Web services and our client-side Climate Data Services (CDS) API. (3) We are establishing an Open Geospatial Consortium (OGC) WPS-compliant Web service interface to our climate data analytics service that will enable greater interoperability with next-generation ESGF capabilities. The CDS API will be extended to accommodate the new WPS Web service endpoints as well as ESGF's Web service endpoints. These activities address some of the most important technical challenges for server-side analytics and support the research community's requirements for improved interoperability and improved access to reanalysis data.
A simulation model for wind energy storage systems. Volume 3: Program descriptions
NASA Technical Reports Server (NTRS)
Warren, A. W.; Edsinger, R. W.; Burroughs, J. D.
1977-01-01
Program descriptions, flow charts, and program listings for the SIMWEST model generation program, the simulation program, the file maintenance program, and the printer plotter program are given. For Vol 2, see .
Model-based control strategies for systems with constraints of the program type
NASA Astrophysics Data System (ADS)
Jarzębowska, Elżbieta
2006-08-01
The paper presents a model-based tracking control strategy for constrained mechanical systems. Constraints we consider can be material and non-material ones referred to as program constraints. The program constraint equations represent tasks put upon system motions and they can be differential equations of orders higher than one or two, and be non-integrable. The tracking control strategy relies upon two dynamic models: a reference model, which is a dynamic model of a system with arbitrary order differential constraints and a dynamic control model. The reference model serves as a motion planner, which generates inputs to the dynamic control model. It is based upon a generalized program motion equations (GPME) method. The method enables to combine material and program constraints and merge them both into the motion equations. Lagrange's equations with multipliers are the peculiar case of the GPME, since they can be applied to systems with constraints of first orders. Our tracking strategy referred to as a model reference program motion tracking control strategy enables tracking of any program motion predefined by the program constraints. It extends the "trajectory tracking" to the "program motion tracking". We also demonstrate that our tracking strategy can be extended to a hybrid program motion/force tracking.
[Documenting a rehabilitation program using a logic model: an advantage to the assessment process].
Poncet, Frédérique; Swaine, Bonnie; Pradat-Diehl, Pascale
2017-03-06
The cognitive and behavioral disorders after brain injury can result in severe limitations of activities and restrictions of participation. An interdisciplinary rehabilitation program was developed in physical medicine and rehabilitation at the Pitié-Salpêtriere Hospital, Paris, France. Clinicians believe this program decreases activity limitations and improves participation in patients. However, the program’s effectiveness had never been assessed. To do this, we had to define/describe this program. However rehabilitation programs are holistic and thus complex making them difficult to describe. Therefore, to facilitate the evaluation of complex programs, including those for rehabilitation, we illustrate the use of a theoretical logic model, as proposed by Champagne, through the process of documentation of a specific complex and interdisciplinary rehabilitation program. Through participatory/collaborative research, the rehabilitation program was analyzed using three “submodels” of the logic model of intervention: causal model, intervention model and program theory model. This should facilitate the evaluation of programs, including those for rehabilitation.
Unified Program Design: Organizing Existing Programming Models, Delivery Options, and Curriculum
ERIC Educational Resources Information Center
Rubenstein, Lisa DaVia; Ridgley, Lisa M.
2017-01-01
A persistent problem in the field of gifted education has been the lack of categorization and delineation of gifted programming options. To address this issue, we propose Unified Program Design as a structural framework for gifted program models. This framework defines gifted programs as the combination of delivery methods and curriculum models.…
Van Metre, P.C.
1990-01-01
A computer-program interface between a geographic-information system and a groundwater flow model links two unrelated software systems for use in developing the flow models. The interface program allows the modeler to compile and manage geographic components of a groundwater model within the geographic information system. A significant savings of time and effort is realized in developing, calibrating, and displaying the groundwater flow model. Four major guidelines were followed in developing the interface program: (1) no changes to the groundwater flow model code were to be made; (2) a data structure was to be designed within the geographic information system that follows the same basic data structure as the groundwater flow model; (3) the interface program was to be flexible enough to support all basic data options available within the model; and (4) the interface program was to be as efficient as possible in terms of computer time used and online-storage space needed. Because some programs in the interface are written in control-program language, the interface will run only on a computer with the PRIMOS operating system. (USGS)
Program management model study
NASA Technical Reports Server (NTRS)
Connelly, J. J.; Russell, J. E.; Seline, J. R.; Sumner, N. R., Jr.
1972-01-01
Two models, a system performance model and a program assessment model, have been developed to assist NASA management in the evaluation of development alternatives for the Earth Observations Program. Two computer models were developed and demonstrated on the Goddard Space Flight Center Computer Facility. Procedures have been outlined to guide the user of the models through specific evaluation processes, and the preparation of inputs describing earth observation needs and earth observation technology. These models are intended to assist NASA in increasing the effectiveness of the overall Earth Observation Program by providing a broader view of system and program development alternatives.
ERIC Educational Resources Information Center
Matzke, Orville R.
The purpose of this study was to formulate a linear programming model to simulate a foundation type support program and to apply this model to a state support program for the public elementary and secondary school districts in the State of Iowa. The model was successful in producing optimal solutions to five objective functions proposed for…
Program listing for the REEDM (Rocket Exhaust Effluent Diffusion Model) computer program
NASA Technical Reports Server (NTRS)
Bjorklund, J. R.; Dumbauld, R. K.; Cheney, C. S.; Geary, H. V.
1982-01-01
The program listing for the REEDM Computer Program is provided. A mathematical description of the atmospheric dispersion models, cloud-rise models, and other formulas used in the REEDM model; vehicle and source parameters, other pertinent physical properties of the rocket exhaust cloud and meteorological layering techniques; user's instructions for the REEDM computer program; and worked example problems are contained in NASA CR-3646.
ERIC Educational Resources Information Center
King, Gillian; Currie, Melissa; Smith, Linda; Servais, Michelle; McDougall, Janette
2008-01-01
A framework of operating models for interdisciplinary research programs in clinical service organizations is presented, consisting of a "clinician-researcher" skill development model, a program evaluation model, a researcher-led knowledge generation model, and a knowledge conduit model. Together, these models comprise a tailored, collaborative…
Using the OIG model compliance programs to fight fraud.
Lovitky, Jeffrey A; Ahern, Jack
2002-03-01
Many healthcare organizations already have implemented compliance programs for their facilities. However, in light of recent fines and continued scrutiny of such programs by the HHS Office of Inspector General (OIG), healthcare organizations should consider reviewing their current programs against the OIG's relevant model compliance program. Although healthcare organizations are not required to adhere strictly to OIG's model programs, they would benefit from ensuring that their programs meet all the OIG's requirements. The common, minimum elements suggested by the OIG model programs include development and distribution of written compliance policies, the designation of a chief compliance officer to manage the program, the development of a corrective action and enforcement system, and the use of audits to monitor compliance. Using these models as guides, healthcare organizations should be better able to avoid the possibility of fraud and abuse within their organizations.
Devil is in the details: Using logic models to investigate program process.
Peyton, David J; Scicchitano, Michael
2017-12-01
Theory-based logic models are commonly developed as part of requirements for grant funding. As a tool to communicate complex social programs, theory based logic models are an effective visual communication. However, after initial development, theory based logic models are often abandoned and remain in their initial form despite changes in the program process. This paper examines the potential benefits of committing time and resources to revising the initial theory driven logic model and developing detailed logic models that describe key activities to accurately reflect the program and assist in effective program management. The authors use a funded special education teacher preparation program to exemplify the utility of drill down logic models. The paper concludes with lessons learned from the iterative revision process and suggests how the process can lead to more flexible and calibrated program management. Copyright © 2017 Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
From Petascale to Exascale: Eight Focus Areas of R&D Challenges for HPC Simulation Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Springmeyer, R; Still, C; Schulz, M
2011-03-17
Programming models bridge the gap between the underlying hardware architecture and the supporting layers of software available to applications. Programming models are different from both programming languages and application programming interfaces (APIs). Specifically, a programming model is an abstraction of the underlying computer system that allows for the expression of both algorithms and data structures. In comparison, languages and APIs provide implementations of these abstractions and allow the algorithms and data structures to be put into practice - a programming model exists independently of the choice of both the programming language and the supporting APIs. Programming models are typically focusedmore » on achieving increased developer productivity, performance, and portability to other system designs. The rapidly changing nature of processor architectures and the complexity of designing an exascale platform provide significant challenges for these goals. Several other factors are likely to impact the design of future programming models. In particular, the representation and management of increasing levels of parallelism, concurrency and memory hierarchies, combined with the ability to maintain a progressive level of interoperability with today's applications are of significant concern. Overall the design of a programming model is inherently tied not only to the underlying hardware architecture, but also to the requirements of applications and libraries including data analysis, visualization, and uncertainty quantification. Furthermore, the successful implementation of a programming model is dependent on exposed features of the runtime software layers and features of the operating system. Successful use of a programming model also requires effective presentation to the software developer within the context of traditional and new software development tools. Consideration must also be given to the impact of programming models on both languages and the associated compiler infrastructure. Exascale programming models must reflect several, often competing, design goals. These design goals include desirable features such as abstraction and separation of concerns. However, some aspects are unique to large-scale computing. For example, interoperability and composability with existing implementations will prove critical. In particular, performance is the essential underlying goal for large-scale systems. A key evaluation metric for exascale models will be the extent to which they support these goals rather than merely enable them.« less
Testing of a Program Evaluation Model: Final Report.
ERIC Educational Resources Information Center
Nagler, Phyllis J.; Marson, Arthur A.
A program evaluation model developed by Moraine Park Technical Institute (MPTI) is described in this report. Following background material, the four main evaluation criteria employed in the model are identified as program quality, program relevance to community needs, program impact on MPTI, and the transition and growth of MPTI graduates in the…
More-Realistic Digital Modeling of a Human Body
NASA Technical Reports Server (NTRS)
Rogge, Renee
2010-01-01
A MATLAB computer program has been written to enable improved (relative to an older program) modeling of a human body for purposes of designing space suits and other hardware with which an astronaut must interact. The older program implements a kinematic model based on traditional anthropometric measurements that do provide important volume and surface information. The present program generates a three-dimensional (3D) whole-body model from 3D body-scan data. The program utilizes thin-plate spline theory to reposition the model without need for additional scans.
Advanced very high resolution radiometer
NASA Technical Reports Server (NTRS)
1976-01-01
The advanced very high resolution radiometer development program is considered. The program covered the design, construction, and test of a breadboard model, engineering model, protoflight model, mechanical structural model, and a life test model. Special bench test and calibration equipment was also developed for use on the program.
Testing the Structure of Hydrological Models using Genetic Programming
NASA Astrophysics Data System (ADS)
Selle, B.; Muttil, N.
2009-04-01
Genetic Programming is able to systematically explore many alternative model structures of different complexity from available input and response data. We hypothesised that genetic programming can be used to test the structure hydrological models and to identify dominant processes in hydrological systems. To test this, genetic programming was used to analyse a data set from a lysimeter experiment in southeastern Australia. The lysimeter experiment was conducted to quantify the deep percolation response under surface irrigated pasture to different soil types, water table depths and water ponding times during surface irrigation. Using genetic programming, a simple model of deep percolation was consistently evolved in multiple model runs. This simple and interpretable model confirmed the dominant process contributing to deep percolation represented in a conceptual model that was published earlier. Thus, this study shows that genetic programming can be used to evaluate the structure of hydrological models and to gain insight about the dominant processes in hydrological systems.
An OpenACC-Based Unified Programming Model for Multi-accelerator Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S
2015-01-01
This paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
User's manual for LINEAR, a FORTRAN program to derive linear aircraft models
NASA Technical Reports Server (NTRS)
Duke, Eugene L.; Patterson, Brian P.; Antoniewicz, Robert F.
1987-01-01
This report documents a FORTRAN program that provides a powerful and flexible tool for the linearization of aircraft models. The program LINEAR numerically determines a linear system model using nonlinear equations of motion and a user-supplied nonlinear aerodynamic model. The system model determined by LINEAR consists of matrices for both state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model.
The Spiral-Interactive Program Evaluation Model.
ERIC Educational Resources Information Center
Khaleel, Ibrahim Adamu
1988-01-01
Describes the spiral interactive program evaluation model, which is designed to evaluate vocational-technical education programs in secondary schools in Nigeria. Program evaluation is defined; utility oriented and process oriented models for evaluation are described; and internal and external evaluative factors and variables that define each…
Puerto Rico water resources planning model program description
Moody, D.W.; Maddock, Thomas; Karlinger, M.R.; Lloyd, J.J.
1973-01-01
Because the use of the Mathematical Programming System -Extended (MPSX) to solve large linear and mixed integer programs requires the preparation of many input data cards, a matrix generator program to produce the MPSX input data from a much more limited set of data may expedite the use of the mixed integer programming optimization technique. The Model Definition and Control Program (MODCQP) is intended to assist a planner in preparing MPSX input data for the Puerto Rico Water Resources Planning Model. The model utilizes a mixed-integer mathematical program to identify a minimum present cost set of water resources projects (diversions, reservoirs, ground-water fields, desalinization plants, water treatment plants, and inter-basin transfers of water) which will meet a set of future water demands and to determine their sequence of construction. While MODCOP was specifically written to generate MPSX input data for the planning model described in this report, the program can be easily modified to reflect changes in the model's mathematical structure.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bergen, Benjamin Karl
This is the PDF of a powerpoint presentation from a teleconference on Los Alamos programming models. It starts by listing their assumptions for the programming models and then details a hierarchical programming model at the System Level and Node Level. Then it details how to map this to their internal nomenclature. Finally, a list is given of what they are currently doing in this regard.
Snyder, Kimberly; Rieker, Patricia P.
2014-01-01
Functioning program infrastructure is necessary for achieving public health outcomes. It is what supports program capacity, implementation, and sustainability. The public health program infrastructure model presented in this article is grounded in data from a broader evaluation of 18 state tobacco control programs and previous work. The newly developed Component Model of Infrastructure (CMI) addresses the limitations of a previous model and contains 5 core components (multilevel leadership, managed resources, engaged data, responsive plans and planning, networked partnerships) and 3 supporting components (strategic understanding, operations, contextual influences). The CMI is a practical, implementation-focused model applicable across public health programs, enabling linkages to capacity, sustainability, and outcome measurement. PMID:24922125
Testing the structure of a hydrological model using Genetic Programming
NASA Astrophysics Data System (ADS)
Selle, Benny; Muttil, Nitin
2011-01-01
SummaryGenetic Programming is able to systematically explore many alternative model structures of different complexity from available input and response data. We hypothesised that Genetic Programming can be used to test the structure of hydrological models and to identify dominant processes in hydrological systems. To test this, Genetic Programming was used to analyse a data set from a lysimeter experiment in southeastern Australia. The lysimeter experiment was conducted to quantify the deep percolation response under surface irrigated pasture to different soil types, watertable depths and water ponding times during surface irrigation. Using Genetic Programming, a simple model of deep percolation was recurrently evolved in multiple Genetic Programming runs. This simple and interpretable model supported the dominant process contributing to deep percolation represented in a conceptual model that was published earlier. Thus, this study shows that Genetic Programming can be used to evaluate the structure of hydrological models and to gain insight about the dominant processes in hydrological systems.
Model Checking Abstract PLEXIL Programs with SMART
NASA Technical Reports Server (NTRS)
Siminiceanu, Radu I.
2007-01-01
We describe a method to automatically generate discrete-state models of abstract Plan Execution Interchange Language (PLEXIL) programs that can be analyzed using model checking tools. Starting from a high-level description of a PLEXIL program or a family of programs with common characteristics, the generator lays the framework that models the principles of program execution. The concrete parts of the program are not automatically generated, but require the modeler to introduce them by hand. As a case study, we generate models to verify properties of the PLEXIL macro constructs that are introduced as shorthand notation. After an exhaustive analysis, we conclude that the macro definitions obey the intended semantics and behave as expected, but contingently on a few specific requirements on the timing semantics of micro-steps in the concrete executive implementation.
A Field-Based Curriculum Model for Earth Science Teacher-Preparation Programs.
ERIC Educational Resources Information Center
Dubois, David D.
1979-01-01
This study proposed a model set of cognitive-behavioral objectives for field-based teacher education programs for earth science teachers. It describes field experience integration into teacher education programs. The model is also applicable for evaluation of earth science teacher education programs. (RE)
Model Entrepreneurship Programs. Special Publication Series No. 53.
ERIC Educational Resources Information Center
Ross, Novella; Kurth, Paula
This collection contains descriptions of model entrepreneurship education programs submitted by 10 states. The first section deals with the Colorado Network of Small Business Programs and the Colorado Entrepreneurship Education Infusion Project. Described next is a model teacher education program in entrepreneurship education that was implemented…
User's guide to the western spruce budworm modeling system
Nicholas L. Crookston; J. J. Colbert; Paul W. Thomas; Katharine A. Sheehan; William P. Kemp
1990-01-01
The Budworm Modeling System is a set of four computer programs: The Budworm Dynamics Model, the Prognosis-Budworm Dynamics Model, the Prognosis-Budworm Damage Model, and the Parallel Processing-Budworm Dynamics Model. Input to the first three programs and the output produced are described in this guide. A guide to the fourth program will be published separately....
EvoBuild: A Quickstart Toolkit for Programming Agent-Based Models of Evolutionary Processes
NASA Astrophysics Data System (ADS)
Wagh, Aditi; Wilensky, Uri
2018-04-01
Extensive research has shown that one of the benefits of programming to learn about scientific phenomena is that it facilitates learning about mechanisms underlying the phenomenon. However, using programming activities in classrooms is associated with costs such as requiring additional time to learn to program or students needing prior experience with programming. This paper presents a class of programming environments that we call quickstart: Environments with a negligible threshold for entry into programming and a modest ceiling. We posit that such environments can provide benefits of programming for learning without incurring associated costs for novice programmers. To make this claim, we present a design-based research study conducted to compare programming models of evolutionary processes with a quickstart toolkit with exploring pre-built models of the same processes. The study was conducted in six seventh grade science classes in two schools. Students in the programming condition used EvoBuild, a quickstart toolkit for programming agent-based models of evolutionary processes, to build their NetLogo models. Students in the exploration condition used pre-built NetLogo models. We demonstrate that although students came from a range of academic backgrounds without prior programming experience, and all students spent the same number of class periods on the activities including the time students took to learn programming in this environment, EvoBuild students showed greater learning about evolutionary mechanisms. We discuss the implications of this work for design research on programming environments in K-12 science education.
Development of a program to fit data to a new logistic model for microbial growth.
Fujikawa, Hiroshi; Kano, Yoshihiro
2009-06-01
Recently we developed a mathematical model for microbial growth in food. The model successfully predicted microbial growth at various patterns of temperature. In this study, we developed a program to fit data to the model with a spread sheet program, Microsoft Excel. Users can instantly get curves fitted to the model by inputting growth data and choosing the slope portion of a curve. The program also could estimate growth parameters including the rate constant of growth and the lag period. This program would be a useful tool for analyzing growth data and further predicting microbial growth.
Enhancements to the Branched Lagrangian Transport Modeling System
Jobson, Harvey E.
1997-01-01
The Branched Lagrangian Transport Model (BLTM) has received wide use within the U.S. Geological Survey over the past 10 years. This report documents the enhancements and modifications that have been made to this modeling system since it was first introduced. The programs in the modeling system are arranged into five levels?programs to generate time-series of meteorological data (EQULTMP, SOLAR), programs to process time-series data (INTRP, MRG), programs to build input files for transport model (BBLTM, BQUAL2E), the model with defined reaction kinetics (BLTM, QUAL2E), and post processor plotting programs (CTPLT, CXPLT). An example application is presented to illustrate how the modeling system can be used to simulate 10 water-quality constituents in the Chattahoochee River below Atlanta, Georgia.
Jaegers, Lisa; Dale, Ann Marie; Weaver, Nancy; Buchholz, Bryan; Welch, Laura; Evanoff, Bradley
2014-03-01
Intervention studies in participatory ergonomics (PE) are often difficult to interpret due to limited descriptions of program planning and evaluation. In an ongoing PE program with floor layers, we developed a logic model to describe our program plan, and process and summative evaluations designed to describe the efficacy of the program. The logic model was a useful tool for describing the program elements and subsequent modifications. The process evaluation measured how well the program was delivered as intended, and revealed the need for program modifications. The summative evaluation provided early measures of the efficacy of the program as delivered. Inadequate information on program delivery may lead to erroneous conclusions about intervention efficacy due to Type III error. A logic model guided the delivery and evaluation of our intervention and provides useful information to aid interpretation of results. © 2013 Wiley Periodicals, Inc.
Jaegers, Lisa; Dale, Ann Marie; Weaver, Nancy; Buchholz, Bryan; Welch, Laura; Evanoff, Bradley
2013-01-01
Background Intervention studies in participatory ergonomics (PE) are often difficult to interpret due to limited descriptions of program planning and evaluation. Methods In an ongoing PE program with floor layers, we developed a logic model to describe our program plan, and process and summative evaluations designed to describe the efficacy of the program. Results The logic model was a useful tool for describing the program elements and subsequent modifications. The process evaluation measured how well the program was delivered as intended, and revealed the need for program modifications. The summative evaluation provided early measures of the efficacy of the program as delivered. Conclusions Inadequate information on program delivery may lead to erroneous conclusions about intervention efficacy due to Type III error. A logic model guided the delivery and evaluation of our intervention and provides useful information to aid interpretation of results. PMID:24006097
NASA Astrophysics Data System (ADS)
Kusumawati, Rosita; Subekti, Retno
2017-04-01
Fuzzy bi-objective linear programming (FBOLP) model is bi-objective linear programming model in fuzzy number set where the coefficients of the equations are fuzzy number. This model is proposed to solve portfolio selection problem which generate an asset portfolio with the lowest risk and the highest expected return. FBOLP model with normal fuzzy numbers for risk and expected return of stocks is transformed into linear programming (LP) model using magnitude ranking function.
Two Models for Implementing Senior Mentor Programs in Academic Medical Settings
ERIC Educational Resources Information Center
Corwin, Sara J.; Bates, Tovah; Cohan, Mary; Bragg, Dawn S.; Roberts, Ellen
2007-01-01
This paper compares two models of undergraduate geriatric medical education utilizing senior mentoring programs. Descriptive, comparative multiple-case study was employed analyzing program documents, archival records, and focus group data. Themes were compared for similarities and differences between the two program models. Findings indicate that…
Programming While Construction of Engineering 3D Models of Complex Geometry
NASA Astrophysics Data System (ADS)
Kheyfets, A. L.
2017-11-01
The capabilities of geometrically accurate computational 3D models construction with the use of programming are presented. The construction of models of an architectural arch and a glo-boid worm gear is considered as an example. The models are designed in the AutoCAD pack-age. Three programs of construction are given. The first program is for designing a multi-section architectural arch. The control of the arch’s geometry by impacting its main parameters is shown. The second program is for designing and studying the working surface of a globoid gear’s worm. The article shows how to make the animation for this surface’s formation. The third program is for formation of a worm gear cavity surface. The cavity formation dynamics is studied. The programs are written in the AutoLisp programming language. The program texts are provided.
Shaw, Tim; Barnet, Stewart; Mcgregor, Deborah; Avery, Jennifer
2015-01-01
Online learning is a primary delivery method for continuing health education programs. It is critical that programs have curricula objectives linked to educational models that support learning. Using a proven educational modelling process ensures that curricula objectives are met and a solid basis for learning and assessment is achieved. To develop an educational design model that produces an educationally sound program development plan for use by anyone involved in online course development. We have described the development of a generic educational model designed for continuing health education programs. The Knowledge, Process, Practice (KPP) model is founded on recognised educational theory and online education practice. This paper presents a step-by-step guide on using this model for program development that encases reliable learning and evaluation. The model supports a three-step approach, KPP, based on learning outcomes and supporting appropriate assessment activities. It provides a program structure for online or blended learning that is explicit, educationally defensible, and supports multiple assessment points for health professionals. The KPP model is based on best practice educational design using a structure that can be adapted for a variety of online or flexibly delivered postgraduate medical education programs.
Community Modeling Program for Space Weather: A CCMC Perspective
NASA Technical Reports Server (NTRS)
Hesse, Michael
2009-01-01
A community modeling program, which provides a forum for exchange and integration between modelers, has excellent potential for furthering our Space Weather modeling and forecasting capabilities. The design of such a program is of great importance to its success. In this presentation, we will argue that the most effective community modeling program should be focused on Space Weather-related objectives, and that it should be open and inclusive. The tremendous successes of prior community research activities further suggest that the most effective implementation of a new community modeling program should be based on community leadership, rather than on domination by individual institutions or centers. This presentation will provide an experience-based justification for these conclusions.
AutoCAD-To-NASTRAN Translator Program
NASA Technical Reports Server (NTRS)
Jones, A.
1989-01-01
Program facilitates creation of finite-element mathematical models from geometric entities. AutoCAD to NASTRAN translator (ACTON) computer program developed to facilitate quick generation of small finite-element mathematical models for use with NASTRAN finite-element modeling program. Reads geometric data of drawing from Data Exchange File (DXF) used in AutoCAD and other PC-based drafting programs. Written in Microsoft Quick-Basic (Version 2.0).
Ratcliffe, Michelle M
2012-08-01
Farm to School programs hold promise to address childhood obesity. These programs may increase students’ access to healthier foods, increase students’ knowledge of and desire to eat these foods, and increase their consumption of them. Implementing Farm to School programs requires the involvement of multiple people, including nutrition services, educators, and food producers. Because these groups have not traditionally worked together and each has different goals, it is important to demonstrate how Farm to School programs that are designed to decrease childhood obesity may also address others’ objectives, such as academic achievement and economic development. A logic model is an effective tool to help articulate a shared vision for how Farm to School programs may work to accomplish multiple goals. Furthermore, there is evidence that programs based on theory are more likely to be effective at changing individuals’ behaviors. Logic models based on theory may help to explain how a program works, aid in efficient and sustained implementation, and support the development of a coherent evaluation plan. This article presents a sample theory-based logic model for Farm to School programs. The presented logic model is informed by the polytheoretical model for food and garden-based education in school settings (PMFGBE). The logic model has been applied to multiple settings, including Farm to School program development and evaluation in urban and rural school districts. This article also includes a brief discussion on the development of the PMFGBE, a detailed explanation of how Farm to School programs may enhance the curricular, physical, and social learning environments of schools, and suggestions for the applicability of the logic model for practitioners, researchers, and policy makers.
Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias
2011-01-01
Future multiscale and multiphysics models must use the power of high performance computing (HPC) systems to enable research into human disease, translational medical science, and treatment. Previously we showed that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message passing processes (e.g. the message passing interface (MPI)) with multithreading (e.g. OpenMP, POSIX pthreads). The objective of this work is to compare the performance of such hybrid programming models when applied to the simulation of a lightweight multiscale cardiac model. Our results show that the hybrid models do not perform favourably when compared to an implementation using only MPI which is in contrast to our results using complex physiological models. Thus, with regards to lightweight multiscale cardiac models, the user may not need to increase programming complexity by using a hybrid programming approach. However, considering that model complexity will increase as well as the HPC system size in both node count and number of cores per node, it is still foreseeable that we will achieve faster than real time multiscale cardiac simulations on these systems using hybrid programming models.
Elementary Teacher Training Models.
ERIC Educational Resources Information Center
Blewett, Evelyn J., Ed.
This collection of articles contains descriptions of nine elementary teacher training program models conducted at universities throughout the United States. The articles include the following: (a) "The University of Toledo Model Program," by George E. Dickson; (b) "The Florida State University Model Program," by G. Wesley Sowards; (c) "The…
Life prediction and constitutive models for engine hot section anisotropic materials program
NASA Technical Reports Server (NTRS)
Nissley, D. M.; Meyer, T. G.
1992-01-01
This report presents the results from a 35 month period of a program designed to develop generic constitutive and life prediction approaches and models for nickel-based single crystal gas turbine airfoils. The program is composed of a base program and an optional program. The base program addresses the high temperature coated single crystal regime above the airfoil root platform. The optional program investigates the low temperature uncoated single crystal regime below the airfoil root platform including the notched conditions of the airfoil attachment. Both base and option programs involve experimental and analytical efforts. Results from uniaxial constitutive and fatigue life experiments of coated and uncoated PWA 1480 single crystal material form the basis for the analytical modeling effort. Four single crystal primary orientations were used in the experiments: (001), (011), (111), and (213). Specific secondary orientations were also selected for the notched experiments in the optional program. Constitutive models for an overlay coating and PWA 1480 single crystal material were developed based on isothermal hysteresis loop data and verified using thermomechanical (TMF) hysteresis loop data. A fatigue life approach and life models were selected for TMF crack initiation of coated PWA 1480. An initial life model used to correlate smooth and notched fatigue data obtained in the option program shows promise. Computer software incorporating the overlay coating and PWA 1480 constitutive models was developed.
Evans, Richard I
2003-01-01
For the past several years the author and his colleagues have explored the area of how social psychological constructs and theoretical models can be applied to the prevention of health threatening behaviors in adolescents. In examining the need for the development of gambling prevention programs for adolescents, it might be of value to consider the application of such constructs and theoretical models as a foundation to the development of prevention programs in this emerging problem behavior among adolescents. In order to provide perspective to the reader, the present paper reviews the history of various psychosocial models and constructs generic to programs directed at prevention of substance abuse in adolescents. A brief history of some of these models, possibly most applicable to gambling prevention programs, are presented. Social inoculation, reasoned action, planned behavior, and problem behavior theory, are among those discussed. Some deficits of these models, are also articulated. How such models may have relevance to developing programs for prevention of problem gambling in adolescents is also discussed. However, the inherent differences between gambling and more directly health threatening behaviors such as substance abuse must, of course, be seriously considered in utilizing such models. Most current gambling prevention programs have seldom been guided by theoretical models. Developers of gambling prevention programs should consider theoretical foundations, particularly since such foundations not only provide a guide for programs, but may become critical tools in evaluating their effectiveness.
Trained simulated ultrasound patients: medical students as models, learners, and teachers.
Blickendorf, J Matthew; Adkins, Eric J; Boulger, Creagh; Bahner, David P
2014-01-01
Medical educators must develop ultrasound education programs to ensure that future physicians are prepared to face the changing demands of clinical practice. It can be challenging to find human models for hands-on scanning sessions. This article outlines an educational model from a large university medical center that uses medical students to fulfill the need for human models. During the 2011-2012 academic year, medical students from The Ohio State University College of Medicine served as trained simulated ultrasound patients (TSUP) for hands-on scanning sessions held by the college and many residency programs. The extracurricular program is voluntary and coordinated by medical students with faculty supervision. Students receive a longitudinal didactic and hands-on ultrasound education program as an incentive for serving as a TSUP. The College of Medicine and 7 residency programs used the program, which included 47 second-year and 7 first-year student volunteers. Participation has increased annually because of the program's ease, reliability, and cost savings in providing normal anatomic models for ultrasound education programs. A key success of this program is its inherent reproducibility, as a new class of eager students constitutes the volunteer pool each year. The TSUP program is a feasible and sustainable method of fulfilling the need for normal anatomic ultrasound models while serving as a valuable extracurricular ultrasound education program for medical students. The program facilitates the coordination of ultrasound education programs by educators at the undergraduate and graduate levels.
The FITS model office ergonomics program: a model for best practice.
Chim, Justine M Y
2014-01-01
An effective office ergonomics program can predict positive results in reducing musculoskeletal injury rates, enhancing productivity, and improving staff well-being and job satisfaction. Its objective is to provide a systematic solution to manage the potential risk of musculoskeletal disorders among computer users in an office setting. A FITS Model office ergonomics program is developed. The FITS Model Office Ergonomics Program has been developed which draws on the legislative requirements for promoting the health and safety of workers using computers for extended periods as well as previous research findings. The Model is developed according to the practical industrial knowledge in ergonomics, occupational health and safety management, and human resources management in Hong Kong and overseas. This paper proposes a comprehensive office ergonomics program, the FITS Model, which considers (1) Furniture Evaluation and Selection; (2) Individual Workstation Assessment; (3) Training and Education; (4) Stretching Exercises and Rest Break as elements of an effective program. An experienced ergonomics practitioner should be included in the program design and implementation. Through the FITS Model Office Ergonomics Program, the risk of musculoskeletal disorders among computer users can be eliminated or minimized, and workplace health and safety and employees' wellness enhanced.
ERIC Educational Resources Information Center
Beckman, Brenda Marshall; Ventura-Merkel, Catherine
In an effort to more effectively disseminate information about community college programs for older adults, this directory was developed for three purposes: to make guidelines available for establishing, expanding, or revising programs; to offer a selection of successful programming models; and to provide a compendium of existing programs. Part I…
Practical Application of Model-based Programming and State-based Architecture to Space Missions
NASA Technical Reports Server (NTRS)
Horvath, Gregory; Ingham, Michel; Chung, Seung; Martin, Oliver; Williams, Brian
2006-01-01
A viewgraph presentation to develop models from systems engineers that accomplish mission objectives and manage the health of the system is shown. The topics include: 1) Overview; 2) Motivation; 3) Objective/Vision; 4) Approach; 5) Background: The Mission Data System; 6) Background: State-based Control Architecture System; 7) Background: State Analysis; 8) Overview of State Analysis; 9) Background: MDS Software Frameworks; 10) Background: Model-based Programming; 10) Background: Titan Model-based Executive; 11) Model-based Execution Architecture; 12) Compatibility Analysis of MDS and Titan Architectures; 13) Integrating Model-based Programming and Execution into the Architecture; 14) State Analysis and Modeling; 15) IMU Subsystem State Effects Diagram; 16) Titan Subsystem Model: IMU Health; 17) Integrating Model-based Programming and Execution into the Software IMU; 18) Testing Program; 19) Computationally Tractable State Estimation & Fault Diagnosis; 20) Diagnostic Algorithm Performance; 21) Integration and Test Issues; 22) Demonstrated Benefits; and 23) Next Steps
Development of a sustainable community-based dental education program.
Piskorowski, Wilhelm A; Fitzgerald, Mark; Mastey, Jerry; Krell, Rachel E
2011-08-01
Increasing the use of community-based programs is an important trend in improving dental education to meet the needs of students and the public. To support this trend, understanding the history of programs that have established successful models for community-based education is valuable for the creation and development of new programs. The community-based education model of the University of Michigan School of Dentistry (UMSOD) offers a useful guide for understanding the essential steps and challenges involved in developing a successful program. Initial steps in program development were as follows: raising funds, selecting an outreach clinical model, and recruiting clinics to become partners. As the program developed, the challenges of creating a sustainable financial model with the highest educational value required the inclusion of new clinical settings and the creation of a unique revenue-sharing model. Since the beginning of the community-based program at UMSOD in 2000, the number of community partners has increased to twenty-seven clinics, and students have treated thousands of patients in need. Fourth-year students now spend a minimum of ten weeks in community-based clinical education. The community-based program at UMSOD demonstrates the value of service-based education and offers a sustainable model for the development of future programs.
ModelMate - A graphical user interface for model analysis
Banta, Edward R.
2011-01-01
ModelMate is a graphical user interface designed to facilitate use of model-analysis programs with models. This initial version of ModelMate supports one model-analysis program, UCODE_2005, and one model software program, MODFLOW-2005. ModelMate can be used to prepare input files for UCODE_2005, run UCODE_2005, and display analysis results. A link to the GW_Chart graphing program facilitates visual interpretation of results. ModelMate includes capabilities for organizing directories used with the parallel-processing capabilities of UCODE_2005 and for maintaining files in those directories to be identical to a set of files in a master directory. ModelMate can be used on its own or in conjunction with ModelMuse, a graphical user interface for MODFLOW-2005 and PHAST.
Dung Tuan Nguyen
2012-01-01
Forest harvest scheduling has been modeled using deterministic and stochastic programming models. Past models seldom address explicit spatial forest management concerns under the influence of natural disturbances. In this research study, we employ multistage full recourse stochastic programming models to explore the challenges and advantages of building spatial...
Designing a Model of Vocational Training Programs for Disables through ODL
ERIC Educational Resources Information Center
Majid, Shaista; Razzak, Adeela
2015-01-01
This study was conducted to designing a model of vocational training programs for disables. For this purpose desk review was carried out and the vocational training models/programs of Israel, U.K., Vietnam, Japan and Thailand were analyzed to form a conceptual framework of the model. Keeping in view the local conditions/requirements a model of…
ERIC Educational Resources Information Center
Martin, Ian; Carey, John
2014-01-01
A logic model was developed based on an analysis of the 2012 American School Counselor Association (ASCA) National Model in order to provide direction for program evaluation initiatives. The logic model identified three outcomes (increased student achievement/gap reduction, increased school counseling program resources, and systemic change and…
Emile: Software-Realized Scaffolding for Science Learners Programming in Mixed Media
NASA Astrophysics Data System (ADS)
Guzdial, Mark Joseph
Emile is a computer program that facilitates students using programming to create models of kinematics (physics of motion without forces) and executing these models as simulations. Emile facilitates student programming and model-building with software-realized scaffolding (SRS). Emile integrates a range of SRS and provides mechanisms to fade (or diminish) most scaffolding. By fading Emile's SRS, students can adapt support to their individual needs. Programming in Emile involves graphic and text elements (as compared with more traditional text-based programming). For example, students create graphical objects which can be dragged on the screen, and when dropped, fall as if in a gravitational field. Emile supports a simplified, HyperCard-like mixed media programming framework. Scaffolding is defined as support which enables student performance (called the immediate benefit of scaffolding) and which facilitates student learning (called the lasting benefit of scaffolding). Scaffolding provides this support through three methods: Modeling, coaching, and eliciting articulation. For example, Emile has tools to structure the programming task (modeling), menus identify the next step in the programming and model-building process (coaching), and prompts for student plans and predictions (eliciting articulation). Five students used Emile in a summer workshop (45 hours total) focusing on creating kinematics simulations and multimedia demonstrations. Evaluation of Emile's scaffolding addressed use of scaffolding and the expected immediate and lasting benefits. Emile created records of student interactions (log files) which were analyzed to determine how students used Emile's SRS and how they faded that scaffolding. Student projects and articulations about those projects were analyzed to assess success of student's model-building and programming activities. Clinical interviews were conducted before and after the workshop to determine students' conceptualizations of kinematics and programming and how they changed. The results indicate that students were successful at model-building and programming, learned physics and programming, and used and faded Emile's scaffolding over time. These results are from a small sample who were self -selected and highly-motivated. Nonetheless, this study provides a theory and operationalization for SRS, an example of a successful model-building environment, and a description of student use of mixed media programming.
Energy efficiency in nonprofit agencies: Creating effective program models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, M.A.; Prindle, B.; Scherr, M.I.
Nonprofit agencies are a critical component of the health and human services system in the US. It has been clearly demonstrated by programs that offer energy efficiency services to nonprofits that, with minimal investment, they can educe their energy consumption by ten to thirty percent. This energy conservation potential motivated the Department of Energy and Oak Ridge National Laboratory to conceive a project to help states develop energy efficiency programs for nonprofits. The purpose of the project was two-fold: (1) to analyze existing programs to determine which design and delivery mechanisms are particularly effective, and (2) to create model programsmore » for states to follow in tailoring their own plans for helping nonprofits with energy efficiency programs. Twelve existing programs were reviewed, and three model programs were devised and put into operation. The model programs provide various forms of financial assistance to nonprofits and serve as a source of information on energy efficiency as well. After examining the results from the model programs (which are still on-going) and from the existing programs, several replicability factors'' were developed for use in the implementation of programs by other states. These factors -- some concrete and practical, others more generalized -- serve as guidelines for states devising program based on their own particular needs and resources.« less
A Comparison of Three Programming Models for Adaptive Applications
NASA Technical Reports Server (NTRS)
Shan, Hong-Zhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswa, Rupak; Kwak, Dochan (Technical Monitor)
2000-01-01
We study the performance and programming effort for two major classes of adaptive applications under three leading parallel programming models. We find that all three models can achieve scalable performance on the state-of-the-art multiprocessor machines. The basic parallel algorithms needed for different programming models to deliver their best performance are similar, but the implementations differ greatly, far beyond the fact of using explicit messages versus implicit loads/stores. Compared with MPI and SHMEM, CC-SAS (cache-coherent shared address space) provides substantial ease of programming at the conceptual and program orchestration level, which often leads to the performance gain. However it may also suffer from the poor spatial locality of physically distributed shared data on large number of processors. Our CC-SAS implementation of the PARMETIS partitioner itself runs faster than in the other two programming models, and generates more balanced result for our application.
A Parallel Multiclassification Algorithm for Big Data Using an Extreme Learning Machine.
Duan, Mingxing; Li, Kenli; Liao, Xiangke; Li, Keqin
2018-06-01
As data sets become larger and more complicated, an extreme learning machine (ELM) that runs in a traditional serial environment cannot realize its ability to be fast and effective. Although a parallel ELM (PELM) based on MapReduce to process large-scale data shows more efficient learning speed than identical ELM algorithms in a serial environment, some operations, such as intermediate results stored on disks and multiple copies for each task, are indispensable, and these operations create a large amount of extra overhead and degrade the learning speed and efficiency of the PELMs. In this paper, an efficient ELM based on the Spark framework (SELM), which includes three parallel subalgorithms, is proposed for big data classification. By partitioning the corresponding data sets reasonably, the hidden layer output matrix calculation algorithm, matrix decomposition algorithm, and matrix decomposition algorithm perform most of the computations locally. At the same time, they retain the intermediate results in distributed memory and cache the diagonal matrix as broadcast variables instead of several copies for each task to reduce a large amount of the costs, and these actions strengthen the learning ability of the SELM. Finally, we implement our SELM algorithm to classify large data sets. Extensive experiments have been conducted to validate the effectiveness of the proposed algorithms. As shown, our SELM achieves an speedup on a cluster with ten nodes, and reaches a speedup with 15 nodes, an speedup with 20 nodes, a speedup with 25 nodes, a speedup with 30 nodes, and a speedup with 35 nodes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Tianzhen; Chen, Yixing; Belafi, Zsofia
Occupant behavior (OB) in buildings is a leading factor influencing energy use in buildings. Quantifying this influence requires the integration of OB models with building performance simulation (BPS). This study reviews approaches to representing and implementing OB models in today’s popular BPS programs, and discusses weaknesses and strengths of these approaches and key issues in integrating of OB models with BPS programs. Two of the key findings are: (1) a common data model is needed to standardize the representation of OB models, enabling their flexibility and exchange among BPS programs and user applications; the data model can be implemented usingmore » a standard syntax (e.g., in the form of XML schema), and (2) a modular software implementation of OB models, such as functional mock-up units for co-simulation, adopting the common data model, has advantages in providing a robust and interoperable integration with multiple BPS programs. Such common OB model representation and implementation approaches help standardize the input structures of OB models, enable collaborative development of a shared library of OB models, and allow for rapid and widespread integration of OB models with BPS programs to improve the simulation of occupant behavior and quantification of their impact on building performance.« less
Hong, Tianzhen; Chen, Yixing; Belafi, Zsofia; ...
2017-07-27
Occupant behavior (OB) in buildings is a leading factor influencing energy use in buildings. Quantifying this influence requires the integration of OB models with building performance simulation (BPS). This study reviews approaches to representing and implementing OB models in today’s popular BPS programs, and discusses weaknesses and strengths of these approaches and key issues in integrating of OB models with BPS programs. Two of the key findings are: (1) a common data model is needed to standardize the representation of OB models, enabling their flexibility and exchange among BPS programs and user applications; the data model can be implemented usingmore » a standard syntax (e.g., in the form of XML schema), and (2) a modular software implementation of OB models, such as functional mock-up units for co-simulation, adopting the common data model, has advantages in providing a robust and interoperable integration with multiple BPS programs. Such common OB model representation and implementation approaches help standardize the input structures of OB models, enable collaborative development of a shared library of OB models, and allow for rapid and widespread integration of OB models with BPS programs to improve the simulation of occupant behavior and quantification of their impact on building performance.« less
An Instructional Approach to Modeling in Microevolution.
ERIC Educational Resources Information Center
Thompson, Steven R.
1988-01-01
Describes an approach to teaching population genetics and evolution and some of the ways models can be used to enhance understanding of the processes being studied. Discusses the instructional plan, and the use of models including utility programs and analysis with models. Provided are a basic program and sample program outputs. (CW)
Development and validation of a general purpose linearization program for rigid aircraft models
NASA Technical Reports Server (NTRS)
Duke, E. L.; Antoniewicz, R. F.
1985-01-01
A FORTRAN program that provides the user with a powerful and flexible tool for the linearization of aircraft models is discussed. The program LINEAR numerically determines a linear systems model using nonlinear equations of motion and a user-supplied, nonlinear aerodynamic model. The system model determined by LINEAR consists of matrices for both the state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model. Also, included in the report is a comparison of linear and nonlinear models for a high performance aircraft.
Genetic Programming as Alternative for Predicting Development Effort of Individual Software Projects
Chavoya, Arturo; Lopez-Martin, Cuauhtemoc; Andalon-Garcia, Irma R.; Meda-Campaña, M. E.
2012-01-01
Statistical and genetic programming techniques have been used to predict the software development effort of large software projects. In this paper, a genetic programming model was used for predicting the effort required in individually developed projects. Accuracy obtained from a genetic programming model was compared against one generated from the application of a statistical regression model. A sample of 219 projects developed by 71 practitioners was used for generating the two models, whereas another sample of 130 projects developed by 38 practitioners was used for validating them. The models used two kinds of lines of code as well as programming language experience as independent variables. Accuracy results from the model obtained with genetic programming suggest that it could be used to predict the software development effort of individual projects when these projects have been developed in a disciplined manner within a development-controlled environment. PMID:23226305
Nagy, Christopher J; Fitzgerald, Brian M; Kraus, Gregory P
2014-01-01
Anesthesiology residency programs will be expected to have Milestones-based evaluation systems in place by July 2014 as part of the Next Accreditation System. The San Antonio Uniformed Services Health Education Consortium (SAUSHEC) anesthesiology residency program developed and implemented a Milestones-based feedback and evaluation system a year ahead of schedule. It has been named the Milestone-specific, Observed Data points for Evaluating Levels of performance (MODEL) assessment strategy. The "MODEL Menu" and the "MODEL Blueprint" are tools that other anesthesiology residency programs can use in developing their own Milestones-based feedback and evaluation systems prior to ACGME-required implementation. Data from our early experience with the streamlined MODEL blueprint assessment strategy showed substantially improved faculty compliance with reporting requirements. The MODEL assessment strategy provides programs with a workable assessment method for residents, and important Milestones data points to programs for ACGME reporting.
SMP: A solid modeling program version 2.0
NASA Technical Reports Server (NTRS)
Randall, D. P.; Jones, K. H.; Vonofenheim, W. H.; Gates, R. L.; Matthews, C. G.
1986-01-01
The Solid Modeling Program (SMP) provides the capability to model complex solid objects through the composition of primitive geometric entities. In addition to the construction of solid models, SMP has extensive facilities for model editing, display, and analysis. The geometric model produced by the software system can be output in a format compatible with existing analysis programs such as PATRAN-G. The present version of the SMP software supports six primitives: boxes, cones, spheres, paraboloids, tori, and trusses. The details for creating each of the major primitive types is presented. The analysis capabilities of SMP, including interfaces to existing analysis programs, are discussed.
Orzol, Leonard L.; McGrath, Timothy S.
1992-01-01
This report documents modifications to the U.S. Geological Survey modular, three-dimensional, finite-difference, ground-water flow model, commonly called MODFLOW, so that it can read and write files used by a geographic information system (GIS). The modified model program is called MODFLOWARC. Simulation programs such as MODFLOW generally require large amounts of input data and produce large amounts of output data. Viewing data graphically, generating head contours, and creating or editing model data arrays such as hydraulic conductivity are examples of tasks that currently are performed either by the use of independent software packages or by tedious manual editing, manipulating, and transferring data. Programs such as GIS programs are commonly used to facilitate preparation of the model input data and analyze model output data; however, auxiliary programs are frequently required to translate data between programs. Data translations are required when different programs use different data formats. Thus, the user might use GIS techniques to create model input data, run a translation program to convert input data into a format compatible with the ground-water flow model, run the model, run a translation program to convert the model output into the correct format for GIS, and use GIS to display and analyze this output. MODFLOWARC, avoids the two translation steps and transfers data directly to and from the ground-water-flow model. This report documents the design and use of MODFLOWARC and includes instructions for data input/output of the Basic, Block-centered flow, River, Recharge, Well, Drain, Evapotranspiration, General-head boundary, and Streamflow-routing packages. The modification to MODFLOW and the Streamflow-Routing package was minimized. Flow charts and computer-program code describe the modifications to the original computer codes for each of these packages. Appendix A contains a discussion on the operation of MODFLOWARC using a sample problem.
Solid rocket booster performance evaluation model. Volume 2: Users manual
NASA Technical Reports Server (NTRS)
1974-01-01
This users manual for the solid rocket booster performance evaluation model (SRB-II) contains descriptions of the model, the program options, the required program inputs, the program output format and the program error messages. SRB-II is written in FORTRAN and is operational on both the IBM 370/155 and the MSFC UNIVAC 1108 computers.
ERIC Educational Resources Information Center
Reider, David; Knestis, Kirk; Malyn-Smith, Joyce
2016-01-01
This article proposes a STEM workforce education logic model, tailored to the particular context of the National Science Foundation's Innovative Technology Experiences for Students and Teachers (ITEST) program. This model aims to help program designers and researchers address challenges particular to designing, implementing, and studying education…
ERIC Educational Resources Information Center
Rothwell, William J.; Cookson, Peter S.
This book introduces key issues in program planning as practiced in business and educational settings. Two chapters in part 1 introduce two foundational models--Lifelong Education Program Planning (LEPP) model and Contingency-Based Program Planning--and provide background on models designed by Houle, Knowles, Boyle, and Nadler. Parts 2-5 focus on…
ERIC Educational Resources Information Center
Steinke, Theodore R.
This paper traces the historical development of cartography graduate programs, establishes an evolutionary model, and evaluates the model to determine if it has some utility today for the development of programs capable of producing highly skilled cartographers. Cartography is defined to include traditional cartography, computer cartography,…
Flexible language constructs for large parallel programs
NASA Technical Reports Server (NTRS)
Rosing, Matthew; Schnabel, Robert
1993-01-01
The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.
Life prediction and constitutive models for engine hot section anisotropic materials program
NASA Technical Reports Server (NTRS)
Nissley, D. M.; Meyer, T. G.; Walker, K. P.
1992-01-01
This report presents a summary of results from a 7 year program designed to develop generic constitutive and life prediction approaches and models for nickel-based single crystal gas turbine airfoils. The program was composed of a base program and an optional program. The base program addressed the high temperature coated single crystal regime above the airfoil root platform. The optional program investigated the low temperature uncoated single crystal regime below the airfoil root platform including the notched conditions of the airfoil attachment. Both base and option programs involved experimental and analytical efforts. Results from uniaxial constitutive and fatigue life experiments of coated and uncoated PWA 1480 single crystal material formed the basis for the analytical modeling effort. Four single crystal primary orientations were used in the experiments: group of zone axes (001), group of zone axes (011), group of zone axes (111), and group of zone axes (213). Specific secondary orientations were also selected for the notched experiments in the optional program. Constitutive models for an overlay coating and PWA 1480 single crystal materials were developed based on isothermal hysteresis loop data and verified using thermomechanical (TMF) hysteresis loop data. A fatigue life approach and life models were developed for TMF crack initiation of coated PWA 1480. A life model was developed for smooth and notched fatigue in the option program. Finally, computer software incorporating the overlay coating and PWA 1480 constitutive and life models was developed.
Johnson, Victoria A; Ronan, Kevin R; Johnston, David M; Peace, Robin
2016-11-01
A main weakness in the evaluation of disaster education programs for children is evaluators' propensity to judge program effectiveness based on changes in children's knowledge. Few studies have articulated an explicit program theory of how children's education would achieve desired outcomes and impacts related to disaster risk reduction in households and communities. This article describes the advantages of constructing program theory models for the purpose of evaluating disaster education programs for children. Following a review of some potential frameworks for program theory development, including the logic model, the program theory matrix, and the stage step model, the article provides working examples of these frameworks. The first example is the development of a program theory matrix used in an evaluation of ShakeOut, an earthquake drill practiced in two Washington State school districts. The model illustrates a theory of action; specifically, the effectiveness of school earthquake drills in preventing injuries and deaths during disasters. The second example is the development of a stage step model used for a process evaluation of What's the Plan Stan?, a voluntary teaching resource distributed to all New Zealand primary schools for curricular integration of disaster education. The model illustrates a theory of use; specifically, expanding the reach of disaster education for children through increased promotion of the resource. The process of developing the program theory models for the purpose of evaluation planning is discussed, as well as the advantages and shortcomings of the theory-based approaches. © 2015 Society for Risk Analysis.
Nested ocean models: Work in progress
NASA Technical Reports Server (NTRS)
Perkins, A. Louise
1991-01-01
The ongoing work of combining three existing software programs into a nested grid oceanography model is detailed. The HYPER domain decomposition program, the SPEM ocean modeling program, and a quasi-geostrophic model written in England are being combined into a general ocean modeling facility. This facility will be used to test the viability and the capability of two-way nested grids in the North Atlantic.
AVHRR/1-FM Advanced Very High Resolution Radiometer
NASA Technical Reports Server (NTRS)
1979-01-01
The advanced very high resolution radiometer is discussed. The program covers design, construction, and test of a breadboard model, engineering model, protoflight model, mechanical/structural model, and a life test model. Special bench test and calibration equipment was developed for use on the program. The flight model program objectives were to fabricate, assemble and test four of the advanced very high resolution radiometers along with a bench cooler and collimator.
Semi-Markov adjunction to the Computer-Aided Markov Evaluator (CAME)
NASA Technical Reports Server (NTRS)
Rosch, Gene; Hutchins, Monica A.; Leong, Frank J.; Babcock, Philip S., IV
1988-01-01
The rule-based Computer-Aided Markov Evaluator (CAME) program was expanded in its ability to incorporate the effect of fault-handling processes into the construction of a reliability model. The fault-handling processes are modeled as semi-Markov events and CAME constructs and appropriate semi-Markov model. To solve the model, the program outputs it in a form which can be directly solved with the Semi-Markov Unreliability Range Evaluator (SURE) program. As a means of evaluating the alterations made to the CAME program, the program is used to model the reliability of portions of the Integrated Airframe/Propulsion Control System Architecture (IAPSA 2) reference configuration. The reliability predictions are compared with a previous analysis. The results bear out the feasibility of utilizing CAME to generate appropriate semi-Markov models to model fault-handling processes.
Meekers, Dominique; Rahaim, Stephen
2005-01-27
Over the past two decades, social marketing programs have become an important element of the national family planning and HIV prevention strategy in several developing countries. As yet, there has not been any comprehensive empirical assessment to determine which of several social marketing models is most effective for a given socio-economic context. Such an assessment is urgently needed to inform the design of future social marketing programs, and to avoid that programs are designed using an ineffective model. This study addresses this issue using a database of annual statistics about reproductive health oriented social marketing programs in over 70 countries. In total, the database covers 555 years of program experience with social marketing programs that distribute and promote the use of oral contraceptives and condoms. Specifically, our analysis assesses to what extent the model used by different reproductive health social marketing programs has varied across different socio-economic contexts. We then use random effects regression to test in which socio-economic context each of the models is most successful at increasing use of socially marketed oral contraceptives and condoms. The results show that there has been a tendency to design reproductive health social marketing program with a management structure that matches the local context. However, the evidence also shows that this has not always been the case. While socio-economic context clearly influences the effectiveness of some of the social marketing models, program maturity and the size of the target population appear equally important. To maximize the effectiveness of future social marketing programs, it is essential that more effort is devoted to ensuring that such programs are designed using the model or approach that is most suitable for the local context.
Meekers, Dominique; Rahaim, Stephen
2005-01-01
Background Over the past two decades, social marketing programs have become an important element of the national family planning and HIV prevention strategy in several developing countries. As yet, there has not been any comprehensive empirical assessment to determine which of several social marketing models is most effective for a given socio-economic context. Such an assessment is urgently needed to inform the design of future social marketing programs, and to avoid that programs are designed using an ineffective model. Methods This study addresses this issue using a database of annual statistics about reproductive health oriented social marketing programs in over 70 countries. In total, the database covers 555 years of program experience with social marketing programs that distribute and promote the use of oral contraceptives and condoms. Specifically, our analysis assesses to what extent the model used by different reproductive health social marketing programs has varied across different socio-economic contexts. We then use random effects regression to test in which socio-economic context each of the models is most successful at increasing use of socially marketed oral contraceptives and condoms. Results The results show that there has been a tendency to design reproductive health social marketing program with a management structure that matches the local context. However, the evidence also shows that this has not always been the case. While socio-economic context clearly influences the effectiveness of some of the social marketing models, program maturity and the size of the target population appear equally important. Conclusions To maximize the effectiveness of future social marketing programs, it is essential that more effort is devoted to ensuring that such programs are designed using the model or approach that is most suitable for the local context. PMID:15676068
New reflective symmetry design capability in the JPL-IDEAS Structure Optimization Program
NASA Technical Reports Server (NTRS)
Strain, D.; Levy, R.
1986-01-01
The JPL-IDEAS antenna structure analysis and design optimization computer program was modified to process half structure models of symmetric structures subjected to arbitrary external static loads, synthesize the performance, and optimize the design of the full structure. Significant savings in computation time and cost (more than 50%) were achieved compared to the cost of full model computer runs. The addition of the new reflective symmetry analysis design capabilities to the IDEAS program allows processing of structure models whose size would otherwise prevent automated design optimization. The new program produced synthesized full model iterative design results identical to those of actual full model program executions at substantially reduced cost, time, and computer storage.
Miller, Robin L; Shinn, Marybeth
2005-06-01
The model of prevention science advocated by the Institute of Medicine (P. J. Mrazek & R. J. Haggerty, 1994) has not lead to widespread adoption of prevention and promotion programs for four reasons. The model of dissemination of programs to communities fails to consider community and organizational capacity to implement programs, ignores the need for congruence in values between programs and host sites, displays a pro-innovation bias that undervalues indigenous practices, and assumes a simplistic model of how community organizations adopt innovations. To address these faults, researchers should locate, study, and help disseminate successful indigenous programs that fit community capacity and values. In addition, they should build on theoretical models of how locally developed programs work to make existing programs and polices more effective.
A thermal vacuum test optimization procedure
NASA Technical Reports Server (NTRS)
Kruger, R.; Norris, H. P.
1979-01-01
An analytical model was developed that can be used to establish certain parameters of a thermal vacuum environmental test program based on an optimization of program costs. This model is in the form of a computer program that interacts with a user insofar as the input of certain parameters. The program provides the user a list of pertinent information regarding an optimized test program and graphs of some of the parameters. The model is a first attempt in this area and includes numerous simplifications. The model appears useful as a general guide and provides a way for extrapolating past performance to future missions.
Cinelli, Renata L; Peralta, Louisa R
2015-01-01
There is growing support for the prosocial value of role modelling in programs for adolescents and the potentially positive impact role models can have on health and health behaviours in remote communities. Despite known benefits for remote outreach program recipients, there is limited literature on the outcomes of participation for role models. Twenty-four role models participated in a remote outreach program across four remote Aboriginal communities in the Northern Territory, Australia (100% recruitment). Role models participated in semi-structured one-on-one interviews. Transcripts were coded and underwent thematic analysis by both authors. Cultural training, Indigenous heritage and prior experience contributed to general feelings of preparedness, yet some role models experienced a level of culture shock, being confronted by how disparate the communities were to their home communities. Benefits of participation included exposure to and experience with remote Aboriginal peoples and community, increased cultural knowledge, personal learning, forming and building relationships, and skill development. Effective role model programs designed for remote Indigenous youth can have positive outcomes for both role models and the program recipients. Cultural safety training is an important factor for preparing role models and for building their cultural competency for implementing health and education programs in remote Indigenous communities in Australia. This will maximise the opportunities for participants to achieve outcomes and minimise their culture shock.
ERIC Educational Resources Information Center
Campbell, Robert E.; And Others
This handbook presents management techniques, program ideas, and student activities for building comprehensive secondary career guidance programs. Part 1 (chapter 1) traces the history of guidance to set the stage for the current emphasis on comprehensive programs, summarizes four representative models for designing comprehensive programs, and…
Cullen, Patricia; Clapham, Kathleen; Byrne, Jake; Hunter, Kate; Senserrick, Teresa; Keay, Lisa; Ivers, Rebecca
2016-08-01
Evidence indicates that Aboriginal people are underrepresented among driver licence holders in New South Wales, which has been attributed to licensing barriers for Aboriginal people. The Driving Change program was developed to provide culturally responsive licensing services that engage Aboriginal communities and build local capacity. This paper outlines the formative evaluation of the program, including logic model construction and exploration of contextual factors. Purposive sampling was used to identify key informants (n=12) from a consultative committee of key stakeholders and program staff. Semi-structured interviews were transcribed and thematically analysed. Data from interviews informed development of the logic model. Participants demonstrated high level of support for the program and reported that it filled an important gap. The program context revealed systemic barriers to licensing that were correspondingly targeted by specific program outputs in the logic model. Addressing underlying assumptions of the program involved managing local capacity and support to strengthen implementation. This formative evaluation highlights the importance of exploring program context as a crucial first step in logic model construction. The consultation process assisted in clarifying program goals and ensuring that the program was responding to underlying systemic factors that contribute to inequitable licensing access for Aboriginal people. Copyright © 2016 Elsevier Ltd. All rights reserved.
A Component-based Programming Model for Composite, Distributed Applications
NASA Technical Reports Server (NTRS)
Eidson, Thomas M.; Bushnell, Dennis M. (Technical Monitor)
2001-01-01
The nature of scientific programming is evolving to larger, composite applications that are composed of smaller element applications. These composite applications are more frequently being targeted for distributed, heterogeneous networks of computers. They are most likely programmed by a group of developers. Software component technology and computational frameworks are being proposed and developed to meet the programming requirements of these new applications. Historically, programming systems have had a hard time being accepted by the scientific programming community. In this paper, a programming model is outlined that attempts to organize the software component concepts and fundamental programming entities into programming abstractions that will be better understood by the application developers. The programming model is designed to support computational frameworks that manage many of the tedious programming details, but also that allow sufficient programmer control to design an accurate, high-performance application.
A constructive Indian country response to the evidence-based program mandate.
Walker, R Dale; Bigelow, Douglas A
2011-01-01
Over the last 20 years governmental mandates for preferentially funding evidence-based "model" practices and programs has become doctrine in some legislative bodies, federal agencies, and state agencies. It was assumed that what works in small sample, controlled settings would work in all community settings, substantially improving safety, effectiveness, and value-for-money. The evidence-based "model" programs mandate has imposed immutable "core components," fidelity testing, alien programming and program developers, loss of familiar programs, and resource capacity requirements upon tribes, while infringing upon their tribal sovereignty and consultation rights. Tribal response in one state (Oregon) went through three phases: shock and rejection; proposing an alternative approach using criteria of cultural appropriateness, aspiring to evaluability; and adopting logic modeling. The state heard and accepted the argument that the tribal way of knowing is different and valid. Currently, a state-authorized tribal logic model and a review panel process are used to approve tribal best practices for state funding. This constructive response to the evidence-based program mandate elevates tribal practices in the funding and regulatory world, facilitates continuing quality improvement and evaluation, while ensuring that practices and programs remain based on local community context and culture. This article provides details of a model that could well serve tribes facing evidence-based model program mandates throughout the country.
Ufnar, J. A.; Kuner, Susan; Shepherd, V. L.
2012-01-01
The National Science Foundation GK–12 program has made more than 300 awards to universities, supported thousands of graduate student trainees, and impacted thousands of K–12 students and teachers. The goals of the current study were to determine the number of sustained GK–12 programs that follow the original GK–12 structure of placing graduate students into classrooms and to propose models for universities with current funding or universities interested in starting a program. Results from surveys, literature reviews, and Internet searches of programs funded between 1999 and 2008 indicated that 19 of 188 funded sites had sustained in-classroom programs. Three distinct models emerged from an analysis of these programs: a full-stipend model, in which graduate fellows worked with partner teachers in a K–12 classroom for 2 d/wk; a supplemental stipend model in which fellows worked with teachers for 1 d/wk; and a service-learning model, in which in-classroom activity was integrated into university academic coursework. Based on these results, potential models for sustainability and replication are suggested, including establishment of formal collaborations between sustained GK–12 programs and universities interested in starting in-classroom programs; development of a new Teaching Experience for Fellows program; and integration of supplemental fellow stipends into grant broader-impact sections. PMID:22949421
Program Model Checking: A Practitioner's Guide
NASA Technical Reports Server (NTRS)
Pressburger, Thomas T.; Mansouri-Samani, Masoud; Mehlitz, Peter C.; Pasareanu, Corina S.; Markosian, Lawrence Z.; Penix, John J.; Brat, Guillaume P.; Visser, Willem C.
2008-01-01
Program model checking is a verification technology that uses state-space exploration to evaluate large numbers of potential program executions. Program model checking provides improved coverage over testing by systematically evaluating all possible test inputs and all possible interleavings of threads in a multithreaded system. Model-checking algorithms use several classes of optimizations to reduce the time and memory requirements for analysis, as well as heuristics for meaningful analysis of partial areas of the state space Our goal in this guidebook is to assemble, distill, and demonstrate emerging best practices for applying program model checking. We offer it as a starting point and introduction for those who want to apply model checking to software verification and validation. The guidebook will not discuss any specific tool in great detail, but we provide references for specific tools.
DOT National Transportation Integrated Search
2003-05-01
This research project studied the feasibility as well as the scientific validity and utility of performing functional capacity screening with older drivers. A Model Program was described encompassing procedures to detect functionally impaired drivers...
FEASIBILITY STUDY ON EXECUTIVE PROGRAM DEVELOPMENT FOR BASIN ECOSYSTEMS MODELING
The project was undertaken in order to provide a feasibility study in developing and implementing a complete executive program to interface automatically various basin-wide water quality models for use by relatively inexperienced modelers. This executive program should ultimately...
Marketing for a Web-Based Master's Degree Program in Light of Marketing Mix Model
ERIC Educational Resources Information Center
Pan, Cheng-Chang
2012-01-01
The marketing mix model was applied with a focus on Web media to re-strategize a Web-based Master's program in a southern state university in U.S. The program's existing marketing strategy was examined using the four components of the model: product, price, place, and promotion, in hopes to repackage the program (product) to prospective students…
ERIC Educational Resources Information Center
American Institutes for Research in the Behavioral Sciences, Palo Alto, CA.
Prepared for a White House Conference on Children (December 1970), this report describes a program in which first- through third-graders in three schools in Dayton, Ohio, participate in a model of a Follow Through program sponsored by Siegfried Engelmann and Wesley Becker of the University of Oregon at Eugene. All teachers chose to participate,…
Code of Federal Regulations, 2010 CFR
2017-10-01
... INFRASTRUCTURE AND MODEL PROGRAMS COMPREHENSIVE CARE FOR JOINT REPLACEMENT MODEL Pricing and Payment § 510.320 Treatment of incentive programs or add-on payments under existing Medicare payment systems. The CJR model... 42 Public Health 5 2017-10-01 2017-10-01 false Treatment of incentive programs or add-on payments...
Code of Federal Regulations, 2010 CFR
2016-10-01
... INFRASTRUCTURE AND MODEL PROGRAMS COMPREHENSIVE CARE FOR JOINT REPLACEMENT MODEL Pricing and Payment § 510.320 Treatment of incentive programs or add-on payments under existing Medicare payment systems. The CJR model... 42 Public Health 5 2016-10-01 2016-10-01 false Treatment of incentive programs or add-on payments...
Software reliability: Application of a reliability model to requirements error analysis
NASA Technical Reports Server (NTRS)
Logan, J.
1980-01-01
The application of a software reliability model having a well defined correspondence of computer program properties to requirements error analysis is described. Requirements error categories which can be related to program structural elements are identified and their effect on program execution considered. The model is applied to a hypothetical B-5 requirement specification for a program module.
The Use of Molecular Modeling Programs in Medicinal Chemistry Instruction.
ERIC Educational Resources Information Center
Harrold, Marc W.
1992-01-01
This paper describes and evaluates the use of a molecular modeling computer program (Alchemy II) in a pharmaceutical education program. Provided are the hardware requirements and basic program features as well as several examples of how this program and its features have been applied in the classroom. (GLR)
NASA Technical Reports Server (NTRS)
Walker, K. P.
1981-01-01
Results of a 20-month research and development program for nonlinear structural modeling with advanced time-temperature constitutive relationships are reported. The program included: (1) the evaluation of a number of viscoplastic constitutive models in the published literature; (2) incorporation of three of the most appropriate constitutive models into the MARC nonlinear finite element program; (3) calibration of the three constitutive models against experimental data using Hastelloy-X material; and (4) application of the most appropriate constitutive model to a three dimensional finite element analysis of a cylindrical combustor liner louver test specimen to establish the capability of the viscoplastic model to predict component structural response.
NASA Technical Reports Server (NTRS)
Shooman, Martin L.
1991-01-01
Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc.
Optimization Research of Generation Investment Based on Linear Programming Model
NASA Astrophysics Data System (ADS)
Wu, Juan; Ge, Xueqian
Linear programming is an important branch of operational research and it is a mathematical method to assist the people to carry out scientific management. GAMS is an advanced simulation and optimization modeling language and it will combine a large number of complex mathematical programming, such as linear programming LP, nonlinear programming NLP, MIP and other mixed-integer programming with the system simulation. In this paper, based on the linear programming model, the optimized investment decision-making of generation is simulated and analyzed. At last, the optimal installed capacity of power plants and the final total cost are got, which provides the rational decision-making basis for optimized investments.
Representing and reasoning about program in situation calculus
NASA Astrophysics Data System (ADS)
Yang, Bo; Zhang, Ming-yi; Wu, Mao-nian; Xie, Gang
2011-12-01
Situation calculus is an expressive tool for modeling dynamical system in artificial intelligence, changes in a dynamical world is represented naturally by the notions of action, situation and fluent in situation calculus. Program can be viewed as a discrete dynamical system, so it is possible to model program with situation calculus. To model program written in a smaller core programming language CL, notion of fluent is expanded for representing value of expression. Together with some functions returning concerned objects from expressions, a basic action theory of CL programming is constructed. Under such a theory, some properties of program, such as correctness and termination can be reasoned about.
Jackson, M E; Gnadt, J W
1999-03-01
The object-oriented graphical programming language LabView was used to implement the numerical solution to a computational model of saccade generation in primates. The computational model simulates the activity and connectivity of anatomical strictures known to be involved in saccadic eye movements. The LabView program provides a graphical user interface to the model that makes it easy to observe and modify the behavior of each element of the model. Essential elements of the source code of the LabView program are presented and explained. A copy of the model is available for download from the internet.
Modeling small-scale dairy farms in central Mexico using multi-criteria programming.
Val-Arreola, D; Kebreab, E; France, J
2006-05-01
Milk supply from Mexican dairy farms does not meet demand and small-scale farms can contribute toward closing the gap. Two multi-criteria programming techniques, goal programming and compromise programming, were used in a study of small-scale dairy farms in central Mexico. To build the goal and compromise programming models, 4 ordinary linear programming models were also developed, which had objective functions to maximize metabolizable energy for milk production, to maximize margin of income over feed costs, to maximize metabolizable protein for milk production, and to minimize purchased feedstuffs. Neither multi-criteria approach was significantly better than the other; however, by applying both models it was possible to perform a more comprehensive analysis of these small-scale dairy systems. The multi-criteria programming models affirm findings from previous work and suggest that a forage strategy based on alfalfa, ryegrass, and corn silage would meet nutrient requirements of the herd. Both models suggested that there is an economic advantage in rescheduling the calving season to the second and third calendar quarters to better synchronize higher demand for nutrients with the period of high forage availability.
Evaluation and Strategic Planning for the GLOBE Program
NASA Astrophysics Data System (ADS)
Geary, E. E.; Williams, V. L.
2010-12-01
The Global Learning and Observations to Benefit the Environment (GLOBE) Program is an international environmental education program. It unites educators, students and scientists worldwide to collaborate on inquiry based investigations of the environment and Earth system science. Evaluation of the GLOBE program has been challenging because of its broad reach, diffuse models of implementation, and multiple stakeholders. In an effort to guide current evaluation efforts, a logic model was developed that provides a visual display of how the GLOBE program operates. Using standard elements of inputs, activities, outputs, customers and outcomes, this model describes how the program operates to achieve its goals. The template used to develop this particular logic model aligns the GLOBE program operations with its program strategy, thus ensuring that what the program is doing supports the achievement of long-term, intermediate and annual goals. It also provides a foundation for the development of key programmatic metrics that can be used to gauge progress toward the achievement of strategic goals.
Airport Performance Model : Volume 2 - User's Manual and Program Documentation
DOT National Transportation Integrated Search
1978-10-01
Volume II contains a User's manual and program documentation for the Airport Performance Model. This computer-based model is written in FORTRAN IV for the DEC-10. The user's manual describes the user inputs to the interactive program and gives sample...
The Bridges SOI Model School Program at Palo Verde School, Palo Verde, Arizona.
ERIC Educational Resources Information Center
Stock, William A.; DiSalvo, Pamela M.
The Bridges SOI Model School Program is an educational service based upon the SOI (Structure of Intellect) Model School curriculum. For the middle seven months of the academic year, all students in the program complete brief daily exercises that develop specific cognitive skills delineated in the SOI model. Additionally, intensive individual…
NASA Technical Reports Server (NTRS)
Peterson, John B., Jr.
1991-01-01
Two programs were developed to calculate the pitch and roll position of the conventional sting drive and the pitch of a high angle articulated sting to position a wind tunnel model at the desired angle of attack and sideslip and position the model as near as possible to the centerline of the tunnel. These programs account for the effects of sting offset angles, sting bending angles, and wind-tunnel stream flow angles. In addition, the second program incorporates inputs form on-board accelerometers that measure model pitch and roll with respect to gravity. The programs are presented and a description of the numerical operation of the programs with a definition of the variables used in the programs is given.
NASA Technical Reports Server (NTRS)
Kvaternik, Raymond G.
1992-01-01
An overview is presented of government contributions to the program called Design Analysis Methods for Vibrations (DAMV) which attempted to develop finite-element-based analyses of rotorcraft vibrations. NASA initiated the program with a finite-element modeling program for the CH-47D tandem-rotor helicopter. The DAMV program emphasized four areas including: airframe finite-element modeling, difficult components studies, coupled rotor-airframe vibrations, and airframe structural optimization. Key accomplishments of the program include industrywide standards for modeling metal and composite airframes, improved industrial designs for vibrations, and the identification of critical structural contributors to airframe vibratory responses. The program also demonstrated the value of incorporating secondary modeling details to improving correlation, and the findings provide the basis for an improved finite-element-based dynamics design-analysis capability.
Greenfield, David; Hinchcliff, Reece; Hogden, Anne; Mumford, Virginia; Debono, Deborah; Pawsey, Marjorie; Westbrook, Johanna; Braithwaite, Jeffrey
2016-07-01
The study aim was to investigate the understandings and concerns of stakeholders regarding the evolution of health service accreditation programs in Australia. Stakeholder representatives from programs in the primary, acute and aged care sectors participated in semi-structured interviews. Across 2011-12 there were 47 group and individual interviews involving 258 participants. Interviews lasted, on average, 1 h, and were digitally recorded and transcribed. Transcriptions were analysed using textual referencing software. Four significant issues were considered to have directed the evolution of accreditation programs: altering underlying program philosophies; shifting of program content focus and details; different surveying expectations and experiences and the influence of external contextual factors upon accreditation programs. Three accreditation program models were noted by participants: regulatory compliance; continuous quality improvement and a hybrid model, incorporating elements of these two. Respondents noted the compatibility or incommensurability of the first two models. Participation in a program was reportedly experienced as ranging on a survey continuum from "malicious compliance" to "performance audits" to "quality improvement journeys". Wider contextual factors, in particular, political and community expectations, and associated media reporting, were considered significant influences on the operation and evolution of programs. A hybrid accreditation model was noted to have evolved. The hybrid model promotes minimum standards and continuous quality improvement, through examining the structure and processes of organisations and the outcomes of care. The hybrid model appears to be directing organisational and professional attention to enhance their safety cultures. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Strategies for Energy Efficient Resource Management of Hybrid Programming Models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Dong; Supinski, Bronis de; Schulz, Martin
2013-01-01
Many scientific applications are programmed using hybrid programming models that use both message-passing and shared-memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared-memory or message-passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoptionmore » of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74% on average and up to 13.8%) with some performance gain (up to 7.5%) or negligible performance loss.« less
A computer program for predicting nonlinear uniaxial material responses using viscoplastic models
NASA Technical Reports Server (NTRS)
Chang, T. Y.; Thompson, R. L.
1984-01-01
A computer program was developed for predicting nonlinear uniaxial material responses using viscoplastic constitutive models. Four specific models, i.e., those due to Miller, Walker, Krieg-Swearengen-Rhode, and Robinson, are included. Any other unified model is easily implemented into the program in the form of subroutines. Analysis features include stress-strain cycling, creep response, stress relaxation, thermomechanical fatigue loop, or any combination of these responses. An outline is given on the theoretical background of uniaxial constitutive models, analysis procedure, and numerical integration methods for solving the nonlinear constitutive equations. In addition, a discussion on the computer program implementation is also given. Finally, seven numerical examples are included to demonstrate the versatility of the computer program developed.
Formally verifying Ada programs which use real number types
NASA Technical Reports Server (NTRS)
Sutherland, David
1986-01-01
Formal verification is applied to programs which use real number arithmetic operations (mathematical programs). Formal verification of a program P consists of creating a mathematical model of F, stating the desired properties of P in a formal logical language, and proving that the mathematical model has the desired properties using a formal proof calculus. The development and verification of the mathematical model are discussed.
Code of Federal Regulations, 2010 CFR
2017-10-01
... INFRASTRUCTURE AND MODEL PROGRAMS EPISODE PAYMENT MODEL Pricing and Payment § 512.320 Treatment of incentive... under such models are independent of, and do not affect, any incentive programs or add-on payments under... 42 Public Health 5 2017-10-01 2017-10-01 false Treatment of incentive programs or add-on payments...
ERIC Educational Resources Information Center
Anderson, Stephen C.; And Others
Module 2 of a seven module package for child protective service workers explores various types of parent aide programs for abused and neglected children and their families. Four training activities address models of parent aide programs, organization analysis, and selection of the appropriate program model. Included are directions for using the…
Streamlining DOD Acquisitions: Balancing Schedule with Complexity
2006-09-01
from them has a distinct industrial flavor: streamlined processes, benchmarking, and business models . The requirements generation com- munity led by... model ), and the Department of the Navy assumed program lead. [Stable Program Inputs (-)] By 1984, the program goals included delivery of 913 V-22...they subsequently specified a crew of two. [Stable Program Input (-)] The contractor team won in a “fly-off” solely via modeling and simulation
NASA Astrophysics Data System (ADS)
Louca, Loucas
This is a descriptive case study investigating the use of two computer-based programming environments (CPEs), MicroWorlds(TM) (MW) and Stagecast Creator(TM) (SC), as modeling tools for collaborative fifth grade science learning. In this study I investigated how CPEs might support fifth grade student work and inquiry in science. There is a longstanding awareness of the need to help students learn about models and modeling in science, and CPEs are promising tools for this. A computer program can be a model of a physical system, and modeling through programming may make the process more tangible: Programming involves making decisions and assumptions; the code is used to express ideas; running the program shows the implications of those ideas. In this study I have analyzed and compared students' activities and conversations in two after-school clubs, one working with MW and the other with SC. The findings confirm the promise of CPEs as tools for teaching practices of modeling and science, and they suggest advantages and disadvantages to that purpose of particular aspects of CPE designs. MW is an open-ended, textual CPE that uses procedural programming. MW students focused on breaking down phenomena into small programmable pieces, which is useful for scientific modeling. Developing their programs, the students focused on writing, testing and debugging code, which are also useful for scientific modeling. SC is a non-linear, object-oriented CPE that uses visual program language. SC students saw their work as creating games. They were focused on the overall story which they then translated it into SC rules, which was in conflict with SC's object-oriented interface. However, telling the story of individual causal agents was useful for scientific modeling. Programming in SC was easier, whereas reading code in MW was more tangible. The latter helped MW students to use the code as the representation of the phenomenon rather than merely as a tool for creating a simulation. The analyses also pointed to three emerging "frames" that describe student's work focus, based on their goals, strategies, and criteria for success. Emerging "frames" are the programming, the visualization, and the modeling frame. One way to understand the respective advantages and disadvantages of the two CPEs is with respect to which frames they engendered in students.
A Leadership Model for University Geology Department Teacher Inservice Programs.
ERIC Educational Resources Information Center
Sheldon, Daniel S.; And Others
1983-01-01
Provides geology departments and science educators with a leadership model for developing earth science inservice programs. Model emphasizes cooperation/coordination among departments, science educators, and curriculum specialists at local/intermediate/state levels. Includes rationale for inservice programs and geology department involvement in…
ESPVI 4.0 ELECTROSTATIS PRECIPITATOR V-1 AND PERFORMANCE MODEL: USER'S MANUAL
The manual is the companion document for the microcomputer program ESPVI 4.0, Electrostatic Precipitation VI and Performance Model. The program was developed to provide a user- friendly interface to an advanced model of electrostatic precipitation (ESP) performance. The program i...
Implementing a citizen's DWI reporting program using the Extra Eyes model
DOT National Transportation Integrated Search
2008-09-01
This manual is a guide for law enforcement agencies and community organizations in creating and implementing a citizens DWI reporting program in their communities modeling the Operation Extra Eyes program. Extra Eyes is a program that engages volu...
ERIC Educational Resources Information Center
Truckenmiller, James L.
The former HEW (Health, Education, and Welfare) National Strategy for Youth Development Model proposed a community-based program to promote positive youth development and to prevent delinquency through a sequence of youth needs assessments, needs-targeted programs, and program impact evaluation. HEW Community Program Impact Scales data obtained…
ERIC Educational Resources Information Center
Crites, John O.
Evaluating the effectiveness of career guidance programs is a complex process, and few comprehensive models for evaluating such programs exist. Evaluation of career guidance programs has been hampered by the myth that program outcomes are uniform and monolithic. Findings from studies of attribute treatment interactions have revealed only a few…
A Growth Model for Academic Program Life Cycle (APLC): A Theoretical and Empirical Analysis
ERIC Educational Resources Information Center
Acquah, Edward H. K.
2010-01-01
Academic program life cycle concept states each program's life flows through several stages: introduction, growth, maturity, and decline. A mixed-influence diffusion growth model is fitted to enrolment data on academic programs to analyze the factors determining progress of academic programs through their life cycles. The regression analysis yield…
The Origin and Evolution of the Behavior Analysis Program at the University of Nevada, Reno.
Hayes, Linda J; Houmanfar, Ramona A; Ghezzi, Patrick M; Williams, W Larry; Locey, Matthew; Hayes, Steven C
2016-05-01
The origins of the Behavior Analysis program at the University of Nevada, Reno by way of a self-capitalized model through its transition to a more typical graduate program is described. Details of the original proposal to establish the program and the funding model are described. Some of the unusual features of the program executed in this way are discussed, along with problems engendered by the model. Also included is the diversification of faculty interests over time. The status of the program, now, after 25 years of operation, is presented.
DIVWAG Model Documentation. Volume II. Programmer/Analyst Manual. Part 4.
1976-07-01
Model Constant Data Deck Structure . .. .... IV-13-A-40 Appendix B. Movement Model Program Descriptions . .. .. . .IV-13-B-1 1. Introduction...Data ................ IV-15-A-17 11. Airmobile Constant Data Deck Structure .. ...... .. IV-15-A-30 Appendix B. Airmobile Model Program Descriptions...Make no changes. 12. AIRMOBILE CONSTANT DATA DECK STRUCTURE . The deck structure required by the Airmobile Model constant data load program and the data
ERIC Educational Resources Information Center
Hamilton, Jenny; Bronte-Tinkew, Jacinta
2007-01-01
A logic model, also called a conceptual model and theory-of-change model, is a visual representation of how a program is expected to "work." It relates resources, activities, and the intended changes or impacts that a program is expected to create. Typically, logic models are diagrams or flow charts with illustrations, text, and arrows that…
Gunn, Christine M; Clark, Jack A; Battaglia, Tracy A; Freund, Karen M; Parker, Victoria A
2014-10-01
To determine how closely a published model of navigation reflects the practice of navigation in breast cancer patient navigation programs. Observational field notes describing patient navigator activities collected from 10 purposefully sampled, foundation-funded breast cancer navigation programs in 2008-2009. An exploratory study evaluated a model framework for patient navigation published by Harold Freeman by using an a priori coding scheme based on model domains. Field notes were compiled and coded. Inductive codes were added during analysis to characterize activities not included in the original model. Programs were consistent with individual-level principles representing tasks focused on individual patients. There was variation with respect to program-level principles that related to program organization and structure. Program characteristics such as the use of volunteer or clinical navigators were identified as contributors to patterns of model concordance. This research provides a framework for defining the navigator role as focused on eliminating barriers through the provision of individual-level interventions. The diversity observed at the program level in these programs was a reflection of implementation according to target population. Further guidance may be required to assist patient navigation programs to define and tailor goals and measurement to community needs. © Health Research and Educational Trust.
Gunn, Christine M; Clark, Jack A; Battaglia, Tracy A; Freund, Karen M; Parker, Victoria A
2014-01-01
Objective To determine how closely a published model of navigation reflects the practice of navigation in breast cancer patient navigation programs. Data Source Observational field notes describing patient navigator activities collected from 10 purposefully sampled, foundation-funded breast cancer navigation programs in 2008–2009. Study Design An exploratory study evaluated a model framework for patient navigation published by Harold Freeman by using an a priori coding scheme based on model domains. Data Collection Field notes were compiled and coded. Inductive codes were added during analysis to characterize activities not included in the original model. Principal Findings Programs were consistent with individual-level principles representing tasks focused on individual patients. There was variation with respect to program-level principles that related to program organization and structure. Program characteristics such as the use of volunteer or clinical navigators were identified as contributors to patterns of model concordance. Conclusions This research provides a framework for defining the navigator role as focused on eliminating barriers through the provision of individual-level interventions. The diversity observed at the program level in these programs was a reflection of implementation according to target population. Further guidance may be required to assist patient navigation programs to define and tailor goals and measurement to community needs. PMID:24820445
Satellite broadcasting system study
NASA Technical Reports Server (NTRS)
1972-01-01
The study to develop a system model and computer program representative of broadcasting satellite systems employing community-type receiving terminals is reported. The program provides a user-oriented tool for evaluating performance/cost tradeoffs, synthesizing minimum cost systems for a given set of system requirements, and performing sensitivity analyses to identify critical parameters and technology. The performance/ costing philosophy and what is meant by a minimum cost system is shown graphically. Topics discussed include: main line control program, ground segment model, space segment model, cost models and launch vehicle selection. Several examples of minimum cost systems resulting from the computer program are presented. A listing of the computer program is also included.
Pandey, Daya Shankar; Pan, Indranil; Das, Saptarshi; Leahy, James J; Kwapinski, Witold
2015-03-01
A multi-gene genetic programming technique is proposed as a new method to predict syngas yield production and the lower heating value for municipal solid waste gasification in a fluidized bed gasifier. The study shows that the predicted outputs of the municipal solid waste gasification process are in good agreement with the experimental dataset and also generalise well to validation (untrained) data. Published experimental datasets are used for model training and validation purposes. The results show the effectiveness of the genetic programming technique for solving complex nonlinear regression problems. The multi-gene genetic programming are also compared with a single-gene genetic programming model to show the relative merits and demerits of the technique. This study demonstrates that the genetic programming based data-driven modelling strategy can be a good candidate for developing models for other types of fuels as well. Copyright © 2014 Elsevier Ltd. All rights reserved.
McDonald, Paige L; Harwood, Kenneth J; Butler, Joan T; Schlumpf, Karen S; Eschmann, Carson W; Drago, Daniela
2018-12-01
Intensive courses (ICs), or accelerated courses, are gaining popularity in medical and health professions education, particularly as programs adopt e-learning models to negotiate challenges of flexibility, space, cost, and time. In 2014, the Department of Clinical Research and Leadership (CRL) at the George Washington University School of Medicine and Health Sciences began the process of transitioning two online 15-week graduate programs to an IC model. Within a year, a third program also transitioned to this model. A literature review yielded little guidance on the process of transitioning from 15-week, traditional models of delivery to IC models, particularly in online learning environments. Correspondingly, this paper describes the process by which CRL transitioned three online graduate programs to an IC model and details best practices for course design and facilitation resulting from our iterative redesign process. Finally, we present lessons-learned for the benefit of other medical and health professions' programs contemplating similar transitions. CRL: Department of Clinical Research and Leadership; HSCI: Health Sciences; IC: Intensive course; PD: Program director; QM: Quality Matters.
McDonald, Paige L.; Harwood, Kenneth J.; Butler, Joan T.; Schlumpf, Karen S.; Eschmann, Carson W.; Drago, Daniela
2018-01-01
ABSTRACT Intensive courses (ICs), or accelerated courses, are gaining popularity in medical and health professions education, particularly as programs adopt e-learning models to negotiate challenges of flexibility, space, cost, and time. In 2014, the Department of Clinical Research and Leadership (CRL) at the George Washington University School of Medicine and Health Sciences began the process of transitioning two online 15-week graduate programs to an IC model. Within a year, a third program also transitioned to this model. A literature review yielded little guidance on the process of transitioning from 15-week, traditional models of delivery to IC models, particularly in online learning environments. Correspondingly, this paper describes the process by which CRL transitioned three online graduate programs to an IC model and details best practices for course design and facilitation resulting from our iterative redesign process. Finally, we present lessons-learned for the benefit of other medical and health professionsʼ programs contemplating similar transitions. Abbreviations: CRL: Department of Clinical Research and Leadership; HSCI: Health Sciences; IC: Intensive course; PD: Program director; QM: Quality Matters PMID:29277143
Flexible Language Constructs for Large Parallel Programs
Rosing, Matt; Schnabel, Robert
1994-01-01
The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less
Predoctoral Program Models in Dentistry for the Handicapped: The University of Tennessee Program.
ERIC Educational Resources Information Center
Wessels, Kenneth E.
1980-01-01
A model program in dentistry for the handicapped offered at the University of Tennessee and supported by a Robert Wood Johnson Foundation grant is described. The program includes didactic instruction, observation and seminar, and clinical practice. (JMF)
ERIC Educational Resources Information Center
Leff, H. Stephen; Turner, Ralph R.
This report focuses on the use of linear programming models to address the issues of how vocational rehabilitation (VR) resources should be allocated in order to maximize program efficiency within given resource constraints. A general introduction to linear programming models is first presented that describes the major types of models available,…
A Comprehensive Model for Developing and Evaluating Study Abroad Programs in Counselor Education
ERIC Educational Resources Information Center
Santos, Syntia Dinora
2014-01-01
This paper introduces a model to guide the process of designing and evaluating study abroad programs, addressing particular stages and influential factors. The main purpose of the model is to serve as a basic structure for those who want to develop their own program or evaluate previous cultural immersion experiences. The model is based on the…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xie, Fei; Huang, Yongxi
Here, we develop a multistage, stochastic mixed-integer model to support biofuel supply chain expansion under evolving uncertainties. By utilizing the block-separable recourse property, we reformulate the multistage program in an equivalent two-stage program and solve it using an enhanced nested decomposition method with maximal non-dominated cuts. We conduct extensive numerical experiments and demonstrate the application of the model and algorithm in a case study based on the South Carolina settings. The value of multistage stochastic programming method is also explored by comparing the model solution with the counterparts of an expected value based deterministic model and a two-stage stochastic model.
Xie, Fei; Huang, Yongxi
2018-02-04
Here, we develop a multistage, stochastic mixed-integer model to support biofuel supply chain expansion under evolving uncertainties. By utilizing the block-separable recourse property, we reformulate the multistage program in an equivalent two-stage program and solve it using an enhanced nested decomposition method with maximal non-dominated cuts. We conduct extensive numerical experiments and demonstrate the application of the model and algorithm in a case study based on the South Carolina settings. The value of multistage stochastic programming method is also explored by comparing the model solution with the counterparts of an expected value based deterministic model and a two-stage stochastic model.
User's manual for interactive LINEAR: A FORTRAN program to derive linear aircraft models
NASA Technical Reports Server (NTRS)
Antoniewicz, Robert F.; Duke, Eugene L.; Patterson, Brian P.
1988-01-01
An interactive FORTRAN program that provides the user with a powerful and flexible tool for the linearization of aircraft aerodynamic models is documented in this report. The program LINEAR numerically determines a linear system model using nonlinear equations of motion and a user-supplied linear or nonlinear aerodynamic model. The nonlinear equations of motion used are six-degree-of-freedom equations with stationary atmosphere and flat, nonrotating earth assumptions. The system model determined by LINEAR consists of matrices for both the state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model.
SCI model structure determination program (OSR) user's guide. [optimal subset regression
NASA Technical Reports Server (NTRS)
1979-01-01
The computer program, OSR (Optimal Subset Regression) which estimates models for rotorcraft body and rotor force and moment coefficients is described. The technique used is based on the subset regression algorithm. Given time histories of aerodynamic coefficients, aerodynamic variables, and control inputs, the program computes correlation between various time histories. The model structure determination is based on these correlations. Inputs and outputs of the program are given.
Optimal Repair And Replacement Policy For A System With Multiple Components
2016-06-17
Numerical Demonstration To implement the linear program, we use the Python Programming Language (PSF 2016) with the Pyomo optimization modeling language...opre.1040.0133. Hart, W.E., C. Laird, J. Watson, D.L. Woodruff. 2012. Pyomo–optimization modeling in python , vol. 67. Springer Science & Business...Media. Hart, W.E., J. Watson, D.L. Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python . Mathematical Programming Computation 3(3
Program Planners’ Perspectives of Promotora Roles, Recruitment, and Selection
Koskan, Alexis; Hilfinger Messias, DeAnne K.; Friedman, Daniela B.; Brandt, Heather M.; Walsemann, Katrina M.
2013-01-01
Objective Program planners work with promotoras (the Spanish term for female community health workers) to reduce health disparities among underserved populations. Based on the Role-Outcomes Linkage Evaluation Model for Community Health Workers (ROLES) conceptual model, we explored how program planners conceptualized the promotora role and the approaches and strategies they used to recruit, select, and sustain promotoras. Design We conducted semi-structured, in-depth interviews with a purposive convenience sample of 24 program planners, program coordinators, promotora recruiters, research principal investigators, and other individuals who worked closely with promotoras on United States-based health programs for Hispanic women (ages 18 and older). Results Planners conceptualized the promotora role based on their personal experiences and their understanding of the underlying philosophical tenets of the promotora approach. Recruitment and selection methods reflected planners’ conceptualizations and experiences of promotoras as paid staff or volunteers. Participants described a variety of program planning and implementation methods. They focused on sustainability of the programs, the intended health behavior changes or activities, and the individual promotoras. Conclusion To strengthen health programs employing the promotora delivery model, job descriptions should delineate role expectations and boundaries and better guide promotora evaluations. We suggest including additional components such as information on funding sources, program type and delivery, and sustainability outcomes to enhance the ROLES conceptual model. The expanded model can be used to guide program planners in the planning, implementing, and evaluating of promotora health programs. PMID:23039847
Model Checking JAVA Programs Using Java Pathfinder
NASA Technical Reports Server (NTRS)
Havelund, Klaus; Pressburger, Thomas
2000-01-01
This paper describes a translator called JAVA PATHFINDER from JAVA to PROMELA, the "programming language" of the SPIN model checker. The purpose is to establish a framework for verification and debugging of JAVA programs based on model checking. This work should be seen in a broader attempt to make formal methods applicable "in the loop" of programming within NASA's areas such as space, aviation, and robotics. Our main goal is to create automated formal methods such that programmers themselves can apply these in their daily work (in the loop) without the need for specialists to manually reformulate a program into a different notation in order to analyze the program. This work is a continuation of an effort to formally verify, using SPIN, a multi-threaded operating system programmed in Lisp for the Deep-Space 1 spacecraft, and of previous work in applying existing model checkers and theorem provers to real applications.
Execution models for mapping programs onto distributed memory parallel computers
NASA Technical Reports Server (NTRS)
Sussman, Alan
1992-01-01
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.
Computer programs for forward and inverse modeling of acoustic and electromagnetic data
Ellefsen, Karl J.
2011-01-01
A suite of computer programs was developed by U.S. Geological Survey personnel for forward and inverse modeling of acoustic and electromagnetic data. This report describes the computer resources that are needed to execute the programs, the installation of the programs, the program designs, some tests of their accuracy, and some suggested improvements.
A Value-Added Approach to Selecting the Best Master of Business Administration (MBA) Program
ERIC Educational Resources Information Center
Fisher, Dorothy M.; Kiang, Melody; Fisher, Steven A.
2007-01-01
Although numerous studies rank master of business administration (MBA) programs, prospective students' selection of the best MBA program is a formidable task. In this study, the authors used a linear-programming-based model called data envelopment analysis (DEA) to evaluate MBA programs. The DEA model connects costs to benefits to evaluate the…
Pedagogy and Processes for a Computer Programming Outreach Workshop--The Bridge to College Model
ERIC Educational Resources Information Center
Tangney, Brendan; Oldham, Elizabeth; Conneely, Claire; Barrett, Stephen; Lawlor, John
2010-01-01
This paper describes a model for computer programming outreach workshops aimed at second-level students (ages 15-16). Participants engage in a series of programming activities based on the Scratch visual programming language, and a very strong group-based pedagogy is followed. Participants are not required to have any prior programming experience.…
ERIC Educational Resources Information Center
Fox, Danielle Polizzi; Gottfredson, Denise C.; Kumpfer, Karol K.; Beatty, Penny D.
2004-01-01
This article discusses the challenges faced when a popular model program, the Strengthening Families Program, which in the past has been implemented on a smaller scale in single organizations, moves to a larger, multiorganization endeavor. On the basis of 42 interviews conducted with program staff, the results highlight two main themes that…
Compare and Contrast Program Planning Models
ERIC Educational Resources Information Center
Baskas, Richard S.
2011-01-01
This paper will examine the differences and similarities between two program planning models, Tyler and Caffarella, to reveal their strengths and weaknesses. When adults are involved in training sessions, there are various program planning models that can be used, depending on the goal of the training session. Researchers developed these models…
Reverse engineering of legacy agricultural phenology modeling system
USDA-ARS?s Scientific Manuscript database
A program which implements predictive phenology modeling is a valuable tool for growers and scientists. Such a program was created in the late 1980's by the creators of general phenology modeling as proof of their techniques. However, this first program could not continue to meet the needs of the fi...
A Conceptual Framework for Institutional Research in Community Colleges.
ERIC Educational Resources Information Center
Alfred, Richard L.; Ivens, Stephen H.
This paper defines a conceptual model for institutional research in the community college and identifies sources of information, programs, and services that provide data necessary for implementation of the model. The model contains four specific subsystems: goal setting, program development, program review, and cost effectiveness. Each subsystem…
ERIC Educational Resources Information Center
Cleaveland, Bonnie L.
1994-01-01
Current state-of-the-art substance abuse prevention programs are mostly social cognitive theory based. However, there are few publications which review specifically how modeling is applied to adolescent substance abuse prevention programs. This article reviews theoretical considerations for implementing modeling for this purpose. (Author/LKS)
A Model for Evaluating Development Programs. Miscellaneous Report.
ERIC Educational Resources Information Center
Burton, John E., Jr.; Rogers, David L.
Taking the position that the Classical Experimental Evaluation (CEE) Model does not do justice to the process of acquiring information necessary for decision making re planning, programming, implementing, and recycling program activities, this paper presents the Inductive, System-Process (ISP) evaluation model as an alternative to be used in…
Field modeling of heat transfer in atrium
NASA Astrophysics Data System (ADS)
Nedryshkin, Oleg; Gravit, Marina; Bushuev, Nikolay
2017-10-01
The results of calculating fire risk are an important element in the system of modern fire safety assessment. The article reviews the work on the mathematical modeling of fire in the room. A comparison of different calculation models in the programs of fire risk assessment and fire modeling was performed. The results of full-scale fire tests and fire modeling in the FDS program are presented. The analysis of empirical and theoretical data on fire modeling is made, a conclusion is made about the modeling accuracy in the FDS program.
NASA Technical Reports Server (NTRS)
Justus, C. G.; Alyea, F. N.; Cunnold, D. M.; Jeffries, W. R., III; Johnson, D. L.
1991-01-01
A new (1990) version of the NASA/MSFC Global Reference Atmospheric Model (GRAM-90) was completed and the program and key data base listing are presented. GRAM-90 incorporate extensive new data, mostly collected under the Middle Atmosphere Program, to produce a completely revised middle atmosphere model (20 to 120 km). At altitudes greater than 120 km, GRAM-90 uses the NASA Marshall Engineering Thermosphere model. Complete listings of all program and major data bases are presented. Also, a test case is included.
A Structural Evaluation of a Large-Scale Quasi-Experimental Microfinance Initiative.
Kaboski, Joseph P; Townsend, Robert M
2011-09-01
This paper uses a structural model to understand, predict, and evaluate the impact of an exogenous microcredit intervention program, the Thai Million Baht Village Fund program. We model household decisions in the face of borrowing constraints, income uncertainty, and high-yield indivisible investment opportunities. After estimation of parameters using pre-program data, we evaluate the model's ability to predict and interpret the impact of the village fund intervention. Simulations from the model mirror the data in yielding a greater increase in consumption than credit, which is interpreted as evidence of credit constraints. A cost-benefit analysis using the model indicates that some households value the program much more than its per household cost, but overall the program costs 20 percent more than the sum of these benefits.
Core Professionalism Education in Surgery: A Systematic Review.
Sarıoğlu Büke, Akile; Karabilgin Öztürkçü, Özlem Sürel; Yılmaz, Yusuf; Sayek, İskender
2018-03-15
Professionalism education is one of the major elements of surgical residency education. To evaluate the studies on core professionalism education programs in surgical professionalism education. Systematic review. This systematic literature review was performed to analyze core professionalism programs for surgical residency education published in English with at least three of the following features: program developmental model/instructional design method, aims and competencies, methods of teaching, methods of assessment, and program evaluation model or method. A total of 27083 articles were retrieved using EBSCOHOST, PubMed, Science Direct, Web of Science, and manual search. Eight articles met the selection criteria. The instructional design method was presented in only one article, which described the Analysis, Design, Development, Implementation, and Evaluation model. Six articles were based on the Accreditation Council for Graduate Medical Education criterion, although there was significant variability in content. The most common teaching method was role modeling with scenario- and case-based learning. A wide range of assessment methods for evaluating professionalism education were reported. The Kirkpatrick model was reported in one article as a method for program evaluation. It is suggested that for a core surgical professionalism education program, developmental/instructional design model, aims and competencies, content, teaching methods, assessment methods, and program evaluation methods/models should be well defined, and the content should be comparable.
Flightdeck Automation Problems (FLAP) Model for Safety Technology Portfolio Assessment
NASA Technical Reports Server (NTRS)
Ancel, Ersin; Shih, Ann T.
2014-01-01
NASA's Aviation Safety Program (AvSP) develops and advances methodologies and technologies to improve air transportation safety. The Safety Analysis and Integration Team (SAIT) conducts a safety technology portfolio assessment (PA) to analyze the program content, to examine the benefits and risks of products with respect to program goals, and to support programmatic decision making. The PA process includes systematic identification of current and future safety risks as well as tracking several quantitative and qualitative metrics to ensure the program goals are addressing prominent safety risks accurately and effectively. One of the metrics within the PA process involves using quantitative aviation safety models to gauge the impact of the safety products. This paper demonstrates the role of aviation safety modeling by providing model outputs and evaluating a sample of portfolio elements using the Flightdeck Automation Problems (FLAP) model. The model enables not only ranking of the quantitative relative risk reduction impact of all portfolio elements, but also highlighting the areas with high potential impact via sensitivity and gap analyses in support of the program office. Although the model outputs are preliminary and products are notional, the process shown in this paper is essential to a comprehensive PA of NASA's safety products in the current program and future programs/projects.
GUI to Facilitate Research on Biological Damage from Radiation
NASA Technical Reports Server (NTRS)
Cucinotta, Frances A.; Ponomarev, Artem Lvovich
2010-01-01
A graphical-user-interface (GUI) computer program has been developed to facilitate research on the damage caused by highly energetic particles and photons impinging on living organisms. The program brings together, into one computational workspace, computer codes that have been developed over the years, plus codes that will be developed during the foreseeable future, to address diverse aspects of radiation damage. These include codes that implement radiation-track models, codes for biophysical models of breakage of deoxyribonucleic acid (DNA) by radiation, pattern-recognition programs for extracting quantitative information from biological assays, and image-processing programs that aid visualization of DNA breaks. The radiation-track models are based on transport models of interactions of radiation with matter and solution of the Boltzmann transport equation by use of both theoretical and numerical models. The biophysical models of breakage of DNA by radiation include biopolymer coarse-grained and atomistic models of DNA, stochastic- process models of deposition of energy, and Markov-based probabilistic models of placement of double-strand breaks in DNA. The program is designed for use in the NT, 95, 98, 2000, ME, and XP variants of the Windows operating system.
Assessment of Evidence-based Management Training Program: Application of a Logic Model.
Guo, Ruiling; Farnsworth, Tracy J; Hermanson, Patrick M
2016-06-01
The purposes of this study were to apply a logic model to plan and implement an evidence-based management (EBMgt) educational training program for healthcare administrators and to examine whether a logic model is a useful tool for evaluating the outcomes of the educational program. The logic model was used as a conceptual framework to guide the investigators in developing an EBMgt educational training program and evaluating the outcomes of the program. The major components of the logic model were constructed as inputs, outputs, and outcomes/impacts. The investigators delineated the logic model based on the results of the needs assessment survey. Two 3-hour training workshops were delivered to 30 participants. To assess the outcomes of the EBMgt educational program, pre- and post-tests and self-reflection surveys were conducted. The data were collected and analyzed descriptively and inferentially, using the IBM Statistical Package for the Social Sciences (SPSS) 22.0. A paired sample t-test was performed to compare the differences in participants' EBMgt knowledge and skills prior to and after the training. The assessment results showed that there was a statistically significant difference in participants' EBMgt knowledge and information searching skills before and after the training (p< 0.001). Participants' confidence in using the EBMgt approach for decision-making was significantly increased after the training workshops (p< 0.001). Eighty-three percent of participants indicated that the knowledge and skills they gained through the training program could be used for future management decision-making in their healthcare organizations. The overall evaluation results of the program were positive. It is suggested that the logic model is a useful tool for program planning, implementation, and evaluation, and it also improves the outcomes of the educational program.
Booker, Kathy; Hilgenberg, Cheryl
2010-01-01
Nursing is often considered expensive in the cost analysis of academic programs. Yet nursing programs have the power to attract many students, and the national nursing shortage has resulted in a high demand for nurses. Methods to systematically assess programs across an entire university academic division are often dissimilar in technique and outcome. At a small, private, Midwestern university, a model for comprehensive program assessment, titled the Quality, Potential and Cost (QPC) model, was developed and applied to each major offered at the university through the collaborative effort of directors, chairs, deans, and the vice president for academic affairs. The QPC model provides a means of equalizing data so that single measures (such as cost) are not viewed in isolation. It also provides a common language to ensure that all academic leaders at an institution apply consistent methods for assessment of individual programs. The application of the QPC model allowed for consistent, fair assessments and the ability to allocate resources to programs according to strategic direction. In this article, the application of the QPC model to School of Nursing majors and other selected university majors will be illustrated. Copyright 2010 Elsevier Inc. All rights reserved.
Combining Thermal And Structural Analyses
NASA Technical Reports Server (NTRS)
Winegar, Steven R.
1990-01-01
Computer code makes programs compatible so stresses and deformations calculated. Paper describes computer code combining thermal analysis with structural analysis. Called SNIP (for SINDA-NASTRAN Interfacing Program), code provides interface between finite-difference thermal model of system and finite-element structural model when no node-to-element correlation between models. Eliminates much manual work in converting temperature results of SINDA (Systems Improved Numerical Differencing Analyzer) program into thermal loads for NASTRAN (NASA Structural Analysis) program. Used to analyze concentrating reflectors for solar generation of electric power. Large thermal and structural models needed to predict distortion of surface shapes, and SNIP saves considerable time and effort in combining models.
Planning Models for Tuberculosis Control Programs
Chorba, Ronald W.; Sanders, J. L.
1971-01-01
A discrete-state, discrete-time simulation model of tuberculosis is presented, with submodels of preventive interventions. The model allows prediction of the prevalence of the disease over the simulation period. Preventive and control programs and their optimal budgets may be planned by using the model for cost-benefit analysis: costs are assigned to the program components and disease outcomes to determine the ratio of program expenditures to future savings on medical and socioeconomic costs of tuberculosis. Optimization is achieved by allocating funds in successive increments to alternative program components in simulation and identifying those components that lead to the greatest reduction in prevalence for the given level of expenditure. The method is applied to four hypothetical disease prevalence situations. PMID:4999448
Program Development and External Assessment.
ERIC Educational Resources Information Center
Minnis, D. L.
Although the development of new models is essential to the improvement of teacher preparation programs, the California system of accreditation of teacher education programs seems to hinder innovative program development and evaluation. External assessment in California is based on a discrepancy model which works best when applied to static…
Unifying Model-Based and Reactive Programming within a Model-Based Executive
NASA Technical Reports Server (NTRS)
Williams, Brian C.; Gupta, Vineet; Norvig, Peter (Technical Monitor)
1999-01-01
Real-time, model-based, deduction has recently emerged as a vital component in AI's tool box for developing highly autonomous reactive systems. Yet one of the current hurdles towards developing model-based reactive systems is the number of methods simultaneously employed, and their corresponding melange of programming and modeling languages. This paper offers an important step towards unification. We introduce RMPL, a rich modeling language that combines probabilistic, constraint-based modeling with reactive programming constructs, while offering a simple semantics in terms of hidden state Markov processes. We introduce probabilistic, hierarchical constraint automata (PHCA), which allow Markov processes to be expressed in a compact representation that preserves the modularity of RMPL programs. Finally, a model-based executive, called Reactive Burton is described that exploits this compact encoding to perform efficIent simulation, belief state update and control sequence generation.
ERIC Educational Resources Information Center
Acquah, Edward H. K.
2012-01-01
The academic program life cycle (APLC) concept states each program's life flows through several stages: introduction, growth, maturity, and decline. A mixed-influence diffusion growth model is fitted to annual enrollment data on academic programs to analyze the factors determining progress of academic programs through their life cycles. The…
NASA Technical Reports Server (NTRS)
Peterson, John B., Jr.
1988-01-01
Two programs have been developed to calculate the pitch and roll angles of a wind-tunnel sting drive system that will position a model at the desired angle of attack and and angle of sideslip in the wind tunnel. These programs account for the effects of sting offset angles, sting bending angles and wind-tunnel stream flow angles. In addition, the second program incorporates inputs from on-board accelerometers that measure model pitch and roll with respect to gravity. The programs are presented in the report and a description of the numerical operation of the programs with a definition of the variables used in the programs is given.
Harbaugh, Arien W.
2011-01-01
The MFI2005 data-input (entry) program was developed for use with the U.S. Geological Survey modular three-dimensional finite-difference groundwater model, MODFLOW-2005. MFI2005 runs on personal computers and is designed to be easy to use; data are entered interactively through a series of display screens. MFI2005 supports parameter estimation using the UCODE_2005 program for parameter estimation. Data for MODPATH, a particle-tracking program for use with MODFLOW-2005, also can be entered using MFI2005. MFI2005 can be used in conjunction with other data-input programs so that the different parts of a model dataset can be entered by using the most suitable program.