Parallel matrix transpose algorithms on distributed memory concurrent computers
Choi, J.; Walker, D.W.; Dongarra, J.J. |
1993-10-01
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database
Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; Shephard, Mark S.
2013-01-01
Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order inmore » a 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
Beard, R.A.
1990-03-01
The purpose of this thesis is to explore the methods used to parallelize NP-complete problems and the degree of improvement that can be realized using a distributed parallel processor to solve these combinatoric problems. Common NP-complete problem characteristics such as a priori reductions, use of partial-state information, and inhomogeneous searches are identified and studied. The set covering problem (SCP) is implemented for this research because many applications such as information retrieval, task scheduling, and VLSI expression simplification can be structured as an SCP problem. In addition, its generic NP-complete common characteristics are well documented and a parallel implementation has not been reported. Parallel programming design techniques involve decomposing the problem and developing the parallel algorithms. The major components of a parallel solution are developed in a four phase process. First, a meta-level design is accomplished using an appropriate design language such as UNITY. Then, the UNITY design is transformed into an algorithm and implementation specific to a distributed architecture. Finally, a complexity analysis of the algorithm is performed. the a priori reductions are divided-and-conquer algorithms; whereas, the search for the optimal set cover is accomplished with a branch-and-bound algorithm. The search utilizes a global best cost maintained at a central location for distribution to all processors. Three methods of load balancing are implemented and studied: coarse grain with static allocation of the search space, fine grain with dynamic allocation, and dynamic load balancing.
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time. PMID:22254462
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
Integrating Parallel and Distributed Data Mining Algorithms into the NASA Earth Exchange (NEX)
NASA Astrophysics Data System (ADS)
Oza, N.; Kumar, V.; Nemani, R. R.; Boriah, S.; Das, K.; Khandelwal, A.; Matthews, B.; Michaelis, A.; Mithal, V.; Nayak, G.; Votava, P.
2014-12-01
There is an urgent need in global climate change science for efficient model and/or data analysis algorithms that can be deployed in distributed and parallel environments because of the proliferation of large and heterogeneous data sets. Members of our team from NASA Ames Research Center and the University of Minnesota have been developing new distributed data mining algorithms and developing distributed versions of algorithms originally developed to run on a single machine. We are integrating these algorithms together with the Terrestrial Observation and Prediction System (TOPS), an ecological nowcasting and forecasting system, on the NASA Earth Exchange (NEX). We are also developing a framework under which data mining algorithm developers can make their algorithms available for use by scientists in our system, model developers can set up their models to run within our system and make their results available, and data source providers can make their data available, all with as little effort as possible. We demonstrate the substantial time savings and new results that can be derived through this framework by demonstrating an improvement to the Burned Area (BA) data product on a global scale. Our improvement was derived through development and implementation on NEX of a novel spatiotemporal time series change detection algorithm which will also be presented.
A data distributed parallel algorithm for ray-traced volume rendering
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Painter, James S.; Hansen, Charles D.; Krogh, Michael F.
1993-01-01
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local ray tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on both the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.
Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim
2014-07-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
NASA Astrophysics Data System (ADS)
Loring, B.; Karimabadi, H.; Rortershteyn, V.
2015-10-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
NASA Astrophysics Data System (ADS)
Zheng, Yan
2015-03-01
Internet of things (IoT), focusing on providing users with information exchange and intelligent control, attracts a lot of attention of researchers from all over the world since the beginning of this century. IoT is consisted of large scale of sensor nodes and data processing units, and the most important features of IoT can be illustrated as energy confinement, efficient communication and high redundancy. With the sensor nodes increment, the communication efficiency and the available communication band width become bottle necks. Many research work is based on the instance which the number of joins is less. However, it is not proper to the increasing multi-join query in whole internet of things. To improve the communication efficiency between parallel units in the distributed sensor network, this paper proposed parallel query optimization algorithm based on distribution attributes cost graph. The storage information relations and the network communication cost are considered in this algorithm, and an optimized information changing rule is established. The experimental result shows that the algorithm has good performance, and it would effectively use the resource of each node in the distributed sensor network. Therefore, executive efficiency of multi-join query between different nodes could be improved.
Choi, Jaeyoung; Walker, D.W.; Dongarra, J.J. |
1993-08-01
This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A{center_dot}B, but also transposed multiplication routines C = A{sup T}{center_dot}B, C = A{center_dot}B{sup T}, and C = A{sup T}{center_dot}B{sup T}, for a block scattered data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.
Totally parallel multilevel algorithms
NASA Technical Reports Server (NTRS)
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform. PMID:25097872
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
A data distributed, parallel algorithm for ray-traced volume rendering
Ma, Kwan-Liu; Painter, J.S.; Hansen, C.D.; Krogh, M.F.
1993-03-30
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and its implementation on networked workstations and a massively parallel computer, the Connection Machine CM-5. This algorithm distributes the data and the computational load to individual processing units to achieve fast, high-quality rendering of high-resolution data, even when only a modest amount of memory is available on each machine. The volume data, once distributed, is left intact. The processing nodes perform local ray-tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Implementations and tests on a group of networked workstations and on the Thinking Machines CM-5 demonstrate the practicality of our algorithm and expose different performance tuning issues for each platform. We use data sets from medical imaging and computational fluid dynamics simulations in the study of this algorithm.
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform. PMID:25097872
Schatz, Martin D.; Kolda, Tamara G.; van de Geijn, Robert
2015-09-01
Large-scale datasets in computational chemistry typically require distributed-memory parallel methods to perform a special operation known as tensor contraction. Tensors are multidimensional arrays, and a tensor contraction is akin to matrix multiplication with special types of permutations. Creating an efficient algorithm and optimized im- plementation in this domain is complex, tedious, and error-prone. To address this, we develop a notation to express data distributions so that we can apply use automated methods to find optimized implementations for tensor contractions. We consider the spin-adapted coupled cluster singles and doubles method from computational chemistry and use our methodology to produce an efficient implementation. Experiments per- formed on the IBM Blue Gene/Q and Cray XC30 demonstrate impact both improved performance and reduced memory consumption.
Parallel algorithms for matrix computations
Plemmons, R.J.
1990-01-01
The present conference on parallel algorithms for matrix computations encompasses both shared-memory systems and distributed-memory systems, as well as combinations of the two, to provide an overall perspective on parallel algorithms for both dense and sparse matrix computations in solving systems of linear equations, dense or structured problems related to least-squares computations, eigenvalue computations, singular-value computations, and rapid elliptic solvers. Specific issues addressed include the influence of parallel and vector architectures on algorithm design, computations for distributed-memory architectures such as hypercubes, solutions for sparse symmetric positive definite linear systems, symbolic and numeric factorizations, and triangular solutions. Also addressed are reference sources for parallel and vector numerical algorithms, sources for machine architectures, and sources for programming languages.
Parallel algorithm development
Adams, T.F.
1996-06-01
Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.
HEATR project: ATR algorithm parallelization
NASA Astrophysics Data System (ADS)
Deardorf, Catherine E.
1998-09-01
High Performance Computing (HPC) Embedded Application for Target Recognition (HEATR) is a project funded by the High Performance Computing Modernization Office through the Common HPC Software Support Initiative (CHSSI). The goal of CHSSI is to produce portable, parallel, multi-purpose, freely distributable, support software to exploit emerging parallel computing technologies and enable application of scalable HPC's for various critical DoD applications. Specifically, the CHSSI goal for HEATR is to provide portable, parallel versions of several existing ATR detection and classification algorithms to the ATR-user community to achieve near real-time capability. The HEATR project will create parallel versions of existing automatic target recognition (ATR) detection and classification algorithms and generate reusable code that will support porting and software development process for ATR HPC software. The HEATR Team has selected detection/classification algorithms from both the model- based and training-based (template-based) arena in order to consider the parallelization requirements for detection/classification algorithms across ATR technology. This would allow the Team to assess the impact that parallelization would have on detection/classification performance across ATR technology. A field demo is included in this project. Finally, any parallel tools produced to support the project will be refined and returned to the ATR user community along with the parallel ATR algorithms. This paper will review: (1) HPCMP structure as it relates to HEATR, (2) Overall structure of the HEATR project, (3) Preliminary results for the first algorithm Alpha Test, (4) CHSSI requirements for HEATR, and (5) Project management issues and lessons learned.
Parallel computers and parallel algorithms for CFD: An introduction
NASA Astrophysics Data System (ADS)
Roose, Dirk; Vandriessche, Rafael
1995-10-01
This text presents a tutorial on those aspects of parallel computing that are important for the development of efficient parallel algorithms and software for computational fluid dynamics. We first review the main architectural features of parallel computers and we briefly describe some parallel systems on the market today. We introduce some important concepts concerning the development and the performance evaluation of parallel algorithms. We discuss how work load imbalance and communication costs on distributed memory parallel computers can be minimized. We present performance results for some CFD test cases. We focus on applications using structured and block structured grids, but the concepts and techniques are also valid for unstructured grids.
Pronk, Sander; Pouya, Iman; Lundborg, Magnus; Rotskoff, Grant; Wesén, Björn; Kasson, Peter M; Lindahl, Erik
2015-06-01
Computational chemistry and other simulation fields are critically dependent on computing resources, but few problems scale efficiently to the hundreds of thousands of processors available in current supercomputers-particularly for molecular dynamics. This has turned into a bottleneck as new hardware generations primarily provide more processing units rather than making individual units much faster, which simulation applications are addressing by increasingly focusing on sampling with algorithms such as free-energy perturbation, Markov state modeling, metadynamics, or milestoning. All these rely on combining results from multiple simulations into a single observation. They are potentially powerful approaches that aim to predict experimental observables directly, but this comes at the expense of added complexity in selecting sampling strategies and keeping track of dozens to thousands of simulations and their dependencies. Here, we describe how the distributed execution framework Copernicus allows the expression of such algorithms in generic workflows: dataflow programs. Because dataflow algorithms explicitly state dependencies of each constituent part, algorithms only need to be described on conceptual level, after which the execution is maximally parallel. The fully automated execution facilitates the optimization of these algorithms with adaptive sampling, where undersampled regions are automatically detected and targeted without user intervention. We show how several such algorithms can be formulated for computational chemistry problems, and how they are executed efficiently with many loosely coupled simulations using either distributed or parallel resources with Copernicus. PMID:26575558
Pronk, Sander; Pouya, Iman; Lundborg, Magnus; Rotskoff, Grant; Wesén, Björn; Kasson, Peter M; Lindahl, Erik
2015-06-01
Computational chemistry and other simulation fields are critically dependent on computing resources, but few problems scale efficiently to the hundreds of thousands of processors available in current supercomputers-particularly for molecular dynamics. This has turned into a bottleneck as new hardware generations primarily provide more processing units rather than making individual units much faster, which simulation applications are addressing by increasingly focusing on sampling with algorithms such as free-energy perturbation, Markov state modeling, metadynamics, or milestoning. All these rely on combining results from multiple simulations into a single observation. They are potentially powerful approaches that aim to predict experimental observables directly, but this comes at the expense of added complexity in selecting sampling strategies and keeping track of dozens to thousands of simulations and their dependencies. Here, we describe how the distributed execution framework Copernicus allows the expression of such algorithms in generic workflows: dataflow programs. Because dataflow algorithms explicitly state dependencies of each constituent part, algorithms only need to be described on conceptual level, after which the execution is maximally parallel. The fully automated execution facilitates the optimization of these algorithms with adaptive sampling, where undersampled regions are automatically detected and targeted without user intervention. We show how several such algorithms can be formulated for computational chemistry problems, and how they are executed efficiently with many loosely coupled simulations using either distributed or parallel resources with Copernicus.
On the Effects of Migration on the Fitness Distribution of Parallel Evolutionary Algorithms
Cantu-Paz, E.
2000-04-25
Migration of individuals between populations may increase the selection pressure. This has the desirable consequence of speeding up convergence, but it may result in an excessively rapid loss of variation that may cause the search to fail. This paper investigates the effects of migration on the distribution of fitness. It considers arbitrary migration rates and topologies with different number of neighbors, and it compares algorithms that are configured to have the same selection intensity. The results suggest that migration preserves more diversity as the number of neighbors of a deme increases.
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
Yoo, A; Chow, E; Henderson, K; McLendon, W; Hendrickson, B; Catalyurek, U
2005-07-19
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth-first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported.
On mesh rezoning algorithms for parallel platforms
Plaskacz, E.J.
1995-07-01
A mesh rezoning algorithm for finite element simulations in a parallel-distributed environment is described. The cornerstones of the algorithm are: the parallel computation of distortion norms on the element and subdomain level, the exchange of the individual subdomain norms to form a subdomain distortion vector, the classification of subdomains and the rezoning behavior prescribed within each subdomain as a response to its own classification and the classification of neighboring subdomains.
Acoustic simulation in architecture with parallel algorithm
NASA Astrophysics Data System (ADS)
Li, Xiaohong; Zhang, Xinrong; Li, Dan
2004-03-01
In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
Parallel algorithms for message decomposition
Teng, S.H.; Wang, B.
1987-06-01
The authors consider the deterministic and random parallel complexity (time and processor) of message decoding: an essential problem in communications systems and translation systems. They present an optimal parallel algorithm to decompose prefix-coded messages and uniquely decipherable-coded messages in O(n/P) time, using O(P) processors (for all P:1 less than or equal toPless than or equal ton/log n) deterministically as well as randomly on the weakest version of parallel random access machines in which concurrent read and concurrent write to a cell in the common memory are not allowed. This is done by reducing decoding to parallel finite-state automata simulation and the prefix sums.
An efficient parallel termination detection algorithm
Baker, A. H.; Crivelli, S.; Jessup, E. R.
2004-05-27
Information local to any one processor is insufficient to monitor the overall progress of most distributed computations. Typically, a second distributed computation for detecting termination of the main computation is necessary. In order to be a useful computational tool, the termination detection routine must operate concurrently with the main computation, adding minimal overhead, and it must promptly and correctly detect termination when it occurs. In this paper, we present a new algorithm for detecting the termination of a parallel computation on distributed-memory MIMD computers that satisfies all of those criteria. A variety of termination detection algorithms have been devised. Of these, the algorithm presented by Sinha, Kale, and Ramkumar (henceforth, the SKR algorithm) is unique in its ability to adapt to the load conditions of the system on which it runs, thereby minimizing the impact of termination detection on performance. Because their algorithm also detects termination quickly, we consider it to be the most efficient practical algorithm presently available. The termination detection algorithm presented here was developed for use in the PMESC programming library for distributed-memory MIMD computers. Like the SKR algorithm, our algorithm adapts to system loads and imposes little overhead. Also like the SKR algorithm, ours is tree-based, and it does not depend on any assumptions about the physical interconnection topology of the processors or the specifics of the distributed computation. In addition, our algorithm is easier to implement and requires only half as many tree traverses as does the SKR algorithm. This paper is organized as follows. In section 2, we define our computational model. In section 3, we review the SKR algorithm. We introduce our new algorithm in section 4, and prove its correctness in section 5. We discuss its efficiency and present experimental results in section 6.
Parallel Implicit Algorithms for CFD
NASA Technical Reports Server (NTRS)
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.
1989-01-01
Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.
NASA Astrophysics Data System (ADS)
Usamentiaga, Rubén; García, Daniel F.; Molleda, Julio; Sainz, Ignacio; Bulnes, Francisco G.
2011-01-01
Advances in the image processing field have brought new methods which are able to perform complex tasks robustly. However, in order to meet constraints on functionality and reliability, imaging application developers often design complex algorithms with many parameters which must be finely tuned for each particular environment. The best approach for tuning these algorithms is to use an automatic training method, but the computational cost of this kind of training method is prohibitive, making it inviable even in powerful machines. The same problem arises when designing testing procedures. This work presents methods to train and test complex image processing algorithms in parallel execution environments. The approach proposed in this work is to use existing resources in offices or laboratories, rather than expensive clusters. These resources are typically non-dedicated, heterogeneous and unreliable. The proposed methods have been designed to deal with all these issues. Two methods are proposed: intelligent training based on genetic algorithms and PVM, and a full factorial design based on grid computing which can be used for training or testing. These methods are capable of harnessing the available computational power resources, giving more work to more powerful machines, while taking its unreliable nature into account. Both methods have been tested using real applications.
A parallel algorithm for random searches
NASA Astrophysics Data System (ADS)
Wosniack, M. E.; Raposo, E. P.; Viswanathan, G. M.; da Luz, M. G. E.
2015-11-01
We discuss a parallelization procedure for a two-dimensional random search of a single individual, a typical sequential process. To assure the same features of the sequential random search in the parallel version, we analyze the former spatial patterns of the encountered targets for different search strategies and densities of homogeneously distributed targets. We identify a lognormal tendency for the distribution of distances between consecutively detected targets. Then, by assigning the distinct mean and standard deviation of this distribution for each corresponding configuration in the parallel simulations (constituted by parallel random walkers), we are able to recover important statistical properties, e.g., the target detection efficiency, of the original problem. The proposed parallel approach presents a speedup of nearly one order of magnitude compared with the sequential implementation. This algorithm can be easily adapted to different instances, as searches in three dimensions. Its possible range of applicability covers problems in areas as diverse as automated computer searchers in high-capacity databases and animal foraging.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than line relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. The parallel implementation of a V-cycle multiple semi-coarsened grid (MSG) algorithm or distributed-memory architectures such as the Intel iPSC/860 and Paragon computers is addressed. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. A mapping of an MSG algorithm to distributed-memory architectures that demonstrate how both levels of parallelism can be exploited is described. The results is a robust and effective multigrid algorithm for distributed-memory machines.
Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak
2016-05-01
Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less
Parallel algorithms for unconstrained optimizations by multisplitting
He, Qing
1994-12-31
In this paper a new parallel iterative algorithm for unconstrained optimization using the idea of multisplitting is proposed. This algorithm uses the existing sequential algorithms without any parallelization. Some convergence and numerical results for this algorithm are presented. The experiments are performed on an Intel iPSC/860 Hyper Cube with 64 nodes. It is interesting that the sequential implementation on one node shows that if the problem is split properly, the algorithm converges much faster than one without splitting.
A parallel algorithm for global routing
NASA Technical Reports Server (NTRS)
Brouwer, Randall J.; Banerjee, Prithviraj
1990-01-01
A Parallel Hierarchical algorithm for Global Routing (PHIGURE) is presented. The router is based on the work of Burstein and Pelavin, but has many extensions for general global routing and parallel execution. Main features of the algorithm include structured hierarchical decomposition into separate independent tasks which are suitable for parallel execution and adaptive simplex solution for adding feedthroughs and adjusting channel heights for row-based layout. Alternative decomposition methods and the various levels of parallelism available in the algorithm are examined closely. The algorithm is described and results are presented for a shared-memory multiprocessor implementation.
Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Array distribution in data-parallel programs
NASA Technical Reports Server (NTRS)
Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.
1994-01-01
We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.
Data-parallel algorithms for image computing
NASA Astrophysics Data System (ADS)
Carlotto, Mark J.
1990-11-01
Data-parallel algorithms for image computing on the Connection Machine are described. After a brief review of some basic programming concepts in *Lip, a parallel extension of Common Lisp, data-parallel programming paradigms based on a local (diffusion-like) model of computation, the scan model of computation, a general interprocessor communications model, and a region-based model are introduced. Algorithms for connected component labeling, distance transformation, Voronoi diagrams, finding minimum cost paths, local means, shape-from-shading, hidden surface calculations, affine transformation, oblique parallel projection, and spatial operations over regions are presented. An new algorithm for interpolating irregularly spaced data via Voronoi diagrams is also described.
A parallel variable metric optimization algorithm
NASA Technical Reports Server (NTRS)
Straeter, T. A.
1973-01-01
An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
A parallel unmixing algorithm for hyperspectral images
NASA Astrophysics Data System (ADS)
Robila, Stefan A.; Maciak, Lukasz G.
2006-10-01
We present a new algorithm for feature extraction in hyperspectral images based on source separation and parallel computing. In source separation, given a linear mixture of sources, the goal is to recover the components by producing an unmixing matrix. In hyperspectral imagery, the mixing transform and the separated components can be associated with endmembers and their abundances. Source separation based methods have been employed for target detection and classification of hyperspectral images. However, these methods usually involve restrictive conditions on the nature of the results such as orthogonality (in Principal Component Analysis - PCA and Orthogonal Subspace Projection - OSP) of the endmembers or statistical independence (in Independent Component Analysis - ICA) of the abundances nor do they fully satisfy all the conditions included in the Linear Mixing Model. Compared to this, our approach is based on the Nonnegative Matrix Factorization (NMF), a less constraining unmixing method. NMF has the advantage of producing positively defined data, and, with several modifications that we introduce also ensures addition to one. The endmember vectors and the abundances are obtained through a gradient based optimization approach. The algorithm is further modified to run in a parallel environment. The parallel NMF (P-NMF) significantly reduces the time complexity and is shown to also easily port to a distributed environment. Experiments with in-house and Hydice data suggest that NMF outperforms ICA, PCA and OSP for unsupervised endmember extraction. Coupled with its parallel implementation, the new method provides an efficient way for unsupervised unmixing further supporting our efforts in the development of a real time hyperspectral sensing environment with applications to industry and life sciences.
A parallelization of the row-searching algorithm
NASA Astrophysics Data System (ADS)
Yaici, Malika; Khaled, Hayet; Khaled, Zakia; Bentahar, Athmane
2012-11-01
The problem dealt in this paper concerns the parallelization of the row-searching algorithm which allows the search for linearly dependant rows on a given matrix and its implementation on MPI (Message Passing Interface) environment. This algorithm is largely used in control theory and more specifically in solving the famous diophantine equation. An introduction to the diophantine equation is presented, then two parallelization approaches of the algorithm are detailed. The first distributes a set of rows on processes (processors) and the second makes a distribution per blocks. The sequential algorithm and its two parallel forms are implemented using MPI routines, then modelled using UML (Unified Modelling Language) and finally evaluated using algorithmic complexity.
MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION
In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...
Parallel algorithms for dynamically partitioning unstructured grids
Diniz, P.; Plimpton, S.; Hendrickson, B.; Leland, R.
1994-10-01
Grid partitioning is the method of choice for decomposing a wide variety of computational problems into naturally parallel pieces. In problems where computational load on the grid or the grid itself changes as the simulation progresses, the ability to repartition dynamically and in parallel is attractive for achieving higher performance. We describe three algorithms suitable for parallel dynamic load-balancing which attempt to partition unstructured grids so that computational load is balanced and communication is minimized. The execution time of algorithms and the quality of the partitions they generate are compared to results from serial partitioners for two large grids. The integration of the algorithms into a parallel particle simulation is also briefly discussed.
Parallelized dilate algorithm for remote sensing image.
Zhang, Suli; Hu, Haoran; Pan, Xin
2014-01-01
As an important algorithm, dilate algorithm can give us more connective view of a remote sensing image which has broken lines or objects. However, with the technological progress of satellite sensor, the resolution of remote sensing image has been increasing and its data quantities become very large. This would lead to the decrease of algorithm running speed or cannot obtain a result in limited memory or time. To solve this problem, our research proposed a parallelized dilate algorithm for remote sensing Image based on MPI and MP. Experiments show that our method runs faster than traditional single-process algorithm.
Parallel processing of numerical transport algorithms
Wienke, B.R.; Hiromoto, R.E.
1984-01-01
The multigroup, discrete ordinates representation for the linear transport equation enjoys widespread computational use and popularity. Serial solution schemes and numerical algorithms developed over the years provide a timely framework for parallel extension. On the Denelcor HEP, we investigate the parallel structure and extension of a number of standard S/sub n/ approaches. Concurrent inner sweeps, coupled acceleration techniques, synchronized inner-outer loops, and chaotic iteration are described, and results of computations are contrasted. The multigroup representation and serial iteration methods are also detailed. The basic iterative S/sub n/ method lends itself to parallel tasking, portably affording an effective medium for performing transport calculations on future architectures. This analysis represents a first attempt to extend serial S/sub n/ algorithms to parallel environments and provides good baseline estimates on ease of parallel implementation, relative algorithm efficiency, comparative speedup, and some future directions. We find basic inner-outer and chaotic iteration strategies both easily support comparably high degrees of parallelism. Both accommodate parallel rebalance and diffusion acceleration and appear as robust and viable parallel techniques for S/sub n/ production work.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Parallel Genetic Algorithm for Alpha Spectra Fitting
NASA Astrophysics Data System (ADS)
García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio
2005-01-01
We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.
Experiences with the PGAPack Parallel Genetic Algorithm library
Levine, D.; Hallstrom, P.; Noelle, D.; Walenz, B.
1997-07-01
PGAPack is the first widely distributed parallel genetic algorithm library. Since its release, several thousand copies have been distributed worldwide to interested users. In this paper we discuss the key components of the PGAPack design philosophy and present a number of application examples that use PGAPack.
Empirical study of parallel LRU simulation algorithms
NASA Technical Reports Server (NTRS)
Carr, Eric; Nicol, David M.
1994-01-01
This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Parallel algorithms for mapping pipelined and parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Parallel, Distributed Scripting with Python
Miller, P J
2002-05-24
Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI library gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.
A parallel adaptive mesh refinement algorithm
NASA Technical Reports Server (NTRS)
Quirk, James J.; Hanebutte, Ulf R.
1993-01-01
Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1994-04-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations or a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional FFTs and other parallel transforms.
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1997-05-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, the authors describe these different parallel algorithms and report on computational experiments that they have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. The authors focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but they also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.
Analysis of a parallel multigrid algorithm
NASA Technical Reports Server (NTRS)
Chan, Tony F.; Tuminaro, Ray S.
1989-01-01
The parallel multigrid algorithm of Frederickson and McBryan (1987) is considered. This algorithm uses multiple coarse-grid problems (instead of one problem) in the hope of accelerating convergence and is found to have a close relationship to traditional multigrid methods. Specifically, the parallel coarse-grid correction operator is identical to a traditional multigrid coarse-grid correction operator, except that the mixing of high and low frequencies caused by aliasing error is removed. Appropriate relaxation operators can be chosen to take advantage of this property. Comparisons between the standard multigrid and the new method are made.
The PRISM project: Infrastructure and algorithms for parallel eigensolvers
Bischof, C.; Sun, X.; Huss-Lederman, S.; Tsao, A.
1993-12-31
The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly reviewing the SYISDA approach, we discuss the algorithmic highlights of a distributed-memory implementation of an eigensolver based on this approach. These include a fast matrix-matrix multiplication algorithm, a new approach to parallel band reduction and tridiagonalization, and a harness for coordinating the divide-and-conquer parallelism in the problem. We also present performance results of these kernels as well as the overall SYISDA implementation on the Intel Touchstone Delta prototype and the IBM SP/1.
Parallel Clustering Algorithms for Structured AMR
Gunney, B T; Wissink, A M; Hysom, D A
2005-10-26
We compare several different parallel implementation approaches for the clustering operations performed during adaptive gridding operations in patch-based structured adaptive mesh refinement (SAMR) applications. Specifically, we target the clustering algorithm of Berger and Rigoutsos (BR91), which is commonly used in many SAMR applications. The baseline for comparison is a simplistic parallel extension of the original algorithm that works well for up to O(10{sup 2}) processors. Our goal is a clustering algorithm for machines of up to O(10{sup 5}) processors, such as the 64K-processor IBM BlueGene/Light system. We first present an algorithm that avoids the unneeded communications of the simplistic approach to improve the clustering speed by up to an order of magnitude. We then present a new task-parallel implementation to further reduce communication wait time, adding another order of magnitude of improvement. The new algorithms also exhibit more favorable scaling behavior for our test problems. Performance is evaluated on a number of large scale parallel computer systems, including a 16K-processor BlueGene/Light system.
Parallel distributed computing using Python
NASA Astrophysics Data System (ADS)
Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro
2011-09-01
This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.
NASA Astrophysics Data System (ADS)
Gladwin, D.; Stewart, P.; Stewart, J.
2011-02-01
This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control
Parallelization of Edge Detection Algorithm using MPI on Beowulf Cluster
NASA Astrophysics Data System (ADS)
Haron, Nazleeni; Amir, Ruzaini; Aziz, Izzatdin A.; Jung, Low Tan; Shukri, Siti Rohkmah
In this paper, we present the design of parallel Sobel edge detection algorithm using Foster's methodology. The parallel algorithm is implemented using MPI message passing library and master/slave algorithm. Every processor performs the same sequential algorithm but on different part of the image. Experimental results conducted on Beowulf cluster are presented to demonstrate the performance of the parallel algorithm.
Design and implementation of parallel multigrid algorithms
NASA Technical Reports Server (NTRS)
Chan, Tony F.; Tuminaro, Ray S.
1988-01-01
Techniques for mapping multigrid algorithms to solve elliptic PDEs on hypercube parallel computers are described and demonstrated. The need for proper data mapping to minimize communication distances is stressed, and an execution-time model is developed to show how algorithm efficiency is affected by changes in the machine and algorithm parameters. Particular attention is then given to the case of coarse computational grids, which can lead to idle processors, load imbalances, and inefficient performance. It is shown that convergence can be improved by using idle processors to solve a new problem concurrently on the fine grid defined by a splitting.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research on parallel algorithm for sequential pattern mining
NASA Astrophysics Data System (ADS)
Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao
2008-03-01
Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Mapping algorithms on regular parallel architectures
Lee, P.
1989-01-01
It is significant that many of time-intensive scientific algorithms are formulated as nested loops, which are inherently regularly structured. In this dissertation the relations between the mathematical structure of nested loop algorithms and the architectural capabilities required for their parallel execution are studied. The architectural model considered in depth is that of an arbitrary dimensional systolic array. The mathematical structure of the algorithm is characterized by classifying its data-dependence vectors according to the new ZERO-ONE-INFINITE property introduced. Using this classification, the first complete set of necessary and sufficient conditions for correct transformation of a nested loop algorithm onto a given systolic array of an arbitrary dimension by means of linear mappings is derived. Practical methods to derive optimal or suboptimal systolic array implementations are also provided. The techniques developed are used constructively to develop families of implementations satisfying various optimization criteria and to design programmable arrays efficiently executing classes of algorithms. In addition, a Computer-Aided Design system running on SUN workstations has been implemented to help in the design. The methodology, which deals with general algorithms, is illustrated by synthesizing linear and planar systolic array algorithms for matrix multiplication, a reindexed Warshall-Floyd transitive closure algorithm, and the longest common subsequence algorithm.
Parallel asynchronous systems and image processing algorithms
NASA Technical Reports Server (NTRS)
Coon, D. D.; Perera, A. G. U.
1989-01-01
A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.
Parallelization of the Pipelined Thomas Algorithm
NASA Technical Reports Server (NTRS)
Povitsky, A.
1998-01-01
In this study the following questions are addressed. Is it possible to improve the parallelization efficiency of the Thomas algorithm? How should the Thomas algorithm be formulated in order to get solved lines that are used as data for other computational tasks while processors are idle? To answer these questions, two-step pipelined algorithms (PAs) are introduced formally. It is shown that the idle processor time is invariant with respect to the order of backward and forward steps in PAs starting from one outermost processor. The advantage of PAs starting from two outermost processors is small. Versions of the pipelined Thomas algorithms considered here fall into the category of PAs. These results show that the parallelization efficiency of the Thomas algorithm cannot be improved directly. However, the processor idle time can be used if some data has been computed by the time processors become idle. To achieve this goal the Immediate Backward pipelined Thomas Algorithm (IB-PTA) is developed in this article. The backward step is computed immediately after the forward step has been completed for the first portion of lines. This enables the completion of the Thomas algorithm for some of these lines before processors become idle. An algorithm for generating a static processor schedule recursively is developed. This schedule is used to switch between forward and backward computations and to control communications between processors. The advantage of the IB-PTA over the basic PTA is the presence of solved lines, which are available for other computations, by the time processors become idle.
Parallel algorithms for computing linked list prefix
Han, Y. )
1989-06-01
Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.
Parallel algorithms for boundary value problems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Parallel algorithms for optical digital computers
Huang, A.
1983-01-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.
Coupled cluster algorithms for networks of shared memory parallel processors
NASA Astrophysics Data System (ADS)
Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.
2007-05-01
As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.
Parallelized FVM algorithm for three-dimensional viscoelastic flows
NASA Astrophysics Data System (ADS)
Dou, H.-S.; Phan-Thien, N.
A parallel implementation for the finite volume method (FVM) for three-dimensional (3D) viscoelastic flows is developed on a distributed computing environment through Parallel Virtual Machine (PVM). The numerical procedure is based on the SIMPLEST algorithm using a staggered FVM discretization in Cartesian coordinates. The final discretized algebraic equations are solved with the TDMA method. The parallelisation of the program is implemented by a domain decomposition strategy, with a master/slave style programming paradigm, and a message passing through PVM. A load balancing strategy is proposed to reduce the communications between processors. The three-dimensional viscoelastic flow in a rectangular duct is computed with this program. The modified Phan-Thien-Tanner (MPTT) constitutive model is employed for the equation system closure. Computing results are validated on the secondary flow problem due to non-zero second normal stress difference N2. Three sets of meshes are used, and the effect of domain decomposition strategies on the performance is discussed. It is found that parallel efficiency is strongly dependent on the grid size and the number of processors for a given block number. The convergence rate as well as the total efficiency of domain decomposition depends upon the flow problem and the boundary conditions. The parallel efficiency increases with increasing problem size for given block number. Comparing to two-dimensional flow problems, 3D parallelized algorithm has a lower efficiency owing to largely overlapped block interfaces, but the parallel algorithm is indeed a powerful means for large scale flow simulations.
Parallelism of the SANDstorm hash algorithm.
Torgerson, Mark Dolan; Draelos, Timothy John; Schroeppel, Richard Crabtree
2009-09-01
Mainstream cryptographic hashing algorithms are not parallelizable. This limits their speed and they are not able to take advantage of the current trend of being run on multi-core platforms. Being limited in speed limits their usefulness as an authentication mechanism in secure communications. Sandia researchers have created a new cryptographic hashing algorithm, SANDstorm, which was specifically designed to take advantage of multi-core processing and be parallelizable on a wide range of platforms. This report describes a late-start LDRD effort to verify the parallelizability claims of the SANDstorm designers. We have shown, with operating code and bench testing, that the SANDstorm algorithm may be trivially parallelized on a wide range of hardware platforms. Implementations using OpenMP demonstrates a linear speedup with multiple cores. We have also shown significant performance gains with optimized C code and the use of assembly instructions to exploit particular platform capabilities.
Embodied and Distributed Parallel DJing.
Cappelen, Birgitta; Andersson, Anders-Petter
2016-01-01
Everyone has a right to take part in cultural events and activities, such as music performances and music making. Enforcing that right, within Universal Design, is often limited to a focus on physical access to public areas, hearing aids etc., or groups of persons with special needs performing in traditional ways. The latter might be people with disabilities, being musicians playing traditional instruments, or actors playing theatre. In this paper we focus on the innovative potential of including people with special needs, when creating new cultural activities. In our project RHYME our goal was to create health promoting activities for children with severe disabilities, by developing new musical and multimedia technologies. Because of the users' extreme demands and rich contribution, we ended up creating both a new genre of musical instruments and a new art form. We call this new art form Embodied and Distributed Parallel DJing, and the new genre of instruments for Empowering Multi-Sensorial Things.
Embodied and Distributed Parallel DJing.
Cappelen, Birgitta; Andersson, Anders-Petter
2016-01-01
Everyone has a right to take part in cultural events and activities, such as music performances and music making. Enforcing that right, within Universal Design, is often limited to a focus on physical access to public areas, hearing aids etc., or groups of persons with special needs performing in traditional ways. The latter might be people with disabilities, being musicians playing traditional instruments, or actors playing theatre. In this paper we focus on the innovative potential of including people with special needs, when creating new cultural activities. In our project RHYME our goal was to create health promoting activities for children with severe disabilities, by developing new musical and multimedia technologies. Because of the users' extreme demands and rich contribution, we ended up creating both a new genre of musical instruments and a new art form. We call this new art form Embodied and Distributed Parallel DJing, and the new genre of instruments for Empowering Multi-Sensorial Things. PMID:27534347
Analysis of the numerical effects of parallelism on a parallel genetic algorithm
Hart, W.E.; Belew, R.K.; Kohn, S.; Baden, S.
1995-09-18
This paper examines the effects of relaxed synchronization on both the numerical and parallel efficiency of parallel genetic algorithms (GAs). We describe a coarse-grain geographically structured parallel genetic algorithm. Our experiments show that asynchronous versions of these algorithms have a lower run time than-synchronous GAs. Furthermore, we demonstrate that this improvement in performance is partly due to the fact that the numerical efficiency of the asynchronous genetic algorithm is better than the synchronous genetic algorithm. Our analysis includes a critique of the utility of traditional parallel performance measures for parallel GAs, and we evaluate the claims made by several researchers that parallel GAs can have superlinear speedup.
Predicting mining activity with parallel genetic algorithms
Talaie, S.; Leigh, R.; Louis, S.J.; Raines, G.L.; Beyer, H.G.; O'Reilly, U.M.; Banzhaf, Arnold D.; Blum, W.; Bonabeau, C.; Cantu-Paz, E.W.; ,; ,
2005-01-01
We explore several different techniques in our quest to improve the overall model performance of a genetic algorithm calibrated probabilistic cellular automata. We use the Kappa statistic to measure correlation between ground truth data and data predicted by the model. Within the genetic algorithm, we introduce a new evaluation function sensitive to spatial correctness and we explore the idea of evolving different rule parameters for different subregions of the land. We reduce the time required to run a simulation from 6 hours to 10 minutes by parallelizing the code and employing a 10-node cluster. Our empirical results suggest that using the spatially sensitive evaluation function does indeed improve the performance of the model and our preliminary results also show that evolving different rule parameters for different regions tends to improve overall model performance. Copyright 2005 ACM.
Algorithmic commonalities in the parallel environment
NASA Technical Reports Server (NTRS)
Mcanulty, Michael A.; Wainer, Michael S.
1987-01-01
The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.
An intelligent allocation algorithm for parallel processing
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.
1988-01-01
The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.
A garbage collection algorithm for shared memory parallel processors
Crammond, J. )
1988-12-01
This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.
Parallel algorithm strategies for circuit simulation.
Thornquist, Heidi K.; Schiek, Richard Louis; Keiter, Eric Richard
2010-01-01
Circuit simulation tools (e.g., SPICE) have become invaluable in the development and design of electronic circuits. However, they have been pushed to their performance limits in addressing circuit design challenges that come from the technology drivers of smaller feature scales and higher integration. Improving the performance of circuit simulation tools through exploiting new opportunities in widely-available multi-processor architectures is a logical next step. Unfortunately, not all traditional simulation applications are inherently parallel, and quickly adapting mature application codes (even codes designed to parallel applications) to new parallel paradigms can be prohibitively difficult. In general, performance is influenced by many choices: hardware platform, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, the use of mini-applications small self-contained proxies for real applications is an excellent approach for rapidly exploring the parameter space of all these choices. In this report we present a multi-core performance study of Xyce, a transistor-level circuit simulation tool, and describe the future development of a mini-application for circuit simulation.
A Parallel Processing Algorithm for Gravity Inversion
NASA Astrophysics Data System (ADS)
Frasheri, Neki; Bushati, Salvatore; Frasheri, Alfred
2013-04-01
The paper presents results of using MPI parallel processing for the 3D inversion of gravity anomalies. The work is done under the FP7 project HP-SEE (http://www.hp-see.eu/). The inversion of geophysical anomalies remains a challenge, and the use of parallel processing can be a tool to achieve better results, "compensating" the complexity of the ill-posed problem of inversion with the increase of volume of calculations. We considered the gravity as the simplest case of physical fields and experimented an algorithm based in the methodology known as CLEAN and developed by Högbom in 1974. The 3D geosection was discretized in finite cuboid elements and represented by a 3D array of nodes, while the ground surface where the anomaly is observed as a 2D array of points. Starting from a geosection with mass density zero in all nodes, iteratively the algorithm defines the 3D node that offers the best anomaly shape that approximates the observed anomaly minimizing the least squares error; the mass density in the best 3D node is modified with a prefixed density step and the related effect subtracted from the observed anomaly; the process continues until some criteria is fulfilled. Theoretical complexity of he algorithm was evaluated on the basis of iterations and run-time for a geosection discretized in different scales. We considered the average number N of nodes in one edge of the 3D array. The order of number of iterations was evaluated O(N^3); and the order of run-time was evaluated O(N^8). We used several different methods for the identification of the 3D node which effect offers the best least squares error in approximating the observed anomaly: unweighted least squares error for the whole 2D array of anomalous points; weighting least squares error by the inverted value of observed anomaly over each 3D node; and limiting the area of 2D anomalous points where least squares are calculated over shallow 3D nodes. By comparing results from the inversion of single body and two
A parallel genetic algorithm for the set partitioning problem
Levine, D.
1994-05-01
In this dissertation the author reports on his efforts to develop a parallel genetic algorithm and apply it to the solution of set partitioning problem -- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. He developed a distributed steady-state genetic algorithm in conjunction with a specialized local search heuristic for solving the set partitioning problem. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. The authors found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulation found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high-quality integer feasible solutions were found for problems with 36,699 and 43,749 integer variables, respectively. A notable limitation they found was the difficulty solving problems with many constraints.
Parallel Harmony Search Based Distributed Energy Resource Optimization
Ceylan, Oguzhan; Liu, Guodong; Tomsovic, Kevin
2015-01-01
This paper presents a harmony search based parallel optimization algorithm to minimize voltage deviations in three phase unbalanced electrical distribution systems and to maximize active power outputs of distributed energy resources (DR). The main contribution is to reduce the adverse impacts on voltage profile during a day as photovoltaics (PVs) output or electrical vehicles (EVs) charging changes throughout a day. The IEEE 123- bus distribution test system is modified by adding DRs and EVs under different load profiles. The simulation results show that by using parallel computing techniques, heuristic methods may be used as an alternative optimization tool in electrical power distribution systems operation.
Fast parallel algorithm for slicing STL based on pipeline
NASA Astrophysics Data System (ADS)
Ma, Xulong; Lin, Feng; Yao, Bo
2016-05-01
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Communication-efficient parallel-graph algorithms. Master's thesis
Maggs, B.M.
1986-06-01
Communication bandwidth is a resource ignored by most parallel random-access machine (PRAM) models. This thesis shows that many graph problems can be solved in parallel, not only with polylogarithmic performance, but with efficient communication at each step of the computation. The communication requirements of an algorithm are measured in a restricted PRAM model called the distributed random-access machine (DRAM), which can be viewed as an abstraction of volume-universal networks such as fat trees. In this model, communication cost is measured in terms of the congestion of memory accesses across cuts of the machine. It is demonstrated that the recursive doubling technique frequently used in PRAM algorithms is wasteful of communication resources, and that recursive pairing can be used to perform many of the same functions more efficiently. The prefix computation is generalized on linear lists to trees and show that these tree-fix computations, which can be performed in a communication-efficient fashion using a variant of the tree-contraction technique of Miller and Reif, simplify many parallel graph algorithms in the literature.
Parallel expectation-maximization algorithms for PET image reconstruction
NASA Astrophysics Data System (ADS)
Jeng, Wei-Min
1999-10-01
Image reconstruction using Positron Emission Tomography (PET) involves estimating an unknown number of photon pairs emitted from the radiopharmaceuticals within the tissues of the patient's body. The generation of the photons can be described as a Poisson process, and the difficulty of image reconstruction involves approximating the parameter of the tissue density distribution function. A significant amount of artifactual noise exists in the reconstructed image with the convolution back projection method. Using the Maximum Likelihood (ML) formulation, a better estimate can be made for the unknown image information. Despite the better quality of images, the Expectation Maximization (EM) iterative algorithm is not being used in practice due to the tremendous processing time. This research proposes new techniques in designing parallel algorithms in order to speed the reconstruction process. Using the EM algorithm as an example, several general parallel techniques were studied for both distributed-memory architecture and message-passing programming paradigm. Both intra- and inter-iteration latency-hiding schemes were designed to effectively reduce the communication time. Dependencies that exist in and between iterations were rearranged by overlap communication and computation with MPI's non-blocking collective reduction operation. A performance model was established to estimate the processing time of the algorithms and was found to agree with the experimental results. A second strategy, the sparse matrix compaction technique, was developed to reduce the computational time of the computation-bound EM algorithm with better use of PET system geometry. The proposed techniques are generally applicable to many scientific computation problems that involve sparse matrix operations as well as iterative types, of algorithms.
A parallel algorithm for the non-symmetric eigenvalue problem
Dongarra, J.; Sidani, M. |
1991-12-01
This paper describes a parallel algorithm for computing the eigenvalues and eigenvectors of a non-symmetric matrix. The algorithm is based on a divide-and-conquer procedure and uses an iterative refinement technique.
Parallel Transferable Uniform Multi-Round Algorithm for Minimizing Makespan
NASA Astrophysics Data System (ADS)
Yamamoto, Hiroshi; Tsuru, Masato; Yamazaki, Katsuyuki; Oie, Yuji
In parallel computing systems using the master/worker model for distributed grid computing, as the size of handling data grows, the increase in the data transmission time degrades the performance. For divisible workload applications, therefore, multiple-round scheduling algorithms have been being developed to mitigate the adverse effect of longer data transmission time by dividing the data into chunks to be sent out in multiple rounds, thus overlapping the times required for computation and transmission. However, a standard multiple-round scheduling algorithm, Uniform Multi-Round (UMR), adopts a sequential transmission model where the master communicates with one worker at a time, thus the transmission capacity of the link attached to the master cannot be fully utilized due to the limits of worker-side capacity. In the present study, a Parallel Transferable Uniform Multi-Round algorithm (PTUMR) is proposed. It efficiently utilizes the data transmission capacity of network links by allowing chunks to be transmitted in parallel to workers. This algorithm divides workers into groups in a way that fully uses the link bandwidth of the master under some constraints and considers each group of workers as one virtual worker. In particular, introducing a Grouping Threshold effectively deals with very heterogeneous workers in both data transmission and computation capacities. Then, the master schedules sequential data transmissions to the virtual workers in an optimal way like in UMR. The performance evaluations show that the proposed algorithm achieves significantly shorter turnaround times (i.e., makespan) compared with UMR regardless of heterogeneity of workers, which are close to the theoretical lower limits.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
On the design, analysis, and implementation of efficient parallel algorithms
Sohn, S.M.
1989-01-01
There is considerable interest in developing algorithms for a variety of parallel computer architectures. This is not a trivial problem, although for certain models great progress has been made. Recently, general-purpose parallel machines have become available commercially. These machines possess widely varying interconnection topologies and data/instruction access schemes. It is important, therefore, to develop methodologies and design paradigms for not only synthesizing parallel algorithms from initial problem specifications, but also for mapping algorithms between different architectures. This work has considered both of these problems. A systolic array consists of a large collection of simple processors that are interconnected in a uniform pattern. The author has studied in detain the problem of mapping systolic algorithms onto more general-purpose parallel architectures such as the hypercube. The hypercube architecture is notable due to its symmetry and high connectivity, characteristics which are conducive to the efficient embedding of parallel algorithms. Although the parallel-to-parallel mapping techniques have yielded efficient target algorithms, it is not surprising that an algorithm designed directly for a particular parallel model would achieve superior performance. In this context, the author has developed hypercube algorithms for some important problems in speech and signal processing, text processing, language processing and artificial intelligence. These algorithms were implemented on a 64-node NCUBE/7 hypercube machine in order to evaluate their performance.
Towards Distributed Memory Parallel Program Analysis
Quinlan, D; Barany, G; Panas, T
2008-06-17
This paper presents a parallel attribute evaluation for distributed memory parallel computer architectures where previously only shared memory parallel support for this technique has been developed. Attribute evaluation is a part of how attribute grammars are used for program analysis within modern compilers. Within this work, we have extended ROSE, a open compiler infrastructure, with a distributed memory parallel attribute evaluation mechanism to support user defined global program analysis required for some forms of security analysis which can not be addressed by a file by file view of large scale applications. As a result, user defined security analyses may now run in parallel without the user having to specify the way data is communicated between processors. The automation of communication enables an extensible open-source parallel program analysis infrastructure.
A parallel algorithm for implicit depletant simulations
NASA Astrophysics Data System (ADS)
Glaser, Jens; Karas, Andrew S.; Glotzer, Sharon C.
2015-11-01
We present an algorithm to simulate the many-body depletion interaction between anisotropic colloids in an implicit way, integrating out the degrees of freedom of the depletants, which we treat as an ideal gas. Because the depletant particles are statistically independent and the depletion interaction is short-ranged, depletants are randomly inserted in parallel into the excluded volume surrounding a single translated and/or rotated colloid. A configurational bias scheme is used to enhance the acceptance rate. The method is validated and benchmarked both on multi-core processors and graphics processing units for the case of hard spheres, hemispheres, and discoids. With depletants, we report novel cluster phases in which hemispheres first assemble into spheres, which then form ordered hcp/fcc lattices. The method is significantly faster than any method without cluster moves and that tracks depletants explicitly, for systems of colloid packing fraction ϕc < 0.50, and additionally enables simulation of the fluid-solid transition.
Differences Between Distributed and Parallel Systems
Brightwell, R.; Maccabe, A.B.; Rissen, R.
1998-10-01
Distributed systems have been studied for twenty years and are now coming into wider use as fast networks and powerful workstations become more readily available. In many respects a massively parallel computer resembles a network of workstations and it is tempting to port a distributed operating system to such a machine. However, there are significant differences between these two environments and a parallel operating system is needed to get the best performance out of a massively parallel system. This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these differences on operating system design. In the second part of the report, we introduce Puma, an operating system specifically developed for massively parallel systems. We describe Puma portals, the basic building blocks for message passing paradigms implemented on top of Puma, and show how the differences observed in the first part of the report have influenced the design and implementation of Puma.
Applications and accuracy of the parallel diagonal dominant algorithm
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1993-01-01
The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines.
Parallel optimization algorithms and their implementation in VLSI design
NASA Technical Reports Server (NTRS)
Lee, G.; Feeley, J. J.
1991-01-01
Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
A Parallel Algorithm for Contact in a Finite Element Hydrocode
Pierce, T G
2003-06-01
A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the same processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
Efficient graph algorithms for sequential and parallel computers. Doctoral thesis
Goldberg, A.V.
1987-02-01
This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.
Algorithm Calculates Cumulative Poisson Distribution
NASA Technical Reports Server (NTRS)
Bowerman, Paul N.; Nolty, Robert C.; Scheuer, Ernest M.
1992-01-01
Algorithm calculates accurate values of cumulative Poisson distribution under conditions where other algorithms fail because numbers are so small (underflow) or so large (overflow) that computer cannot process them. Factors inserted temporarily to prevent underflow and overflow. Implemented in CUMPOIS computer program described in "Cumulative Poisson Distribution Program" (NPO-17714).
AN ALGORITHM FOR PARALLEL SN SWEEPS ON UNSTRUCTURED MESHES
S. D. PAUTZ
2000-12-01
We develop a new algorithm for performing parallel S{sub n} sweeps on unstructured meshes. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned mesh. For typical problems and with ''normal'' mesh partitionings we have observed nearly linear speedups on up to 126 processors. This is an important and desirable result, since although analyses of structured meshes indicate that parallel sweeps will not scale with normal partitioning approaches, we do not observe any severe asymptotic degradation in the parallel efficiency with modest ({le}100) levels of parallelism. This work is a fundamental step in the development of parallel S{sub n} methods.
AN ALGORITHM FOR PARALLEL SN SWEEPS ON UNSTRUCTURED MESHES
S. D. PAUTZ
2001-03-01
We develop a new algorithm for performing parallel S{sub n} sweeps on unstructured meshes. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned mesh. For typical problems and with ''normal'' mesh partitionings we have observed nearly linear speedups on up to 126 processors. This is an important and desirable result, since although analyses of structured meshes indicate that parallel sweeps will not scale with normal partitioning approaches, we do not observe any severe asymptotic degradation in the parallel efficiency with modest ({le} 100) levels of parallelism. This work is a fundamental step in the development of parallel S{sub n} methods.
Parallel Breadth-First Search on Distributed Memory Systems
Computational Research Division; Buluc, Aydin; Madduri, Kamesh
2011-04-15
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned par- allel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix- partitioning-based approach that mitigates parallel commu- nication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execu- tion regimes in which these approaches will be competitive, and we demonstrate extremely high performance on lead- ing distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny- Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.
A Parallel Algorithm for the Vehicle Routing Problem
Groer, Christopher S; Golden, Bruce; Edward, Wasil
2011-01-01
The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.
A parallel implementation of the Wang Landau algorithm
NASA Astrophysics Data System (ADS)
Zhan, Lixin
2008-09-01
The Wang-Landau algorithm is a flat-histogram Monte Carlo method that performs random walks in the configuration space of a system to obtain a close estimation of the density of states iteratively. It has been applied successfully to many research fields. In this paper, we propose a parallel implementation of the Wang-Landau algorithm on computers of shared memory architectures by utilizing the OpenMP API for distributed computing. This implementation is applied to Ising model systems with promising speedups. We also examine the effects on the running speed when using different strategies in accessing the shared memory space during the updating procedure. The allowance of data race is recommended in consideration of the simulation efficiency. Such treatment does not affect the accuracy of the final density of states obtained.
Parallel programming interface for distributed data
NASA Astrophysics Data System (ADS)
Wang, Manhui; May, Andrew J.; Knowles, Peter J.
2009-12-01
The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summaryProgram title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2-256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with
Parallel projected variable metric algorithms for unconstrained optimization
NASA Technical Reports Server (NTRS)
Freeman, T. L.
1989-01-01
The parallel variable metric optimization algorithms of Straeter (1973) and van Laarhoven (1985) are reviewed, and the possible drawbacks of the algorithms are noted. By including Davidon (1975) projections in the variable metric updating, researchers can generalize Straeter's algorithm to a family of parallel projected variable metric algorithms which do not suffer the above drawbacks and which retain quadratic termination. Finally researchers consider the numerical performance of one member of the family on several standard example problems and illustrate how the choice of the displacement vectors affects the performance of the algorithm.
A Parallel Prefix Algorithm for Almost Toeplitz Tridiagonal Systems
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Joslin, Ronald D.
1995-01-01
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study has been conducted to provide a simple truncation formula. Experimental results have been measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and for the compact scheme on high-performance computers.
An efficient parallel algorithm for matrix-vector multiplication
Hendrickson, B.; Leland, R.; Plimpton, S.
1993-03-01
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
Parallelization and automatic data distribution for nuclear reactor simulations
Liebrock, L.M.
1997-07-01
Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1997-01-01
The multiblock reacting Navier-Stokes flow solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1994-01-01
The multiblock reacting Navier-Stokes flow-solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Distributed parallel messaging for multiprocessor systems
Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka
2013-06-04
A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Distributed Parallel Particle Advection using Work Requesting
Muller, Cornelius; Camp, David; Hentschel, Bernd; Garth, Christoph
2013-09-30
Particle advection is an important vector field visualization technique that is difficult to apply to very large data sets in a distributed setting due to scalability limitations in existing algorithms. In this paper, we report on several experiments using work requesting dynamic scheduling which achieves balanced work distribution on arbitrary problems with minimal communication overhead. We present a corresponding prototype implementation, provide and analyze benchmark results, and compare our results to an existing algorithm.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
NASA Technical Reports Server (NTRS)
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Parallel Algorithm Solves Coupled Differential Equations
NASA Technical Reports Server (NTRS)
Hayashi, A.
1987-01-01
Numerical methods adapted to concurrent processing. Algorithm solves set of coupled partial differential equations by numerical integration. Adapted to run on hypercube computer, algorithm separates problem into smaller problems solved concurrently. Increase in computing speed with concurrent processing over that achievable with conventional sequential processing appreciable, especially for large problems.
Exact parallel maximum clique algorithm for general and protein graphs.
Depolli, Matjaž; Konc, Janez; Rozman, Kati; Trobec, Roman; Janežič, Dušanka
2013-09-23
A new exact parallel maximum clique algorithm MaxCliquePara, which finds the maximum clique (the fully connected subgraph) in undirected general and protein graphs, is presented. First, a new branch and bound algorithm for finding a maximum clique on a single computer core, which builds on ideas presented in two published state of the art sequential algorithms is implemented. The new sequential MaxCliqueSeq algorithm is faster than the reference algorithms on both DIMACS benchmark graphs as well as on protein-derived product graphs used for protein structural comparisons. Next, the MaxCliqueSeq algorithm is parallelized by splitting the branch-and-bound search tree to multiple cores, resulting in MaxCliquePara algorithm. The ability to exploit all cores efficiently makes the new parallel MaxCliquePara algorithm markedly superior to other tested algorithms. On a 12-core computer, the parallelization provides up to 2 orders of magnitude faster execution on the large DIMACS benchmark graphs and up to an order of magnitude faster execution on protein product graphs. The algorithms are freely accessible on http://commsys.ijs.si/~matjaz/maxclique.
A join algorithm for combining AND parallel solutions in AND/OR parallel systems
Ramkumar, B. ); Kale, L.V. )
1992-02-01
When two or more literals in the body of a Prolog clause are solved in (AND) parallel, their solutions need to be joined to compute solutions for the clause. This is often a difficult problem in parallel Prolog systems that exploit OR and independent AND parallelism in Prolog programs. In several AND/OR parallel systems proposed recently, this problem is side-stepped at the cost of unexploited OR parallelism in the program, in part due to the complexity of the backtracking algorithm beneath AND parallel branches. In some cases, the data dependency graphs used by these systems cannot represent all the exploitable independent AND parallelism known at compile time. In this paper, we describe the compile time analysis for an optimized join algorithm for supporting independent AND parallelism in logic programs efficiently without leaving and OR parallelism unexploited. We then discuss how this analysis can be used to yield very efficient runtime behavior. We also discuss problems associated with a tree representation of the search space when arbitrarily complex data dependency graphs are permitted. We describe how these problems can be resolved by mapping the search space onto data dependency graphs themselves. The algorithm has been implemented in a compiler for parallel Prolog based on the reduce-OR process model. The algorithm is suitable for the implementation of AND/OR systems on both shared and nonshared memory machines. Performance on benchmark programs.
NASA Astrophysics Data System (ADS)
Chen, Yufeng; Wu, Zebin; Sun, Le; Wei, Zhihui; Li, Yonglong
2016-04-01
With the gradual increase in the spatial and spectral resolution of hyperspectral images, the size of image data becomes larger and larger, and the complexity of processing algorithms is growing, which poses a big challenge to efficient massive hyperspectral image processing. Cloud computing technologies distribute computing tasks to a large number of computing resources for handling large data sets without the limitation of memory and computing resource of a single machine. This paper proposes a parallel pixel purity index (PPI) algorithm for unmixing massive hyperspectral images based on a MapReduce programming model for the first time in the literature. According to the characteristics of hyperspectral images, we describe the design principle of the algorithm, illustrate the main cloud unmixing processes of PPI, and analyze the time complexity of serial and parallel algorithms. Experimental results demonstrate that the parallel implementation of the PPI algorithm on the cloud can effectively process big hyperspectral data and accelerate the algorithm.
Parallel algorithms and architectures for the manipulator inertia matrix
Amin-Javaheri, M.
1989-01-01
Several parallel algorithms and architectures to compute the manipulator inertia matrix in real time are proposed. An O(N) and an O(log{sub 2}N) parallel algorithm based upon recursive computation of the inertial parameters of sets of composite rigid bodies are formulated. One- and two-dimensional systolic architectures are presented to implement the O(N) parallel algorithm. A cube architecture is employed to implement the diagonal element of the inertia matrix in O(log{sub 2}N) time and the upper off-diagonal elements in O(N) time. The resulting K{sub 1}O(N) + K{sub 2}O(log{sub 2}N) parallel algorithm is more efficient for a cube network implementation. All the architectural configurations are based upon a VLSI Robotics Processor exploiting fine-grain parallelism. In evaluation all the architectural configurations, significant performance parameters such as I/O time and idle time due to processor synchronization as well as CPU utilization and on-chip memory size are fully included. The O(N) and O(log{sub 2}N) parallel algorithms adhere to the precedence relationships among the processors. In order to achieve a higher speedup factor; however, parallel algorithms in conjunction with Non-Strict Computational Models are devised to relax interprocess precedence, and as a result, to decrease the effective computational delays. The effectiveness of the Non-strict Computational Algorithms is verified by computer simulations, based on a PUMA 560 robot manipulator. It is demonstrated that a combination of parallel algorithms and architectures results in a very effective approach to achieve real-time response for computing the manipulator inertia matrix.
Algorithmic support for commodity-based parallel computing systems.
Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
Ray tracing on distributed memory parallel systems
NASA Technical Reports Server (NTRS)
Jensen, David W.; Reed, Daniel A.
1990-01-01
Among the many techniques in computer graphics, ray tracing is prized because it can render realistic images, albeit at great computational expense. In this note, the performance of several approaches to ray tracing on a distributed memory parallel system is evaluated. A set of performance instrumentation tools and their associated visualization software are used to identify the underlying causes of performance differences.
Efficient parallel algorithm for statistical ion track simulations in crystalline materials
NASA Astrophysics Data System (ADS)
Jeon, Byoungseon; Grønbech-Jensen, Niels
2009-02-01
We present an efficient parallel algorithm for statistical Molecular Dynamics simulations of ion tracks in solids. The method is based on the Rare Event Enhanced Domain following Molecular Dynamics (REED-MD) algorithm, which has been successfully applied to studies of, e.g., ion implantation into crystalline semiconductor wafers. We discuss the strategies for parallelizing the method, and we settle on a host-client type polling scheme in which a multiple of asynchronous processors are continuously fed to the host, which, in turn, distributes the resulting feed-back information to the clients. This real-time feed-back consists of, e.g., cumulative damage information or statistics updates necessary for the cloning in the rare event algorithm. We finally demonstrate the algorithm for radiation effects in a nuclear oxide fuel, and we show the balanced parallel approach with high parallel efficiency in multiple processor configurations.
Parallel algorithms for graph cycle extraction using the cyclical conjunction operator.
Cerruela García, G; Luque Ruiz, I; Gómez-Nieto, M A
2002-01-01
With a view to reducing the computational cost of extracting all the cycles from complex graphs, the authors have examined the viability here of parallel processing. Based on the cyclical conjunction operator, which uses an iterative process to extract every cycle from a graph, a study was performed of the factors intervening in the parallelization of this algorithm, namely the following: granularity of the parallel algorithm, requirements for synchronization points, and the spreading of the load across different processors. Tests were performed on two granularities and four different load distributions. Algorithm implementation is carried out using SGI MP and OpenMP libraries, and, in the light of the present findings, the authors propose a dynamically distributed fine-grain algorithm using that allows all the cycles in a complex graph to be found in an acceptable computational time.
Hadoop neural network for parallel and distributed feature selection.
Hodge, Victoria J; O'Keefe, Simon; Austin, Jim
2016-06-01
In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop. PMID:26403824
Hadoop neural network for parallel and distributed feature selection.
Hodge, Victoria J; O'Keefe, Simon; Austin, Jim
2016-06-01
In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop.
Efficient sequential and parallel algorithms for record linkage
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar
2014-01-01
Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837
Cloud identification using genetic algorithms and massively parallel computation
NASA Technical Reports Server (NTRS)
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
A Task-parallel Clustering Algorithm for Structured AMR
Gunney, B N; Wissink, A M
2004-11-02
A new parallel algorithm, based on the Berger-Rigoutsos algorithm for clustering grid points into logically rectangular regions, is presented. The clustering operation is frequently performed in the dynamic gridding steps of structured adaptive mesh refinement (SAMR) calculations. A previous study revealed that although the cost of clustering is generally insignificant for smaller problems run on relatively few processors, the algorithm scaled inefficiently in parallel and its cost grows with problem size. Hence, it can become significant for large scale problems run on very large parallel machines, such as the new BlueGene system (which has {Omicron}(10{sup 4}) processors). We propose a new task-parallel algorithm designed to reduce communication wait times. Performance was assessed using dynamic SAMR re-gridding operations on up to 16K processors of currently available computers at Lawrence Livermore National Laboratory. The new algorithm was shown to be up to an order of magnitude faster than the baseline algorithm and had better scaling trends.
A scalable parallel algorithm for multiple objective linear programs
NASA Technical Reports Server (NTRS)
Wiecek, Malgorzata M.; Zhang, Hong
1994-01-01
This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Singular value decomposition utilizing parallel algorithms on graphical processors
Kotas, Charlotte W; Barhen, Jacob
2011-01-01
transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.
A parallel algorithm for the distance embedding problem
Sampath, G.
1993-12-31
There are several applications in which it is required to embed a given set of distances between pairs of points in m- dimensional space. In this report, a parallel version of a sequential algorithm for the distance embedding problem due to Sippl and Scheraga is discussed for an SIMD-SM CRCW machine. In contrast with another (sequential) algorithm for the problem due to Pulskamp, this one does not require the solution of a characteristic matrix equation for its eigenvalues and therefore leads to a more efficient parallel implementation.
Hash based parallel algorithms for mining association rules
Shintani, Takahiko; Kitsuregawa, Masaru
1996-12-31
In this paper, we propose four parallel algorithms (NPA, SPA, HPA and RPA-ELD) for mining association rules on shared-nothing parallel machines to improve its performance. In NPA, candidate itemsets are just copied amongst all the processors, which can lead to memory overflow for large transaction databases. The remaining three algorithms partition the candidate itemsets over the processors. If it is partitioned simply (SPA), transaction data has to be broadcast to all processors. HPA partitions the candidate itemsets using a hash function to eliminate broadcasting, which also reduces the comparison workload significantly. HPA-ELD fully utilizes the available memory space by detecting the extremely large itemsets and copying them, which is also very effective at flattering the load over the processors. We implemented these algorithms in a shared-nothing environment. Performance evaluations show that the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.
Parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
A biconjugate gradient type algorithm on massively parallel architectures
NASA Technical Reports Server (NTRS)
Freund, Roland W.; Hochbruck, Marlis
1991-01-01
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
Parallel implementation of an algorithm for Delaunay triangulation
NASA Technical Reports Server (NTRS)
Merriam, Marshal L.
1992-01-01
The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.
Parallel implementation of an algorithm for Delaunay triangulation
NASA Technical Reports Server (NTRS)
Merriam, Marshall L.
1992-01-01
This work concerns the theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer. Tanemura's algorithm does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a conventional, vector processing, supercomputer is problematic. Efficient implementation on a parallel architecture is possible, however. In this work, speeds in excess of 8 times a single processor Cray Y-mp are realized on 128 processors of the Intel Gamma prototype.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
NASA Technical Reports Server (NTRS)
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.
1989-01-01
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
NASA Technical Reports Server (NTRS)
Lohn, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris; Norvig, Peter (Technical Monitor)
2000-01-01
We describe a parallel genetic algorithm (GA) that automatically generates circuit designs using evolutionary search. A circuit-construction programming language is introduced and we show how evolution can generate practical analog circuit designs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. We present experimental results as applied to analog filter and amplifier design tasks.
Parallel algorithm of VLBI software correlator under multiprocessor environment
NASA Astrophysics Data System (ADS)
Zheng, Weimin; Zhang, Dong
2007-11-01
The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
Databases in parallel and distributed systems
Agrawal, R. ); Bell, D. )
1990-01-01
The existence of this Symposium, and the success of the First Symposium acknowledge the growing recognition of the importance of two related challenges for database systems as we enter the 1990s. One is to exploit the potential for flexibility of access to dispersed data which may preexist at nodes of a computer network, or are distributed for some other reason. The other is to meet the often stringent performance requirements for some database queries by invoking the support of parallel architectures based on recent hardware and software developments.
The delayed coupling method: An algorithm for solving banded diagonal matrix problems in parallel
Mattor, N.; Williams, T.J.; Hewett, D.W.; Dimits, A.M.
1997-09-01
We present a new algorithm for solving banded diagonal matrix problems efficiently on distributed-memory parallel computers, designed originally for use in dynamic alternating-direction implicit partial differential equation solvers. The algorithm optimizes efficiency with respect to the number of numerical operations and to the amount of interprocessor communication. This is called the ``delayed coupling method`` because the communication is deferred until needed. We focus here on tridiagonal and periodic tridiagonal systems.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
NavP: Structured and Multithreaded Distributed Parallel Programming
NASA Technical Reports Server (NTRS)
Pan, Lei
2007-01-01
We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.
Parallel algorithms for high-speed SAR processing
NASA Astrophysics Data System (ADS)
Mallorqui, Jordi J.; Bara, Marc; Broquetas, Antoni; Wis, Mariano; Martinez, Antonio; Nogueira, Leonardo; Moreno, Victoriano
1998-11-01
The mass production of SAR products and its usage on monitoring emergency situations (oil spill detection, floods, etc.) requires high-speed SAR processors. Two different parallel strategies for near real time SAR processing based on a multiblock version of the Chirp Scaling Algorithm (CSA) have been studied. The first one is useful for small companies that would like to reduce computation times with no extra investment. It uses a cluster of heterogeneous UNIX workstations as a parallel computer. The second one is oriented to institutions, which have to process large amounts of data in short times and can afford the cost of large parallel computers. The parallel programming has reduced in both cases the computational times when compared with the sequential versions.
A parallel stereo reconstruction algorithm with applications in entomology (APSRA)
NASA Astrophysics Data System (ADS)
Bhasin, Rajesh; Jang, Won Jun; Hart, John C.
2012-03-01
We propose a fast parallel algorithm for the reconstruction of 3-Dimensional point clouds of insects from binocular stereo image pairs using a hierarchical approach for disparity estimation. Entomologists study various features of insects to classify them, build their distribution maps, and discover genetic links between specimens among various other essential tasks. This information is important to the pesticide and the pharmaceutical industries among others. When considering the large collections of insects entomologists analyze, it becomes difficult to physically handle the entire collection and share the data with researchers across the world. With the method presented in our work, Entomologists can create an image database for their collections and use the 3D models for studying the shape and structure of the insects thus making it easier to maintain and share. Initial feedback shows that the reconstructed 3D models preserve the shape and size of the specimen. We further optimize our results to incorporate multiview stereo which produces better overall structure of the insects. Our main contribution is applying stereoscopic vision techniques to entomology to solve the problems faced by entomologists.
Parallel global optimization with the particle swarm algorithm.
Schutte, J F; Reinbolt, J A; Fregly, B J; Haftka, R T; George, A D
2004-12-01
Present day engineering optimization problems often impose large computational demands, resulting in long solution times even on a modern high-end processor. To obtain enhanced computational throughput and global search capability, we detail the coarse-grained parallelization of an increasingly popular global search method, the particle swarm optimization (PSO) algorithm. Parallel PSO performance was evaluated using two categories of optimization problems possessing multiple local minima-large-scale analytical test problems with computationally cheap function evaluations and medium-scale biomechanical system identification problems with computationally expensive function evaluations. For load-balanced analytical test problems formulated using 128 design variables, speedup was close to ideal and parallel efficiency above 95% for up to 32 nodes on a Beowulf cluster. In contrast, for load-imbalanced biomechanical system identification problems with 12 design variables, speedup plateaued and parallel efficiency decreased almost linearly with increasing number of nodes. The primary factor affecting parallel performance was the synchronization requirement of the parallel algorithm, which dictated that each iteration must wait for completion of the slowest fitness evaluation. When the analytical problems were solved using a fixed number of swarm iterations, a single population of 128 particles produced a better convergence rate than did multiple independent runs performed using sub-populations (8 runs with 16 particles, 4 runs with 32 particles, or 2 runs with 64 particles). These results suggest that (1) parallel PSO exhibits excellent parallel performance under load-balanced conditions, (2) an asynchronous implementation would be valuable for real-life problems subject to load imbalance, and (3) larger population sizes should be considered when multiple processors are available.
A Data Parallel Algorithm for XML DOM Parsing
NASA Astrophysics Data System (ADS)
Shah, Bhavik; Rao, Praveen R.; Moon, Bongki; Rajagopalan, Mohan
The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme - each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports
Parallel global optimization with the particle swarm algorithm
Schutte, J. F.; Reinbolt, J. A.; Fregly, B. J.; Haftka, R. T.; George, A. D.
2007-01-01
SUMMARY Present day engineering optimization problems often impose large computational demands, resulting in long solution times even on a modern high-end processor. To obtain enhanced computational throughput and global search capability, we detail the coarse-grained parallelization of an increasingly popular global search method, the particle swarm optimization (PSO) algorithm. Parallel PSO performance was evaluated using two categories of optimization problems possessing multiple local minima—large-scale analytical test problems with computationally cheap function evaluations and medium-scale biomechanical system identification problems with computationally expensive function evaluations. For load-balanced analytical test problems formulated using 128 design variables, speedup was close to ideal and parallel efficiency above 95% for up to 32 nodes on a Beowulf cluster. In contrast, for load-imbalanced biomechanical system identification problems with 12 design variables, speedup plateaued and parallel efficiency decreased almost linearly with increasing number of nodes. The primary factor affecting parallel performance was the synchronization requirement of the parallel algorithm, which dictated that each iteration must wait for completion of the slowest fitness evaluation. When the analytical problems were solved using a fixed number of swarm iterations, a single population of 128 particles produced a better convergence rate than did multiple independent runs performed using sub-populations (8 runs with 16 particles, 4 runs with 32 particles, or 2 runs with 64 particles). These results suggest that (1) parallel PSO exhibits excellent parallel performance under load-balanced conditions, (2) an asynchronous implementation would be valuable for real-life problems subject to load imbalance, and (3) larger population sizes should be considered when multiple processors are available. PMID:17891226
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
a Distributed Polygon Retrieval Algorithm Using Mapreduce
NASA Astrophysics Data System (ADS)
Guo, Q.; Palanisamy, B.; Karimi, H. A.
2015-07-01
The burst of large-scale spatial terrain data due to the proliferation of data acquisition devices like 3D laser scanners poses challenges to spatial data analysis and computation. Among many spatial analyses and computations, polygon retrieval is a fundamental operation which is often performed under real-time constraints. However, existing sequential algorithms fail to meet this demand for larger sizes of terrain data. Motivated by the MapReduce programming model, a well-adopted large-scale parallel data processing technique, we present a MapReduce-based polygon retrieval algorithm designed with the objective of reducing the IO and CPU loads of spatial data processing. By indexing the data based on a quad-tree approach, a significant amount of unneeded data is filtered in the filtering stage and it reduces the IO overhead. The indexed data also facilitates querying the relationship between the terrain data and query area in shorter time. The results of the experiments performed in our Hadoop cluster demonstrate that our algorithm performs significantly better than the existing distributed algorithms.
NASA Astrophysics Data System (ADS)
Mighell, Kenneth John
2010-10-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the
Parallel Algorithms for Graph Optimization using Tree Decompositions
Weerapurage, Dinesh P; Sullivan, Blair D; Groer, Christopher S
2013-01-01
Although many NP-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of required dynamic programming tables and excessive running times of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree-decomposition based approach to solve maximum weighted independent set. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Parallel Algorithms for Graph Optimization using Tree Decompositions
Sullivan, Blair D; Weerapurage, Dinesh P; Groer, Christopher S
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Feed-forward volume rendering algorithm for moderately parallel MIMD machines
NASA Technical Reports Server (NTRS)
Yagel, Roni
1993-01-01
Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance.
Fast, Parallel and Secure Cryptography Algorithm Using Lorenz's Attractor
NASA Astrophysics Data System (ADS)
Marco, Anderson Gonçalves; Martinez, Alexandre Souto; Bruno, Odemir Martinez
A novel cryptography method based on the Lorenz's attractor chaotic system is presented. The proposed algorithm is secure and fast, making it practical for general use. We introduce the chaotic operation mode, which provides an interaction among the password, message and a chaotic system. It ensures that the algorithm yields a secure codification, even if the nature of the chaotic system is known. The algorithm has been implemented in two versions: one sequential and slow and the other, parallel and fast. Our algorithm assures the integrity of the ciphertext (we know if it has been altered, which is not assured by traditional algorithms) and consequently its authenticity. Numerical experiments are presented, discussed and show the behavior of the method in terms of security and performance. The fast version of the algorithm has a performance comparable to AES, a popular cryptography program used commercially nowadays, but it is more secure, which makes it immediately suitable for general purpose cryptography applications. An internet page has been set up, which enables the readers to test the algorithm and also to try to break into the cipher.
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix
NASA Technical Reports Server (NTRS)
Shroff, Gautam
1989-01-01
A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
Parallel algorithms and architectures for very fast AI search
Gu, J.
1989-01-01
A wide range of problems in natural and artificial intelligence, computer vision, computer graphics, database engineering, operations research, symbolic logic, robot manipulation and hardware design automation are special cases of Consistent Labeling Problems (CLP). CLP has long been viewed as an efficient computational model based on a unit constraint relation containing 2N-tuples of units and labels which specifies which N-tuples of labels are compatible with which N-tuples of units. Due to high computation cost and design complexity, most currently best-known algorithms and computer architectures have usually proven infeasible for solving the consistent labeling problems. Efficiency in CLP computation during the last decade has only been improved a few times. This research presents several parallel algorithms and computer architectures for solving CLP within a parallel processing framework. For problems of practical interest, 4 to 10 orders of magnitude of efficiency improvement can be easily reached. Several simple wafer scale computer architectures are given which implement these parallel algorithms at a surprisingly low cost.
Parallel algorithm for computing 3-D reachable workspaces
NASA Astrophysics Data System (ADS)
Alameldin, Tarek K.; Sobh, Tarek M.
1992-03-01
The problem of computing the 3-D workspace for redundant articulated chains has applications in a variety of fields such as robotics, computer aided design, and computer graphics. The computational complexity of the workspace problem is at least NP-hard. The recent advent of parallel computers has made practical solutions for the workspace problem possible. Parallel algorithms for computing the 3-D workspace for redundant articulated chains with joint limits are presented. The first phase of these algorithms computes workspace points in parallel. The second phase uses workspace points that are computed in the first phase and fits a 3-D surface around the volume that encompasses the workspace points. The second phase also maps the 3- D points into slices, uses region filling to detect the holes and voids in the workspace, extracts the workspace boundary points by testing the neighboring cells, and tiles the consecutive contours with triangles. The proposed algorithms are efficient for computing the 3-D reachable workspace for articulated linkages, not only those with redundant degrees of freedom but also those with joint limits.
A parallel genetic algorithm for the set partitioning problem
Levine, D.
1996-12-31
This paper describes a parallel genetic algorithm developed for the solution of the set partitioning problem- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their own subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. We found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high- quality integer feasible solutions were found for problems with 36, 699 and 43,749 integer variables, respectively. A notable limitation we found was the difficulty solving problems with many constraints.
A novel highly parallel algorithm for linearly unmixing hyperspectral images
NASA Astrophysics Data System (ADS)
Guerra, Raúl; López, Sebastián.; Callico, Gustavo M.; López, Jose F.; Sarmiento, Roberto
2014-10-01
Endmember extraction and abundances calculation represent critical steps within the process of linearly unmixing a given hyperspectral image because of two main reasons. The first one is due to the need of computing a set of accurate endmembers in order to further obtain confident abundance maps. The second one refers to the huge amount of operations involved in these time-consuming processes. This work proposes an algorithm to estimate the endmembers of a hyperspectral image under analysis and its abundances at the same time. The main advantage of this algorithm is its high parallelization degree and the mathematical simplicity of the operations implemented. This algorithm estimates the endmembers as virtual pixels. In particular, the proposed algorithm performs the descent gradient method to iteratively refine the endmembers and the abundances, reducing the mean square error, according with the linear unmixing model. Some mathematical restrictions must be added so the method converges in a unique and realistic solution. According with the algorithm nature, these restrictions can be easily implemented. The results obtained with synthetic images demonstrate the well behavior of the algorithm proposed. Moreover, the results obtained with the well-known Cuprite dataset also corroborate the benefits of our proposal.
Potts-model grain growth simulations: Parallel algorithms and applications
Wright, S.A.; Plimpton, S.J.; Swiler, T.P.
1997-08-01
Microstructural morphology and grain boundary properties often control the service properties of engineered materials. This report uses the Potts-model to simulate the development of microstructures in realistic materials. Three areas of microstructural morphology simulations were studied. They include the development of massively parallel algorithms for Potts-model grain grow simulations, modeling of mass transport via diffusion in these simulated microstructures, and the development of a gradient-dependent Hamiltonian to simulate columnar grain growth. Potts grain growth models for massively parallel supercomputers were developed for the conventional Potts-model in both two and three dimensions. Simulations using these parallel codes showed self similar grain growth and no finite size effects for previously unapproachable large scale problems. In addition, new enhancements to the conventional Metropolis algorithm used in the Potts-model were developed to accelerate the calculations. These techniques enable both the sequential and parallel algorithms to run faster and use essentially an infinite number of grain orientation values to avoid non-physical grain coalescence events. Mass transport phenomena in polycrystalline materials were studied in two dimensions using numerical diffusion techniques on microstructures generated using the Potts-model. The results of the mass transport modeling showed excellent quantitative agreement with one dimensional diffusion problems, however the results also suggest that transient multi-dimension diffusion effects cannot be parameterized as the product of the grain boundary diffusion coefficient and the grain boundary width. Instead, both properties are required. Gradient-dependent grain growth mechanisms were included in the Potts-model by adding an extra term to the Hamiltonian. Under normal grain growth, the primary driving term is the curvature of the grain boundary, which is included in the standard Potts-model Hamiltonian.
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation
NASA Technical Reports Server (NTRS)
Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.
1996-01-01
We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
BMI optimization by using parallel UNDX real-coded genetic algorithm with Beowulf cluster
NASA Astrophysics Data System (ADS)
Handa, Masaya; Kawanishi, Michihiro; Kanki, Hiroshi
2007-12-01
This paper deals with the global optimization algorithm of the Bilinear Matrix Inequalities (BMIs) based on the Unimodal Normal Distribution Crossover (UNDX) GA. First, analyzing the structure of the BMIs, the existence of the typical difficult structures is confirmed. Then, in order to improve the performance of algorithm, based on results of the problem structures analysis and consideration of BMIs characteristic properties, we proposed the algorithm using primary search direction with relaxed Linear Matrix Inequality (LMI) convex estimation. Moreover, in these algorithms, we propose two types of evaluation methods for GA individuals based on LMI calculation considering BMI characteristic properties more. In addition, in order to reduce computational time, we proposed parallelization of RCGA algorithm, Master-Worker paradigm with cluster computing technique.
Parallel computer graphics algorithms for the Connection Machine
Richardson, J.F.
1990-01-01
Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.
A parallel algorithm for 3D dislocation dynamics
NASA Astrophysics Data System (ADS)
Wang, Zhiqiang; Ghoniem, Nasr; Swaminarayan, Sriram; LeSar, Richard
2006-12-01
Dislocation dynamics (DD), a discrete dynamic simulation method in which dislocations are the fundamental entities, is a powerful tool for investigation of plasticity, deformation and fracture of materials at the micron length scale. However, severe computational difficulties arising from complex, long-range interactions between these curvilinear line defects limit the application of DD in the study of large-scale plastic deformation. We present here the development of a parallel algorithm for accelerated computer simulations of DD. By representing dislocations as a 3D set of dislocation particles, we show here that the problem of an interacting ensemble of dislocations can be converted to a problem of a particle ensemble, interacting with a long-range force field. A grid using binary space partitioning is constructed to keep track of node connectivity across domains. We demonstrate the computational efficiency of the parallel micro-plasticity code and discuss how O(N) methods map naturally onto the parallel data structure. Finally, we present results from applications of the parallel code to deformation in single crystal fcc metals.
Adaptive Mesh Refinement Algorithms for Parallel Unstructured Finite Element Codes
Parsons, I D; Solberg, J M
2006-02-03
This project produced algorithms for and software implementations of adaptive mesh refinement (AMR) methods for solving practical solid and thermal mechanics problems on multiprocessor parallel computers using unstructured finite element meshes. The overall goal is to provide computational solutions that are accurate to some prescribed tolerance, and adaptivity is the correct path toward this goal. These new tools will enable analysts to conduct more reliable simulations at reduced cost, both in terms of analyst and computer time. Previous academic research in the field of adaptive mesh refinement has produced a voluminous literature focused on error estimators and demonstration problems; relatively little progress has been made on producing efficient implementations suitable for large-scale problem solving on state-of-the-art computer systems. Research issues that were considered include: effective error estimators for nonlinear structural mechanics; local meshing at irregular geometric boundaries; and constructing efficient software for parallel computing environments.
A Simple Physical Optics Algorithm Perfect for Parallel Computing
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1993-01-01
One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
A parallel algorithm for solving the 3d Schroedinger equation
Strickland, Michael; Yager-Elorriaga, David
2010-08-20
We describe a parallel algorithm for solving the time-independent 3d Schroedinger equation using the finite difference time domain (FDTD) method. We introduce an optimized parallelization scheme that reduces communication overhead between computational nodes. We demonstrate that the compute time, t, scales inversely with the number of computational nodes as t {proportional_to} (N{sub nodes}){sup -0.95} {sup {+-} 0.04}. This makes it possible to solve the 3d Schroedinger equation on extremely large spatial lattices using a small computing cluster. In addition, we present a new method for precisely determining the energy eigenvalues and wavefunctions of quantum states based on a symmetry constraint on the FDTD initial condition. Finally, we discuss the usage of multi-resolution techniques in order to speed up convergence on extremely large lattices.
Crane, N K; Parsons, I D; Hjelmstad, K D
2002-03-21
Adaptive mesh refinement selectively subdivides the elements of a coarse user supplied mesh to produce a fine mesh with reduced discretization error. Effective use of adaptive mesh refinement coupled with an a posteriori error estimator can produce a mesh that solves a problem to a given discretization error using far fewer elements than uniform refinement. A geometric multigrid solver uses increasingly finer discretizations of the same geometry to produce a very fast and numerically scalable solution to a set of linear equations. Adaptive mesh refinement is a natural method for creating the different meshes required by the multigrid solver. This paper describes the implementation of a scalable adaptive multigrid method on a distributed memory parallel computer. Results are presented that demonstrate the parallel performance of the methodology by solving a linear elastic rocket fuel deformation problem on an SGI Origin 3000. Two challenges must be met when implementing adaptive multigrid algorithms on massively parallel computing platforms. First, although the fine mesh for which the solution is desired may be large and scaled to the number of processors, the multigrid algorithm must also operate on much smaller fixed-size data sets on the coarse levels. Second, the mesh must be repartitioned as it is adapted to maintain good load balancing. In an adaptive multigrid algorithm, separate mesh levels may require separate partitioning, further complicating the load balance problem. This paper shows that, when the proper optimizations are made, parallel adaptive multigrid algorithms perform well on machines with several hundreds of processors.
Serial Order: A Parallel Distributed Processing Approach.
ERIC Educational Resources Information Center
Jordan, Michael I.
Human behavior shows a variety of serially ordered action sequences. This paper presents a theory of serial order which describes how sequences of actions might be learned and performed. In this theory, parallel interactions across time (coarticulation) and parallel interactions across space (dual-task interference) are viewed as two aspects of a…
Algorithms for parallel flow solvers on message passing architectures
NASA Astrophysics Data System (ADS)
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
A parallel dynamic load balancing algorithm for 3-D adaptive unstructured grids
NASA Technical Reports Server (NTRS)
Vidwans, A.; Kallinderis, Y.; Venkatakrishnan, V.
1993-01-01
Adaptive local grid refinement and coarsening results in unequal distribution of workload among the processors of a parallel system. A novel method for balancing the load in cases of dynamically changing tetrahedral grids is developed. The approach employs local exchange of cells among processors in order to redistribute the load equally. An important part of the load balancing algorithm is the method employed by a processor to determine which cells within its subdomain are to be exchanged. Two such methods are presented and compared. The strategy for load balancing is based on the Divide-and-Conquer approach which leads to an efficient parallel algorithm. This method is implemented on a distributed-memory MIMD system.
Programming environment for parallel-vision algorithms. Annual report, February 1985-February 1986
Brown
1986-08-01
During the first year of the award period, three main lines of work were pursued: systems support algorithms, Butterfly programming environment, and vision applications. Today's multiprocessor computer architectures are not efficiently programmed or even conceptualized with standard computer languages, and their operating systems and debugging tools are also challengingly different. The University of Rochester is doing work in the area of tools for controlling large-grain parallelism, as one finds in a distributed multiprocessor application like the Autonomous Land Vehicle, or in tightly coupled processors like the Hypercube or the Butterfly Parallel Processor.
Parallel simulations of Grover's algorithm for closest match search in neutron monitor data
NASA Astrophysics Data System (ADS)
Kussainov, Arman; White, Yelena
We are studying the parallel implementations of Grover's closest match search algorithm for neutron monitor data analysis. This includes data formatting, and matching quantum parameters to a conventional structure of a chosen programming language and selected experimental data type. We have employed several workload distribution models based on acquired data and search parameters. As a result of these simulations, we have an understanding of potential problems that may arise during configuration of real quantum computational devices and the way they could run tasks in parallel. The work was supported by the Science Committee of the Ministry of Science and Education of the Republic of Kazakhstan Grant #2532/GF3.
Efficient parallel algorithms for string editing and related problems
NASA Technical Reports Server (NTRS)
Apostolico, Alberto; Atallah, Mikhail J.; Larmore, Lawrence; Mcfaddin, H. S.
1988-01-01
The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol x with another symbol. This problem has a well known O((absolute value of x)(absolute value of y)) time sequential solution (25). The efficient Program Requirements Analysis Methods (PRAM) parallel algorithms for the string editing problem are given. If m = ((absolute value of x),(absolute value of y)) and n = max((absolute value of x),(absolute value of y)), then the CREW bound is O (log m log n) time with O (mn/log m) processors. In all algorithms, space is O (mn).
Users guide to the PGAPack parallel genetic algorithm library
Levine, D.
1996-01-01
PGAPack is a parallel genetic algorithm library that is intended to provide most capabilities desired in a genetic algorithm package, in an integrated, seamless, and portable manner. Key features of PGAPack are as follows: Ability to be called from Fortran or C. Executable on uniprocessors, multiprocessors, multicomputers, and workstation networks. Binary-, integer-, real-, and character-valued native data types. Object-oriented data structure neutral design. Parameterized population replacement. Multiple choices for selection, crossover, and mutation operators. Easy integration of hill-climbing heuristics. Easy-to-use interface for novice and application users. Multiple levels of access for expert users. Full extensibility to support custom operators and new data types. Extensive debugging facilities. Large set of example problems.
An intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-09-01
Poisson disk sampling has excellent spatial and spectral properties, and plays an important role in a variety of visual computing. Although many promising algorithms have been proposed for multidimensional sampling in euclidean space, very few studies have been reported with regard to the problem of generating Poisson disks on surfaces due to the complicated nature of the surface. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. In sharp contrast to the conventional parallel approaches, our method neither partitions the given surface into small patches nor uses any spatial data structure to maintain the voids in the sampling domain. Instead, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. Our algorithm guarantees that the generated Poisson disks are uniformly and randomly distributed without bias. It is worth noting that our method is intrinsic and independent of the embedding space. This intrinsic feature allows us to generate Poisson disk patterns on arbitrary surfaces in IR(n). To our knowledge, this is the first intrinsic, parallel, and accurate algorithm for surface Poisson disk sampling. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
NASA Technical Reports Server (NTRS)
Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris
2000-01-01
Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency
A class of parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Multi-jagged: A scalable parallel spatial partitioning algorithm
Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen D.; Catalyurek, Umit V.
2015-03-18
Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficient implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Lastly, our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.
Multi-jagged: A scalable parallel spatial partitioning algorithm
Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen D.; Catalyurek, Umit V.
2015-03-18
Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficientmore » implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Lastly, our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.« less
An Evaluation of Biosurveillance Grid—Dynamic Algorithm Distribution Across Multiple Computer Nodes
Tsai, Ming-Chi; Tsui, Fu-Chiang; Wagner, Michael M.
2007-01-01
Performing fast data analysis to detect disease outbreaks plays a critical role in real-time biosurveillance. In this paper, we described and evaluated an Algorithm Distribution Manager Service (ADMS) based on grid technologies, which dynamically partition and distribute detection algorithms across multiple computers. We compared the execution time to perform the analysis on a single computer and on a grid network (3 computing nodes) with and without using dynamic algorithm distribution. We found that algorithms with long runtime completed approximately three times earlier in distributed environment than in a single computer while short runtime algorithms performed worse in distributed environment. A dynamic algorithm distribution approach also performed better than static algorithm distribution approach. This pilot study shows a great potential to reduce lengthy analysis time through dynamic algorithm partitioning and parallel processing, and provides the opportunity of distributing algorithms from a client to remote computers in a grid network. PMID:18693936
Parallel radiation transport algorithms and associated architectural requirements
Morel, J. E.; Baker, R. S.; Warsa, J. S.
2004-01-01
The radiation transport equation is a seven-dimensional equation that can be extremely expensive to solve. In general, transport can be expected to completely dominate the memory and CPU time requirements for the ASCI codes. Both traditional iterative transport solution methods and modern Krylov-subspace solution methods require the inversion of a large number of block lower-diagonal matrices. While such inversions are easily done in serial, a high level of sophistication is needed for implementations on massively parallel platforms. Rectangular-mesh methods are well-established and generally quite efficient but unstructured-mesh methods remain a research topic. Nonetheless, considerable progress has been made in unstructured-mesh methods over the last several years. In general, the efficiency of transport solution algorithms are quite sensitive to communication latencies and bandwidth, but there are other significant considerations as well. Some new parallel algorithms have recently been defined that may be significantly better than existing methods for time-dependent problems, but will be significantly less effective for steady-state problems in some circumstances. Transport methods would benefit from a machine architecture with low latencies, high bandwidth, and on the order of one thousand very fast, large-memory processors, as opposed to an architecture that consists of a very large number of slower processors with less memory. In addition, a lightweight operating system is highly desirable.
Massively parallel algorithms for trace-driven cache simulations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.
1991-01-01
Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
Parallel algorithm for computing points on a computation front hyperplane
NASA Astrophysics Data System (ADS)
Krasnov, M. M.
2015-01-01
A parallel algorithm for computing points on a computation front hyperplane is described. This task arises in the computation of a quantity defined on a multidimensional rectangular domain. Three-dimensional domains are usually discussed, but the material is given in the general form when the number of measurements is at least two. When the values of a quantity at different points are internally independent (which is frequently the case), the corresponding computations are independent as well and can be performed in parallel. However, if there are internal dependences (as, for example, in the Gauss-Seidel method for systems of linear equations), then the order of scanning points of the domain is an important issue. A conventional approach in this case is to form a computation front hyperplane (a usual plane in the three-dimensional case and a line in the two-dimensional case) that moves linearly across the domain at a certain angle. At every step in the course of motion of this hyperplane, its intersection points with the domain can be treated independently and, hence, in parallel, but the steps themselves are executed sequentially. At different steps, the intersection of the hyperplane with the entire domain can have a rather complex geometry and the search for all points of the domain lying on the hyperplane at a given step is a nontrivial problem. This problem (i.e., the computation of the coordinates of points lying in the intersection of the domain with the hyperplane at a given step in the course of hyperplane motion) is addressed below. The computations over the points of the hyperplane can be executed in parallel.
A scalable parallel algorithm for large-scale reactive force-field molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Nomura, Ken-ichi; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
2008-01-01
A scalable parallel algorithm has been designed to perform multimillion-atom molecular dynamics (MD) simulations, in which first principles-based reactive force fields (ReaxFF) describe chemical reactions. Environment-dependent bond orders associated with atomic pairs and their derivatives are reused extensively with the aid of linked-list cells to minimize the computation associated with atomic n-tuple interactions ( n⩽4 explicitly and ⩽6 due to chain-rule differentiation). These n-tuple computations are made modular, so that they can be reconfigured effectively with a multiple time-step integrator to further reduce the computation time. Atomic charges are updated dynamically with an electronegativity equalization method, by iteratively minimizing the electrostatic energy with the charge-neutrality constraint. The ReaxFF-MD simulation algorithm has been implemented on parallel computers based on a spatial decomposition scheme combined with distributed n-tuple data structures. The measured parallel efficiency of the parallel ReaxFF-MD algorithm is 0.998 on 131,072 IBM BlueGene/L processors for a 1.01 billion-atom RDX system.
Parallel Information Processing.
ERIC Educational Resources Information Center
Rasmussen, Edie M.
1992-01-01
Examines parallel computer architecture and the use of parallel processors for text. Topics discussed include parallel algorithms; performance evaluation; parallel information processing; parallel access methods for text; parallel and distributed information retrieval systems; parallel hardware for text; and network models for information…
Katouda, Michio; Nakajima, Takahito
2013-12-10
A new algorithm for massively parallel calculations of electron correlation energy of large molecules based on the resolution of identity second-order Møller-Plesset perturbation (RI-MP2) technique is developed and implemented into the quantum chemistry software NTChem. In this algorithm, a Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) hybrid parallel programming model is applied to attain efficient parallel performance on massively parallel supercomputers. An in-core storage scheme of intermediate data of three-center electron repulsion integrals utilizing the distributed memory is developed to eliminate input/output (I/O) overhead. The parallel performance of the algorithm is tested on massively parallel supercomputers such as the K computer (using up to 45 992 central processing unit (CPU) cores) and a commodity Intel Xeon cluster (using up to 8192 CPU cores). The parallel RI-MP2/cc-pVTZ calculation of two-layer nanographene sheets (C150H30)2 (number of atomic orbitals is 9640) is performed using 8991 node and 71 288 CPU cores of the K computer. PMID:26592275
Katouda, Michio; Nakajima, Takahito
2013-12-10
A new algorithm for massively parallel calculations of electron correlation energy of large molecules based on the resolution of identity second-order Møller-Plesset perturbation (RI-MP2) technique is developed and implemented into the quantum chemistry software NTChem. In this algorithm, a Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) hybrid parallel programming model is applied to attain efficient parallel performance on massively parallel supercomputers. An in-core storage scheme of intermediate data of three-center electron repulsion integrals utilizing the distributed memory is developed to eliminate input/output (I/O) overhead. The parallel performance of the algorithm is tested on massively parallel supercomputers such as the K computer (using up to 45 992 central processing unit (CPU) cores) and a commodity Intel Xeon cluster (using up to 8192 CPU cores). The parallel RI-MP2/cc-pVTZ calculation of two-layer nanographene sheets (C150H30)2 (number of atomic orbitals is 9640) is performed using 8991 node and 71 288 CPU cores of the K computer.
Parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D
Attaway, S.W.; Hendrickson, B.A.; Plimpton, S.J.
1996-09-01
An efficient, scalable, parallel algorithm for treating material surface contacts in solid mechanics finite element programs has been implemented in a modular way for MIMD parallel computers. The serial contact detection algorithm that was developed previously for the transient dynamics finite element code PRONTO3D has been extended for use in parallel computation by devising a dynamic (adaptive) processor load balancing scheme.
Adaptive link selection algorithms for distributed estimation
NASA Astrophysics Data System (ADS)
Xu, Songcen; de Lamare, Rodrigo C.; Poor, H. Vincent
2015-12-01
This paper presents adaptive link selection algorithms for distributed estimation and considers their application to wireless sensor networks and smart grids. In particular, exhaustive search-based least mean squares (LMS) / recursive least squares (RLS) link selection algorithms and sparsity-inspired LMS / RLS link selection algorithms that can exploit the topology of networks with poor-quality links are considered. The proposed link selection algorithms are then analyzed in terms of their stability, steady-state, and tracking performance and computational complexity. In comparison with the existing centralized or distributed estimation strategies, the key features of the proposed algorithms are as follows: (1) more accurate estimates and faster convergence speed can be obtained and (2) the network is equipped with the ability of link selection that can circumvent link failures and improve the estimation performance. The performance of the proposed algorithms for distributed estimation is illustrated via simulations in applications of wireless sensor networks and smart grids.
Unitary qubit extremely parallelized algorithms for coupled nonlinear Schrodinger equations
NASA Astrophysics Data System (ADS)
Oganesov, Armen; Flint, Chris; Vahala, George; Vahala, Linda; Yepez, Jeffrey; Soe, Min
2015-11-01
The nonlinear Schrodinger equation (NLS) is a ubiquitous equation occurring in plasma physics, nonlinear optics and in Bose Einstein condensates. Viewed from the BEC standpoint of phase transitions, the wave function is the order parameter and topological defects in that manifold are simply the vortices, which for a scalar NLS have quantized circulation. In multi-species NLS the topological nature of the vortices are radically different with some classes of vortices no longer having quantized circulation as in classical turbulence. Moreover, some of the vortex equivalence classes need no longer be Abelian. This strongly effects the permitted vortex reconnections. The effect of these structures on the spectral properties of the ensuing turbulence will be investigated. Our 3D algorithm is based on a novel unitary qubit lattice scheme that is ideally parallelized - tested up to 780 000 cores on Mira. This scheme is mesoscopic (like lattice Boltzmann), but fully unitary (unlike LB). Supported by NSF, DoD.
Performance of a parallel algorithm for standard cell placement on the Intel Hypercube
NASA Technical Reports Server (NTRS)
Jones, Mark; Banerjee, Prithviraj
1987-01-01
A parallel simulated annealing algorithm for standard cell placement on the Intel Hypercube is presented. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than uniprocessor simulated annealing algorithms.
A parallel algorithm for transient solid dynamics simulations with contact detection
Attaway, S.; Hendrickson, B.; Plimpton, S.; Gardner, D.; Vaughan, C.; Heinstein, M.; Peery, J.
1996-06-01
Solid dynamics simulations with Lagrangian finite elements are used to model a wide variety of problems, such as the calculation of impact damage to shipping containers for nuclear waste and the analysis of vehicular crashes. Using parallel computers for these simulations has been hindered by the difficulty of searching efficiently for material surface contacts in parallel. A new parallel algorithm for calculation of arbitrary material contacts in finite element simulations has been developed and implemented in the PRONTO3D transient solid dynamics code. This paper will explore some of the issues involved in developing efficient, portable, parallel finite element models for nonlinear transient solid dynamics simulations. The contact-detection problem poses interesting challenges for efficient implementation of a solid dynamics simulation on a parallel computer. The finite element mesh is typically partitioned so that each processor owns a localized region of the finite element mesh. This mesh partitioning is optimal for the finite element portion of the calculation since each processor must communicate only with the few connected neighboring processors that share boundaries with the decomposed mesh. However, contacts can occur between surfaces that may be owned by any two arbitrary processors. Hence, a global search across all processors is required at every time step to search for these contacts. Load-imbalance can become a problem since the finite element decomposition divides the volumetric mesh evenly across processors but typically leaves the surface elements unevenly distributed. In practice, these complications have been limiting factors in the performance and scalability of transient solid dynamics on massively parallel computers. In this paper the authors present a new parallel algorithm for contact detection that overcomes many of these limitations.
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Parallel multiphysics algorithms and software for computational nuclear engineering
NASA Astrophysics Data System (ADS)
Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.
2009-07-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
NASA Astrophysics Data System (ADS)
Qureshi, Kalim; Rashid, Haroon
In this paper we practically compared the performance of three blocked based parallel multiplication algorithms with simple fixed size runtime task scheduling strategy on homogeneous cluster of workstations. Parallel Virtual Machine (PVM) was used for this study.
Partitioning problems in parallel, pipelined and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, S.
1985-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
A conflict-free, path-level parallelization approach for sequential simulation algorithms
NASA Astrophysics Data System (ADS)
Rasera, Luiz Gustavo; Machado, Péricles Lopes; Costa, João Felipe C. L.
2015-07-01
Pixel-based simulation algorithms are the most widely used geostatistical technique for characterizing the spatial distribution of natural resources. However, sequential simulation does not scale well for stochastic simulation on very large grids, which are now commonly found in many petroleum, mining, and environmental studies. With the availability of multiple-processor computers, there is an opportunity to develop parallelization schemes for these algorithms to increase their performance and efficiency. Here we present a conflict-free, path-level parallelization strategy for sequential simulation. The method consists of partitioning the simulation grid into a set of groups of nodes and delegating all available processors for simulation of multiple groups of nodes concurrently. An automated classification procedure determines which groups are simulated in parallel according to their spatial arrangement in the simulation grid. The major advantage of this approach is that it does not require conflict resolution operations, and thus allows exact reproduction of results. Besides offering a large performance gain when compared to the traditional serial implementation, the method provides efficient use of computational resources and is generic enough to be adapted to several sequential algorithms.
A Distributed, Parallel Visualization and Analysis Tool
2007-12-01
VisIt is an interactive parallel visualization and graphical analysis tool for viewing scientific date on UNIX and PC platforms. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images for presentations. VisIt contains a rich set of visualization features so that you can view your data in a variety of ways. It can be used to visualize scalar and vector fields defined on two- and three- dimensional (2D and 3D) structured and unstructured meshes. VisIt was designed to handle very large data set sizes in the terascale range and yet can also handle small data sets in the kilobyte range.
A Distributed, Parallel Visualization and Analysis Tool
2007-12-01
VisIt is an interactive parallel visualization and graphical analysis tool for viewing scientific date on UNIX and PC platforms. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images for presentations. VisIt contains a rich set of visualization features so that you can view your data in a variety of ways. It can be used to visualize scalar and vector fields defined on two- and three-more » dimensional (2D and 3D) structured and unstructured meshes. VisIt was designed to handle very large data set sizes in the terascale range and yet can also handle small data sets in the kilobyte range.« less
NASA Technical Reports Server (NTRS)
Eidson, T. M.; Erlebacher, G.
1994-01-01
While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
NASA Astrophysics Data System (ADS)
Slattery, Stuart R.
2016-02-01
In this paper we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothness and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. These scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.
Slattery, Stuart R.
2015-12-02
In this study we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothness and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. Finally, these scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.
Slattery, Stuart R.
2015-12-02
In this study we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothnessmore » and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. Finally, these scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.« less
Performance of a parallel algorithm for standard cell placement on the Intel Hypercube
NASA Technical Reports Server (NTRS)
Jones, Mark; Banerjee, Prithviraj
1987-01-01
A parallel simulated annealing algorithm for standard cell placement that is targeted to run on the Intel Hypercube is presented. A tree broadcasting strategy that is used extensively in our algorithm for updating cell locations in the parallel environment is presented. Studies on the performance of our algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms.
Convergence measure and some parallel aspects of Markov-chain Monte Carlo algorithms
NASA Astrophysics Data System (ADS)
Malfait, Maurits J.; Roose, Dirk; Vandermeulen, Dirk
1993-10-01
We examine methods to assess the convergence of Markov chain Monte Carlo (MCMC) algorithms and to accelerate their execution via parallel computing. We propose a convergence measure based on the deviations between simultaneously running MCMC algorithms. We also examine the acceleration of MCMC algorithms when independent parallel sampler are used and report on some experiments with coupled samplers. As applications we use small Ising model simulations and a larger medical image processing algorithm.
Para-GMRF: parallel algorithm for anomaly detection of hyperspectral image
NASA Astrophysics Data System (ADS)
Dong, Chao; Zhao, Huijie; Li, Na; Wang, Wei
2007-12-01
The hyperspectral imager is capable of collecting hundreds of images corresponding to different wavelength channels for the observed area simultaneously, which make it possible to discriminate man-made objects from natural background. However, the price paid for the wealthy information is the enormous amounts of data, usually hundreds of Gigabytes per day. Turning the huge volume data into useful information and knowledge in real time is critical for geoscientists. In this paper, the proposed parallel Gaussian-Markov random field (Para-GMRF) anomaly detection algorithm is an attempt of applying parallel computing technology to solve the problem. Based on the locality of GMRF algorithm, we partition the 3-D hyperspectral image cube in spatial domain and distribute data blocks to multiple computers for concurrent detection. Meanwhile, to achieve load balance, a work pool scheduler is designed for task assignment. The Para-GMRF algorithm is organized in master-slave architecture, coded in C programming language using message passing interface (MPI) library and tested on a Beowulf cluster. Experimental results show that Para-GMRF algorithm successfully conquers the challenge and can be used in time sensitive areas, such as environmental monitoring and battlefield reconnaissance.
Parallelization of the Wolff single-cluster algorithm.
Kaupuzs, J; Rimsāns, J; Melnik, R V N
2010-02-01
A parallel [open multiprocessing (OpenMP)] implementation of the Wolff single-cluster algorithm has been developed and tested for the three-dimensional (3D) Ising model. The developed procedure is generalizable to other lattice spin models and its effectiveness depends on the specific application at hand. The applicability of the developed methodology is discussed in the context of the applications, where a sophisticated shuffling scheme is used to generate pseudorandom numbers of high quality, and an iterative method is applied to find the critical temperature of the 3D Ising model with a great accuracy. For the lattice with linear size L=1024, we have reached the speedup about 1.79 times on two processors and about 2.67 times on four processors, as compared to the serial code. According to our estimation, the speedup about three times on four processors is reachable for the O(n) models with n> or =2. Furthermore, the application of the developed OpenMP code allows us to simulate larger lattices due to greater operative (shared) memory available.
Parallelization of the Wolff single-cluster algorithm
NASA Astrophysics Data System (ADS)
Kaupužs, J.; Rimšāns, J.; Melnik, R. V. N.
2010-02-01
A parallel [open multiprocessing (OpenMP)] implementation of the Wolff single-cluster algorithm has been developed and tested for the three-dimensional (3D) Ising model. The developed procedure is generalizable to other lattice spin models and its effectiveness depends on the specific application at hand. The applicability of the developed methodology is discussed in the context of the applications, where a sophisticated shuffling scheme is used to generate pseudorandom numbers of high quality, and an iterative method is applied to find the critical temperature of the 3D Ising model with a great accuracy. For the lattice with linear size L=1024 , we have reached the speedup about 1.79 times on two processors and about 2.67 times on four processors, as compared to the serial code. According to our estimation, the speedup about three times on four processors is reachable for the O(n) models with n≥2 . Furthermore, the application of the developed OpenMP code allows us to simulate larger lattices due to greater operative (shared) memory available.
Mesh Algorithms for PDE with Sieve I: Mesh Distribution
Knepley, Matthew G.; Karpeev, Dmitry A.
2009-01-01
We have developed a new programming framework, called Sieve, to support parallel numerical partial differential equation(s) (PDE) algorithms operating over distributed meshes. We have also developed a reference implementation of Sieve in C++ as a library of generic algorithms operating on distributed containers conforming to the Sieve interface. Sieve makes instances of the incidence relation, or arrows, the conceptual first-class objects represented in the containers. Further, generic algorithms acting on this arrow container are systematically used to provide natural geometric operations on the topology and also, through duality, on the data. Finally, coverings and duality are used to encode notmore » only individual meshes, but all types of hierarchies underlying PDE data structures, including multigrid and mesh partitions. In order to demonstrate the usefulness of the framework, we show how the mesh partition data can be represented and manipulated using the same fundamental mechanisms used to represent meshes. We present the complete description of an algorithm to encode a mesh partition and then distribute a mesh, which is independent of the mesh dimension, element shape, or embedding. Moreover, data associated with the mesh can be similarly distributed with exactly the same algorithm. The use of a high level of abstraction within the Sieve leads to several benefits in terms of code reuse, simplicity, and extensibility. We discuss these benefits and compare our approach to other existing mesh libraries.« less
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
A compositional reservoir simulator on distributed memory parallel computers
Rame, M.; Delshad, M.
1995-12-31
This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. A portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.
NASA Astrophysics Data System (ADS)
Lai, Jim Z. C.; Chao, Ming
1992-06-01
A parallel algorithm of generating set points in Cartesian space for a manipulator with straight-line and circular motions is described. This algorithm is developed for parallel computation and does not have the problem of the wobbling approach vector that affects many techniques. When the scheme is executed serially, the computing time is about two-thirds that of the conventional technique.
The openGL visualization of the 2D parallel FDTD algorithm
NASA Astrophysics Data System (ADS)
Walendziuk, Wojciech
2005-02-01
This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges
NASA Technical Reports Server (NTRS)
Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.
2000-01-01
This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.
NASA Technical Reports Server (NTRS)
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu
1995-01-01
As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
Stellar structure modeling using a parallel genetic algorithm for objective global optimization
NASA Astrophysics Data System (ADS)
Metcalfe, Travis S.; Charbonneau, Paul
2003-02-01
Genetic algorithms are a class of heuristic search techniques that apply basic evolutionary operators in a computational setting. We have designed a fully parallel and distributed hardware/software implementation of the generalized optimization subroutine PIKAIA, which utilizes a genetic algorithm to provide an objective determination of the globally optimal parameters for a given model against an observational data set. We have used this modeling tool in the context of white dwarf asteroseismology, i.e., the art and science of extracting physical and structural information about these stars from observations of their oscillation frequencies. The efficient, parallel exploration of parameter-space made possible by genetic-algorithm-based numerical optimization led us to a number of interesting physical results: (1) resolution of a hitherto puzzling discrepancy between stellar evolution models and prior asteroseismic inferences of the surface helium layer mass for a DBV white dwarf; (2) precise determination of the central oxygen mass fraction in a white dwarf star; and (3) a preliminary estimate of the astrophysically important but experimentally uncertain rate for the 12C(α,γ)16O nuclear reaction. These successes suggest that a broad class of computationally intensive modeling applications could also benefit from this approach.
Measurements of parallel electron velocity distributions using whistler wave absorption.
Thuecks, D J; Skiff, F; Kletzing, C A
2012-08-01
We describe a diagnostic to measure the parallel electron velocity distribution in a magnetized plasma that is overdense (ω(pe) > ω(ce)). This technique utilizes resonant absorption of whistler waves by electrons with velocities parallel to a background magnetic field. The whistler waves were launched and received by a pair of dipole antennas immersed in a cylindrical discharge plasma at two positions along an axial background magnetic field. The whistler wave frequency was swept from somewhat below and up to the electron cyclotron frequency ω(ce). As the frequency was swept, the wave was resonantly absorbed by the part of the electron phase space density which was Doppler shifted into resonance according to the relation ω - k([parallel])v([parallel]) = ω(ce). The measured absorption is directly related to the reduced parallel electron distribution function integrated along the wave trajectory. The background theory and initial results from this diagnostic are presented here. Though this diagnostic is best suited to detect tail populations of the parallel electron distribution function, these first results show that this diagnostic is also rather successful in measuring the bulk plasma density and temperature both during the plasma discharge and into the afterglow.
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
A balanced decomposition algorithm for parallel solutions of very large sparse systems
Zecevic, A.I.; Siljak, D.D.
1995-12-01
In this paper we present an algorithm for balanced bordered block diagonal (BBD) decompositions of very large symmetric positive definite or diagonally dominant sparse matrices. The algorithm represents a generalization of the method described, and is primarily aimed at parallel solutions of very large sparse systems (> 20,000 equations). A variety of experimental results are provided to illustrate the performance of the algorithm and demonstrate its potential for computing on massively parallel architectures.
Plimpton, Steven J.; Hendrickson, Bruce; Burns, Shawn P.; McLendon, William III; Rauchwerger, Lawrence
2005-07-15
The method of discrete ordinates is commonly used to solve the Boltzmann transport equation. The solution in each ordinate direction is most efficiently computed by sweeping the radiation flux across the computational grid. For unstructured grids this poses many challenges, particularly when implemented on distributed-memory parallel machines where the grid geometry is spread across processors. We present several algorithms relevant to this approach: (a) an asynchronous message-passing algorithm that performs sweeps simultaneously in multiple ordinate directions, (b) a simple geometric heuristic to prioritize the computational tasks that a processor works on, (c) a partitioning algorithm that creates columnar-style decompositions for unstructured grids, and (d) an algorithm for detecting and eliminating cycles that sometimes exist in unstructured grids and can prevent sweeps from successfully completing. Algorithms (a) and (d) are fully parallel; algorithms (b) and (c) can be used in conjunction with (a) to achieve higher parallel efficiencies. We describe our message-passing implementations of these algorithms within a radiation transport package. Performance and scalability results are given for unstructured grids with up to 3 million elements (500 million unknowns) running on thousands of processors of Sandia National Laboratories' Intel Tflops machine and DEC-Alpha CPlant cluster.
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation
Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn
2014-11-14
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
A simple parallel prefix algorithm for compact finite-difference schemes
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Joslin, Ronald D.
1993-01-01
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.
Efficient Out of Core Sorting Algorithms for the Parallel Disks Model.
Kundeti, Vamsi; Rajasekaran, Sanguthevar
2011-11-01
In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
Study of improved ray tracing parallel algorithm for CGH of 3D objects on GPU
NASA Astrophysics Data System (ADS)
Cong, Bin; Jiang, Xiaoyu; Yao, Jun; Zhao, Kai
2014-11-01
An improved parallel algorithm for holograms of three-dimensional objects was presented. According to the physical characteristics and mathematical properties of the original ray tracing algorithm for computer generated holograms (CGH), using transform approximation and numerical analysis methods, we extract parts of ray tracing algorithm which satisfy parallelization features and implement them on graphics processing unit (GPU). Meanwhile, through proper design of parallel numerical procedure, we did parallel programming to the two-dimensional slices of three-dimensional object with CUDA. According to the experiments, an effective method of dealing with occlusion problem in ray tracing is proposed, as well as generating the holograms of 3D objects with additive property. Our results indicate that the improved algorithm can effectively shorten the computing time. Due to the different sizes of spatial object points and hologram pixels, the speed has increased 20 to 70 times comparing with original ray tracing algorithm.
Xie, Dexuan; Dash, Ranjan K.; Beard, Daniel A.
2009-01-01
Fast algorithms for simulating mathematical models of coupled blood-tissue transport and metabolism are critical for the analysis of data on transport and reaction in tissues. Here, by combining the method of characteristics with the standard grid discretization technique, a novel algorithm is introduced for solving a general blood-tissue transport and metabolism model governed by a large system of one-dimensional semilinear first order partial differential equations. The key part of the algorithm is to approximate the model as a group of independent ordinary differential equation (ODE) systems such that each ODE system has the same size as the model and can be integrated independently. Thus the method can be easily implemented in parallel on a large scale multiprocessor computer. The accuracy of the algorithm is demonstrated for solving a simple blood-tissue exchange model introduced by Sangren and Sheppard (Bull. Math. Biophys. 15:387–394, 1953), which has an analytical solution. Numerical experiments made on a distributed-memory parallel computer (an HP Linux cluster) and a shared-memory parallel computer (a SGI Origin 2000) demonstrate the parallel efficiency of the algorithm. PMID:20161089
a Vomr-Tree Based Parallel Range Query Method on Distributed Spatial Database
NASA Astrophysics Data System (ADS)
Fu, Z.; Liu, S.
2012-07-01
Spatial index impacts upon the efficiency of spatial query seriously in distributed spatial database. In this paper, we introduce a parallel spatial range query algorithm, based on VoMR-tree index, which incorporates Voronoi diagrams into MR-tree, benefiting from the nearest neighbors. We first augments MR-tree to store the nearest neighbors and constructs the VoMR-tree index by Voronoi diagram. We then propose a novel range query algorithm based on VoMR-tree index. In processing a range query, we discuss the data partition method so that we can improve the efficiency by parallelization in distributed database. Just then a verification strategy is promoted. We show the superiority of the proposed method by extensive experiments using data sets of various sizes. The experimental results reveal that the proposed method improves the performance of range query processing up to three times in comparison with the widely-used R-tree variants.
Speedup properties of phases in the execution profile of distributed parallel programs
Carlson, B.M.; Wagner, T.D.; Dowdy, L.W.; Worley, P.H.
1992-08-01
The execution profile of a distributed-memory parallel program specifies the number of busy processors as a function of time. Periods of homogeneous processor utilization are manifested in many execution profiles. These periods can usually be correlated with the algorithms implemented in the underlying parallel code. Three families of methods for smoothing execution profile data are presented. These approaches simplify the problem of detecting end points of periods of homogeneous utilization. These periods, called phases, are then examined in isolation, and their speedup characteristics are explored. A specific workload executed on an Intel iPSC/860 is used for validation of the techniques described.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
NASA Technical Reports Server (NTRS)
Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
2000-01-01
The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Brown, C.
1990-04-11
This contract developed and disseminated papers, ideas, algorithms, analysis, software, applications, and implementations for parallel programming environments for computer vision and for vision applications. The work has been widely reported and highly influential. The most significant work centered on the Butterfly Parallel Processor, the MaxVideo pipelined parallel image processor, and the development of the real-time computer vision laboratory. For the Butterfly, the Psyche multi-model operating system was developed and the CONSUL autoparallelizing compiler was designed. Much basic and influential performance monitoring and debugging work was completed, resulting in working systems and novel algorithms. There was also significant research in systems and applications using other parallel architectures in the laboratory, such as the MaxVideo parallel pipelined image processor. The contract developed a heterogeneous parallel architecture involving pipelined and MIMD parallelism and integrated it with a robot head.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor
NASA Technical Reports Server (NTRS)
Rao, Hariprasad Nannapaneni
1989-01-01
The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
A sweep algorithm for massively parallel simulation of circuit-switched networks
NASA Technical Reports Server (NTRS)
Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.
1992-01-01
A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
NASA Astrophysics Data System (ADS)
Mattei, D.; Smith, I.; Ferrari, A.; Carbillet, M.
2010-10-01
Post-processing for exoplanet detection using direct imaging requires large data cubes and/or sophisticated signal processing technics. For alt-azimuthal mounts, a projection effect called field rotation makes the potential planet rotate in a known manner on the set of images. For ground based telescopes that use extreme adaptive optics and advanced coronagraphy, technics based on field rotation are already broadly used and still under progress. In most such technics, for a given initial position of the planet the planet intensity estimate is a linear function of the set of images. However, due to field rotation the modified instrumental response applied is not shift invariant like usual linear filters. Testing all possible initial positions is therefore very time-consuming. To reduce the time process, we propose to deal with each subset of initial positions computed on a different machine using parallelization programming. In particular, the MOODS algorithm dedicated to the VLT-SPHERE instrument, that estimates jointly the light contributions of the star and the potential exoplanet, is parallelized on the Observatoire de la Cote d'Azur cluster. Different parallelization methods (OpenMP, MPI, Jobs Array) have been elaborated for the initial MOODS code and compared to each other. The one finally chosen splits the initial positions on the processors available by accounting at best for the different constraints of the cluster structure: memory, job submission queues, number of available CPUs, cluster average load. At the end, a standard set of images is satisfactorily processed in a few hours instead of a few days.
Overview of a distributed parallel architecture for speech understanding
Bronson, E.C.; Siegel, L.J.
1982-01-01
The complexity of the speech understanding task requires extensive computation. To improve the processing speed, methods are explored by which tasks involved in speech understanding can be structured for execution on a parallel processing system. An architecture is described in which a speech understanding system is decomposed into a series of distributed processing computation stations. 24 references.
Dynamic overset grid communication on distributed memory parallel processors
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.
1993-01-01
A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
NavP: Structured and Multithreaded Distributed Parallel Programming
NASA Technical Reports Server (NTRS)
Pan, Lei; Xu, Jingling
2006-01-01
This slide presentation reviews some of the issues around distributed parallel programming. It compares and contrast two methods of programming: Single Program Multiple Data (SPMD) with the Navigational Programming (NAVP). It then reviews the distributed sequential computing (DSC) method and the methodology of NavP. Case studies are presented. It also reviews the work that is being done to enable the NavP system.
NASA Astrophysics Data System (ADS)
Xu, Dexiang
This dissertation presents a novel method of designing finite word length Finite Impulse Response (FIR) digital filters using a Real Parameter Parallel Genetic Algorithm (RPPGA). This algorithm is derived from basic Genetic Algorithms which are inspired by natural genetics principles. Both experimental results and theoretical studies in this work reveal that the RPPGA is a suitable method for determining the optimal or near optimal discrete coefficients of finite word length FIR digital filters. Performance of RPPGA is evaluated by comparing specifications of filters designed by other methods with filters designed by RPPGA. The parallel and spatial structures of the algorithm result in faster and more robust optimization than basic genetic algorithms. A filter designed by RPPGA is implemented in hardware to attenuate high frequency noise in a data acquisition system for collecting seismic signals. These studies may lead to more applications of the Real Parameter Parallel Genetic Algorithms in Electrical Engineering.
Interface for Parallel I/O from Componentized Visualization Algorithms
2008-09-16
The software is an interface layer over file I/O with features specifically designed for efficient parallel reads and writes. The interface provides multiple concrete implementations that easily allow the replacement of one interface with another. This feature allows a reader or writer implementation to work independently of whether parallel file I/O is available or desired. The software also contains extensions to some readers to allow it to use the file I/O functionality.
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
NASA Technical Reports Server (NTRS)
Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.
1998-01-01
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, "routine" parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (Psi-NKS) algorithmic framework is presented as an answer. We show that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, Psi-NKS can simultaneously deliver: globalized, asymptotically rapid convergence through adaptive pseudo- transient continuation and Newton's method-, reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per- processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Psi-NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of Psi-NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
Schmidberger, Markus; Vicedo, Esmeralda; Mansmann, Ulrich
2009-07-22
Microarray data repositories as well as large clinical applications of gene expression allow to analyse several hundreds of microarrays at one time. The preprocessing of large amounts of microarrays is still a challenge. The algorithms are limited by the available computer hardware. For example, building classification or prognostic rules from large microarray sets will be very time consuming. Here, preprocessing has to be a part of the cross-validation and resampling strategy which is necessary to estimate the rule's prediction quality honestly.This paper proposes the new Bioconductor package affyPara for parallelized preprocessing of Affymetrix microarray data. Partition of data can be applied on arrays and parallelization of algorithms is a straightforward consequence. The partition of data and distribution to several nodes solves the main memory problems and accelerates preprocessing by up to the factor 20 for 200 or more arrays.affyPara is a free and open source package, under GPL license, available form the Bioconductor project at www.bioconductor.org. A user guide and examples are provided with the package.
Madduri, Kamesh; Bader, David A.
2009-02-15
Graph-theoretic abstractions are extensively used to analyze massive data sets. Temporal data streams from socioeconomic interactions, social networking web sites, communication traffic, and scientific computing can be intuitively modeled as graphs. We present the first study of novel high-performance combinatorial techniques for analyzing large-scale information networks, encapsulating dynamic interaction data in the order of billions of entities. We present new data structures to represent dynamic interaction networks, and discuss algorithms for processing parallel insertions and deletions of edges in small-world networks. With these new approaches, we achieve an average performance rate of 25 million structural updates per second and a parallel speedup of nearly28 on a 64-way Sun UltraSPARC T2 multicore processor, for insertions and deletions to a small-world network of 33.5 million vertices and 268 million edges. We also design parallel implementations of fundamental dynamic graph kernels related to connectivity and centrality queries. Our implementations are freely distributed as part of the open-source SNAP (Small-world Network Analysis and Partitioning) complex network analysis framework.
Parallel OSEM Reconstruction Algorithm for Fully 3-D SPECT on a Beowulf Cluster.
Rong, Zhou; Tianyu, Ma; Yongjie, Jin
2005-01-01
In order to improve the computation speed of ordered subset expectation maximization (OSEM) algorithm for fully 3-D single photon emission computed tomography (SPECT) reconstruction, an experimental beowulf-type cluster was built and several parallel reconstruction schemes were described. We implemented a single-program-multiple-data (SPMD) parallel 3-D OSEM reconstruction algorithm based on message passing interface (MPI) and tested it with combinations of different number of calculating processors and different size of voxel grid in reconstruction (64×64×64 and 128×128×128). Performance of parallelization was evaluated in terms of the speedup factor and parallel efficiency. This parallel implementation methodology is expected to be helpful to make fully 3-D OSEM algorithms more feasible in clinical SPECT studies.
A parallel algorithm for motion estimation in video coding using the bilinear transformation.
Konstantopoulos, Charalampos
2015-01-01
Accurate motion estimation between frames is important for drastically reducing data redundancy in video coding. However, advanced motion estimation methods are computationally intensive and their execution in real time usually requires a parallel implementation. In this paper, we investigate the parallel implementation of such a motion estimation technique. Specifically, we present a parallel algorithm for motion estimation based on the bilinear transformation on the well-known parallel model of the hypercube network and formally prove the time and the space complexity of the proposed algorithm. We also show that the parallel algorithm can also run on other hypercubic networks, such as butterfly, cube-connected-cycles, shuffle-exchange or de Bruijn network with only constant slowdown.
Measurements of parallel electron velocity distributions using whistler wave absorption
Thuecks, D. J.; Skiff, F.; Kletzing, C. A.
2012-08-15
We describe a diagnostic to measure the parallel electron velocity distribution in a magnetized plasma that is overdense ({omega}{sub pe} > {omega}{sub ce}). This technique utilizes resonant absorption of whistler waves by electrons with velocities parallel to a background magnetic field. The whistler waves were launched and received by a pair of dipole antennas immersed in a cylindrical discharge plasma at two positions along an axial background magnetic field. The whistler wave frequency was swept from somewhat below and up to the electron cyclotron frequency {omega}{sub ce}. As the frequency was swept, the wave was resonantly absorbed by the part of the electron phase space density which was Doppler shifted into resonance according to the relation {omega}-k{sub ||v||} = {omega}{sub ce}. The measured absorption is directly related to the reduced parallel electron distribution function integrated along the wave trajectory. The background theory and initial results from this diagnostic are presented here. Though this diagnostic is best suited to detect tail populations of the parallel electron distribution function, these first results show that this diagnostic is also rather successful in measuring the bulk plasma density and temperature both during the plasma discharge and into the afterglow.
Parallel Computing Environments and Methods for Power Distribution System Simulation
Lu, Ning; Taylor, Zachary T.; Chassin, David P.; Guttromson, Ross T.; Studham, Scott S.
2005-11-10
The development of cost-effective high-performance parallel computing on multi-processor super computers makes it attractive to port excessively time consuming simulation software from personal computers (PC) to super computes. The power distribution system simulator (PDSS) takes a bottom-up approach and simulates load at appliance level, where detailed thermal models for appliances are used. This approach works well for a small power distribution system consisting of a few thousand appliances. When the number of appliances increases, the simulation uses up the PC memory and its run time increases to a point where the approach is no longer feasible to model a practical large power distribution system. This paper presents an effort made to port a PC-based power distribution system simulator (PDSS) to a 128-processor shared-memory super computer. The paper offers an overview of the parallel computing environment and a description of the modification made to the PDSS model. The performances of the PDSS running on a standalone PC and on the super computer are compared. Future research direction of utilizing parallel computing in the power distribution system simulation is also addressed.
Parallel algorithm for target recognition using a multiclass hash database
NASA Astrophysics Data System (ADS)
Uddin, Mosleh; Myler, Harley R.
1998-07-01
A method for recognition of unknown targets using large databases of model targets is discussed. Our approach is based on parallel processing of multi-class hash databases that are generated off-line. A geometric hashing technique is used on feature points of model targets to create each class database. Bit level coding is then performed to represent the models in an image format. Parallelism is achieved during the recognition phase. Feature points of an unknown target are passed to parallel processors each accessing an individual class database. Each processor reads a particular class of hash data base and indexes feature points of the unknown target. A simple voting technique is applied to determine the best match model with the unknown. The paper discusses our technique and the results from testing with unknown FLIR targets.
Parallelized event chain algorithm for dense hard sphere and polymer systems
Kampmann, Tobias A. Boltz, Horst-Holger; Kierfeld, Jan
2015-01-15
We combine parallelization and cluster Monte Carlo for hard sphere systems and present a parallelized event chain algorithm for the hard disk system in two dimensions. For parallelization we use a spatial partitioning approach into simulation cells. We find that it is crucial for correctness to ensure detailed balance on the level of Monte Carlo sweeps by drawing the starting sphere of event chains within each simulation cell with replacement. We analyze the performance gains for the parallelized event chain and find a criterion for an optimal degree of parallelization. Because of the cluster nature of event chain moves massive parallelization will not be optimal. Finally, we discuss first applications of the event chain algorithm to dense polymer systems, i.e., bundle-forming solutions of attractive semiflexible polymers.
Distributed and parallel approach for handle and perform huge datasets
NASA Astrophysics Data System (ADS)
Konopko, Joanna
2015-12-01
Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.
A computational fluid dynamics algorithm on a massively parallel computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The implementation and performance of a finite-difference algorithm for the compressible Navier-Stokes equations in two or three dimensions on the Connection Machine are described. This machine is a single-instruction multiple-data machine with up to 65536 physical processors. The implicit portion of the algorithm is of particular interest. Running times and megadrop rates are given for two- and three-dimensional problems. Included are comparisons with the standard codes on a Cray X-MP/48.
Pruning Neural Networks with Distribution Estimation Algorithms
Cantu-Paz, E
2003-01-15
This paper describes the application of four evolutionary algorithms to the pruning of neural networks used in classification problems. Besides of a simple genetic algorithm (GA), the paper considers three distribution estimation algorithms (DEAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to determine if the DEAs present advantages over the simple GA in terms of accuracy or speed in this problem. The experiments used a feed forward neural network trained with standard back propagation and public-domain and artificial data sets. The pruned networks seemed to have better or equal accuracy than the original fully-connected networks. Only in a few cases, pruning resulted in less accurate networks. We found few differences in the accuracy of the networks pruned by the four EAs, but found important differences in the execution time. The results suggest that a simple GA with a small population might be the best algorithm for pruning networks on the data sets we tested.
Parallel algorithms for 2-D cylindrical transport equations of Eigenvalue problem
Wei, J.; Yang, S.
2013-07-01
In this paper, aimed at the neutron transport equations of eigenvalue problem under 2-D cylindrical geometry on unstructured grid, the discrete scheme of Sn discrete ordinate and discontinuous finite is built, and the parallel computation for the scheme is realized on MPI systems. Numerical experiments indicate that the designed parallel algorithm can reach perfect speedup, it has good practicality and scalability. (authors)
2014-01-01
Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to
D`Azevedo, E.F.; Romine, C.H.
1992-09-01
The standard formulation of the conjugate gradient algorithm involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsequent distribution of these two values requires two separate communication and synchronization phases. In this paper, we present a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.
Seal, Sudip K; Perumalla, Kalyan S; Hirshman, Steven Paul
2013-01-01
Simulations that require solutions of block tridiagonal systems of equations rely on fast parallel solvers for runtime efficiency. Leading parallel solvers that are highly effective for general systems of equations, dense or sparse, are limited in scalability when applied to block tridiagonal systems. This paper presents scalability results as well as detailed analyses of two parallel solvers that exploit the special structure of block tridiagonal matrices to deliver superior performance, often by orders of magnitude. A rigorous analysis of their relative parallel runtimes is shown to reveal the existence of a critical block size that separates the parameter space spanned by the number of block rows, the block size and the processor count, into distinct regions that favor one or the other of the two solvers. Dependence of this critical block size on the above parameters as well as on machine-specific constants is established. These formal insights are supported by empirical results on up to 2,048 cores of a Cray XT4 system. To the best of our knowledge, this is the highest reported scalability for parallel block tridiagonal solvers to date.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm
NASA Technical Reports Server (NTRS)
Povitsky, A.
1998-01-01
In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Advanced Algorithms and Automation Tools for Discrete Ordinates Methods in Parallel Environments
Alireza Haghighat
2003-05-07
This final report discusses major accomplishments of a 3-year project under the DOE's NEER Program. The project has developed innovative and automated algorithms, codes, and tools for solving the discrete ordinates particle transport method efficiently in parallel environments. Using a number of benchmark and real-life problems, the performance and accuracy of the new algorithms have been measured and analyzed.
NASA Astrophysics Data System (ADS)
Tang, Zhili
2016-06-01
This paper solved aerodynamic drag reduction of transport wing fuselage configuration in transonic regime by using a parallel Nash evolutionary/deterministic hybrid optimization algorithm. Two sets of parameters are used, namely globally and locally. It is shown that optimizing separately local and global parameters by using Nash algorithms is far more efficient than considering these variables as a whole.
A portable implementation of ARPACK for distributed memory parallel architectures
Maschhoff, K.J.; Sorensen, D.C.
1996-12-31
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).
Parallel algorithms and archtectures for computational structural mechanics
NASA Technical Reports Server (NTRS)
Patrick, Merrell; Ma, Shing; Mahajan, Umesh
1989-01-01
The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.
Parallel algorithms for simulating continuous time Markov chains
NASA Technical Reports Server (NTRS)
Nicol, David M.; Heidelberger, Philip
1992-01-01
We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuous-time Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors.
A Parallel Processing Algorithm for Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Distributed parallel computing in stochastic modeling of groundwater systems.
Dong, Yanhui; Li, Guomin; Xu, Haizhen
2013-03-01
Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling.
Constant-time parallel sorting algorithm and its optical implementation using smart pixels
NASA Astrophysics Data System (ADS)
Louri, Ahmed; Hatch, James A., Jr.; Na, Jongwhoa
1995-06-01
Sorting is a fundamental operation that has important implications in a vast number of areas. For instance, sorting is heavily utilized in applications such as database machines, in which hashing techniques are used to accelerate data-processing algorithms. It is also the basis for interprocessor message routing and has strong implications in video telecommunications. However, high-speed electronic sorting networks are difficult to implement with VLSI technology because of the dense, global connectivity required. Optics eliminates this bottleneck by offering global interconnects, massive parallelism, and noninterfering communications. We present a parallel sorting algorithm and its efficient optical implementation. The algorithm sorts n data elements in few steps, independent of the number of elements to be sorted. Thus it is a constant-time sorting algorithm [i.e., O(1) time]. We also estimate the system's performance to show that the proposed sorting algorithm can provide at least 2 orders of magnitude improvement in execution time over conventional electronic algorithms.
NASA Technical Reports Server (NTRS)
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Performance of multiprocessors and parallel algorithms: Quick-sort, a case study
Patil, I.M.
1989-01-01
Performance of parallel algorithms on multiprocessors has been traditionally analyzed by looking at either the algorithm or the architecture of the multiprocessor system. However, it is important to study the combined effect of both these factors in order to evaluate and predict performance. A different methodology based on approximate trace-driven simulation is adopted in this thesis to study the performance of a class of non-numerical algorithms. Performance of parallel quick-sort and parallel quick-merge sort is investigated in order to demonstrate the methodology as well as develop an understanding of the limitations imposed by a cache-based single bus environment on achievable speedup. A wide range of issues including the effect of cache parameters, coherency protocol, scheduling mechanisms and technology effects are discussed in the context of performance of the two versions of parallel quick-sort.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations
NASA Technical Reports Server (NTRS)
Stone, H. S.
1971-01-01
Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations.
NASA Technical Reports Server (NTRS)
Stone, H. S.
1973-01-01
Tridiagonal linear systems of equations can be solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computation on computers of the Illiac IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log(sub-2) N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
Multi-directional search: A direct search algorithm for parallel machines
Torczon, V.J.
1989-01-01
In recent years there has been a great deal in the development of optimization algorithms which exploit the computational power of parallel computer architectures. The author has developed a new direct search algorithm, which he calls multi-directional search, that is ideally suited for parallel computation. His algorithm belongs to the class of direct search methods, a class of optimization algorithms which neither compute nor approximate any derivatives of the objective function. His work, in fact, was inspired by the simplex method of Spendley, Hext, and Himsworth, and the simplex method of Nelder and Mead. The multi-directional search algorithm is inherently parallel. The basic idea of the algorithm is to perform concurrent searches in multiple directions. These searches are free of any interdependencies, so the information required can be computed in parallel. A central result of his work is the convergence analysis for his algorithm. By requiring only that the function be continuously differentiable over a bounded level set, he can prove that a subsequence of the points generated by the multi-directional search algorithm converges to a stationary point of the objective function. This is of great interest since he knows of few convergence results for practical direct search algorithms. He also presents numerical results indicating that the multidirectional search algorithm is robust, even in the presence of noise. His results include comparisons with the Nelder-Mead simplex algorithm, the method of steepest descent, and a quasi-Newton method. One surprising conclusion of his numerical tests is that the Nelder-Mead simplex algorithm is not robust. He closes with some comments about future directions of research.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices
NASA Technical Reports Server (NTRS)
Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.
1991-01-01
A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David
2006-05-01
The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
Event parallelism: Distributed memory parallel computing for high energy physics experiments
Nash, T.
1989-05-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs.
A parallel algorithm for thermo-hydro-mechanical analysis of deforming porous media
NASA Astrophysics Data System (ADS)
Wang, X.; Gawin, D.; Schrefler, B. A.
1996-11-01
In this paper, a parallel Newton-Raphson algorithm with domain decomposition is developed to solve fully coupled heat, water and gas flow in deformable porous media. The model makes use of the modified effective stress concept together with the capillary pressure relationship. Phase change and latent heat transfer are also taken into account. The chosen macroscopic field variables are displacement, capillary pressure, gas pressure and temperature. The parallel program is developed on a cluster of workstations. The PVM (Parallel Virtual Machine) system is used to handle communications among networked workstations. An implementation of this parallel method on workstations is discussed, the speedup and efficiency of this method being demonstrated by numerical examples.
An efficient parallel algorithm for three-dimensional analysis of subsidence above gas reservoirs
NASA Astrophysics Data System (ADS)
Schrefler, B. A.; Wang, X.; Salomoni, V. A.; Zuccolo, G.
1999-09-01
In this paper an efficient parallel algorithm to solve a three-dimensional problem of subsidence above exploited gas reservoirs is presented. The parallel program is developed on a cluster of workstations. The parallel virtual machine (PVM) system is used to handle communications among networked workstations. The method has advantages such as numbering of the finite element mesh in an arbitrary manner, simple programming organization, smaller core requirements and computation times. An implementation of this parallel method on workstations is discussed, the speed-up and efficiency of this method being demonstrated by a numerical example. Copyright
A dataflow analysis tool for parallel processing of algorithms
NASA Technical Reports Server (NTRS)
Jones, Robert L., III
1993-01-01
A graph-theoretic design process and software tool is presented for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described using a dataflow graph and are intended to be executed repetitively on a set of identical parallel processors. Typical applications include signal processing and control law problems. Graph analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool is shown to facilitate the application of the design process to a given problem.
A parallel, volume-tracking algorithm for unstructured meshes
Mosso, S.J.; Swartz, B.K.; Kothe, D.B.; Ferrell, R.C.
1996-10-01
Many diverse areas of industry benefit from the use of volume of fluid methods to predict the movement of materials. Casting is a common method of part fabrication. The accurate prediction of the casting process is pivotal to industry. Mold design and casting is currently considered an art by industry. It typically involves many trial mold designs, and the rejection of defective parts is costly. Failure of cast parts, because residual stresses reduce the part`s strength, can be catastrophic. Cast parts should have precise geometric details that reduce or eliminate the need for machining after casting. Volume of fluid codes will help designers predict how the molten metal fills a mold and where ay trapped voids remain. Prediction of defects due to thermal contraction or expansion will eliminate defective, trial mold designs and speed the parts to market with fewer rejections. Increasing the predictability and therefore the accuracy of the casting process will reduce the art that is involved in mold design and parts casting. Here, recent enhancements to multidimensional volume-tracking algorithms are presented. Illustrations in two dimensions are given. The improvements include new, local algorithms for interface normal constructions and a new full remapping algorithm for time integration. These methods are used on structured and unstructured grids.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Chiou, Jin-Chern
1990-01-01
Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
A Self Consistent Multiprocessor Space Charge Algorithm that is Almost Embarrassingly Parallel
Edward Nissen, B. Erdelyi, S.L. Manikonda
2012-07-01
We present a space charge code that is self consistent, massively parallelizeable, and requires very little communication between computer nodes; making the calculation almost embarrassingly parallel. This method is implemented in the code COSY Infinity where the differential algebras used in this code are important to the algorithm's proper functioning. The method works by calculating the self consistent space charge distribution using the statistical moments of the test particles, and converting them into polynomial series coefficients. These coefficients are combined with differential algebraic integrals to form the potential, and electric fields. The result is a map which contains the effects of space charge. This method allows for massive parallelization since its statistics based solver doesn't require any binning of particles, and only requires a vector containing the partial sums of the statistical moments for the different nodes to be passed. All other calculations are done independently. The resulting maps can be used to analyze the system using normal form analysis, as well as advance particles in numbers and at speeds that were previously impossible.
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Execution models for mapping programs onto distributed memory parallel computers
NASA Technical Reports Server (NTRS)
Sussman, Alan
1992-01-01
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.
Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua
2011-01-01
A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058
Development of an aerodynamics algorithm for parallel-processing supercomputers
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.; Johnson, Gary M.
1988-01-01
An explicit flow solver, applicable to the hierarchy of model equations ranging from Euler to full Navier-Stokes, is combined with several techniques designed to reduce computational expense. The computational domain consists of local grid refinements embedded in a global coarse mesh, where the locations of these refinements are defined by the physics of the flow. Flow characteristics are also used to determine which set of model equations is appropriate for solution in each region, thereby reducing not only the number of grid points at which the solution must be obtained, but also the computational effort required to get that solution. Acceleration to steady-state is achieved by applying multigrid on each of the subgrids, regardless of the particular model equations being solved. Since each of these components is explicit, advantage can readily be taken of the vector- and parallel-processing capabilities of machines such as the Cray X-MP and Cray-2.
Parallelization and Algorithmic Enhancements of High Resolution IRAS Image Construction
NASA Technical Reports Server (NTRS)
Cao, Yu; Prince, Thomas A.; Tereby, Susan; Beichman, Charles A.
1996-01-01
The Infrared Astronomical Satellite caried out a nearly complete survey of the infrared sky, and the survey data are important for the study of many astrophysical phenomena. However, many data sets at other wavelengths have higher resolutions than that of the co-added IRAS maps, and high resolution IRAS images are strongly desired both for their own information content and their usefulness in correlation. The HIRES program was developed by the Infrared Processing and Analysis Center (IPAC) to produce high resolution (approx. 1') images from IRAS data using the Maximum Correlation Method (MCM). We describe the port of HIRES to the Intel Paragon, a massively parallel supercomputer, other software developments for mass production of HIRES images, and the IRAS Galaxy Atlas, a project to map the Galactic plane at 60 and 100(micro)m.
Hybrid image classification and parameter selection using a shared memory parallel algorithm
NASA Astrophysics Data System (ADS)
Phillips, Rhonda D.; Watson, Layne T.; Wynne, Randolph H.
2007-07-01
This work presents a shared memory parallel version of the hybrid classification algorithm IGSCR (iterative guided spectral class rejection) to facilitate the transition from serial to parallel processing. This transition is motivated by a demonstrated need for more computing power driven by the increasing size of remote sensing data sets due to higher resolution sensors, larger study regions, and the like. Parallel IGSCR was developed to produce fast and portable code using Fortran 95, OpenMP, and the Hierarchical Data Format version 5 (HDF5) and accompanying data access library. The intention of this work is to provide an efficient implementation of the established IGSCR classification algorithm. The applicability of the faster parallel IGSCR algorithm is demonstrated by classifying Landsat data covering most of Virginia, USA into forest and non-forest classes with approximately 90% accuracy. Parallel results are given using the SGI Altix 3300 shared memory computer and the SGI Altix 3700 with as many as 64 processors reaching speedups of almost 77. Parallel IGSCR allows an analyst to perform and assess multiple classifications to refine parameters. As an example, parallel IGSCR was used for a factorial analysis consisting of 42 classifications of a 1.2 GB image to select the number of initial classes (70) and class purity (70%) used for the remaining two images.
Parallel processing of layout data with selective data distribution
NASA Astrophysics Data System (ADS)
Pereira, Mark; Bhat, Nitin; Srinivas, Preethi
2006-10-01
With the increase in layout data (GDSII) size due to finer geometries and resolution enhancement techniques such as Optical Proximity Correction (OPC) and Phase Shift Mask (PSM), layout data is proving to be too voluminous to process by single CPU machines. Post-layout tools have now moved towards distributed computing techniques to process this data more efficiently in terms of speed. Typical distributed computing architectures involve distributing the layout data to various workstations and then each workstation processing its part of the data in parallel. This approach will work well provided the amount of data that is to be distributed is not too large. As the size of the layout data is increasing significantly, the time taken to transfer the layout data between the workstations is turning out to be a major bottleneck. This bottleneck gets further highlighted because the time taken for actual operations gets almost linearly scaled down through employing higher number of workstations in the distributed computing environment and also because the clock speed of the workstations get continuously improved. The focus of this paper is on a smart way of distributing the layout data so that the amount of redundant data transfer is significantly reduced. This is achieved by selective data distribution wherein the layout data is fragmented and each workstation is provided with minimal and sufficient layout information for it to determine the actual fragments required for its processing.
PMCR-Miner: parallel maximal confident association rules miner algorithm for microarray data set.
Zakaria, Wael; Kotb, Yasser; Ghaleb, Fayed F M
2015-01-01
The MCR-Miner algorithm is aimed to mine all maximal high confident association rules form the microarray up/down-expressed genes data set. This paper introduces two new algorithms: IMCR-Miner and PMCR-Miner. The IMCR-Miner algorithm is an extension of the MCR-Miner algorithm with some improvements. These improvements implement a novel way to store the samples of each gene into a list of unsigned integers in order to benefit using the bitwise operations. In addition, the IMCR-Miner algorithm overcomes the drawbacks faced by the MCR-Miner algorithm by setting some restrictions to ignore repeated comparisons. The PMCR-Miner algorithm is a parallel version of the new proposed IMCR-Miner algorithm. The PMCR-Miner algorithm is based on shared-memory systems and task parallelism, where no time is needed in the process of sharing and combining data between processors. The experimental results on real microarray data sets show that the PMCR-Miner algorithm is more efficient and scalable than the counterparts.
A Two-Pass Exact Algorithm for Selection on Parallel Disk Systems
Mi, Tian; Rajasekaran, Sanguthevar
2014-01-01
Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2 + ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2 + ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≫ Mc. PMID:25374478
Two and Three-Dimensional Nonlocal DFT for Inhomogeneous Fluids I: Algorithms and Parallelization
Frink, Laura J. Douglas; Salinger, Andrew
1999-08-09
Fluids adsorbed near surfaces, macromolecules, and in porous materials are inhomogeneous, inhibiting spatially varying density distributions. This inhomogeneity in the fluid plays an important role in controlling a wide variety of complex physical phenomena including wetting, self-assembly, corrosion, and molecular recognition. One of the key methods for studying the properties of inhomogeneous fluids in simple geometries has been density functional theory (DFT). However, there has been a conspicuous lack of calculations in complex 2D and 3D geometries. The computational difficulty arises from the need to perform nested integrals that are due to nonlocal terms in the free energy functional These integral equations are expensive both in evaluation time and in memory requirements; however, the expense can be mitigated by intelligent algorithms and the use of parallel computers. This paper details our efforts to develop efficient numerical algorithms so that no local DFT calculations in complex geometries that require two or three dimensions can be performed. The success of this implementation will enable the study of solvation effects at heterogeneous surfaces, in zeolites, in solvated (bio)polymers, and in colloidal suspensions.
Roche-Lima, Abiel; Thulasiram, Ruppa K.
2016-01-01
Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST Databases
Mata-Montero, Manrique; Shalaby, Nabil; Sheppard, Bradley
2013-01-01
Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of new genes and the mapping of the human genome. Many algorithms have been developed to find unique oligos, many of which are much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which finds the solution of the unique oligos search problem efficiently. We implement this algorithm as well as several new algorithms based on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster than with previous ones. We parallelize these new algorithms to further improve the time of finding unique oligos. All algorithms are run on ESTs obtained from a Barley EST database. PMID:23653640
A practical algorithm for static analysis of parallel programs
McDowell, C.E. )
1989-06-01
One approach to analyzing the behavior of a concurrent program requires determining the reachable program states. A program state consists of a set of task states, the values of shared variables used for synchronization, and local variables that derive the values directly from synchronization operations. However, the number of reachable states rises exponentially with the number of tasks and becomes intractable for many concurrent programs. A variation of this approach merges a set of related states into a single virtual state. Using this approach, the analysis of concurrent programs becomes feasible as the number of virtual states is often orders of magnitude less than the number of reachable states. This paper presents a method for determining the virtual states that describe the reachable program states, and the reduction in the number of states is analyzed. The algorithms given have been implemented in a state program analyzer for multitasking Fortran, and the results obtained are discussed.
Application of parallel distributed processing to space based systems
NASA Technical Reports Server (NTRS)
Macdonald, J. R.; Heffelfinger, H. L.
1987-01-01
The concept of using Parallel Distributed Processing (PDP) to enhance automated experiment monitoring and control is explored. Recent very large scale integration (VLSI) advances have made such applications an achievable goal. The PDP machine has demonstrated the ability to automatically organize stored information, handle unfamiliar and contradictory input data and perform the actions necessary. The PDP machine has demonstrated that it can perform inference and knowledge operations with greater speed and flexibility and at lower cost than traditional architectures. In applications where the rule set governing an expert system's decisions is difficult to formulate, PDP can be used to extract rules by associating the information an expert receives with the actions taken.
The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce
NASA Astrophysics Data System (ADS)
Chen, Xi; Zhou, Liqing
2015-12-01
With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.
Marucci, Evandro A.; Neves, Leandro A.; Valêncio, Carlo R.; Pinto, Alex R.; Cansian, Adriano M.; de Souza, Rogeria C. G.; Shiyou, Yang; Machado, José M.
2014-01-01
With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE. PMID:25140318
Marucci, Evandro A; Zafalon, Geraldo F D; Momente, Julio C; Neves, Leandro A; Valêncio, Carlo R; Pinto, Alex R; Cansian, Adriano M; de Souza, Rogeria C G; Shiyou, Yang; Machado, José M
2014-01-01
With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE. PMID:25140318
Parallel asynchronous hardware implementation of image processing algorithms
NASA Technical Reports Server (NTRS)
Coon, Darryl D.; Perera, A. G. U.
1990-01-01
Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted
1990-01-01
Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications
NASA Technical Reports Server (NTRS)
Povitsky, Alex; Morris, Philip J.
1999-01-01
In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
A space time-ensemble parallel nudged elastic band algorithm for molecular kinetics simulation
NASA Astrophysics Data System (ADS)
Nakano, Aiichiro
2008-02-01
A scalable parallel algorithm has been designed to study long-time dynamics of many-atom systems based on the nudged elastic band method, which performs mutually constrained molecular dynamics simulations for a sequence of atomic configurations (or states) to obtain a minimum energy path between initial and final local minimum-energy states. A directionally heated nudged elastic band method is introduced to search for thermally activated events without the knowledge of final states, which is then applied to an ensemble of bands in a path ensemble method for long-time simulation in the framework of the transition state theory. The resulting molecular kinetics (MK) simulation method is parallelized with a space-time-ensemble parallel nudged elastic band (STEP-NEB) algorithm, which employs spatial decomposition within each state, while temporal parallelism across the states within each band and band-ensemble parallelism are implemented using a hierarchy of communicator constructs in the Message Passing Interface library. The STEP-NEB algorithm exhibits good scalability with respect to spatial, temporal and ensemble decompositions on massively parallel computers. The MK simulation method is used to study low strain-rate deformation of amorphous silica.
On the impact of communication complexity on the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D. B.; Van Rosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-08-19
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine
NASA Technical Reports Server (NTRS)
Lee, C. S. G.; Lin, C. T.
1989-01-01
The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
The convergence analysis of parallel genetic algorithm based on allied strategy
NASA Astrophysics Data System (ADS)
Lin, Feng; Sun, Wei; Chang, K. C.
2010-04-01
Genetic algorithms (GAs) have been applied to many difficult optimization problems such as track assignment and hypothesis managements for multisensor integration and data fusion. However, premature convergence has been a main problem for GAs. In order to prevent premature convergence, we introduce an allied strategy based on biological evolution and present a parallel Genetic Algorithm with the allied strategy (PGAAS). The PGAAS can prevent premature convergence, increase the optimization speed, and has been successfully applied in a few applications. In this paper, we first present a Markov chain model in the PGAAS. Based on this model, we analyze the convergence property of PGAAS. We then present the proof of global convergence for the PGAAS algorithm. The experiments results show that PGAAS is an efficient and effective parallel Genetic algorithm. Finally, we discuss several potential applications of the proposed methodology.
Algorithm comparison and benchmarking using a parallel spectra transform shallow water model
Worley, P.H.; Foster, I.T.; Toonen, B.
1995-04-01
In recent years, a number of computer vendors have produced supercomputers based on a massively parallel processing (MPP) architecture. These computers have been shown to be competitive in performance with conventional vector supercomputers for some applications. As spectral weather and climate models are heavy users of vector supercomputers, it is interesting to determine how these models perform on MPPS, and which MPPs are best suited to the execution of spectral models. The benchmarking of MPPs is complicated by the fact that different algorithms may be more efficient on different architectures. Hence, a comprehensive benchmarking effort must answer two related questions: which algorithm is most efficient on each computer and how do the most efficient algorithms compare on different computers. In general, these are difficult questions to answer because of the high cost associated with implementing and evaluating a range of different parallel algorithms on each MPP platform.
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling
NASA Technical Reports Server (NTRS)
Brown, Matthew; Johnston, Mark D.
2013-01-01
Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
The development of algorithms for parallel knowledge discovery using graphics accelerators
NASA Astrophysics Data System (ADS)
Zieliński, Paweł; Mulawka, Jan
2011-10-01
The paper broaches topics of selected knowledge discovery algorithms. Different implementations have been verified on parallel platforms, including graphics accelerators using CUDA technology, multi-core microprocessors using OpenMP and many graphics accelerators. Results of investigations have been compared in terms of performance and scalability. Different types of data representation were also tested. The possibilities of both platforms, using the classification algorithms: the k-nearest neighbors, support vector machines and logistic regression are discussed.
Parallel algorithms of relative radiometric correction for images of TH-1 satellite
NASA Astrophysics Data System (ADS)
Wang, Xiang; Zhang, Tingtao; Cheng, Jiasheng; Yang, Tao
2014-05-01
The first generation of transitive stereo-metric satellites in China, TH-1 Satellite, is able to gain stereo images of three-line-array with resolution of 5 meters, multispectral images of 10 meters, and panchromatic high resolution images of 2 meters. The procedure between level 0 and level 1A of high resolution images is so called relative radiometric correction (RRC for short). The processing algorithm of high resolution images, with large volumes of data, is complicated and time consuming. In order to bring up the processing speed, people in industry commonly apply parallel processing techniques based on CPU or GPU. This article firstly introduces the whole process and each step of the algorithm - that is in application - of RRC for high resolution images in level 0; secondly, the theory and characteristics of MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) parallel programming techniques is briefly described, as well as the superiority for parallel technique in image processing field; thirdly, aiming at each step of the algorithm in application and based on MPI+OpenMP hybrid paradigm, the parallelizability and the strategies of parallelism for three processing steps: Radiometric Correction, Splicing Pieces of TDICCD (Time Delay Integration Charge-Coupled Device) and Gray Level Adjustment among pieces of TDICCD are deeply discussed, and furthermore, deducts the theoretical acceleration rates of each step and the one of whole procedure, according to the processing styles and independence of calculation; for the step Splicing Pieces of TDICCD, two different strategies of parallelism are proposed, which are to be chosen with consideration of hardware capabilities; finally, series of experiments are carried out to verify the parallel algorithms by applying 2-meter panchromatic high resolution images of TH-1 Satellite, and the experimental results are analyzed. Strictly on the basis of former parallel algorithms, the programs in the experiments
On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching
Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh; Manne, Fredrik
2014-07-01
We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for the parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.
Guo, Wensheng; Yang, Guowu; Wu, Wei; He, Lei; Sun, Mingyu
2014-01-01
In biological systems, the dynamic analysis method has gained increasing attention in the past decade. The Boolean network is the most common model of a genetic regulatory network. The interactions of activation and inhibition in the genetic regulatory network are modeled as a set of functions of the Boolean network, while the state transitions in the Boolean network reflect the dynamic property of a genetic regulatory network. A difficult problem for state transition analysis is the finding of attractors. In this paper, we modeled the genetic regulatory network as a Boolean network and proposed a solving algorithm to tackle the attractor finding problem. In the proposed algorithm, we partitioned the Boolean network into several blocks consisting of the strongly connected components according to their gradients, and defined the connection between blocks as decision node. Based on the solutions calculated on the decision nodes and using a satisfiability solving algorithm, we identified the attractors in the state transition graph of each block. The proposed algorithm is benchmarked on a variety of genetic regulatory networks. Compared with existing algorithms, it achieved similar performance on small test cases, and outperformed it on larger and more complex ones, which happens to be the trend of the modern genetic regulatory network. Furthermore, while the existing satisfiability-based algorithms cannot be parallelized due to their inherent algorithm design, the proposed algorithm exhibits a good scalability on parallel computing architectures.
Applying various algorithms for species distribution modelling.
Li, Xinhai; Wang, Yuan
2013-06-01
Species distribution models have been used extensively in many fields, including climate change biology, landscape ecology and conservation biology. In the past 3 decades, a number of new models have been proposed, yet researchers still find it difficult to select appropriate models for data and objectives. In this review, we aim to provide insight into the prevailing species distribution models for newcomers in the field of modelling. We compared 11 popular models, including regression models (the generalized linear model, the generalized additive model, the multivariate adaptive regression splines model and hierarchical modelling), classification models (mixture discriminant analysis, the generalized boosting model, and classification and regression tree analysis) and complex models (artificial neural network, random forest, genetic algorithm for rule set production and maximum entropy approaches). Our objectives are: (i) to compare the strengths and weaknesses of the models, their characteristics and identify suitable situations for their use (in terms of data type and species-environment relationships) and (ii) to provide guidelines for model application, including 3 steps: model selection, model formulation and parameter estimation. PMID:23731809
Applying various algorithms for species distribution modelling.
Li, Xinhai; Wang, Yuan
2013-06-01
Species distribution models have been used extensively in many fields, including climate change biology, landscape ecology and conservation biology. In the past 3 decades, a number of new models have been proposed, yet researchers still find it difficult to select appropriate models for data and objectives. In this review, we aim to provide insight into the prevailing species distribution models for newcomers in the field of modelling. We compared 11 popular models, including regression models (the generalized linear model, the generalized additive model, the multivariate adaptive regression splines model and hierarchical modelling), classification models (mixture discriminant analysis, the generalized boosting model, and classification and regression tree analysis) and complex models (artificial neural network, random forest, genetic algorithm for rule set production and maximum entropy approaches). Our objectives are: (i) to compare the strengths and weaknesses of the models, their characteristics and identify suitable situations for their use (in terms of data type and species-environment relationships) and (ii) to provide guidelines for model application, including 3 steps: model selection, model formulation and parameter estimation.
Ellison, C. Leland; Finn, J. M.; Qin, H.; Tang, William M.
2014-10-01
Structure-preserving algorithms obtained via discrete variational principles exhibit strong promise for the calculation of guiding center test particle trajectories. The non-canonical Hamiltonian structure of the guiding center equations forms a novel and challenging context for geometric integration. To demonstrate the practical relevance of these methods, a prototypical variational midpoint algorithm is applied to an experimental magnetic equilibrium. The stability characteristics, conservation properties, and implementation requirements associated with the variational algorithms are addressed. Furthermore, computational run time is reduced for large numbers of particles by parallelizing the calculation on GPU hardware.
A Parallel Domain Decomposition BEM Algorithm for Three Dimensional Exponentially Graded Elasticity
Ortiz Tavara, Jhonny E; Shelton Jr, William Allison; Mantic, Vladislav; Criado, Rafael; Paris, Federico; Gray, Leonard J
2008-01-01
A parallel domain decomposition boundary integral algorithm for three-dimensional exponentially graded elasticity has been developed. As this subdomain algorithm allows the grading direction to vary in the structure, geometries arising from practical FGM applications can be handled. Moreover, the boundary integral algorithm scales well with the number of processors, also helping to alleviate the high computational cost of evaluating the Green's function. Numerical results for cylindrical geometries show excellent agreement with the new analytical solution deduced for axisymmetric plane strain states in a radially graded material.
Creating IRT-Based Parallel Test Forms Using the Genetic Algorithm Method
ERIC Educational Resources Information Center
Sun, Koun-Tem; Chen, Yu-Jen; Tsai, Shu-Yen; Cheng, Chien-Fen
2008-01-01
In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel…
Analysis and research on the balanced distribution of the network-based data in parallel database
NASA Astrophysics Data System (ADS)
He, JunHua
2011-12-01
The rapid development of parallel computer systems, making parallel operating environment gradually mature and widely used in scientific computing and research in many fields, thus parallel database of research becomes more and more attention and research has become an important database field of study. This network-based parallel cluster of characteristics and the current parallel computer system new trends, analyzes the network parallel clusters of workstations, parallel database data skew problem in data distribution characteristics of the environment is proposed with the ability to adapt to the dynamic data balanced distribution programs.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
Parallel image restoration with a two-dimensional likelihood-based algorithm.
Neifeld, Mark A; Wu, Yong
2002-08-10
We describe a pixelwise parallel algorithm for the restoration of images that have been corrupted by a low-pass optical channel and additive noise. This new algorithm is based on an iterative soft-decision method of error correction (i.e., turbo decoding) and offers performance on binary-valued imagery that is comparable to the Viterbi algorithm. We quantify the restoration performance of this new algorithm on random binary imagery for which it is superior to both the Wiener filter and the projection onto convex sets algorithms over a wide range of channels. For typical optical channels, the new algorithm is within 0.5 dB of the two-dimensional Viterbi restoration method [J. Opt. Soc. Am. A 17, 265 (2000)]. We also demonstrate the extension of our new algorithm to correlated and gray-scale images using vector quantization to mitigate the associated complexity burden. A highly parallel focal-plane implementation is also discussed, and a design study is presented to quantify the capabilities of such a VLSI hardware solution. We find that video-rate restoration on 252 x 252 pixel images is possible using current technology.
A new parallel algorithm for contact detection in finite element methods
Hendrickson, B.; Plimpton, S.; Attaway, S.; Vaughan, C.; Gardner, D.
1996-03-01
In finite-element, transient dynamics simulations, physical objects are typically modeled as Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. In many simulations, such as computations of impacts or explosions, portions of the deforming mesh come in contact with each other as the simulation progresses. These contacts must be detected and the forces they impart to the mesh must be computed at each timestep to accurately capture the physics of interest. While the finite-element portion of these computations is readily parallelized, the contact detection problem is difficult to implement efficiently on parallel computers and has been a bottleneck to achieving high performance on large parallel machines. In this paper we describe a new parallel algorithm for detecting contacts. Our approach differs from previous work in that we use two different parallel decompositions, a static one for the finite element analysis and dynamic one for contact detection. We present results for this algorithm in a parallel version of the transient dynamics code PRONTO-3D running on a large Intel Paragon.
Reusable Component Model Development Approach for Parallel and Distributed Simulation
Zhu, Feng; Yao, Yiping; Chen, Huilong; Yao, Feng
2014-01-01
Model reuse is a key issue to be resolved in parallel and distributed simulation at present. However, component models built by different domain experts usually have diversiform interfaces, couple tightly, and bind with simulation platforms closely. As a result, they are difficult to be reused across different simulation platforms and applications. To address the problem, this paper first proposed a reusable component model framework. Based on this framework, then our reusable model development approach is elaborated, which contains two phases: (1) domain experts create simulation computational modules observing three principles to achieve their independence; (2) model developer encapsulates these simulation computational modules with six standard service interfaces to improve their reusability. The case study of a radar model indicates that the model developed using our approach has good reusability and it is easy to be used in different simulation platforms and applications. PMID:24729751
NASA Astrophysics Data System (ADS)
Wang, Congzhe; Fang, Yuefa; Guo, Sheng
2015-07-01
Dimensional synthesis is one of the most difficult issues in the field of parallel robots with actuation redundancy. To deal with the optimal design of a redundantly actuated parallel robot used for ankle rehabilitation, a methodology of dimensional synthesis based on multi-objective optimization is presented. First, the dimensional synthesis of the redundant parallel robot is formulated as a nonlinear constrained multi-objective optimization problem. Then four objective functions, separately reflecting occupied space, input/output transmission and torque performances, and multi-criteria constraints, such as dimension, interference and kinematics, are defined. In consideration of the passive exercise of plantar/dorsiflexion requiring large output moment, a torque index is proposed. To cope with the actuation redundancy of the parallel robot, a new output transmission index is defined as well. The multi-objective optimization problem is solved by using a modified Differential Evolution(DE) algorithm, which is characterized by new selection and mutation strategies. Meanwhile, a special penalty method is presented to tackle the multi-criteria constraints. Finally, numerical experiments for different optimization algorithms are implemented. The computation results show that the proposed indices of output transmission and torque, and constraint handling are effective for the redundant parallel robot; the modified DE algorithm is superior to the other tested algorithms, in terms of the ability of global search and the number of non-dominated solutions. The proposed methodology of multi-objective optimization can be also applied to the dimensional synthesis of other redundantly actuated parallel robots only with rotational movements.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
High-performance modeling acoustic and elastic waves using the parallel Dichotomy Algorithm
Fatyanov, Alexey G.; Terekhov, Andrew V.
2011-03-01
A high-performance parallel algorithm is proposed for modeling the propagation of acoustic and elastic waves in inhomogeneous media. An initial boundary-value problem is replaced by a series of boundary-value problems for a constant elliptic operator and different right-hand sides via the integral Laguerre transform. It is proposed to solve difference equations by the conjugate gradient method for acoustic equations and by the GMRES(k) method for modeling elastic waves. A preconditioning operator was the Laplace operator that is inverted using the variable separation method. The novelty of the proposed algorithm is using the Dichotomy Algorithm , which was designed for solving a series of tridiagonal systems of linear equations, in the context of the preconditioning operator inversion. Via considering analytical solutions, it is shown that modeling wave processes for long instants of time requires high-resolution meshes. The proposed parallel fine-mesh algorithm enabled to solve real application seismic problems in acceptable time and with high accuracy. By solving model problems, it is demonstrated that the considered parallel algorithm possesses high performance and efficiency over a wide range of the number of processors (from 2 to 8192).
A new asynchronous parallel algorithm for inferring large-scale gene regulatory networks.
Xiao, Xiangyun; Zhang, Wei; Zou, Xiufen
2015-01-01
The reconstruction of gene regulatory networks (GRNs) from high-throughput experimental data has been considered one of the most important issues in systems biology research. With the development of high-throughput technology and the complexity of biological problems, we need to reconstruct GRNs that contain thousands of genes. However, when many existing algorithms are used to handle these large-scale problems, they will encounter two important issues: low accuracy and high computational cost. To overcome these difficulties, the main goal of this study is to design an effective parallel algorithm to infer large-scale GRNs based on high-performance parallel computing environments. In this study, we proposed a novel asynchronous parallel framework to improve the accuracy and lower the time complexity of large-scale GRN inference by combining splitting technology and ordinary differential equation (ODE)-based optimization. The presented algorithm uses the sparsity and modularity of GRNs to split whole large-scale GRNs into many small-scale modular subnetworks. Through the ODE-based optimization of all subnetworks in parallel and their asynchronous communications, we can easily obtain the parameters of the whole network. To test the performance of the proposed approach, we used well-known benchmark datasets from Dialogue for Reverse Engineering Assessments and Methods challenge (DREAM), experimentally determined GRN of Escherichia coli and one published dataset that contains more than 10 thousand genes to compare the proposed approach with several popular algorithms on the same high-performance computing environments in terms of both accuracy and time complexity. The numerical results demonstrate that our parallel algorithm exhibits obvious superiority in inferring large-scale GRNs.
In-line print defect inspection system based on parallelized algorithms
NASA Astrophysics Data System (ADS)
Lv, Chao; Zhou, Hongjun
2015-03-01
The core algorithm of an on-line print defects detection system is template matching. In this paper, we introduce a kind of edge-based template matching based on Canny's edge detection method to find the edge information and do the matching work. Of all the detection algorithms, the most difficult problem is execution time, in order to reduce the execution time and improve the efficiency of execution, we introduce four different ways to solve and compare. They are Pyramidal algorithm, Multicore and Multi-Threading algorithm based on OpenMP, a Parallel algorithm based on Intel AVX Instruction Set, GPU computing based on OpenCL model. Through the results, we can find different characters of different ways, and then choose the best for your own system.
A divide-and-inner product parallel algorithm for polynomial evaluation
Hu, Jie; Li, Lei; Nakamura, Tadao
1994-12-31
In this paper, a divide-and-inner product parallel algorithm for evaluating a polynomial of degree N (N+1=KL) on a MIMD computer is presented. It needs 2K + log{sub 2}L steps to evaluate a polynomial of degree N in parallel on L+1 processors (L{<=}2K-2log{sub 2}K) which is a decrease of log{sub 2}L steps as compared with the L-order Homer`s method, and which is a decrease of (2log{sub 2}L){sup 1/2} steps as compared with the some MIMD algorithms. The new algorithm is simple in structure and easy to be realized.
Huang, Yu; Guo, Feng; Li, Yongling; Liu, Yufeng
2015-01-01
Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO) is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm. PMID:25603158
Programming environment for parallel vision algorithms. Annual report, February 1986-February 1987
Brown, C.
1987-02-01
During the second year of the award period, the Computer Science Department of the University of Rochester continued work in: 1) systems support algorithms, 2) the Butterfly programming environment, and 3) vision applications. This research produced several internal and external reports as well as much exportable code. The University of Rochester also employed DARPA Parallel Architecture Benchmark problems to test different algorithms using four different Butterfly programming environments. These tests produced several interesting results and demonstrated that the Butterfly architecture is a flexible general-purpose architecture that can be effectively programmed by non-experts, using tools developed at BBN and Rochester. The University of Rochester is continuing to study the issues and concerns surrounding the effective implementation of parallel algorithms.
Wiese, Kay C; Hendriks, Andrew; Deschênes, Alain; Ben Youssef, Belgacem
2005-09-01
This paper presents a fully parallel version of RnaPredict, a genetic algorithm (GA) for RNA secondary structure prediction. The research presented here builds on previous work and examines the impact of three different pseudorandom number generators (PRNGs) on the GA's performance. The three generators tested are the C standard library PRNG RAND, a parallelized multiplicative congruential generator (MCG), and a parallelized Mersenne Twister (MT). A fully parallel version of RnaPredict using the Message Passing Interface (MPI) was implemented on a 128-node Beowulf cluster. The PRNG comparison tests were performed with known structures whose sequences are 118, 122, 468, 543, and 556 nucleotides in length. The effects of the PRNGs are investigated and the predicted structures are compared to known structures. Results indicate that P-RnaPredict demonstrated good prediction accuracy, particularly so for shorter sequences.
[Parallel PLS algorithm using MapReduce and its aplication in spectral modeling].
Yang, Hui-Hua; Du, Ling-Ling; Li, Ling-Qiao; Tang, Tian-Biao; Guo, Tuo; Liang, Qiong-Lin; Wang, Yi-Ming; Luo, Guo-An
2012-09-01
Partial least squares (PLS) has been widely used in spectral analysis and modeling, and it is computation-intensive and time-demanding when dealing with massive data To solve this problem effectively, a novel parallel PLS using MapReduce is proposed, which consists of two procedures, the parallelization of data standardizing and the parallelization of principal component computing. Using NIR spectral modeling as an example, experiments were conducted on a Hadoop cluster, which is a collection of ordinary computers. The experimental results demonstrate that the parallel PLS algorithm proposed can handle massive spectra, can significantly cut down the modeling time, and gains a basically linear speedup, and can be easily scaled up. PMID:23240405
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Parallel Fock matrix construction with distributed shared memory model for the FMO-MO method.
Umeda, Hiroaki; Inadomi, Yuichi; Watanabe, Toshio; Yagi, Toru; Ishimoto, Takayoshi; Ikegami, Tsutomu; Tadano, Hiroto; Sakurai, Tetsuya; Nagashima, Umpei
2010-10-01
A parallel Fock matrix construction program for FMO-MO method has been developed with the distributed shared memory model. To construct a large-sized Fock matrix during FMO-MO calculations, a distributed parallel algorithm was designed to make full use of local memory to reduce communication, and was implemented on the Global Array toolkit. A benchmark calculation for a small system indicates that the parallelization efficiency of the matrix construction portion is as high as 93% at 1,024 processors. A large FMO-MO application on the epidermal growth factor receptor (EGFR) protein (17,246 atoms and 96,234 basis functions) was also carried out at the HF/6-31G level of theory, with the frontier orbitals being extracted by a Sakurai-Sugiura eigensolver. It takes 11.3 h for the FMO calculation, 49.1 h for the Fock matrix construction, and 10 min to extract 94 eigen-components on a PC cluster system using 256 processors.
Distributed edge detection algorithm based on wavelet transform for wireless video sensor network
NASA Astrophysics Data System (ADS)
Li, Qiulin; Hao, Qun; Song, Yong; Wang, Dongsheng
2010-12-01
Edge detection algorithms are critical to image processing and computer vision. Traditional edge detection algorithms are not suitable for wireless video sensor network (WVSN) in which the nodes are with in limited calculation capability and resources. In this paper, a distributed edge detection algorithm based on wavelet transform designed for WVSN is proposed. Wavelet transform decompose the image into several parts, then the parts are assigned to different nodes through wireless network separately. Each node performs sub-image edge detecting algorithm correspondingly, all the results are sent to sink node, Fusing and Synthesis which include image binary and edge connect are executed in it. And finally output the edge image. Lifting scheme and parallel distributed algorithm are adopted to improve the efficiency, simultaneously, decrease the computational complexity. Experimental results show that this method could achieve higher efficiency and better result.
Distributed edge detection algorithm based on wavelet transform for wireless video sensor network
NASA Astrophysics Data System (ADS)
Li, Qiulin; Hao, Qun; Song, Yong; Wang, Dongsheng
2011-05-01
Edge detection algorithms are critical to image processing and computer vision. Traditional edge detection algorithms are not suitable for wireless video sensor network (WVSN) in which the nodes are with in limited calculation capability and resources. In this paper, a distributed edge detection algorithm based on wavelet transform designed for WVSN is proposed. Wavelet transform decompose the image into several parts, then the parts are assigned to different nodes through wireless network separately. Each node performs sub-image edge detecting algorithm correspondingly, all the results are sent to sink node, Fusing and Synthesis which include image binary and edge connect are executed in it. And finally output the edge image. Lifting scheme and parallel distributed algorithm are adopted to improve the efficiency, simultaneously, decrease the computational complexity. Experimental results show that this method could achieve higher efficiency and better result.
NASA Astrophysics Data System (ADS)
Katouda, Michio; Nagase, Shigeru
An efficient parallel algorithm is developed for second-order Møller-Plesset perturbation theory with the resolution-of-identity approximation of two-electron repulsion integrals (RI-MP2) to perform MP2 energy calculations of large molecules on distributed memory processors. Benchmark calculations are carried out for taxol (C47H51NO14), valinomycin (C54H90N6O18), and two-layer nanographene sheets (C96H24)2, which show the high parallel efficiency of the developed algorithm.
NASA Astrophysics Data System (ADS)
Ghizawi, Nidal Awni
Computational Fluid Dynamics problems of engineering interest are among the most demanding scientific problems in terms of the massive computational resources they require. Only parallel architecture computers offer the promise of providing orders of magnitude greater computational power. A common feature of the currently available implicit flow solvers for the compressible Navier-Stokes equations is that the solution for a multi-dimensional problem is obtained by the solution of a set of dependent problems which must be computed in series. In this study, a lower upper cycle independent (LUCI) implicit parallel processing algorithm for solving the compressible Navier-Stokes equations is proposed. A characteristic feature of this algorithm is that the solution for a multi-dimensional problem is obtained by the superposition of the solution of a set of independent problems which, therefore, enhances its parallel processing functionality. The accuracy and stability of this algorithm are carefully analyzed and compared with those of other algorithms. Flow computations using the LUCI algorithm are performed for two test cases which show the symmetry preserving property of this algorithm and demonstrate its accuracy. Through employing the principle of pseudo-parallelism, effects of domain decomposition on the stability and convergence of the LUCI and the Symmetric Successive Over-Relaxation (SSOR) schemes (representative of cycle dependent implicit schemes) are analyzed and quantified. Parallel implementation details of the LUCI (in two VERSIONS: I and II) and the SSOR (in VERSION I) schemes using the standard (portable) Message Passing Interface (MPI) on two computational platforms are given. These platforms are: Lewis Advanced Cluster Environment (LACE) which is an example of Network of Workstations (NOWs), and the Ohio Supercomputer CRAY T3D massive parallel computing environment. Parallel performance results indicate that VERSION I of the LUCI scheme is superior to
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
NASA Astrophysics Data System (ADS)
Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh
2015-07-01
This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Lin, Lin; Yang, Chao; Lu, Jiangfeng; Ying, Lexing; E, Weinan
2009-09-25
We present an efficient parallel algorithm and its implementation for computing the diagonal of $H^-1$ where $H$ is a 2D Kohn-Sham Hamiltonian discretized on a rectangular domain using a standard second order finite difference scheme. This type of calculation can be used to obtain an accurate approximation to the diagonal of a Fermi-Dirac function of $H$ through a recently developed pole-expansion technique \\cite{LinLuYingE2009}. The diagonal elements are needed in electronic structure calculations for quantum mechanical systems \\citeHohenbergKohn1964, KohnSham 1965,DreizlerGross1990. We show how elimination tree is used to organize the parallel computation and how synchronization overhead is reduced by passing data level by level along this tree using the technique of local buffers and relative indices. We analyze the performance of our implementation by examining its load balance and communication overhead. We show that our implementation exhibits an excellent weak scaling on a large-scale high performance distributed parallel machine. When compared with standard approach for evaluating the diagonal a Fermi-Dirac function of a Kohn-Sham Hamiltonian associated a 2D electron quantum dot, the new pole-expansion technique that uses our algorithm to compute the diagonal of $(H-z_i I)^-1$ for a small number of poles $z_i$ is much faster, especially when the quantum dot contains many electrons.
Parallel implementation of the time-evolving block decimation algorithm for the Bose-Hubbard model
NASA Astrophysics Data System (ADS)
Urbanek, Miroslav; Soldán, Pavel
2016-02-01
A system of ultracold atoms in an optical lattice represents a powerful experimental setup for testing the fundamentals of quantum mechanics. While its microscopic interaction mechanisms are well understood, the system behavior for a moderate number of particles is difficult to simulate due to a high dimension of its many-body space. This article presents TEBDOL, a parallel implementation of the time-evolving block decimation (TEBD) algorithm that can efficiently simulate time evolution of a one-dimensional chain of atoms in optical lattices. We investigate the parallelization strategy and the strong and weak scaling with the number of processes.
A parallel finite volume algorithm for large-eddy simulation of turbulent flows
NASA Astrophysics Data System (ADS)
Bui, Trong Tri
1998-11-01
A parallel unstructured finite volume algorithm is developed for large-eddy simulation of compressible turbulent flows. Major components of the algorithm include piecewise linear least-square reconstruction of the unknown variables, trilinear finite element interpolation for the spatial coordinates, Roe flux difference splitting, and second-order MacCormack explicit time marching. The computer code is designed from the start to take full advantage of the additional computational capability provided by the current parallel computer systems. Parallel implementation is done using the message passing programming model and message passing libraries such as the Parallel Virtual Machine (PVM) and Message Passing Interface (MPI). The development of the numerical algorithm is presented in detail. The parallel strategy and issues regarding the implementation of a flow simulation code on the current generation of parallel machines are discussed. The results from parallel performance studies show that the algorithm is well suited for parallel computer systems that use the message passing programming model. Nearly perfect parallel speedup is obtained on MPP systems such as the Cray T3D and IBM SP2. Performance comparison with the older supercomputer systems such as the Cray YMP show that the simulations done on the parallel systems are approximately 10 to 30 times faster. The results of the accuracy and performance studies for the current algorithm are reported. To validate the flow simulation code, a number of Euler and Navier-Stokes simulations are done for internal duct flows. Inviscid Euler simulation of a very small amplitude acoustic wave interacting with a shock wave in a quasi-1D convergent-divergent nozzle shows that the algorithm is capable of simultaneously tracking the very small disturbances of the acoustic wave and capturing the shock wave. Navier-Stokes simulations are made for fully developed laminar flow in a square duct, developing laminar flow in a
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic–there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions. PMID:27487242
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic-there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions. PMID:27487242
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic-there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions.
pSIN: A scalable, Parallel algorithm for Seismic INterferometry of large-N ambient-noise data
NASA Astrophysics Data System (ADS)
Chen, Po; Taylor, Nicholas J.; Dueker, Ken G.; Keifer, Ian S.; Wilson, Andra K.; McGuffy, Casey L.; Novitsky, Christopher G.; Spears, Alec J.; Holbrook, W. Steven
2016-08-01
Seismic interferometry is a technique for extracting deterministic signals (i.e., ambient-noise Green's functions) from recordings of ambient-noise wavefields through cross-correlation and other related signal processing techniques. The extracted ambient-noise Green's functions can be used in ambient-noise tomography for constructing seismic structure models of the Earth's interior. The amount of calculations involved in the seismic interferometry procedure can be significant, especially for ambient-noise datasets collected by large seismic sensor arrays (i.e., "large-N" data). We present an efficient parallel algorithm, named pSIN (Parallel Seismic INterferometry), for solving seismic interferometry problems on conventional distributed-memory computer clusters. The design of the algorithm is based on a two-dimensional partition of the ambient-noise data recorded by a seismic sensor array. We pay special attention to the balance of the computational load, inter-process communication overhead and memory usage across all MPI processes and we minimize the total number of I/O operations. We have tested the algorithm using a real ambient-noise dataset and obtained a significant amount of savings in processing time. Scaling tests have shown excellent strong scalability from 80 cores to over 2000 cores.
NASA Astrophysics Data System (ADS)
Wu, J.; Yang, Y.; Luo, Q.; Wu, J.
2012-12-01
This study presents a new hybrid multi-objective evolutionary algorithm, the niched Pareto tabu search combined with a genetic algorithm (NPTSGA), whereby the global search ability of niched Pareto tabu search (NPTS) is improved by the diversification of candidate solutions arose from the evolving nondominated sorting genetic algorithm II (NSGA-II) population. Also, the NPTSGA coupled with the commonly used groundwater flow and transport codes, MODFLOW and MT3DMS, is developed for multi-objective optimal design of groundwater remediation systems. The proposed methodology is then applied to a large-scale field groundwater remediation system for cleanup of large trichloroethylene (TCE) plume at the Massachusetts Military Reservation (MMR) in Cape Cod, Massachusetts. Furthermore, a master-slave (MS) parallelization scheme based on the Message Passing Interface (MPI) is incorporated into the NPTSGA to implement objective function evaluations in distributed processor environment, which can greatly improve the efficiency of the NPTSGA in finding Pareto-optimal solutions to the real-world application. This study shows that the MS parallel NPTSGA in comparison with the original NPTS and NSGA-II can balance the tradeoff between diversity and optimality of solutions during the search process and is an efficient and effective tool for optimizing the multi-objective design of groundwater remediation systems under complicated hydrogeologic conditions.
Debelak, Rudolf; Tran, Ulrich S
2016-01-01
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis.
Parallel flow accumulation algorithms for graphical processing units with application to RUSLE model
NASA Astrophysics Data System (ADS)
Sten, Johan; Lilja, Harri; Hyväluoma, Jari; Westerholm, Jan; Aspnäs, Mats
2016-04-01
Digital elevation models (DEMs) are widely used in the modeling of surface hydrology, which typically includes the determination of flow directions and flow accumulation. The use of high-resolution DEMs increases the accuracy of flow accumulation computation, but as a drawback, the computational time may become excessively long if large areas are analyzed. In this paper we investigate the use of graphical processing units (GPUs) for efficient flow accumulation calculations. We present two new parallel flow accumulation algorithms based on dependency transfer and topological sorting and compare them to previously published flow transfer and indegree-based algorithms. We benchmark the GPU implementations against industry standards, ArcGIS and SAGA. With the flow-transfer D8 flow routing model and binary input data, a speed up of 19 is achieved compared to ArcGIS and 15 compared to SAGA. We show that on GPUs the topological sort-based flow accumulation algorithm leads on average to a speedup by a factor of 7 over the flow-transfer algorithm. Thus a total speed up of the order of 100 is achieved. We test the algorithms by applying them to the Revised Universal Soil Loss Equation (RUSLE) erosion model. For this purpose we present parallel versions of the slope, LS factor and RUSLE algorithms and show that the RUSLE erosion results for an area of 12 km x 24 km containing 72 million cells can be calculated in less than a second. Since flow accumulation is needed in many hydrological models, the developed algorithms may find use in many other applications than RUSLE modeling. The algorithm based on topological sorting is particularly promising for dynamic hydrological models where flow accumulations are repeatedly computed over an unchanged DEM.
Fast parallel algorithms and enumeration techniques for partial k-trees
Narayanan, C.
1989-01-01
Recent research by several authors have resulted in systematic way of developing linear-time sequential algorithms for a host of problem: on a fairly general class of graphs variously known as bounded decomposable graphs, graphs of bounded treewidth, partial k-trees, etc. Partial k-trees arise in a variety of real-life applications such as network reliability, VLSI design and database systems and hence fast sequential algorithms on these graphs have been found to be desirable. The linear-time methodologies were independently developed by Bern, Lawler, and Wong ((10)), Arnborg and Proskurowski ((6)), Bodlaender ((14)), and Courcelle ((25)). Wimer ((89)) significantly extended the work of Bern, Lawler and Wong. All of these approaches share the common thread of using dynamic programming on a tree structure. In particular the methodology of Wimer uses a parse-tree as the data structure. The methodologies claim linear-time algorithms on partial k-trees for fixed k, for a number of combinatorial optimization problems given the tree structure as input. It is known that obtaining the tree structure is NP-hard. This dissertation investigates three important classes of problems: (1) Developing parallel algorithms for constructing a k-tree embedding, finding a tree decomposition and most notably obtaining a parse-tree for a partial k-tree. (2) Developing parallel algorithms for parse-tree computations, testing isomorphism of k-trees, and finding a 2-tree embedding of a cactus. (3) Obtaining techniques for counting vertex/edge subsets satisfying a certain property in some classes of partial k-trees. The parallel algorithms the author has developed are in class NC and are either new or improve upon the existing results of Bodlaender (13). The difference equations he has obtained for counting certain sub-graphs are not known in the literature so far.
An Optimal Parallel Algorithm for Constructing a Spanning Forest on Trapezoid Graphs
NASA Astrophysics Data System (ADS)
Honma, Hirotoshi; Masuyama, Shigeru
Given a simple graph G with n vertices, m edges and k connected components. The spanning forest problem is to find a spanning tree for each connected component of G. This problem has applications to the electrical power demand problem, computer network design, circuit analysis, etc. An optimal parallel algorithm for finding a spanning tree on the trapezoid graph is given by Bera et al., it takes O(logn) time with O(n/logn) processors on the EREW (Exclusive-Read Exclusive-Write) PRAM. Bera et al.'s algorithm is very simple and elegant. Moreover, it can correctly construct a spanning tree when the graph is connected. However, their algorithm can not accept a disconnected graph as an input. Applying their algorithm to a disconnected graph, Concurrent-Write occurs once for each connected component, thus this can not be achieved on EREW PRAM. In this paper we present an O(logn) time parallel algorithm with O(n/logn) processors for constructing a spanning forest on trapezoid graph G on EREW PRAM even if G is a disconnected graph.
Distributed machine learning: Scaling up with coarse-grained parallelism
Provost, F.J.; Hennessy, D.N.
1994-12-31
Machine teaming methods are becoming accepted as additions to the biologist`s data-analysis tool kit. However, scaling these techniques up to large data sets, such as those in biological and medical domains, is problematic in terms of both the required computational search effort and required memory (and the detrimental effects of excessive swapping). Our approach to tackling the problem of scaling up to large datasets is to take advantage of the ubiquitous workstation networks that are generally available in scientific and engineering environments. This paper introduces the notion of the invariant-partitioning property--that for certain evaluation criteria it is possible to partition a data set across multiple processors such that any rule that is satisfactory over the entire data set will also be satisfactory on at least one subset. In addition, by taking advantage of cooperation through interprocess communication, it is possible to build distributed learning algorithms such that only rules that are satisfactory over the entire data set will be learned. We describe a distributed learning system, CorPRL, that takes advantage of the invariant-partitioning property to learn from very large data sets, and present results demonstrating CorPRL`s effectiveness in analyzing data from two databases.
Farber, R.M.; Lapedes, A.S.; Rico-Martinez, R.; Kevrekidis, I.G.
1993-12-31
Time-delay mappings constructed using neural networks have proven successful in performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-time systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
Farber, R.M.; Lapedes, A.S. ); Rico-Martinez, R.; Kevrekidis, I.G. . Dept. of Chemical Engineering)
1993-01-01
Time-delay mappings constructed using neural networks have proven successful performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-tune systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
Farber, R.M.; Lapedes, A.S.; Rico-Martinez, R.; Kevrekidis, I.G.
1993-06-01
Time-delay mappings constructed using neural networks have proven successful performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-tune systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
A fast parallel algorithm for determining all roots of a polynomial with real roots
Benor, M.; Feig, E.; Kozen, D.; Tiwari, P.
1988-12-01
Given a polynomial rho(z) of degree n with m bit integer coefficients and an integer ..mu.., the problem of determining all its roots with error less than 2/sup -..mu../ is considered. It is shown that this problem is in the class NC if rho(z) has all real roots. Some very interesting properties of a Sturm sequence of a polynomial with distinct real roots are proved and used in the design of a fast parallel algorithm for this problem. Using Newton identities and a novel numerical integration scheme for evaluating a contour integral to high precision, this algorithm determines good approximations to the linear factors of rho(z).
NASA Astrophysics Data System (ADS)
Work, Paul R.
1991-12-01
This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
NASA Astrophysics Data System (ADS)
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
NASA Technical Reports Server (NTRS)
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
NASA Technical Reports Server (NTRS)
Tilton, James C.; Plaza, Antonio J. (Editor); Chang, Chein-I. (Editor)
2008-01-01
The hierarchical image segmentation algorithm (referred to as HSEG) is a hybrid of hierarchical step-wise optimization (HSWO) and constrained spectral clustering that produces a hierarchical set of image segmentations. HSWO is an iterative approach to region grooving segmentation in which the optimal image segmentation is found at N(sub R) regions, given a segmentation at N(sub R+1) regions. HSEG's addition of constrained spectral clustering makes it a computationally intensive algorithm, for all but, the smallest of images. To counteract this, a computationally efficient recursive approximation of HSEG (called RHSEG) has been devised. Further improvements in processing speed are obtained through a parallel implementation of RHSEG. This chapter describes this parallel implementation and demonstrates its computational efficiency on a Landsat Thematic Mapper test scene.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with
Parallelizing Sylvester-like operations on a distributed memory computer
Hu, D.Y.; Sorensen, D.C.
1994-12-31
Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms. PMID:24977204
Ozmutlu, H. Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms. PMID:24977204
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms.
NASA Astrophysics Data System (ADS)
Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.
2015-12-01
We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance
Hendrickson, B.; Plimpton, S.; Attaway, S.; Swegle, J.
1996-09-01
Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (gasoline, water) or fluid-like materials (earth) in the simulation can be modeled using the techniques of smoothed particle hydrodynamics. Implementing a hybrid mesh/particle model on a massively parallel computer poses several difficult challenges. One challenge is to simultaneously parallelize and load-balance both the mesh and particle portions of the computation. A second challenge is to efficiently detect the contacts that occur within the deforming mesh and between mesh elements and particles as the simulation proceeds. These contacts impart forces to the mesh elements and particles which must be computed at each timestep to accurately capture the physics of interest. In this paper we describe new parallel algorithms for smoothed particle hydrodynamics and contact detection which turn out to have several key features in common. Additionally, we describe how to join the new algorithms with traditional parallel finite element techniques to create an integrated particle/mesh transient dynamics simulation. Our approach to this problem differs from previous work in that we use three different parallel decompositions, a static one for the finite element analysis and dynamic ones for particles and for contact detection. We have implemented our ideas in a parallel version of the transient dynamics code PRONTO-3D and present results for the code running on a large Intel Paragon.
Parallel algorithm for dominant points correspondences in robot binocular stereo vision
NASA Technical Reports Server (NTRS)
Al-Tammami, A.; Singh, B.
1993-01-01
This paper presents an algorithm to find the correspondences of points representing dominant feature in robot stereo vision. The algorithm consists of two main steps: dominant point extraction and dominant point matching. In the feature extraction phase, the algorithm utilizes the widely used Moravec Interest Operator and two other operators: the Prewitt Operator and a new operator called Gradient Angle Variance Operator. The Interest Operator in the Moravec algorithm was used to exclude featureless areas and simple edges which are oriented in the vertical, horizontal, and two diagonals. It was incorrectly detecting points on edges which are not on the four main directions (vertical, horizontal, and two diagonals). The new algorithm uses the Prewitt operator to exclude featureless areas, so that the Interest Operator is applied only on the edges to exclude simple edges and to leave interesting points. This modification speeds-up the extraction process by approximately 5 times. The Gradient Angle Variance (GAV), an operator which calculates the variance of the gradient angle in a window around the point under concern, is then applied on the interesting points to exclude the redundant ones and leave the actual dominant ones. The matching phase is performed after the extraction of the dominant points in both stereo images. The matching starts with dominant points in the left image and does a local search, looking for corresponding dominant points in the right image. The search is geometrically constrained the epipolar line of the parallel-axes stereo geometry and the maximum disparity of the application environment. If one dominant point in the right image lies in the search areas, then it is the corresponding point of the reference dominant point in the left image. A parameter provided by the GAV is thresholded and used as a rough similarity measure to select the corresponding dominant point if there is more than one point the search area. The correlation is used as
An efficient algorithm for estimating noise covariances in distributed systems
NASA Technical Reports Server (NTRS)
Dee, D. P.; Cohn, S. E.; Ghil, M.; Dalcher, A.
1985-01-01
An efficient computational algorithm for estimating the noise covariance matrices of large linear discrete stochatic-dynamic systems is presented. Such systems arise typically by discretizing distributed-parameter systems, and their size renders computational efficiency a major consideration. The proposed adaptive filtering algorithm is based on the ideas of Belanger, and is algebraically equivalent to his algorithm. The earlier algorithm, however, has computational complexity proportional to p to the 6th, where p is the number of observations of the system state, while the new algorithm has complexity proportional to only p-cubed. Further, the formulation of noise covariance estimation as a secondary filter, analogous to state estimation as a primary filter, suggests several generalizations of the earlier algorithm. The performance of the proposed algorithm is demonstrated for a distributed system arising in numerical weather prediction.
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarría-Miranda, Daniel
2009-05-29
We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Optimized simulations of Olami-Feder-Christensen systems using parallel algorithms
NASA Astrophysics Data System (ADS)
Dominguez, Rachele; Necaise, Rance; Montag, Eric
The sequential nature of the Olami-Feder-Christensen (OFC) model for earthquake simulations limits the benefits of parallel computing approaches because of the frequent communication required between processors. We developed a parallel version of the OFC algorithm for multi-core processors. Our data, even for relatively small system sizes and low numbers of processors, indicates that increasing the number of processors provides significantly faster simulations; producing more efficient results than previous attempts that used network-based Beowulf clusters. Our algorithm optimizes performance by exploiting the multi-core processor architecture, minimizing communication time in contrast to the networked Beowulf-cluster approaches. Our multi-core algorithm is the basis for a new algorithm using GPUs that will drastically increase the number of processors available. Previous studies incorporating realistic structural features of faults into OFC models have revealed spatial and temporal patterns observed in real earthquake systems. The computational advances presented here will allow for studying interacting networks of faults, rather than individual faults, further enhancing our understanding of the relationship between the earth's structure and the triggering process. Support for this project comes from the Chenery Research Fund, the Rashkind Family Endowment, the Walter Williams Craigie Teaching Endowment, and the Schapiro Undergraduate Research Fellowship.
A Parallel, Finite-Volume Algorithm for Large-Eddy Simulation of Turbulent Flows
NASA Technical Reports Server (NTRS)
Bui, Trong T.
1999-01-01
A parallel, finite-volume algorithm has been developed for large-eddy simulation (LES) of compressible turbulent flows. This algorithm includes piecewise linear least-square reconstruction, trilinear finite-element interpolation, Roe flux-difference splitting, and second-order MacCormack time marching. Parallel implementation is done using the message-passing programming model. In this paper, the numerical algorithm is described. To validate the numerical method for turbulence simulation, LES of fully developed turbulent flow in a square duct is performed for a Reynolds number of 320 based on the average friction velocity and the hydraulic diameter of the duct. Direct numerical simulation (DNS) results are available for this test case, and the accuracy of this algorithm for turbulence simulations can be ascertained by comparing the LES solutions with the DNS results. The effects of grid resolution, upwind numerical dissipation, and subgrid-scale dissipation on the accuracy of the LES are examined. Comparison with DNS results shows that the standard Roe flux-difference splitting dissipation adversely affects the accuracy of the turbulence simulation. For accurate turbulence simulations, only 3-5 percent of the standard Roe flux-difference splitting dissipation is needed.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
NASA Technical Reports Server (NTRS)
Krasteva, Denitza T.
1998-01-01
Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.
A vector-multiplication dominated parallel algorithm for the computation of real eigenvalue spectra
NASA Astrophysics Data System (ADS)
Clint, M.
1982-06-01
In order to exploit effectively the power of array and vector processors for the numerical solution of linear algebraic problems it is desirable to express algorithms principally in terms of vector and matrix operations. Algorithms which manipulate vectors and matrices at component level are best suited for execution on single processor hardware. Often, however, it is difficult, if not impossible, to construct efficient versions of such algorithms which are suitable foe execution on parallwl hardware. A method for computing the eigenvalues of real unsymmetric matrices with real eigenvalue spectra is presented. The method is an extension of the one described in ref. [1]. The algorithm makes heavy use of vector inner product evaluations. The manipulation of individual components of vectors and matrices is kept to a minimum. Essentially, the method involves the construction of a sequence of biorthogonal transformation matrices the combined effect of which is to diagonalise the matrix. The eigenvalues of the matrix are diagonal elements of the final diagonalised form. If the eigenvectors of the matrix are also required the algorithm may be extended in a straightforward way. The effectiveness of the algorithm is demonstrated by an application of sequential version to several small matrices and some comments are made about the time complexity of the parallel version.
Renaut, R.; He, Q.
1994-12-31
In a new parallel iterative algorithm for unconstrained optimization by multisplitting is proposed. In this algorithm the original problem is split into a set of small optimization subproblems which are solved using well known sequential algorithms. These algorithms are iterative in nature, e.g. DFP variable metric method. Here the authors use sequential algorithms based on an inexact subspace search, which is an extension to the usual idea of an inexact fine search. Essentially the idea of the inexact line search for nonlinear minimization is that at each iteration the authors only find an approximate minimum in the line search direction. Hence by inexact subspace search, they mean that, instead of finding the minimum of the subproblem at each interation, they do an incomplete down hill search to give an approximate minimum. Some convergence and numerical results for this algorithm will be presented. Further, the original theory will be generalized to the situation with a singular Hessian. Applications for nonlinear least squares problems will be presented. Experimental results will be presented for implementations on an Intel iPSC/860 Hypercube with 64 nodes as well as on the Intel Paragon.
Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.
NASA Astrophysics Data System (ADS)
de Lorenzi, Flavio; Debattista, Victor P.; Gerhard, Ortwin; Sambhus, Niranjan
2007-03-01
We describe a made-to-measure (M2M) algorithm for constructing N-particle models of stellar systems from observational data (χ2M2M), extending earlier ideas by Syer & Tremaine. The algorithm properly accounts for observational errors, is flexible, and can be applied to various systems and geometries. We implement this algorithm in a parallel code NMAGIC and carry out a sequence of tests to illustrate its power and performance. (i) We reconstruct an isotropic Hernquist model from density moments and projected kinematics and recover the correct differential energy distribution and intrinsic kinematics. (ii) We build a self-consistent oblate three-integral maximum rotator model and compare how the distribution function is recovered from integral field and slit kinematic data. (iii) We create a non-rotating and a figure rotating triaxial stellar particle model, reproduce the projected kinematics of the figure rotating system by a non-rotating system of the same intrinsic shape, and illustrate the signature of pattern rotation in this model. From these tests, we comment on the dependence of the results from χ2M2M on the initial model, the geometry, and the amount of available data.
Parallel algorithm of real-time infrared image restoration based on total variation theory
NASA Astrophysics Data System (ADS)
Zhu, Ran; Li, Miao; Long, Yunli; Zeng, Yaoyuan; An, Wei
2015-10-01
Image restoration is a necessary preprocessing step for infrared remote sensing applications. Traditional methods allow us to remove the noise but penalize too much the gradients corresponding to edges. Image restoration techniques based on variational approaches can solve this over-smoothing problem for the merits of their well-defined mathematical modeling of the restore procedure. The total variation (TV) of infrared image is introduced as a L1 regularization term added to the objective energy functional. It converts the restoration process to an optimization problem of functional involving a fidelity term to the image data plus a regularization term. Infrared image restoration technology with TV-L1 model exploits the remote sensing data obtained sufficiently and preserves information at edges caused by clouds. Numerical implementation algorithm is presented in detail. Analysis indicates that the structure of this algorithm can be easily implemented in parallelization. Therefore a parallel implementation of the TV-L1 filter based on multicore architecture with shared memory is proposed for infrared real-time remote sensing systems. Massive computation of image data is performed in parallel by cooperating threads running simultaneously on multiple cores. Several groups of synthetic infrared image data are used to validate the feasibility and effectiveness of the proposed parallel algorithm. Quantitative analysis of measuring the restored image quality compared to input image is presented. Experiment results show that the TV-L1 filter can restore the varying background image reasonably, and that its performance can achieve the requirement of real-time image processing.
Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes. Parts 1 and 2
NASA Technical Reports Server (NTRS)
Barth, Timothy J.; Kwak, Dochan (Technical Monitor)
1995-01-01
The Advisory Group for Aerospace Research and Development (AGARD) has requested my participation in the lecture series entitled Parallel Computing in Computational Fluid Dynamics to be held at the von Karman Institute in Brussels, Belgium on May 15-19, 1995. In addition, a request has been made from the US Coordinator for AGARD at the Pentagon for NASA Ames to hold a repetition of the lecture series on October 16-20, 1995. I have been asked to be a local coordinator for the Ames event. All AGARD lecture series events have attendance limited to NATO allied countries. A brief of the lecture series is provided in the attached enclosure. Specifically, I have been asked to give two lectures of approximately 75 minutes each on the subject of parallel solution techniques for the fluid flow equations on unstructured meshes. The title of my lectures is "Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes" (Parts I-II). The contents of these lectures will be largely review in nature and will draw upon previously published work in this area. Topics of my lectures will include: (1) Mesh partitioning algorithms. Recursive techniques based on coordinate bisection, Cuthill-McKee level structures, and spectral bisection. (2) Newton's method for large scale CFD problems. Size and complexity estimates for Newton's method, modifications for insuring global convergence. (3) Techniques for constructing the Jacobian matrix. Analytic and numerical techniques for Jacobian matrix-vector products, constructing the transposed matrix, extensions to optimization and homotopy theories. (4) Iterative solution algorithms. Practical experience with GIVIRES and BICG-STAB matrix solvers. (5) Parallel matrix preconditioning. Incomplete Lower-Upper (ILU) factorization, domain-decomposed ILU, approximate Schur complement strategies.
Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)
NASA Technical Reports Server (NTRS)
Riley, Christopher J.; Cheatwood, F. McNeil
1997-01-01
The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier-Stokes solver, has been modified for use in a parallel, distributed-memory environment using the Message-Passing Interface (MPI) standard. A standard domain decomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The effect of domain decomposition and frequency of boundary updates on performance and convergence is also examined for several realistic configurations and conditions typical of large-scale computational fluid dynamic analysis.
NASA Astrophysics Data System (ADS)
Chen, Jian-Lin; Li, Lei; Wang, Lin-Yuan; Cai, Ai-Long; Xi, Xiao-Qi; Zhang, Han-Ming; Li, Jian-Xin; Yan, Bin
2015-02-01
The projection matrix model is used to describe the physical relationship between reconstructed object and projection. Such a model has a strong influence on projection and backprojection, two vital operations in iterative computed tomographic reconstruction. The distance-driven model (DDM) is a state-of-the-art technology that simulates forward and back projections. This model has a low computational complexity and a relatively high spatial resolution; however, it includes only a few methods in a parallel operation with a matched model scheme. This study introduces a fast and parallelizable algorithm to improve the traditional DDM for computing the parallel projection and backprojection operations. Our proposed model has been implemented on a GPU (graphic processing unit) platform and has achieved satisfactory computational efficiency with no approximation. The runtime for the projection and backprojection operations with our model is approximately 4.5 s and 10.5 s per loop, respectively, with an image size of 256×256×256 and 360 projections with a size of 512×512. We compare several general algorithms that have been proposed for maximizing GPU efficiency by using the unmatched projection/backprojection models in a parallel computation. The imaging resolution is not sacrificed and remains accurate during computed tomographic reconstruction. Projected supported by the National High Technology Research and Development Program of China (Grant No. 2012AA011603) and the National Natural Science Foundation of China (Grant No. 61372172).
Distributed Parallel Computing in Data Analysis of Osteoporosis.
Waleska Simões, Priscyla; Venson, Ramon; Comunello, Eros; Casagrande, Rogério Antônio; Bigaton, Everson; da Silva Carlessi, Lucas; da Rosa, Maria Inês; Martins, Paulo João
2015-01-01
This research aimed to compare the performance of two models of load balancing (Proportional and Autotuned algorithms) of the JPPF platform in the processing of data mining from a database with osteoporosis and osteopenia. When performing the analysis of execution times, it was observed that the Proportional algorithm performed better in all cases.
Execution time supports for adaptive scientific algorithms on distributed memory machines
NASA Technical Reports Server (NTRS)
Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey
1990-01-01
Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine
NASA Technical Reports Server (NTRS)
Dubey, A.; Zubair, M.; Grosch, C. E.
1992-01-01
One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
NASA Technical Reports Server (NTRS)
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Goldberg, L.A.; Jerrum, M.; Leighton, T.; Rao, S.
1993-01-20
In this paper we consider the problem of interprocessor communication on a Completely Connected Optical Communication Parallel Computer (OCPC). The particular problem we study is that of realizing an h-relation. In this problem, each processor has at most h messages to send and at most h messages to receive. It is clear that any 1-relation can be realized in one communication step on an OCPC. However, the best known p-processor OCPC algorithm for realizing an arbitrary h-relation for h > 1 requires {Theta}(h + log p) expected communication steps. (This algorithm is due to Valiant and is based on earlier work of Anderson and Miller.) Valiant`s algorithm is optimal only for h = {Omega}(log p) and it is an open question of Gereb-Graus and Tsantilas whether there is a faster algorithm for h = o(log p). In this paper we answer this question in the affirmative by presenting a {Theta} (h + log log p) communication step algorithm that realizes an arbitrary h-relation on a p-processor OCPC. We show that if h {le} log p then the failure probability can be made as small as p{sup -{alpha}} for any positive constant {alpha}.
Muckley, Matthew J; Noll, Douglas C; Fessler, Jeffrey A
2015-02-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms.
Noll, Douglas C.; Fessler, Jeffrey A.
2014-01-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms. PMID:25330484
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Optimization of composite structures by estimation of distribution algorithms
NASA Astrophysics Data System (ADS)
Grosset, Laurent
The design of high performance composite laminates, such as those used in aerospace structures, leads to complex combinatorial optimization problems that cannot be addressed by conventional methods. These problems are typically solved by stochastic algorithms, such as evolutionary algorithms. This dissertation proposes a new evolutionary algorithm for composite laminate optimization, named Double-Distribution Optimization Algorithm (DDOA). DDOA belongs to the family of estimation of distributions algorithms (EDA) that build a statistical model of promising regions of the design space based on sets of good points, and use it to guide the search. A generic framework for introducing statistical variable dependencies by making use of the physics of the problem is proposed. The algorithm uses two distributions simultaneously: the marginal distributions of the design variables, complemented by the distribution of auxiliary variables. The combination of the two generates complex distributions at a low computational cost. The dissertation demonstrates the efficiency of DDOA for several laminate optimization problems where the design variables are the fiber angles and the auxiliary variables are the lamination parameters. The results show that its reliability in finding the optima is greater than that of a simple EDA and of a standard genetic algorithm, and that its advantage increases with the problem dimension. A continuous version of the algorithm is presented and applied to a constrained quadratic problem. Finally, a modification of the algorithm incorporating probabilistic and directional search mechanisms is proposed. The algorithm exhibits a faster convergence to the optimum and opens the way for a unified framework for stochastic and directional optimization.
Fast parallel algorithms that compute transitive closure of a fuzzy relation
NASA Technical Reports Server (NTRS)
Kreinovich, Vladik YA.
1993-01-01
The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Optimizing ion channel models using a parallel genetic algorithm on graphical processors.
Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon
2012-01-01
We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs.
NASA Technical Reports Server (NTRS)
Fijany, Amir
1993-01-01
In this paper parallel 0(log N) algorithms for dynamic simulation of single closed-chain rigid multibody system as specialized to the case of a robot manipulatoar in contact with the environment are developed.
Parallelization of the Volterra algorithm for linear optimal open loop control
NASA Astrophysics Data System (ADS)
Das, S. K.; Utku, S.; Salama, M.
1989-07-01
If and when the time variation of optimal controls of a linear system subject to known forces is required, they can be obtained by the computationally advantageous open loop Volterra formulation (as opposed to the costlier Riccati formulation). For the computation, the Volterra equation is discretized in the time domain via such schemes as trapezoidal integration or SIMPSON's rule and the resulting linear system is solved to obtain the control vector values at discrete time points within the control time T. In the case of very large order systems (degrees of freedom ˜ 5000) a parallel technique is absolutely neccessary, and this paper enunciates an efficient parallel stratagem with efficiencies in the range of 80% and 100%. The algorithm uses ‘ s + 1’ processors, ‘ s’ being the number of intervals within the control time T, and typically each processor characterizes one time point.
Parallel Simulation Algorithms for the Three Dimensional Strong-Strong Beam-Beam Interaction
Kabel, A.C.; /SLAC
2008-03-17
The strong-strong beam-beam effect is one of the most important effects limiting the luminosity of ring colliders. Little is known about it analytically, so most studies utilize numeric simulations. The two-dimensional realm is readily accessible to workstation-class computers (cf.,e.g.,[1, 2]), while three dimensions, which add effects such as phase averaging and the hourglass effect, require vastly higher amounts of CPU time. Thus, parallelization of three-dimensional simulation techniques is imperative; in the following we discuss parallelization strategies and describe the algorithms used in our simulation code, which will reach almost linear scaling of performance vs. number of CPUs for typical setups.
A study of the parallel algorithm for large-scale DC simulation of nonlinear systems
NASA Astrophysics Data System (ADS)
Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel
Newton-Raphson DC analysis of large-scale nonlinear circuits may be an extremely time consuming process even if sparse matrix techniques and bypassing of nonlinear models calculation are used. A slight decrease in the time required for this task may be enabled on multi-core, multithread computers if the calculation of the mathematical models for the nonlinear elements as well as the stamp management of the sparse matrix entries are managed through concurrent processes. This numerical complexity can be further reduced via the circuit decomposition and parallel solution of blocks taking as a departure point the BBD matrix structure. This block-parallel approach may give a considerable profit though it is strongly dependent on the system topology and, of course, on the processor type. This contribution presents the easy-parallelizable decomposition-based algorithm for DC simulation and provides a detailed study of its effectiveness.
Multimode fiber laser beam cleanup based on stochastic parallel gradient descent algorithm
NASA Astrophysics Data System (ADS)
Zhao, Hai-Chuan; Ma, Hao-Tong; Zhou, Pu; Wang, Xiao-Lin; Ma, Yan-Xing; Li, Xiao; Xu, Xiao-Jun; Zhao, Yi-Jun
2011-01-01
We present experimental research on multimode fiber laser beam cleanup based on a stochastic parallel gradient descent (SPGD) algorithm. The multimode laser is obtained by injecting a 1064 nm central wavelength single mode fiber laser into a multimode fiber and the system is setup by using phase only liquid crystal spatial light modulators (LC-SLM). The quality evaluation function is increased by a factor of 10.5 and 65% of the laser energy is encircled in the central lobe when the system evolves from open-loop into close-loop state. Experimental results indicate the feasibility of the multimode fiber laser beam cleanup by adaptive optics (AO).
Wang, Jian; Wang, Xiaolong; Jiang, Aipeng; Jiangzhou, Shu; Li, Ping
2014-01-01
A large-scale parallel-unit seawater reverse osmosis desalination plant contains many reverse osmosis (RO) units. If the operating conditions change, these RO units will not work at the optimal design points which are computed before the plant is built. The operational optimization problem (OOP) of the plant is to find out a scheduling of operation to minimize the total running cost when the change happens. In this paper, the OOP is modelled as a mixed-integer nonlinear programming problem. A two-stage differential evolution algorithm is proposed to solve this OOP. Experimental results show that the proposed method is satisfactory in solution quality. PMID:24701180
Wang, Xiaolong; Jiang, Aipeng; Jiangzhou, Shu; Li, Ping
2014-01-01
A large-scale parallel-unit seawater reverse osmosis desalination plant contains many reverse osmosis (RO) units. If the operating conditions change, these RO units will not work at the optimal design points which are computed before the plant is built. The operational optimization problem (OOP) of the plant is to find out a scheduling of operation to minimize the total running cost when the change happens. In this paper, the OOP is modelled as a mixed-integer nonlinear programming problem. A two-stage differential evolution algorithm is proposed to solve this OOP. Experimental results show that the proposed method is satisfactory in solution quality. PMID:24701180
Wang, Jian; Wang, Xiaolong; Jiang, Aipeng; Jiangzhou, Shu; Li, Ping
2014-01-01
A large-scale parallel-unit seawater reverse osmosis desalination plant contains many reverse osmosis (RO) units. If the operating conditions change, these RO units will not work at the optimal design points which are computed before the plant is built. The operational optimization problem (OOP) of the plant is to find out a scheduling of operation to minimize the total running cost when the change happens. In this paper, the OOP is modelled as a mixed-integer nonlinear programming problem. A two-stage differential evolution algorithm is proposed to solve this OOP. Experimental results show that the proposed method is satisfactory in solution quality.
Du, Tingsong; Hu, Yang; Ke, Xianting
2015-01-01
An improved quantum artificial fish swarm algorithm (IQAFSA) for solving distributed network programming considering distributed generation is proposed in this work. The IQAFSA based on quantum computing which has exponential acceleration for heuristic algorithm uses quantum bits to code artificial fish and quantum revolving gate, preying behavior, and following behavior and variation of quantum artificial fish to update the artificial fish for searching for optimal value. Then, we apply the proposed new algorithm, the quantum artificial fish swarm algorithm (QAFSA), the basic artificial fish swarm algorithm (BAFSA), and the global edition artificial fish swarm algorithm (GAFSA) to the simulation experiments for some typical test functions, respectively. The simulation results demonstrate that the proposed algorithm can escape from the local extremum effectively and has higher convergence speed and better accuracy. Finally, applying IQAFSA to distributed network problems and the simulation results for 33-bus radial distribution network system show that IQAFSA can get the minimum power loss after comparing with BAFSA, GAFSA, and QAFSA.
NASA Astrophysics Data System (ADS)
Zhao, Tao; Hwang, Feng-Nan; Cai, Xiao-Chuan
2016-07-01
We consider a quintic polynomial eigenvalue problem arising from the finite volume discretization of a quantum dot simulation problem. The problem is solved by the Jacobi-Davidson (JD) algorithm. Our focus is on how to achieve the quadratic convergence of JD in a way that is not only efficient but also scalable when the number of processor cores is large. For this purpose, we develop a projected two-level Schwarz preconditioned JD algorithm that exploits multilevel domain decomposition techniques. The pyramidal quantum dot calculation is carefully studied to illustrate the efficiency of the proposed method. Numerical experiments confirm that the proposed method has a good scalability for problems with hundreds of millions of unknowns on a parallel computer with more than 10,000 processor cores.
Estimating the atmospheric correlation length with stochastic parallel gradient descent algorithm.
Yazdani, R; Hajimahmoodzadeh, M; Fallah, H R
2014-03-01
The atmospheric turbulence measurement has received much attention in various fields due to its effects on wave propagation. One of the interesting parameters for characterization of the atmospheric turbulence is the Fried parameter or the atmospheric correlation length. We numerically investigate the feasibility of estimating the Fried parameter using a simple and low-cost system based on the stochastic parallel gradient descent (SPGD) algorithm without the need for wavefront sensing. We simulate the atmospheric turbulence using Zernike polynomials and employ a wavefront sensor-less adaptive optics system based on the SPGD algorithm and report the estimated Fried parameter after compensating for atmospheric-turbulence-induced phase distortions. Several simulations for different atmospheric turbulence strengths are presented to validate the proposed method.
Designing efficient parallel algorithms on mesh-connected computers with multiple broadcasting
Chen, Y.C.; Chen, W.T. ); Chen, G.H. ); Sheu, J.P. )
1990-04-01
Semigroup and prefix computations on two-dimensional mesh-connected computers with multiple broadcasting (2-MCCMB's) are studied in this paper. Previously, only square 2-MCCMB's with N processing elements were considered or semigroup computations of N data items, and O(N{sup 1/6}) time was required. It is found that square machines are not the best form for semigroup computations, and an O(N{sup 1/8}) time algorithm is thus derived on an N{sup 5/8} {times} N{sup 3/8} rectangular 2-MCCMB. This time complexity can be further reduced to O(N{sup 1/9}) if fewer PE's are used. Following the same way, parallel algorithms for prefix computations are also derived with the same time complexities.
Model-based spectral estimation of Doppler signals using parallel genetic algorithms.
Solano González, J; Rodríguez Vázquez, K; García Nocetti, D F
2000-05-01
Conventional spectral analysis methods use a fast Fourier transform (FFT) on consecutive or overlapping windowed data segments. For Doppler ultrasound signals, this approach suffers from an inadequate frequency resolution due to the time segment duration and the non-stationarity characteristics of the signals. Parametric or model-based estimators can give significant improvements in the time-frequency resolution at the expense of a higher computational complexity. This work describes an approach which implements in real-time a parametric spectral estimator method using genetic algorithms (GAs) in order to find the optimum set of parameters for the adaptive filter that minimises the error function. The aim is to reduce the computational complexity of the conventional algorithm by using the simplicity associated to GAs and exploiting its parallel characteristics. This will allow the implementation of higher order filters, increasing the spectrum resolution, and opening a greater scope for using more complex methods. PMID:10767617
Attaway, S.W.; Hendrickson, B.A.; Plimpton, S.J.; Swegle, J.W.; Gardner, D.R.; Vaughan, C.T.
1997-05-01
An efficient, scalable, parallel algorithm for treating contacts in solid mechanics has been applied to interactions between particles in smooth particle hydrodynamics (SPH). The algorithm uses three different decompositions within a single timestep: (1) a static FE-decomposition of mesh elements; (2) a dynamic SPH-decomposition of SPH particles; (3) and a dynamic contact-decomposition of contact nodes and SPH particles. The overhead cost of such a scheme is the cost of moving mesh and particle data between the decompositions. This cost turns out to be small in practice, leading to a highly load-balanced decomposition in which to perform each of the three major computational states within a timestep.
Model-based spectral estimation of Doppler signals using parallel genetic algorithms.
Solano González, J; Rodríguez Vázquez, K; García Nocetti, D F
2000-05-01
Conventional spectral analysis methods use a fast Fourier transform (FFT) on consecutive or overlapping windowed data segments. For Doppler ultrasound signals, this approach suffers from an inadequate frequency resolution due to the time segment duration and the non-stationarity characteristics of the signals. Parametric or model-based estimators can give significant improvements in the time-frequency resolution at the expense of a higher computational complexity. This work describes an approach which implements in real-time a parametric spectral estimator method using genetic algorithms (GAs) in order to find the optimum set of parameters for the adaptive filter that minimises the error function. The aim is to reduce the computational complexity of the conventional algorithm by using the simplicity associated to GAs and exploiting its parallel characteristics. This will allow the implementation of higher order filters, increasing the spectrum resolution, and opening a greater scope for using more complex methods.
Experiences with serial and parallel algorithms for channel routing using simulated annealing
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1988-01-01
Two algorithms for channel routing using simulated annealing are presented. Simulated annealing is an optimization methodology which allows the solution process to back up out of local minima that may be encountered by inappropriate selections. By properly controlling the annealing process, it is very likely that the optimal solution to an NP-complete problem such as channel routing may be found. The algorithm presented proposes very relaxed restrictions on the types of allowable transformations, including overlapping nets. By freeing that restriction and controlling overlap situations with an appropriate cost function, the algorithm becomes very flexible and can be applied to many extensions of channel routing. The selection of the transformation utilizes a number of heuristics, still retaining the pseudorandom nature of simulated annealing. The algorithm was implemented as a serial program for a workstation, and a parallel program designed for a hypercube computer. The details of the serial implementation are presented, including many of the heuristics used and some of the resulting solutions.
A parallel algorithm for viewshed analysis in three-dimensional Digital Earth
NASA Astrophysics Data System (ADS)
Feng, Wang; Gang, Wang; Deji, Pan; Yuan, Liu; Liuzhong, Yang; Hongbo, Wang
2015-02-01
Viewshed analysis, often supported by geographic information systems, is widely used in the three-dimensional (3D) Digital Earth system. Many of the analyzes involve the siting of features and real-timedecision-making. Viewshed analysis is usually performed at a large scale, which poses substantial computational challenges, as geographic datasets continue to become increasingly large. Previous research on viewshed analysis has been generally limited to a single data structure (i.e., DEM), which cannot be used to analyze viewsheds in complicated scenes. In this paper, a real-time algorithm for viewshed analysis in Digital Earth is presented using the parallel computing of graphics processing units (GPUs). An occlusion for each geometric entity in the neighbor space of the viewshed point is generated according to line-of-sight. The region within the occlusion is marked by a stencil buffer within the programmable 3D visualization pipeline. The marked region is drawn with red color concurrently. In contrast to traditional algorithms based on line-of-sight, the new algorithm, in which the viewshed calculation is integrated with the rendering module, is more efficient and stable. This proposed method of viewshed generation is closer to the reality of the virtual geographic environment. No DEM interpolation, which is seen as a computational burden, is needed. The algorithm was implemented in a 3D Digital Earth system (GeoBeans3D) with the DirectX application programming interface (API) and has been widely used in a range of applications.
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martin, Gabriel; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2016-04-01
In the last years, hyperspectral analysis have been applied in many remote sensing applications. In fact, hyperspectral unmixing has been a challenging task in hyperspectral data exploitation. This process consists of three stages: (i) estimation of the number of pure spectral signatures or endmembers, (ii) automatic identification of the estimated endmembers, and (iii) estimation of the fractional abundance of each endmember in each pixel of the scene. However, unmixing algorithms can be computationally very expensive, a fact that compromises their use in applications under real-time constraints. In recent years, several techniques have been proposed to solve the aforementioned problem but until now, most works have focused on the second and third stages. The execution cost of the first stage is usually lower than the other stages. Indeed, it can be optional if we known a priori this estimation. However, its acceleration on parallel architectures is still an interesting and open problem. In this paper we have addressed this issue focusing on the GENE algorithm, a promising geometry-based proposal introduced in.1 We have evaluated our parallel implementation in terms of both accuracy and computational performance through Monte Carlo simulations for real and synthetic data experiments. Performance results on a modern GPU shows satisfactory 16x speedup factors, which allow us to expect that this method could meet real-time requirements on a fully operational unmixing chain.
NASA Astrophysics Data System (ADS)
Zhang, Zhi-Yong; Tan, Han-Dong; Wang, Kun-Peng; Lin, Chang-Hong; Zhang, Bin; Xie, Mao-Bi
2016-03-01
Traditional two-dimensional (2D) complex resistivity forward modeling is based on Poisson's equation but spectral induced polarization (SIP) data are the coproducts of the induced polarization (IP) and the electromagnetic induction (EMI) effects. This is especially true under high frequencies, where the EMI effect can exceed the IP effect. 2D inversion that only considers the IP effect reduces the reliability of the inversion data. In this paper, we derive differential equations using Maxwell's equations. With the introduction of the Cole-Cole model, we use the finite-element method to conduct 2D SIP forward modeling that considers the EMI and IP effects simultaneously. The data-space Occam method, in which different constraints to the model smoothness and parametric boundaries are introduced, is then used to simultaneously obtain the four parameters of the Cole—Cole model using multi-array electric field data. This approach not only improves the stability of the inversion but also significantly reduces the solution ambiguity. To improve the computational efficiency, message passing interface programming was used to accelerate the 2D SIP forward modeling and inversion. Synthetic datasets were tested using both serial and parallel algorithms, and the tests suggest that the proposed parallel algorithm is robust and efficient.
Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael
2015-04-08
The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.
Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael
2015-04-08
The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on themore » performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.« less
NASA Astrophysics Data System (ADS)
Baba, Toshitaka; Takahashi, Narumi; Kaneda, Yoshiyuki; Ando, Kazuto; Matsuoka, Daisuke; Kato, Toshihiro
2015-12-01
Because of improvements in offshore tsunami observation technology, dispersion phenomena during tsunami propagation have often been observed in recent tsunamis, for example the 2004 Indian Ocean and 2011 Tohoku tsunamis. The dispersive propagation of tsunamis can be simulated by use of the Boussinesq model, but the model demands many computational resources. However, rapid progress has been made in parallel computing technology. In this study, we investigated a parallelized approach for dispersive tsunami wave modeling. Our new parallel software solves the nonlinear Boussinesq dispersive equations in spherical coordinates. A variable nested algorithm was used to increase spatial resolution in the target region. The software can also be used to predict tsunami inundation on land. We used the dispersive tsunami model to simulate the 2011 Tohoku earthquake on the Supercomputer K. Good agreement was apparent between the dispersive wave model results and the tsunami waveforms observed offshore. The finest bathymetric grid interval was 2/9 arcsec (approx. 5 m) along longitude and latitude lines. Use of this grid simulated tsunami soliton fission near the Sendai coast. Incorporating the three-dimensional shape of buildings and structures led to improved modeling of tsunami inundation.
Virtual parallel computing and a search algorithm using matrix product states.
Chamon, Claudio; Mucciolo, Eduardo R
2012-07-20
We propose a form of parallel computing on classical computers that is based on matrix product states. The virtual parallelization is accomplished by representing bits with matrices and by evolving these matrices from an initial product state that encodes multiple inputs. Matrix evolution follows from the sequential application of gates, as in a logical circuit. The action by classical probabilistic one-bit and deterministic two-bit gates such as NAND are implemented in terms of matrix operations and, as opposed to quantum computing, it is possible to copy bits. We present a way to explore this method of computation to solve search problems and count the number of solutions. We argue that if the classical computational cost of testing solutions (witnesses) requires less than O(n2) local two-bit gates acting on n bits, the search problem can be fully solved in subexponential time. Therefore, for this restricted type of search problem, the virtual parallelization scheme is faster than Grover's quantum algorithm.
A hybrid-algorithm-based parallel computing framework for optimal reservoir operation
NASA Astrophysics Data System (ADS)
Li, X.; Wei, J.; Li, T.; Wang, G.
2012-12-01
Up to date, various optimization models have been developed to offer optimal operating policies for reservoirs. Each optimization model has its own merits and limitations, and no general algorithm exists even today. At times, some optimization models have to be combined to obtain desired results. In this paper, we present a parallel computing framework to combine various optimization models in a different way compared to traditional serial computing. This framework consists of three functional processor types, that is, master processor, slave processor and transfer processor. The master processor has a full computation scheme that allocates optimization models to slave processors; slave processors perform allocated optimization models; the transfer processor is in charge of the solution communication among all slave processors. Based on these, the proposed framework can perform various optimization models in parallel. Because of the solution communication, the framework can also integrate the merits of involved optimization models while in iteration and the performance of each optimization model can therefore be improved. And more, it can be concluded the framework can effectively improve the solution quality and increase the solution speed by making full use of computing power of parallel computers.
NASA Astrophysics Data System (ADS)
Bogdanov, P. B.; Gorobets, A. V.; Sukov, S. A.
2013-08-01
The design of efficient algorithms for large-scale gas dynamics computations with hybrid (heterogeneous) computing systems whose high performance relies on massively parallel accelerators is addressed. A high-order accurate finite volume algorithm with polynomial reconstruction on unstructured hybrid meshes is used to compute compressible gas flows in domains of complex geometry. The basic operations of the algorithm are implemented in detail for massively parallel accelerators, including AMD and NVIDIA graphics processing units (GPUs). Major optimization approaches and a computation transfer technique are covered. The underlying programming tool is the Open Computing Language (OpenCL) standard, which performs on accelerators of various architectures, both existing and emerging.
Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets
Ikebata, Hisaki; Yoshida, Ryo
2015-01-01
Motivation: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods. Results: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover. Availability and implementation: A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif. Contact: ikebata.hisaki@ism.ac.jp, yoshidar@ism.ac.jp Supplementary information: Supplementary data are available from Bioinformatics online. PMID:25583120
NASA Technical Reports Server (NTRS)
Sanyal, Soumya; Jain, Amit; Das, Sajal K.; Biswas, Rupak
2003-01-01
In this paper, we propose a distributed approach for mapping a single large application to a heterogeneous grid environment. To minimize the execution time of the parallel application, we distribute the mapping overhead to the available nodes of the grid. This approach not only provides a fast mapping of tasks to resources but is also scalable. We adopt a hierarchical grid model and accomplish the job of mapping tasks to this topology using a scheduler tree. Results show that our three-phase algorithm provides high quality mappings, and is fast and scalable.
NASA Technical Reports Server (NTRS)
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
NASA Astrophysics Data System (ADS)
Niwa, Keiichi; Hayashida, Tomohiro; Sakawa, Masatoshi; Yang, Yishen
2010-10-01
We consider two-level programming problems in which there are one decision maker (the leader) at the upper level and two or more decision makers (the followers) at the lower level and decision variables of the leader and the followers are 0-1 variables. We assume that there is coordination among the followers while between the leader and the group of all the followers, there is no motivation to cooperate each other, and fuzzy goals for objective functions of the leader and the followers are introduced so as to take fuzziness of their judgments into consideration. The leader maximizes the degree of satisfaction (the value of the membership function) and the followers choose in concert in order to maximize a minimum among their degrees of satisfaction. We propose a modified computational method that solves problems related to the computational method based on the genetic algorithm (the existing method) for obtaining the Stackelberg solution. Specifically, the distributed genetic algorithm is introduced with respect to the upper level genetic algorithm, which handles decision variables for the leader in order to shorten the computational time of the existing method. Parallelization of the lower level genetic algorithm is also performed along with parallelization of the upper level genetic algorithm. In order to demonstrate the effectiveness of the proposed computational method, numerical experiments are carried out.
Storchi, Loriano; Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Quiney, Harry M
2013-12-10
We propose a new complete memory-distributed algorithm, which significantly improves the parallel implementation of the all-electron four-component Dirac-Kohn-Sham (DKS) module of BERTHA (J. Chem. Theory Comput. 2010, 6, 384). We devised an original procedure for mapping the DKS matrix between an efficient integral-driven distribution, guided by the structure of specific G-spinor basis sets and by density fitting algorithms, and the two-dimensional block-cyclic distribution scheme required by the ScaLAPACK library employed for the linear algebra operations. This implementation, because of the efficiency in the memory distribution, represents a leap forward in the applicability of the DKS procedure to arbitrarily large molecular systems and its porting on last-generation massively parallel systems. The performance of the code is illustrated by some test calculations on several gold clusters of increasing size. The DKS self-consistent procedure has been explicitly converged for two representative clusters, namely Au20 and Au34, for which the density of electronic states is reported and discussed. The largest gold cluster uses more than 39k basis functions and DKS matrices of the order of 23 GB. PMID:26592273
3-D asteroids using parallel graphics on NCUBE: A testbed for evaluating controller algorithms
Ho, A.; Fox, G.; Snyder, S.; Chu, D.; Mylner, T.
1989-04-18
We have implemented on NCUBE a 3-D Asteroids video game system. The system functions as a testbed for evaluating intelligent controller algorithms within a simulated space battle framework. The Asteroids features battle of spacecrafts in a 3-D toroidal space with inert meteorites of various sizes. It supports multi-players and mixed communication protocols. The game can be played either in interactive or batch mode. In interactive mode a player can maneuver a spacecraft by keyboard or graphics tablet control like a regular pc-based video game. 3-D visual display of the game uses the NCUBE Real-Time Parallel Graphics Board which has 16 NCUBE processors and a Hitachi HD63484 drawing/video chip. In batch mode spacecrafts can be controlled by user-supplied software controllers. The modular structure of the game allows easy replacement of game objectives, game rules, and spacecraft controllers. The flexibility of module substitution allows fast prototyping of different controller strategies and algorithms which are constrained by various game rules. The system also allows algorithms that run on distinct subcubes of a hypercube to compete with one another. 6 refs., 1 fig.
Web based parallel/distributed medical data mining using software agents
Kargupta, H.; Stafford, B.; Hamzaoglu, I.
1997-12-31
This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.
A review of estimation of distribution algorithms in bioinformatics
Armañanzas, Rubén; Inza, Iñaki; Santana, Roberto; Saeys, Yvan; Flores, Jose Luis; Lozano, Jose Antonio; Peer, Yves Van de; Blanco, Rosa; Robles, Víctor; Bielza, Concha; Larrañaga, Pedro
2008-01-01
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain. PMID:18822112
Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.
1997-12-31
The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle. The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.
Feature Subset Selection by Estimation of Distribution Algorithms
Cantu-Paz, E
2002-01-17
This paper describes the application of four evolutionary algorithms to the identification of feature subsets for classification problems. Besides a simple GA, the paper considers three estimation of distribution algorithms (EDAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to determine if the EDAs present advantages over the simple GA in terms of accuracy or speed in this problem. The experiments used a Naive Bayes classifier and public-domain and artificial data sets. In contrast with previous studies, we did not find evidence to support or reject the use of EDAs for this problem.
Impacts of Time Delays on Distributed Algorithms for Economic Dispatch
Yang, Tao; Wu, Di; Sun, Yannan; Lian, Jianming
2015-07-26
Economic dispatch problem (EDP) is an important problem in power systems. It can be formulated as an optimization problem with the objective to minimize the total generation cost subject to the power balance constraint and generator capacity limits. Recently, several consensus-based algorithms have been proposed to solve EDP in a distributed manner. However, impacts of communication time delays on these distributed algorithms are not fully understood, especially for the case where the communication network is directed, i.e., the information exchange is unidirectional. This paper investigates communication time delay effects on a distributed algorithm for directed communication networks. The algorithm has been tested by applying time delays to different types of information exchange. Several case studies are carried out to evaluate the effectiveness and performance of the algorithm in the presence of time delays in communication networks. It is found that time delay effects have negative effects on the convergence rate, and can even result in an incorrect converge value or fail the algorithm to converge.
Distributed Query Plan Generation Using Multiobjective Genetic Algorithm
Panicker, Shina; Vijay Kumar, T. V.
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability. PMID:24963513
XTP as a transport protocol for distributed parallel processing
Strayer, W.T.; Lewis, M.J.; Cline, R.E. Jr.
1994-12-31
The Xpress Transfer Protocol (XTP) is a flexible transport layer protocol designed to provide efficient service without dictating the communication paradigm or the delivery characteristics that quality the paradigm. XTP provides the tools to build communication services appropriate to the application. Current data delivery solutions for many popular cluster computing environments use TCP and UDP. We examine TCP, UDP, and XTP with respect to the communication characteristics typical of parallel applications. We perform measurements of end-to-end latency for several paradigms important to cluster computing. An implementation of XTP is shown to be comparable to TCP in end-to-end latency on preestablished connections, and does better for paradigms where connections must be constructed on the fly.
A distributed parallel storage architecture and its potential application within EOSDIS
NASA Technical Reports Server (NTRS)
Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony
1994-01-01
We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
A distributed parallel storage architecture and its potential application within EOSDIS
Johnston, W.E.; Tierney, B.; Feuquay, J.; Butzer, T.
1995-01-01
We describe the architecture, implementation, use, and potential use of a scale, high-performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays
Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin
2016-01-01
In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-10-23
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
Shin, Hyun-Ho; Yoon, Woong-Sup
2008-07-01
An Adaptive-Spatial Decomposition parallel algorithm was developed to increase computation efficiency for molecular dynamics simulations of nano-fluids. Injection of a liquid argon jet with a scale of 17.6 molecular diameters was investigated. A solid annular platinum injector was also solved simultaneously with the liquid injectant by adopting a solid modeling technique which incorporates phantom atoms. The viscous heat was naturally discharged through the solids so the liquid boiling problem was avoided with no separate use of temperature controlling methods. Parametric investigations of injection speed, wall temperature, and injector length were made. A sudden pressure drop at the orifice exit causes flash boiling of the liquid departing the nozzle exit with strong evaporation on the surface of the liquids, while rendering a slender jet. The elevation of the injection speed and the wall temperature causes an activation of the surface evaporation concurrent with reduction in the jet breakup length and the drop size.
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-01-01
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation. PMID:26512650
Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
NASA Technical Reports Server (NTRS)
Godoy, William F.; Liu, Xu
2011-01-01
General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays.
Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin
2016-01-01
In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-01-01
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation. PMID:26512650
Scalable load balancing for massively parallel distributed Monte Carlo particle transport
O'Brien, M. J.; Brantley, P. S.; Joy, K. I.
2013-07-01
In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrence Livermore National Laboratory. (authors)
Parallel and distributed computation for fault-tolerant object recognition
NASA Technical Reports Server (NTRS)
Wechsler, Harry
1988-01-01
The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.
A parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix
NASA Technical Reports Server (NTRS)
Swarztrauber, Paul N.
1993-01-01
A parallel algorithm, called polysection, is presented for computing the eigenvalues of a symmetric tridiagonal matrix. The method is based on a quadratic recurrence in which the characteristic polynomial is constructed on a binary tree from polynomials whose degree doubles at each level. Intervals that contain exactly one zero are determined by the zeros of polynomials at the previous level which ensures that different processors compute different zeros. The signs of the polynomials at the interval endpoints are determined a priori and used to guarantee that all zeros are found. The use of finite-precision arithmetic may result in multiple zeros; however, in this case, the intervals coalesce and their number determines exactly the multiplicity of the zero. For an N x N matrix the eigenvalues can be determined in O(log-squared N) time with N-squared processors and O(N) time with N processors. The method is compared with a parallel variant of bisection that requires O(N-squared) time on a single processor, O(N) time with N processors, and O(log N) time with N-squared processors.
A novel parallel-rotation algorithm for atomistic Monte Carlo simulation of dense polymer systems
NASA Astrophysics Data System (ADS)
Santos, S.; Suter, U. W.; Müller, M.; Nievergelt, J.
2001-06-01
We develop and test a new elementary Monte Carlo move for use in the off-lattice simulation of polymer systems. This novel Parallel-Rotation algorithm (ParRot) permits moving very efficiently torsion angles that are deeply inside long chains in melts. The parallel-rotation move is extremely simple and is also demonstrated to be computationally efficient and appropriate for Monte Carlo simulation. The ParRot move does not affect the orientation of those parts of the chain outside the moving unit. The move consists of a concerted rotation around four adjacent skeletal bonds. No assumption is made concerning the backbone geometry other than that bond lengths and bond angles are held constant during the elementary move. Properly weighted sampling techniques are needed for ensuring detailed balance because the new move involves a correlated change in four degrees of freedom along the chain backbone. The ParRot move is supplemented with the classical Metropolis Monte Carlo, the Continuum-Configurational-Bias, and Reptation techniques in an isothermal-isobaric Monte Carlo simulation of melts of short and long chains. Comparisons are made with the capabilities of other Monte Carlo techniques to move the torsion angles in the middle of the chains. We demonstrate that ParRot constitutes a highly promising Monte Carlo move for the treatment of long polymer chains in the off-lattice simulation of realistic models of dense polymer systems.
Lilith: A scalable secure tool for massively parallel distributed computing
Armstrong, R.C.; Camp, L.J.; Evensky, D.A.; Gentile, A.C.
1997-06-01
Changes in high performance computing have necessitated the ability to utilize and interrogate potentially many thousands of processors. The ASCI (Advanced Strategic Computing Initiative) program conducted by the United States Department of Energy, for example, envisions thousands of distinct operating systems connected by low-latency gigabit-per-second networks. In addition multiple systems of this kind will be linked via high-capacity networks with latencies as low as the speed of light will allow. Code which spans systems of this sort must be scalable; yet constructing such code whether for applications, debugging, or maintenance is an unsolved problem. Lilith is a research software platform that attempts to answer these questions with an end toward meeting these needs. Presently, Lilith exists as a test-bed, written in Java, for various spanning algorithms and security schemes. The test-bed software has, and enforces, hooks allowing implementation and testing of various security schemes.
The study of real-time denoising algorithm based on parallel computing for the MEMS IR imager
NASA Astrophysics Data System (ADS)
Gong, Cheng; Hui, Mei; Dong, Liquan; Zhao, Yuejin
2011-11-01
Recent years, the MEMS-based optical readout infrared imaging technology is becoming a research hotspot. Studies show that the MEMS-based optical readout infrared imager features a high frame rate. Considering the high data Throughput and computing complexity of denoising algorithm It's difficult to ensure real-time of the image processing. In order to improve processing speed and achieve real-time, we conducted a study of denoising algorithm based on parallel computing using FPGA (Field Programmable Gate Array). In the paper, we analyze the imaging characteristics of MEMS-based optical readout infrared imager and design parallel computing methods for real-time denoising using the hardware description language. The experiment shows that the parallel computing denoising algorithm can improve infrared image processing speed to meet real-time requirement.
Intelligent decision support algorithm for distribution system restoration.
Singh, Reetu; Mehfuz, Shabana; Kumar, Parmod
2016-01-01
Distribution system is the means of revenue for electric utility. It needs to be restored at the earliest if any feeder or complete system is tripped out due to fault or any other cause. Further, uncertainty of the loads, result in variations in the distribution network's parameters. Thus, an intelligent algorithm incorporating hybrid fuzzy-grey relation, which can take into account the uncertainties and compare the sequences is discussed to analyse and restore the distribution system. The simulation studies are carried out to show the utility of the method by ranking the restoration plans for a typical distribution system. This algorithm also meets the smart grid requirements in terms of an automated restoration plan for the partial/full blackout of network.
Intelligent decision support algorithm for distribution system restoration.
Singh, Reetu; Mehfuz, Shabana; Kumar, Parmod
2016-01-01
Distribution system is the means of revenue for electric utility. It needs to be restored at the earliest if any feeder or complete system is tripped out due to fault or any other cause. Further, uncertainty of the loads, result in variations in the distribution network's parameters. Thus, an intelligent algorithm incorporating hybrid fuzzy-grey relation, which can take into account the uncertainties and compare the sequences is discussed to analyse and restore the distribution system. The simulation studies are carried out to show the utility of the method by ranking the restoration plans for a typical distribution system. This algorithm also meets the smart grid requirements in terms of an automated restoration plan for the partial/full blackout of network. PMID:27512634
Algorithmic techniques for computer vision on a fine-grained parallel machine
Little, J.J.; Blelloch, G.E.; Cass, T.A.
1989-03-01
The authors describe algorithms for several problems from computer vision, and illustrate how they are implemented using a set of primitive parallel operations. The primitives the authors use include general permutations, grid permutations, and the scan operation - a restricted form of the prefix computation. They cover well-known problems allowing us to concentrate on the implementations rather than the problems. First, they describe some simple routines such as border following, computing histograms and filtering. They then discuss several modules built on these routines including edge detection, Hough transforms, and connected component labeling. Finally, they describe how these modules are composed into higher level vision modules. By defining the routines using a set of primitives operations, they abstract away from a particular architecture. In particular, one does not have to worry about features of machines such as the number of processors or whether a tightly connected architecture has a hypercube network or a four-dimensional grid network. One still needs to worry about the relative performance of the primitives on particular machines. The authors discuss the tradeoffs among primitives and try to identify which primitives are most important for particular problems. All the primitives discussed are supported by the Connection Machine (CM), and they outline how they are implemented. They have implemented most of the algorithms described on the Connection Machine.
He, Hui; Fan, Guotao; Ye, Jianwei; Zhang, Weizhe
2013-01-01
It is of great significance to research the early warning system for large-scale network security incidents. It can improve the network system's emergency response capabilities, alleviate the cyber attacks' damage, and strengthen the system's counterattack ability. A comprehensive early warning system is presented in this paper, which combines active measurement and anomaly detection. The key visualization algorithm and technology of the system are mainly discussed. The large-scale network system's plane visualization is realized based on the divide and conquer thought. First, the topology of the large-scale network is divided into some small-scale networks by the MLkP/CR algorithm. Second, the sub graph plane visualization algorithm is applied to each small-scale network. Finally, the small-scale networks' topologies are combined into a topology based on the automatic distribution algorithm of force analysis. As the algorithm transforms the large-scale network topology plane visualization problem into a series of small-scale network topology plane visualization and distribution problems, it has higher parallelism and is able to handle the display of ultra-large-scale network topology. PMID:24191145
He, Hui; Fan, Guotao; Ye, Jianwei; Zhang, Weizhe
2013-01-01
It is of great significance to research the early warning system for large-scale network security incidents. It can improve the network system's emergency response capabilities, alleviate the cyber attacks' damage, and strengthen the system's counterattack ability. A comprehensive early warning system is presented in this paper, which combines active measurement and anomaly detection. The key visualization algorithm and technology of the system are mainly discussed. The large-scale network system's plane visualization is realized based on the divide and conquer thought. First, the topology of the large-scale network is divided into some small-scale networks by the MLkP/CR algorithm. Second, the sub graph plane visualization algorithm is applied to each small-scale network. Finally, the small-scale networks' topologies are combined into a topology based on the automatic distribution algorithm of force analysis. As the algorithm transforms the large-scale network topology plane visualization problem into a series of small-scale network topology plane visualization and distribution problems, it has higher parallelism and is able to handle the display of ultra-large-scale network topology.
He, Hui; Fan, Guotao; Ye, Jianwei; Zhang, Weizhe
2013-01-01
It is of great significance to research the early warning system for large-scale network security incidents. It can improve the network system's emergency response capabilities, alleviate the cyber attacks' damage, and strengthen the system's counterattack ability. A comprehensive early warning system is presented in this paper, which combines active measurement and anomaly detection. The key visualization algorithm and technology of the system are mainly discussed. The large-scale network system's plane visualization is realized based on the divide and conquer thought. First, the topology of the large-scale network is divided into some small-scale networks by the MLkP/CR algorithm. Second, the sub graph plane visualization algorithm is applied to each small-scale network. Finally, the small-scale networks' topologies are combined into a topology based on the automatic distribution algorithm of force analysis. As the algorithm transforms the large-scale network topology plane visualization problem into a series of small-scale network topology plane visualization and distribution problems, it has higher parallelism and is able to handle the display of ultra-large-scale network topology. PMID:24191145
Distributed genetic algorithms for the floorplan design problem
NASA Technical Reports Server (NTRS)
Cohoon, James P.; Hegde, Shailesh U.; Martin, Worthy N.; Richards, Dana S.
1991-01-01
Designing a VLSI floorplan calls for arranging a given set of modules in the plane to minimize the weighted sum of area and wire-length measures. A method of solving the floorplan design problem using distributed genetic algorithms is presented. Distributed genetic algorithms, based on the paleontological theory of punctuated equilibria, offer a conceptual modification to the traditional genetic algorithms. Experimental results on several problem instances demonstrate the efficacy of this method and indicate the advantages of this method over other methods, such as simulated annealing. The method has performed better than the simulated annealing approach, both in terms of the average cost of the solutions found and the best-found solution, in almost all the problem instances tried.
An O(log sup 2 N) parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix
NASA Technical Reports Server (NTRS)
Swarztrauber, Paul N.
1989-01-01
An O(log sup 2 N) parallel algorithm is presented for computing the eigenvalues of a symmetric tridiagonal matrix using a parallel algorithm for computing the zeros of the characteristic polynomial. The method is based on a quadratic recurrence in which the characteristic polynomial is constructed on a binary tree from polynomials whose degree doubles at each level. Intervals that contain exactly one zero are determined by the zeros of polynomials at the previous level which ensures that different processors compute different zeros. The exact behavior of the polynomials at the interval endpoints is used to eliminate the usual problems induced by finite precision arithmetic.
NASA Astrophysics Data System (ADS)
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2016-07-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
NASA Astrophysics Data System (ADS)
Bernabe, Sergio; Igual, Francisco D.; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2015-10-01
In the last decade, the issue of endmember variability has received considerable attention, particularly when each pixel is modeled as a linear combination of endmembers or pure materials. As a result, several models and algorithms have been developed for considering the effect of endmember variability in spectral unmixing and possibly include multiple endmembers in the spectral unmixing stage. One of the most popular approach for this purpose is the multiple endmember spectral mixture analysis (MESMA) algorithm. The procedure executed by MESMA can be summarized as follows: (i) First, a standard linear spectral unmixing (LSU) or fully constrained linear spectral unmixing (FCLSU) algorithm is run in an iterative fashion; (ii) Then, we use different endmember combinations, randomly selected from a spectral library, to decompose each mixed pixel; (iii) Finally, the model with the best fit, i.e., with the lowest root mean square error (RMSE) in the reconstruction of the original pixel, is adopted. However, this procedure can be computationally very expensive due to the fact that several endmember combinations need to be tested and several abundance estimation steps need to be conducted, a fact that compromises the use of MESMA in applications under real-time constraints. In this paper we develop (for the first time in the literature) an efficient implementation of MESMA on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems. Our experiments have been conducted using a simulated data set and the clMAGMA mathematical library. This kind of implementations with the same descriptive language on different architectures are very important in order to actually calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
NASA Astrophysics Data System (ADS)
Maharudrayya, S.; Jayanti, S.; Deshpande, A. P.
Single U- and Z-type parallel-channel configurations for gas distributor plates in planar fuel cells reduce the pressure drop but give rise to the problem of severe flow maldistribution wherein some of the channels may be starved of the reactants. In this paper, previous analytical solutions obtained for single U- and Z-type flow configurations are extended to multiple U- and multiple Z-type flow configurations of interest to fuel cell applications. Algorithms to calculate flow distribution and pressure drop in multiple U- and Z-type flow configurations are developed. The results are validated by comparison with those obtained from three-dimensional computational fluid dynamics (CFD) simulations. It is found that there is a significant improvement in the flow distribution in some configurations without paying for extra pressure drop. The possibility of unmatched distribution on the cathode and the anodes sides is also highlighted. Careful design of the flow configuration is therefore necessary for optimum performance.
On the consequences of bi-Maxwellian plasma distributions for parallel electric fields
NASA Technical Reports Server (NTRS)
Olsen, Richard C.
1992-01-01
The objective is to use the measurements of the equatorial particle distributions to obtain the parallel electric field structure and the evolution of the plasma distribution function along the field line. Appropriate uses of kinetic theory allows us to use the measured ( and inferred) particle distributions to obtain the electric field, and hence the variation on plasma density along the magnetic field line. The approach, here, is to utilize the adiabatic invariants, and assume the plasma distributions are in equilibrium.
NASA Astrophysics Data System (ADS)
Vidal, David Jean-Emmanuel
Two different parallel lattice Boltzmann (LBM) algorithms have been devised for the simulation of flow through complex porous media. They are based on memory efficient LBM algorithms, namely the one-lattice and shift algorithms, combined with vector data structure, even fluid node vector partitioning domain decomposition and efficient data transfer layouts. The shift implementation also includes a single unit relaxation scheme that allows additional memory savings, but limits its validity to Newtonian fluids. They both provide high parallel performance by balancing the workload among the processors and reducing the amount of data that need to be transferred, and reduce significantly the memory usage as compared to previous parallel LBM codes presented in the literature. Theoretical parallel performance and memory usage models developed show that they also offer a good evolutivity and efficiencies as high as 79% for simulations made of several billions of fluid nodes on 128 processors are reported. The application of one of these algorithms for the simulation of flow through compressed packings made of highly polydisperse spheres has demonstrated the remarkable precision and efficiency of the algorithm proposed. As a result, a modified Carman-Kozeny correlation taking into account the compression level and the particle polydispersity has been formulated.
Cohen, J.D.; Dunbar, K.; McClelland, J.L.
1989-11-22
A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirical data suggest that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a processing pathway and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning. This was accomplished by combining the cascade mechanism described by McClelland (1979) with the back propagation learning algorithm (Rumelhart, Hinton, Williams, 1986). The model is able to simulate performance in the standard Stroop task, as well as aspects of performance in variants of this task which manipulate SOA, response set, and degree of practice. In the discussion we contrast our model with other models, and indicate how it relates to many of the central issues in the literature on attention, automaticity, and interference.
NASA Astrophysics Data System (ADS)
Fukunaga, Takafumi
Due to advent of powerful Multi-Core PC cluster the computation performance of each node is dramatically increassed and this trend will continue in the future. On the other hand, the use of powerful network systems (Myrinet, Infiniband, etc.) is expensive and tends to increase difficulty of programming and degrades portability because they need dedicated libraries and protocol stacks. This paper proposes a relatively simple method to improve bandwidth-oriented parallel applications by improving the communication performance without the above dedicated hardware, libraries, protocol stacks and IEEE802.3ad (LACP). Although there are similarities between this proposal and IEEE802.3ad in respect to using multiple Ethernet ports, the proposal performs equal to or better than IEEE802.3ad without LACP switches and drivers. Moreover the performance of LACP is influenced by the environment (MAC addresses, IP addresses, etc.) because its distribution algorithm uses these parameters, the proposed method shows the same effect in spite of them.
A distributed Canny edge detector: algorithm and FPGA implementation.
Xu, Qian; Varadarajan, Srenivas; Chakrabarti, Chaitali; Karam, Lina J
2014-07-01
The Canny edge detector is one of the most widely used edge detection algorithms due to its superior performance. Unfortunately, not only is it computationally more intensive as compared with other edge detection algorithms, but it also has a higher latency because it is based on frame-level statistics. In this paper, we propose a mechanism to implement the Canny algorithm at the block level without any loss in edge detection performance compared with the original frame-level Canny algorithm. Directly applying the original Canny algorithm at the block-level leads to excessive edges in smooth regions and to loss of significant edges in high-detailed regions since the original Canny computes the high and low thresholds based on the frame-level statistics. To solve this problem, we present a distributed Canny edge detection algorithm that adaptively computes the edge detection thresholds based on the block type and the local distribution of the gradients in the image block. In addition, the new algorithm uses a nonuniform gradient magnitude histogram to compute block-based hysteresis thresholds. The resulting block-based algorithm has a significantly reduced latency and can be easily integrated with other block-based image codecs. It is capable of supporting fast edge detection of images and videos with high resolutions, including full-HD since the latency is now a function of the block size instead of the frame size. In addition, quantitative conformance evaluations and subjective tests show that the edge detection performance of the proposed algorithm is better than the original frame-based algorithm, especially when noise is present in the images. Finally, this algorithm is implemented using a 32 computing engine architecture and is synthesized on the Xilinx Virtex-5 FPGA. The synthesized architecture takes only 0.721 ms (including the SRAM READ/WRITE time and the computation time) to detect edges of 512 × 512 images in the USC SIPI database when clocked at 100
A distributed Canny edge detector: algorithm and FPGA implementation.
Xu, Qian; Varadarajan, Srenivas; Chakrabarti, Chaitali; Karam, Lina J
2014-07-01
The Canny edge detector is one of the most widely used edge detection algorithms due to its superior performance. Unfortunately, not only is it computationally more intensive as compared with other edge detection algorithms, but it also has a higher latency because it is based on frame-level statistics. In this paper, we propose a mechanism to implement the Canny algorithm at the block level without any loss in edge detection performance compared with the original frame-level Canny algorithm. Directly applying the original Canny algorithm at the block-level leads to excessive edges in smooth regions and to loss of significant edges in high-detailed regions since the original Canny computes the high and low thresholds based on the frame-level statistics. To solve this problem, we present a distributed Canny edge detection algorithm that adaptively computes the edge detection thresholds based on the block type and the local distribution of the gradients in the image block. In addition, the new algorithm uses a nonuniform gradient magnitude histogram to compute block-based hysteresis thresholds. The resulting block-based algorithm has a significantly reduced latency and can be easily integrated with other block-based image codecs. It is capable of supporting fast edge detection of images and videos with high resolutions, including full-HD since the latency is now a function of the block size instead of the frame size. In addition, quantitative conformance evaluations and subjective tests show that the edge detection performance of the proposed algorithm is better than the original frame-based algorithm, especially when noise is present in the images. Finally, this algorithm is implemented using a 32 computing engine architecture and is synthesized on the Xilinx Virtex-5 FPGA. The synthesized architecture takes only 0.721 ms (including the SRAM READ/WRITE time and the computation time) to detect edges of 512 × 512 images in the USC SIPI database when clocked at 100
NASA Astrophysics Data System (ADS)
Sun, Yang; Wu, Ke-nan; Gao, Hong; Jin, Yu-qi
2015-02-01
A novel optimization method, stochastic parallel proportional-integral-derivative (SPPID) algorithm, is proposed for high-resolution phase-distortion correction in wave-front sensorless adaptive optics (WSAO). To enhance the global search and self-adaptation of stochastic parallel gradient descent (SPGD) algorithm, residual error and its temporal integration of performance metric are added in to incremental control signal's calculation. On the basis of the maximum fitting rate between real wave-front and corrector, a goal value of metric is set as the reference. The residual error of the metric relative to reference is transformed into proportional and integration terms to produce adaptive step size updating law of SPGD algorithm. The adaptation of step size leads blind optimization to desired goal and helps escape from local extrema. Different from conventional proportional-integral -derivative (PID) algorithm, SPPID algorithm designs incremental control signal as PI-by-D for adaptive adjustment of control law in SPGD algorithm. Experiments of high-resolution phase-distortion correction in "frozen" turbulences based on influence function coefficients optimization were carried out respectively using 128-by-128 typed spatial light modulators, photo detector and control computer. Results revealed the presented algorithm offered better performance in both cases. The step size update based on residual error and its temporal integration was justified to resolve severe local lock-in problem of SPGD algorithm used in high -resolution adaptive optics.
ERIC Educational Resources Information Center
Vazquez Aranda, Armando I.; Henquin, Eduardo R.; Torres, Israel Rodriguez; Bisang, Jose M.
2012-01-01
A laboratory experiment is described to determine the primary current distribution in parallel-plate electrochemical reactors. The electrolyte is simulated by conductive paper and the electrodes are segmented to measure the current distribution. Experiments are reported with the electrolyte confined to the interelectrode gap, where the current…
Kundu, D. )
1992-03-01
The demand for connecting small generators to the distribution system has increased considerably since 1986. Therefore, there is a need to summarize the technical requirements for generator connection. This paper highlights the technical requirements for connecting parallel generators to the Ontario Hydro Distribution Electricity System. It also discusses some applications to connect synchronous and induction generators to the DES.
A new distributed systems scheduling algorithm: a swarm intelligence approach
NASA Astrophysics Data System (ADS)
Haghi Kashani, Mostafa; Sarvizadeh, Raheleh; Jameii, Mahdi
2011-12-01
The scheduling problem in distributed systems is known as an NP-complete problem, and methods based on heuristic or metaheuristic search have been proposed to obtain optimal and suboptimal solutions. The task scheduling is a key factor for distributed systems to gain better performance. In this paper, an efficient method based on memetic algorithm is developed to solve the problem of distributed systems scheduling. With regard to load balancing efficiently, Artificial Bee Colony (ABC) has been applied as local search in the proposed memetic algorithm. The proposed method has been compared to existing memetic-Based approach in which Learning Automata method has been used as local search. The results demonstrated that the proposed method outperform the above mentioned method in terms of communication cost.
A new parallel-vector finite element analysis software on distributed-memory computers
NASA Technical Reports Server (NTRS)
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
Parallel kinematic mechanisms for distributed actuation of future structures
NASA Astrophysics Data System (ADS)
Lai, G.; Plummer, A. R.; Cleaver, D. J.; Zhou, H.
2016-09-01
Future machines will require distributed actuation integrated with load-bearing structures, so that they are lighter, move faster, use less energy, and are more adaptable. Good examples are shape-changing aircraft wings which can adapt precisely to the ideal aerodynamic form for current flying conditions, and light but powerful robotic manipulators which can interact safely with human co-workers. A 'tensegrity structure' is a good candidate for this application due to its potentially excellent stiffness and strength-to-weight ratio and a multi-element structure into which actuators could be embedded. This paper presents results of an analysis of an example practical actuated tensegrity structure consisting of 3 ‘unit cells’. A numerical method is used to determine the stability of the structure with varying actuator length, showing how four actuators can be used to control movement in three degrees of freedom as well as simultaneously maintaining the structural pre-load. An experimental prototype has been built, in which 4 pneumatic artificial muscles (PAMs) are embedded in one unit cell. The PAMs are controlled antagonistically, by high speed switching of on-off valves, to achieve control of position and structure pre-load. Experimental and simulation results are presented, and future prospects for the approach are discussed.
NASA Astrophysics Data System (ADS)
Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves
2009-03-01
This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.
Mesoscale Simulations of Particulate Flows with Parallel Distributed Lagrange Multiplier Technique
Kanarska, Y
2010-03-24
Fluid particulate flows are common phenomena in nature and industry. Modeling of such flows at micro and macro levels as well establishing relationships between these approaches are needed to understand properties of the particulate matter. We propose a computational technique based on the direct numerical simulation of the particulate flows. The numerical method is based on the distributed Lagrange multiplier technique following the ideas of Glowinski et al. (1999). Each particle is explicitly resolved on an Eulerian grid as a separate domain, using solid volume fractions. The fluid equations are solved through the entire computational domain, however, Lagrange multiplier constrains are applied inside the particle domain such that the fluid within any volume associated with a solid particle moves as an incompressible rigid body. Mutual forces for the fluid-particle interactions are internal to the system. Particles interact with the fluid via fluid dynamic equations, resulting in implicit fluid-rigid-body coupling relations that produce realistic fluid flow around the particles (i.e., no-slip boundary conditions). The particle-particle interactions are implemented using explicit force-displacement interactions for frictional inelastic particles similar to the DEM method of Cundall et al. (1979) with some modifications using a volume of an overlapping region as an input to the contact forces. The method is flexible enough to handle arbitrary particle shapes and size distributions. A parallel implementation of the method is based on the SAMRAI (Structured Adaptive Mesh Refinement Application Infrastructure) library, which allows handling of large amounts of rigid particles and enables local grid refinement. Accuracy and convergence of the presented method has been tested against known solutions for a falling sphere as well as by examining fluid flows through stationary particle beds (periodic and cubic packing). To evaluate code performance and validate particle
Stankovski, Z.
1995-12-31
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors.
NASA Technical Reports Server (NTRS)
Sargent, Jeff Scott
1988-01-01
A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.
Chin, George; Choudhury, Sutanay; Kangas, Lars J.; McFarlane, Sally A.; Marquez, Andres
2011-09-01
Long viewed as a strong statistical inference technique, Bayesian networks have emerged to be an important class of applications for high-performance computing. We have applied an architecture-conscious approach to parallelizing the Lauritzen-Spiegelhalter Junction Tree algorithm for exact inferencing in Bayesian networks. In optimizing the Junction Tree algorithm, we have implemented both in-clique and topological parallelism strategies to best leverage the fine-grained synchronization and massive-scale multithreading of the Cray XMT architecture. Two topological techniques were developed to parallelize the evidence propagation process through the Bayesian network. One technique involves performing intelligent scheduling of junction tree nodes based on its topology and relative size. The second technique involves decomposing the junction tree into a much finer tree-like representation to offer much more opportunities for parallelism. We evaluate these optimizations on five different Bayesian networks and report our findings and observations. Another important contribution of this paper is to demonstrate the application of massive-scale multithreading for load balancing and use of implicit parallelism-based compiler optimizations in designing scalable inferencing algorithms.
Du, Tingsong; Hu, Yang; Ke, Xianting
2015-01-01
An improved quantum artificial fish swarm algorithm (IQAFSA) for solving distributed network programming considering distributed generation is proposed in this work. The IQAFSA based on quantum computing which has exponential acceleration for heuristic algorithm uses quantum bits to code artificial fish and quantum revolving gate, preying behavior, and following behavior and variation of quantum artificial fish to update the artificial fish for searching for optimal value. Then, we apply the proposed new algorithm, the quantum artificial fish swarm algorithm (QAFSA), the basic artificial fish swarm algorithm (BAFSA), and the global edition artificial fish swarm algorithm (GAFSA) to the simulation experiments for some typical test functions, respectively. The simulation results demonstrate that the proposed algorithm can escape from the local extremum effectively and has higher convergence speed and better accuracy. Finally, applying IQAFSA to distributed network problems and the simulation results for 33-bus radial distribution network system show that IQAFSA can get the minimum power loss after comparing with BAFSA, GAFSA, and QAFSA. PMID:26447713
Du, Tingsong; Hu, Yang; Ke, Xianting
2015-01-01
An improved quantum artificial fish swarm algorithm (IQAFSA) for solving distributed network programming considering distributed generation is proposed in this work. The IQAFSA based on quantum computing which has exponential acceleration for heuristic algorithm uses quantum bits to code artificial fish and quantum revolving gate, preying behavior, and following behavior and variation of quantum artificial fish to update the artificial fish for searching for optimal value. Then, we apply the proposed new algorithm, the quantum artificial fish swarm algorithm (QAFSA), the basic artificial fish swarm algorithm (BAFSA), and the global edition artificial fish swarm algorithm (GAFSA) to the simulation experiments for some typical test functions, respectively. The simulation results demonstrate that the proposed algorithm can escape from the local extremum effectively and has higher convergence speed and better accuracy. Finally, applying IQAFSA to distributed network problems and the simulation results for 33-bus radial distribution network system show that IQAFSA can get the minimum power loss after comparing with BAFSA, GAFSA, and QAFSA. PMID:26447713
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Tao, X. Lu, Q.
2014-02-15
In space plasmas, charged particles are frequently observed to possess a high-energy tail, which is often modeled by a kappa-type distribution function. In this work, the formation of the electron kappa distribution in generation of parallel propagating whistler waves is investigated using fully nonlinear particle-in-cell (PIC) simulations. A previous research concluded that the bi-Maxwellian character of electron distributions is preserved in PIC simulations. We now demonstrate that for interactions between electrons and parallel propagating whistler waves, a non-Maxwellian high-energy tail can be formed, and a kappa distribution can be used to fit the electron distribution in time-asymptotic limit. The κ-parameter is found to decrease with increasing initial temperature anisotropy or decreasing ratio of electron plasma frequency to cyclotron frequency. The results might be helpful to understanding the origin of electron kappa distributions observed in space plasmas.
Fault-tolerant distributed algorithms for agreement and election
Abu-Amara, H.H.
1988-01-01
The first part characterizes completely the shared-memory requirements for achieving agreement in an asynchronous system of fail-stop processes that die undetectably. There is no agreement protocol that uses only read and write operations, even if at most one process dies. This result implies the impossibility of Byzantine agreement in asynchronous message-passing systems. The second part considers the election problem on asynchronous complete networks when the processors are reliable but some of the channels may be intermittently faulty. To be consistent with the standard model of distributed algorithms in which channel delay scan be arbitrary but finite, it is assumed that channel failures are undetectable. An algorithm is given that correctly solves the problem when the channels fail before or during the execution of the algorithm.
Improving permafrost distribution modelling using feature selection algorithms
NASA Astrophysics Data System (ADS)
Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail
2016-04-01
The availability of an increasing number of spatial data on the occurrence of mountain permafrost allows the employment of machine learning (ML) classification algorithms for modelling the distribution of the phenomenon. One of the major problems when dealing with high-dimensional dataset is the number of input features (variables) involved. Application of ML classification algorithms to this large number of variables leads to the risk of overfitting, with the consequence of a poor generalization/prediction. For this reason, applying feature selection (FS) techniques helps simplifying the amount of factors required and improves the knowledge on adopted features and their relation with the studied phenomenon. Moreover, taking away irrelevant or redundant variables from the dataset effectively improves the quality of the ML prediction. This research deals with a comparative analysis of permafrost distribution models supported by FS variable importance assessment. The input dataset (dimension = 20-25, 10 m spatial resolution) was constructed using landcover maps, climate data and DEM derived variables (altitude, aspect, slope, terrain curvature, solar radiation, etc.). It was completed with permafrost evidences (geophysical and thermal data and rock glacier inventories) that serve as training permafrost data. Used FS algorithms informed about variables that appeared less statistically important for permafrost presence/absence. Three different algorithms were compared: Information Gain (IG), Correlation-based Feature Selection (CFS) and Random Forest (RF). IG is a filter technique that evaluates the worth of a predictor by measuring the information gain with respect to the permafrost presence/absence. Conversely, CFS is a wrapper technique that evaluates the worth of a subset of predictors by considering the individual predictive ability of each variable along with the degree of redundancy between them. Finally, RF is a ML algorithm that performs FS as part of its
NASA Astrophysics Data System (ADS)
Guo, Peng; Cheng, Wenming; Wang, Yi
2015-11-01
This article considers the parallel machine scheduling problem with step-deteriorating jobs and sequence-dependent setup times. The objective is to minimize the total tardiness by determining the allocation and sequence of jobs on identical parallel machines. In this problem, the processing time of each job is a step function dependent upon its starting time. An individual extended time is penalized when the starting time of a job is later than a specific deterioration date. The possibility of deterioration of a job makes the parallel machine scheduling problem more challenging than ordinary ones. A mixed integer programming model for the optimal solution is derived. Due to its NP-hard nature, a hybrid discrete cuckoo search algorithm is proposed to solve this problem. In order to generate a good initial swarm, a modified Biskup-Hermann-Gupta (BHG) heuristic called MBHG is incorporated into the population initialization. Several discrete operators are proposed in the random walk of Lévy flights and the crossover search. Moreover, a local search procedure based on variable neighbourhood descent is integrated into the algorithm as a hybrid strategy in order to improve the quality of elite solutions. Computational experiments are executed on two sets of randomly generated test instances. The results show that the proposed hybrid algorithm can yield better solutions in comparison with the commercial solver CPLEX® with a one hour time limit, the discrete cuckoo search algorithm and the existing variable neighbourhood search algorithm.
NASA Astrophysics Data System (ADS)
Sourbier, F.; Operto, S.; Virieux, J.
2006-12-01
We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor
Xu, Zhenzhen; Zou, Yongxing; Kong, Xiangjie
2015-01-01
To our knowledge, this paper investigates the first application of meta-heuristic algorithms to tackle the parallel machines scheduling problem with weighted late work criterion and common due date ([Formula: see text]). Late work criterion is one of the performance measures of scheduling problems which considers the length of late parts of particular jobs when evaluating the quality of scheduling. Since this problem is known to be NP-hard, three meta-heuristic algorithms, namely ant colony system, genetic algorithm, and simulated annealing are designed and implemented, respectively. We also propose a novel algorithm named LDF (largest density first) which is improved from LPT (longest processing time first). The computational experiments compared these meta-heuristic algorithms with LDF, LPT and LS (list scheduling), and the experimental results show that SA performs the best in most cases. However, LDF is better than SA in some conditions, moreover, the running time of LDF is much shorter than SA. PMID:26702371
NASA Astrophysics Data System (ADS)
Guo, Z.; Xiong, S. M.
2015-05-01
An algorithm comprising adaptive mesh refinement (AMR) and parallel (Para-) computing capabilities was developed to efficiently solve the coupled phase field equations in 3-D. The AMR was achieved based on a gradient criterion and the point clustering algorithm introduced by Berger (1991). To reduce the time for mesh generation, a dynamic regridding approach was developed based on the magnitude of the maximum phase advancing velocity. Local data at each computing process was then constructed and parallel computation was realized based on the hierarchical grid structure created during the AMR. Numerical tests and simulations on single and multi-dendrite growth were performed and results show that the proposed algorithm could shorten the computing time for 3-D phase field simulation for about two orders of magnitude and enable one to gain much more insight in understanding the underlying physics during dendrite growth in solidification.
A Parallel Approach To Optimum Actuator Selection With a Genetic Algorithm
NASA Technical Reports Server (NTRS)
Rogers, James L.
2000-01-01
Recent discoveries in smart technologies have created a variety of aerodynamic actuators which have great potential to enable entirely new approaches to aerospace vehicle flight control. For a revolutionary concept such as a seamless aircraft with no moving control surfaces, there is a large set of candidate locations for placing actuators, resulting in a substantially larger number of combinations to examine in order to find an optimum placement satisfying the mission requirements. The placement of actuators on a wing determines the control effectiveness of the airplane. One approach to placement Maximizes the moments about the pitch, roll, and yaw axes, while minimizing the coupling. Genetic algorithms have been instrumental in achieving good solutions to discrete optimization problems, such as the actuator placement problem. As a proof of concept, a genetic has been developed to find the minimum number of actuators required to provide uncoupled pitch, roll, and yaw control for a simplified, untapered, unswept wing model. To find the optimum placement by searching all possible combinations would require 1,100 hours. Formulating the problem and as a multi-objective problem and modifying it to take advantage of the parallel processing capabilities of a multi-processor computer, reduces the optimization time to 22 hours.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.