Parallel and Distributed Computing Combinatorial Algorithms
1993-10-01
FUPNDKC %2,•, PARALLEL AND DISTRIBUTED COMPUTING COMBINATORIAL ALGORITHMS 6. AUTHOR(S) 2304/DS F49620-92-J-0125 DR. LEIGHTON 7 PERFORMING ORGANIZATION NAME...on several problems involving parallel and distributed computing and combinatorial optimization. This research is reported in the numerous papers that...network decom- position. In Proceedings of the Eleventh Annual ACM Symposium on Principles of Distributed Computing , August 1992. [15] B. Awerbuch, B
Parallel matrix transpose algorithms on distributed memory concurrent computers
Choi, J.; Walker, D.W.; Dongarra, J.J. |
1993-10-01
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database
Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...
2013-01-01
Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order inmore » a 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Distributed-memory Parallel Algorithms for Matching and Coloring
Catalyurek, Umit; Dobrian, Florin; Gebremedhin, Assefaw H.; Halappanavar, Mahantesh; Pothen, Alex
2011-05-31
Graph matching and coloring constitute two fundamental classes of combinatorial problems having numerous established as well as emerging applications in computational science and engineering, high-performance computing, and informatics. We provide a snapshot of an on-going work on the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical problems from these classes, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency and poor data locality, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including approximation, speculation and iteration, optimized communication, and randomization, in concert. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
Parallel and Distributed Computing.
1986-12-12
program was devoted to parallel and distributed computing . Support for this part of the program was obtained from the present Army contract and a...Umesh Vazirani. A workshop on parallel and distributed computing was held from May 19 to May 23, 1986 and drew 141 participants. Keywords: Mathematical programming; Protocols; Randomized algorithms. (Author)
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
A data distributed parallel algorithm for ray-traced volume rendering
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Painter, James S.; Hansen, Charles D.; Krogh, Michael F.
1993-01-01
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local ray tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on both the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.
Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim
2014-07-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
NASA Astrophysics Data System (ADS)
Loring, B.; Karimabadi, H.; Rortershteyn, V.
2015-10-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
NASA Astrophysics Data System (ADS)
Zheng, Yan
2015-03-01
Internet of things (IoT), focusing on providing users with information exchange and intelligent control, attracts a lot of attention of researchers from all over the world since the beginning of this century. IoT is consisted of large scale of sensor nodes and data processing units, and the most important features of IoT can be illustrated as energy confinement, efficient communication and high redundancy. With the sensor nodes increment, the communication efficiency and the available communication band width become bottle necks. Many research work is based on the instance which the number of joins is less. However, it is not proper to the increasing multi-join query in whole internet of things. To improve the communication efficiency between parallel units in the distributed sensor network, this paper proposed parallel query optimization algorithm based on distribution attributes cost graph. The storage information relations and the network communication cost are considered in this algorithm, and an optimized information changing rule is established. The experimental result shows that the algorithm has good performance, and it would effectively use the resource of each node in the distributed sensor network. Therefore, executive efficiency of multi-join query between different nodes could be improved.
Choi, Jaeyoung; Walker, D.W.; Dongarra, J.J. |
1993-08-01
This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A{center_dot}B, but also transposed multiplication routines C = A{sup T}{center_dot}B, C = A{center_dot}B{sup T}, and C = A{sup T}{center_dot}B{sup T}, for a block scattered data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.
A data distributed, parallel algorithm for ray-traced volume rendering
Ma, Kwan-Liu; Painter, J.S.; Hansen, C.D.; Krogh, M.F.
1993-03-30
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and its implementation on networked workstations and a massively parallel computer, the Connection Machine CM-5. This algorithm distributes the data and the computational load to individual processing units to achieve fast, high-quality rendering of high-resolution data, even when only a modest amount of memory is available on each machine. The volume data, once distributed, is left intact. The processing nodes perform local ray-tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Implementations and tests on a group of networked workstations and on the Thinking Machines CM-5 demonstrate the practicality of our algorithm and expose different performance tuning issues for each platform. We use data sets from medical imaging and computational fluid dynamics simulations in the study of this algorithm.
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform. PMID:25097872
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
Totally parallel multilevel algorithms
NASA Technical Reports Server (NTRS)
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Parallel Wolff Cluster Algorithms
NASA Astrophysics Data System (ADS)
Bae, S.; Ko, S. H.; Coddington, P. D.
The Wolff single-cluster algorithm is the most efficient method known for Monte Carlo simulation of many spin models. Due to the irregular size, shape and position of the Wolff clusters, this method does not easily lend itself to efficient parallel implementation, so that simulations using this method have thus far been confined to workstations and vector machines. Here we present two parallel implementations of this algorithm, and show that one gives fairly good performance on a MIMD parallel computer.
Schatz, Martin D.; Kolda, Tamara G.; van de Geijn, Robert
2015-09-01
Large-scale datasets in computational chemistry typically require distributed-memory parallel methods to perform a special operation known as tensor contraction. Tensors are multidimensional arrays, and a tensor contraction is akin to matrix multiplication with special types of permutations. Creating an efficient algorithm and optimized im- plementation in this domain is complex, tedious, and error-prone. To address this, we develop a notation to express data distributions so that we can apply use automated methods to find optimized implementations for tensor contractions. We consider the spin-adapted coupled cluster singles and doubles method from computational chemistry and use our methodology to produce an efficient implementation. Experiments per- formed on the IBM Blue Gene/Q and Cray XC30 demonstrate impact both improved performance and reduced memory consumption.
Parallel Algorithms and Patterns
Robey, Robert W.
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Parallel algorithm development
Adams, T.F.
1996-06-01
Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.
Implementation of Parallel Algorithms
1993-06-30
their socia ’ relations or to achieve some goals. For example, we define a pair-wise force law of i epulsion and attraction for a group of identical...quantization based compression schemes. Photo-refractive crystals, which provide high density recording in real time, are used as our holographic media . The...of Parallel Algorithms (J. Reif, ed.). Kluwer Academic Pu’ ishers, 1993. (4) "A Dynamic Separator Algorithm", D. Armon and J. Reif. To appear in
Predicting Protein Structure Using Parallel Genetic Algorithms.
1994-12-01
By " Predicting rotein Structure D istribticfiar.. ................ Using Parallel Genetic Algorithms ,Avaiu " ’ •"... Dist THESIS I IGeorge H...iiLite-d Approved for public release; distribution unlimited AFIT/ GCS /ENG/94D-03 Predicting Protein Structure Using Parallel Genetic Algorithms ...1-1 1.2 Genetic Algorithms ......... ............................ 1-3 1.3 The Protein Folding Problem
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
Parallel Algorithms for Image Analysis.
1982-06-01
8217 _ _ _ _ _ _ _ 4. TITLE (aid Subtitle) S. TYPE OF REPORT & PERIOD COVERED PARALLEL ALGORITHMS FOR IMAGE ANALYSIS TECHNICAL 6. PERFORMING O4G. REPORT NUMBER TR-1180...Continue on reverse side it neceesary aid Identlfy by block number) Image processing; image analysis ; parallel processing; cellular computers. 20... IMAGE ANALYSIS TECHNICAL 6. PERFORMING ONG. REPORT NUMBER TR-1180 - 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(s) Azriel Rosenfeld AFOSR-77-3271 9
An efficient parallel termination detection algorithm
Baker, A. H.; Crivelli, S.; Jessup, E. R.
2004-05-27
Information local to any one processor is insufficient to monitor the overall progress of most distributed computations. Typically, a second distributed computation for detecting termination of the main computation is necessary. In order to be a useful computational tool, the termination detection routine must operate concurrently with the main computation, adding minimal overhead, and it must promptly and correctly detect termination when it occurs. In this paper, we present a new algorithm for detecting the termination of a parallel computation on distributed-memory MIMD computers that satisfies all of those criteria. A variety of termination detection algorithms have been devised. Of these, the algorithm presented by Sinha, Kale, and Ramkumar (henceforth, the SKR algorithm) is unique in its ability to adapt to the load conditions of the system on which it runs, thereby minimizing the impact of termination detection on performance. Because their algorithm also detects termination quickly, we consider it to be the most efficient practical algorithm presently available. The termination detection algorithm presented here was developed for use in the PMESC programming library for distributed-memory MIMD computers. Like the SKR algorithm, our algorithm adapts to system loads and imposes little overhead. Also like the SKR algorithm, ours is tree-based, and it does not depend on any assumptions about the physical interconnection topology of the processors or the specifics of the distributed computation. In addition, our algorithm is easier to implement and requires only half as many tree traverses as does the SKR algorithm. This paper is organized as follows. In section 2, we define our computational model. In section 3, we review the SKR algorithm. We introduce our new algorithm in section 4, and prove its correctness in section 5. We discuss its efficiency and present experimental results in section 6.
The Complexity of Parallel Algorithms,
1985-11-01
Much of this work was done in collaboration with my advisor, Ernst Mayr . He was also supported in part by ONR contract N00014-85-C-0731. F ’. Table...Helinbold and Mayr in their algorithn to compute an optimal two processor schedule [HM2]. One of the promising developments in parallel algorithms is that...lei can be solved by it fast parallel algorithmmmi if the nmlmmmibers are smiall. llehmibold and Mayr JIlM I] have slhowm that. if Ole job timies are
Parallel job-scheduling algorithms
Rodger, S.H.
1989-01-01
In this thesis, we consider solving job scheduling problems on the CREW PRAM model. We show how to adapt Cole's pipeline merge technique to yield several efficient parallel algorithms for a number of job scheduling problems and one optimal parallel algorithm for the following job scheduling problem: Given a set of n jobs defined by release times, deadlines and processing times, find a schedule that minimizes the maximum lateness of the jobs and allows preemption when the jobs are scheduled to run on one machine. In addition, we present the first NC algorithm for the following job scheduling problem: Given a set of n jobs defined by release times, deadlines and unit processing times, determine if there is a schedule of jobs on one machine, and calculate the schedule if it exists. We identify the notion of a canonical schedule, which is the type of schedule our algorithm computes if there is a schedule. Our algorithm runs in O((log n){sup 2}) time and uses O(n{sup 2}k{sup 2}) processors, where k is the minimum number of distinct offsets of release times or deadlines.
Parallel Implicit Algorithms for CFD
NASA Technical Reports Server (NTRS)
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.
1989-01-01
Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.
A parallel algorithm for random searches
NASA Astrophysics Data System (ADS)
Wosniack, M. E.; Raposo, E. P.; Viswanathan, G. M.; da Luz, M. G. E.
2015-11-01
We discuss a parallelization procedure for a two-dimensional random search of a single individual, a typical sequential process. To assure the same features of the sequential random search in the parallel version, we analyze the former spatial patterns of the encountered targets for different search strategies and densities of homogeneously distributed targets. We identify a lognormal tendency for the distribution of distances between consecutively detected targets. Then, by assigning the distinct mean and standard deviation of this distribution for each corresponding configuration in the parallel simulations (constituted by parallel random walkers), we are able to recover important statistical properties, e.g., the target detection efficiency, of the original problem. The proposed parallel approach presents a speedup of nearly one order of magnitude compared with the sequential implementation. This algorithm can be easily adapted to different instances, as searches in three dimensions. Its possible range of applicability covers problems in areas as diverse as automated computer searchers in high-capacity databases and animal foraging.
Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak
2016-05-01
Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve further as more functionality is added in the future.
Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak
2016-05-01
Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less
Parallel algorithms for unconstrained optimizations by multisplitting
He, Qing
1994-12-31
In this paper a new parallel iterative algorithm for unconstrained optimization using the idea of multisplitting is proposed. This algorithm uses the existing sequential algorithms without any parallelization. Some convergence and numerical results for this algorithm are presented. The experiments are performed on an Intel iPSC/860 Hyper Cube with 64 nodes. It is interesting that the sequential implementation on one node shows that if the problem is split properly, the algorithm converges much faster than one without splitting.
A parallel algorithm for global routing
NASA Technical Reports Server (NTRS)
Brouwer, Randall J.; Banerjee, Prithviraj
1990-01-01
A Parallel Hierarchical algorithm for Global Routing (PHIGURE) is presented. The router is based on the work of Burstein and Pelavin, but has many extensions for general global routing and parallel execution. Main features of the algorithm include structured hierarchical decomposition into separate independent tasks which are suitable for parallel execution and adaptive simplex solution for adding feedthroughs and adjusting channel heights for row-based layout. Alternative decomposition methods and the various levels of parallelism available in the algorithm are examined closely. The algorithm is described and results are presented for a shared-memory multiprocessor implementation.
Array distribution in data-parallel programs
NASA Technical Reports Server (NTRS)
Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.
1994-01-01
We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.
Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Gatski, Thomas B.
1997-01-01
A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Runtime support for parallelizing data mining algorithms
NASA Astrophysics Data System (ADS)
Jin, Ruoming; Agrawal, Gagan
2002-03-01
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
A parallel variable metric optimization algorithm
NASA Technical Reports Server (NTRS)
Straeter, T. A.
1973-01-01
An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
A parallel algorithm for channel routing on a hypercube
NASA Technical Reports Server (NTRS)
Brouwer, Randall; Banerjee, Prithviraj
1987-01-01
A new parallel simulated annealing algorithm for channel routing on a P processor hypercube is presented. The basic idea used is to partition a set of tracks equally among processors in the hypercube. In parallel, P/2 pairs of processors perform displacements and exchanges of nets between tracks, compute the changes in cost functions, and accept moves using a parallel annealing criteria. Through the use of a unique distributed data structure, it is possible to minimize message traffic and add versatility and efficiency in a parallel routing tool. The algorithm has been implemented and is being tested on some of the popular channel problems from the literature.
Parallel Algorithms for the Exascale Era
Robey, Robert W.
2016-10-19
New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this work has been done by undergraduates and published in leading scientific journals.
A parallelization of the row-searching algorithm
NASA Astrophysics Data System (ADS)
Yaici, Malika; Khaled, Hayet; Khaled, Zakia; Bentahar, Athmane
2012-11-01
The problem dealt in this paper concerns the parallelization of the row-searching algorithm which allows the search for linearly dependant rows on a given matrix and its implementation on MPI (Message Passing Interface) environment. This algorithm is largely used in control theory and more specifically in solving the famous diophantine equation. An introduction to the diophantine equation is presented, then two parallelization approaches of the algorithm are detailed. The first distributes a set of rows on processes (processors) and the second makes a distribution per blocks. The sequential algorithm and its two parallel forms are implemented using MPI routines, then modelled using UML (Unified Modelling Language) and finally evaluated using algorithmic complexity.
Parallel algorithms for contour extraction and coding
NASA Astrophysics Data System (ADS)
Dinstein, Its'hak; Landau, Gad M.
1990-07-01
A parallel approach to contour extraction and coding on an Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM) is presented and analyzed. The algorithm is intended for binary images. The labeled contours can be represented by lists of coordinates, and/or chain codes, and/or any other user designed codes. Using O(n2/log n) processors, the algorithm runs in O(logn) time, where n by n is the size of the processed binary image.
MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION
In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...
Parallel algorithms for semi-lagrangian advection
NASA Astrophysics Data System (ADS)
Malevsky, A. V.; Thomas, S. J.
1997-08-01
Numerical time step limitations associated with the explicit treatment of advection-dominated problems in computational fluid dynamics are often relaxed by employing Eulerian-Lagrangian methods. These are also known as semi-Lagrangian methods in the atmospheric sciences. Such methods involve backward time integration of a characteristic equation to find the departure point of a fluid particle arriving at a Eulerian grid point. The value of the advected field at the departure point is obtained by interpolation. Both the trajectory integration and repeated interpolation influence accuracy. We compare the accuracy and performance of interpolation schemes based on piecewise cubic polynomials and cubic B-splines in the context of a distributed memory, parallel computing environment. The computational cost and interprocessor communication requirements for both methods are reported. Spline interpolation has better conservation properties but requires the solution of a global linear system, initially appearing to hinder a distributed memory implementation. The proposed parallel algorithm for multidimensional spline interpolation has almost the same communication overhead as local piecewise polynomial interpolation. We also compare various techniques for tracking trajectories given different values for the Courant number. Large Courant numbers require a high-order ODE solver involving multiple interpolations of the velocity field.
Parallel, Distributed Scripting with Python
Miller, P J
2002-05-24
Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI library gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.
Parallel Algorithms for Computer Vision.
1987-01-01
73 755 P fiu.LEL ALORITHMS FOR CO PUTER VISIO (U) /MASSACHUSETTS INST OF TECH CRMORIDGE T P00010 ET AL.JAN 8? ETL-0456 DACA7-05-C-8IIO m 7E F/0 1...regularization principles, such as edge detection, stereo , motion, surface interpolation and shape from shading. The basic members of class I are convolution...them in collabo- ration with Thinking Machines Corporation): * Parallel convolution * Zero-crossing detection * Stereo -matching * Surface reconstruction
Parallel Algorithms for Computer Vision.
1989-01-01
demonstrated the Vision Machine system processing images and recognizing objects through the inte- gration of several visual cues. The first version of the...achievements. n 2.1 The Vision Machine The overall organization of tie Vision Machine systeliis ased. o parallel processing of tie images by independent...smoothed and made dense by exploiting known constraints within each process (for example., that disparity is smooth). This is the stage of approximation
Parallel algorithms for dynamically partitioning unstructured grids
Diniz, P.; Plimpton, S.; Hendrickson, B.; Leland, R.
1994-10-01
Grid partitioning is the method of choice for decomposing a wide variety of computational problems into naturally parallel pieces. In problems where computational load on the grid or the grid itself changes as the simulation progresses, the ability to repartition dynamically and in parallel is attractive for achieving higher performance. We describe three algorithms suitable for parallel dynamic load-balancing which attempt to partition unstructured grids so that computational load is balanced and communication is minimized. The execution time of algorithms and the quality of the partitions they generate are compared to results from serial partitioners for two large grids. The integration of the algorithms into a parallel particle simulation is also briefly discussed.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Parallel Genetic Algorithm for Alpha Spectra Fitting
NASA Astrophysics Data System (ADS)
García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio
2005-01-01
We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.
Experiences with the PGAPack Parallel Genetic Algorithm library
Levine, D.; Hallstrom, P.; Noelle, D.; Walenz, B.
1997-07-01
PGAPack is the first widely distributed parallel genetic algorithm library. Since its release, several thousand copies have been distributed worldwide to interested users. In this paper we discuss the key components of the PGAPack design philosophy and present a number of application examples that use PGAPack.
Numerical Algorithms and Parallel Tasking.
1984-07-01
34 Principal Investigator, Virginia Klema, Research Staff, George Cybenko and Elizabeth Ducot . During the period, May 15, 1983 through May 14, 1984...Virginia Klema and Elizabeth Ducot have been supported for four months, and George Cybenko has been supported for one month. During this time system...algorithms or applications is the responsibility of the user. Virginia Klema and Elizabeth Ducot presented a description of the concurrent computing
Adapting Eclat algorithm to parallel environments with Charm++ library
NASA Astrophysics Data System (ADS)
Puścian, Marek; Grabski, Waldemar
2016-09-01
In this paper we describe Eclat algorithm that is adapted to deal with growing data repositories. The presented solution utilizes Master-Slave scheme to distribute data mining tasks among available computation nodes. Several improvements have been proposed and successfully implemented using Charm++ library. This paper introduces optimization techniques to reduce communication cost and synchronization overhead. It also discusses results of the performance of parallel Eclat algorithm against different databases and compares it with parallel Apriori algorithm. The proposed approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer platform.
Parallel distributed computing using Python
NASA Astrophysics Data System (ADS)
Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro
2011-09-01
This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.
Scheduling parallel programs in distributed systems
Rommel, C.G.
1988-01-01
Scheduling parallel programs under the processor-sharing discipline for uniprocessors, multiprocessors, and distributed systems was studied. Two classes of parallel programs are considered: those without any IPC (called Fork-Join jobs) and those with asynchronous and uniform IPC (called clusters). The study is divided into two parts: (1) develops analytical solutions for Fork-Join Jobs on uniprocessors and multiprocessors; and (2) develops and evaluates via simulation Fork-Join jobs and clusters on distributed systems. The types of site scheduling studied are TS-PS where tasks of a job are scheduled independently at processor-sharing servers, JS-PS in which tasks of a job are scheduled as a single entity at processor-sharing servers, and FCFS where tasks of a job are scheduled independently by order of arrival. For Poisson job arrivals and exponentially distributed task service times, analytical solutions and computationally efficient bounds were found for Fork-Join TS-PS and JS-PS job response times. An algorithm was developed to schedule parallel programs in distributed systems. Over a wide range of parameters the algorithms was found to be superior to both no-load balancing, NLB, and shortest-queue first scheduling, SQF.
Fast parallel algorithms for short-range molecular dynamics
Plimpton, S.
1993-05-01
Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a subset of atoms; the second assigns each a subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently -- those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 10,000,000 atoms on three parallel supercomputers, the nCUBE 2, Intel iPSC/860, and Intel Delta. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and the Intel Delta performs about 30 times faster than a single Y-MP processor and 12 times faster than a single C90 processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1997-05-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, the authors describe these different parallel algorithms and report on computational experiments that they have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. The authors focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but they also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1994-04-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations or a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional FFTs and other parallel transforms.
A parallel adaptive mesh refinement algorithm
NASA Technical Reports Server (NTRS)
Quirk, James J.; Hanebutte, Ulf R.
1993-01-01
Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
The PRISM project: Infrastructure and algorithms for parallel eigensolvers
Bischof, C.; Sun, X.; Huss-Lederman, S.; Tsao, A.
1993-12-31
The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly reviewing the SYISDA approach, we discuss the algorithmic highlights of a distributed-memory implementation of an eigensolver based on this approach. These include a fast matrix-matrix multiplication algorithm, a new approach to parallel band reduction and tridiagonalization, and a harness for coordinating the divide-and-conquer parallelism in the problem. We also present performance results of these kernels as well as the overall SYISDA implementation on the Intel Touchstone Delta prototype and the IBM SP/1.
NASA Astrophysics Data System (ADS)
Gladwin, D.; Stewart, P.; Stewart, J.
2011-02-01
This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control
Parallel Clustering Algorithms for Structured AMR
Gunney, B T; Wissink, A M; Hysom, D A
2005-10-26
We compare several different parallel implementation approaches for the clustering operations performed during adaptive gridding operations in patch-based structured adaptive mesh refinement (SAMR) applications. Specifically, we target the clustering algorithm of Berger and Rigoutsos (BR91), which is commonly used in many SAMR applications. The baseline for comparison is a simplistic parallel extension of the original algorithm that works well for up to O(10{sup 2}) processors. Our goal is a clustering algorithm for machines of up to O(10{sup 5}) processors, such as the 64K-processor IBM BlueGene/Light system. We first present an algorithm that avoids the unneeded communications of the simplistic approach to improve the clustering speed by up to an order of magnitude. We then present a new task-parallel implementation to further reduce communication wait time, adding another order of magnitude of improvement. The new algorithms also exhibit more favorable scaling behavior for our test problems. Performance is evaluated on a number of large scale parallel computer systems, including a 16K-processor BlueGene/Light system.
Parallelization of a blind deconvolution algorithm
NASA Astrophysics Data System (ADS)
Matson, Charles L.; Borelli, Kathy J.
2006-09-01
Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
Parallelization of Edge Detection Algorithm using MPI on Beowulf Cluster
NASA Astrophysics Data System (ADS)
Haron, Nazleeni; Amir, Ruzaini; Aziz, Izzatdin A.; Jung, Low Tan; Shukri, Siti Rohkmah
In this paper, we present the design of parallel Sobel edge detection algorithm using Foster's methodology. The parallel algorithm is implemented using MPI message passing library and master/slave algorithm. Every processor performs the same sequential algorithm but on different part of the image. Experimental results conducted on Beowulf cluster are presented to demonstrate the performance of the parallel algorithm.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research on parallel algorithm for sequential pattern mining
NASA Astrophysics Data System (ADS)
Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao
2008-03-01
Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Parallelized quantum Monte Carlo algorithm with nonlocal worm updates.
Masaki-Kato, Akiko; Suzuki, Takafumi; Harada, Kenji; Todo, Synge; Kawashima, Naoki
2014-04-11
Based on the worm algorithm in the path-integral representation, we propose a general quantum Monte Carlo algorithm suitable for parallelizing on a distributed-memory computer by domain decomposition. Of particular importance is its application to large lattice systems of bosons and spins. A large number of worms are introduced and its population is controlled by a fictitious transverse field. For a benchmark, we study the size dependence of the Bose-condensation order parameter of the hard-core Bose-Hubbard model with L×L×βt=10240×10240×16, using 3200 computing cores, which shows good parallelization efficiency.
Parallel asynchronous systems and image processing algorithms
NASA Technical Reports Server (NTRS)
Coon, D. D.; Perera, A. G. U.
1989-01-01
A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.
Iterative algorithms for large sparse linear systems on parallel computers
NASA Technical Reports Server (NTRS)
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Parallelization of the Pipelined Thomas Algorithm
NASA Technical Reports Server (NTRS)
Povitsky, A.
1998-01-01
In this study the following questions are addressed. Is it possible to improve the parallelization efficiency of the Thomas algorithm? How should the Thomas algorithm be formulated in order to get solved lines that are used as data for other computational tasks while processors are idle? To answer these questions, two-step pipelined algorithms (PAs) are introduced formally. It is shown that the idle processor time is invariant with respect to the order of backward and forward steps in PAs starting from one outermost processor. The advantage of PAs starting from two outermost processors is small. Versions of the pipelined Thomas algorithms considered here fall into the category of PAs. These results show that the parallelization efficiency of the Thomas algorithm cannot be improved directly. However, the processor idle time can be used if some data has been computed by the time processors become idle. To achieve this goal the Immediate Backward pipelined Thomas Algorithm (IB-PTA) is developed in this article. The backward step is computed immediately after the forward step has been completed for the first portion of lines. This enables the completion of the Thomas algorithm for some of these lines before processors become idle. An algorithm for generating a static processor schedule recursively is developed. This schedule is used to switch between forward and backward computations and to control communications between processors. The advantage of the IB-PTA over the basic PTA is the presence of solved lines, which are available for other computations, by the time processors become idle.
An algorithm on distributed mining association rules
NASA Astrophysics Data System (ADS)
Xu, Fan
2005-12-01
With the rapid development of the Internet/Intranet, distributed databases have become a broadly used environment in various areas. It is a critical task to mine association rules in distributed databases. The algorithms of distributed mining association rules can be divided into two classes. One is a DD algorithm, and another is a CD algorithm. A DD algorithm focuses on data partition optimization so as to enhance the efficiency. A CD algorithm, on the other hand, considers a setting where the data is arbitrarily partitioned horizontally among the parties to begin with, and focuses on parallelizing the communication. A DD algorithm is not always applicable, however, at the time the data is generated, it is often already partitioned. In many cases, it cannot be gathered and repartitioned for reasons of security and secrecy, cost transmission, or sheer efficiency. A CD algorithm may be a more appealing solution for systems which are naturally distributed over large expenses, such as stock exchange and credit card systems. An FDM algorithm provides enhancement to CD algorithm. However, CD and FDM algorithms are both based on net-structure and executing in non-shareable resources. In practical applications, however, distributed databases often are star-structured. This paper proposes an algorithm based on star-structure networks, which are more practical in application, have lower maintenance costs and which are more practical in the construction of the networks. In addition, the algorithm provides high efficiency in communication and good extension in parallel computation.
Parallel algorithms for boundary value problems
NASA Technical Reports Server (NTRS)
Lin, Avi
1991-01-01
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are twofold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Parallel algorithms for boundary value problems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Embodied and Distributed Parallel DJing.
Cappelen, Birgitta; Andersson, Anders-Petter
2016-01-01
Everyone has a right to take part in cultural events and activities, such as music performances and music making. Enforcing that right, within Universal Design, is often limited to a focus on physical access to public areas, hearing aids etc., or groups of persons with special needs performing in traditional ways. The latter might be people with disabilities, being musicians playing traditional instruments, or actors playing theatre. In this paper we focus on the innovative potential of including people with special needs, when creating new cultural activities. In our project RHYME our goal was to create health promoting activities for children with severe disabilities, by developing new musical and multimedia technologies. Because of the users' extreme demands and rich contribution, we ended up creating both a new genre of musical instruments and a new art form. We call this new art form Embodied and Distributed Parallel DJing, and the new genre of instruments for Empowering Multi-Sensorial Things.
Coupled cluster algorithms for networks of shared memory parallel processors
NASA Astrophysics Data System (ADS)
Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.
2007-05-01
As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.
Parallelized FVM algorithm for three-dimensional viscoelastic flows
NASA Astrophysics Data System (ADS)
Dou, H.-S.; Phan-Thien, N.
A parallel implementation for the finite volume method (FVM) for three-dimensional (3D) viscoelastic flows is developed on a distributed computing environment through Parallel Virtual Machine (PVM). The numerical procedure is based on the SIMPLEST algorithm using a staggered FVM discretization in Cartesian coordinates. The final discretized algebraic equations are solved with the TDMA method. The parallelisation of the program is implemented by a domain decomposition strategy, with a master/slave style programming paradigm, and a message passing through PVM. A load balancing strategy is proposed to reduce the communications between processors. The three-dimensional viscoelastic flow in a rectangular duct is computed with this program. The modified Phan-Thien-Tanner (MPTT) constitutive model is employed for the equation system closure. Computing results are validated on the secondary flow problem due to non-zero second normal stress difference N2. Three sets of meshes are used, and the effect of domain decomposition strategies on the performance is discussed. It is found that parallel efficiency is strongly dependent on the grid size and the number of processors for a given block number. The convergence rate as well as the total efficiency of domain decomposition depends upon the flow problem and the boundary conditions. The parallel efficiency increases with increasing problem size for given block number. Comparing to two-dimensional flow problems, 3D parallelized algorithm has a lower efficiency owing to largely overlapped block interfaces, but the parallel algorithm is indeed a powerful means for large scale flow simulations.
Optical flow optimization using parallel genetic algorithm
NASA Astrophysics Data System (ADS)
Zavala-Romero, Olmo; Botella, Guillermo; Meyer-Bäse, Anke; Meyer Base, Uwe
2011-06-01
A new approach to optimize the parameters of a gradient-based optical flow model using a parallel genetic algorithm (GA) is proposed. The main characteristics of the optical flow algorithm are its bio-inspiration and robustness against contrast, static patterns and noise, besides working consistently with several optical illusions where other algorithms fail. This model depends on many parameters which conform the number of channels, the orientations required, the length and shape of the kernel functions used in the convolution stage, among many more. The GA is used to find a set of parameters which improve the accuracy of the optical flow on inputs where the ground-truth data is available. This set of parameters helps to understand which of them are better suited for each type of inputs and can be used to estimate the parameters of the optical flow algorithm when used with videos that share similar characteristics. The proposed implementation takes into account the embarrassingly parallel nature of the GA and uses the OpenMP Application Programming Interface (API) to speedup the process of estimating an optimal set of parameters. The information obtained in this work can be used to dynamically reconfigure systems, with potential applications in robotics, medical imaging and tracking.
Parallelism of the SANDstorm hash algorithm.
Torgerson, Mark Dolan; Draelos, Timothy John; Schroeppel, Richard Crabtree
2009-09-01
Mainstream cryptographic hashing algorithms are not parallelizable. This limits their speed and they are not able to take advantage of the current trend of being run on multi-core platforms. Being limited in speed limits their usefulness as an authentication mechanism in secure communications. Sandia researchers have created a new cryptographic hashing algorithm, SANDstorm, which was specifically designed to take advantage of multi-core processing and be parallelizable on a wide range of platforms. This report describes a late-start LDRD effort to verify the parallelizability claims of the SANDstorm designers. We have shown, with operating code and bench testing, that the SANDstorm algorithm may be trivially parallelized on a wide range of hardware platforms. Implementations using OpenMP demonstrates a linear speedup with multiple cores. We have also shown significant performance gains with optimized C code and the use of assembly instructions to exploit particular platform capabilities.
FRPA: A Framework for Recursive Parallel Algorithms
2015-05-01
Math Ker- nel Library (MKL) [4] matrix multiplication routine on “skinny” matrices. Our double-precision Strassen- Winograd implementation, at just...Optimal Par- allel Recursive Rectangular Matrix Multiplication,” in IEEE International Parallel & Distributed Processing Symposium, 2013. [4] Intel, “ Math
Parallel/distributed direct method for solving linear systems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Predicting mining activity with parallel genetic algorithms
Talaie, S.; Leigh, R.; Louis, S.J.; Raines, G.L.; Beyer, H.G.; O'Reilly, U.M.; Banzhaf, Arnold D.; Blum, W.; Bonabeau, C.; Cantu-Paz, E.W.; ,; ,
2005-01-01
We explore several different techniques in our quest to improve the overall model performance of a genetic algorithm calibrated probabilistic cellular automata. We use the Kappa statistic to measure correlation between ground truth data and data predicted by the model. Within the genetic algorithm, we introduce a new evaluation function sensitive to spatial correctness and we explore the idea of evolving different rule parameters for different subregions of the land. We reduce the time required to run a simulation from 6 hours to 10 minutes by parallelizing the code and employing a 10-node cluster. Our empirical results suggest that using the spatially sensitive evaluation function does indeed improve the performance of the model and our preliminary results also show that evolving different rule parameters for different regions tends to improve overall model performance. Copyright 2005 ACM.
Algorithmic commonalities in the parallel environment
NASA Technical Reports Server (NTRS)
Mcanulty, Michael A.; Wainer, Michael S.
1987-01-01
The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.
A parallel dynamic programming algorithm for multi-reservoir system optimization
NASA Astrophysics Data System (ADS)
Li, Xiang; Wei, Jiahua; Li, Tiejian; Wang, Guangqian; Yeh, William W.-G.
2014-05-01
This paper develops a parallel dynamic programming algorithm to optimize the joint operation of a multi-reservoir system. First, a multi-dimensional dynamic programming (DP) model is formulated for a multi-reservoir system. Second, the DP algorithm is parallelized using a peer-to-peer parallel paradigm. The parallelization is based on the distributed memory architecture and the message passing interface (MPI) protocol. We consider both the distributed computing and distributed computer memory in the parallelization. The parallel paradigm aims at reducing the computation time as well as alleviating the computer memory requirement associated with running a multi-dimensional DP model. Next, we test the parallel DP algorithm on the classic, benchmark four-reservoir problem on a high-performance computing (HPC) system with up to 350 cores. Results indicate that the parallel DP algorithm exhibits good performance in parallel efficiency; the parallel DP algorithm is scalable and will not be restricted by the number of cores. Finally, the parallel DP algorithm is applied to a real-world, five-reservoir system in China. The results demonstrate the parallel efficiency and practical utility of the proposed methodology.
Parallel algorithm strategies for circuit simulation.
Thornquist, Heidi K.; Schiek, Richard Louis; Keiter, Eric Richard
2010-01-01
Circuit simulation tools (e.g., SPICE) have become invaluable in the development and design of electronic circuits. However, they have been pushed to their performance limits in addressing circuit design challenges that come from the technology drivers of smaller feature scales and higher integration. Improving the performance of circuit simulation tools through exploiting new opportunities in widely-available multi-processor architectures is a logical next step. Unfortunately, not all traditional simulation applications are inherently parallel, and quickly adapting mature application codes (even codes designed to parallel applications) to new parallel paradigms can be prohibitively difficult. In general, performance is influenced by many choices: hardware platform, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, the use of mini-applications small self-contained proxies for real applications is an excellent approach for rapidly exploring the parameter space of all these choices. In this report we present a multi-core performance study of Xyce, a transistor-level circuit simulation tool, and describe the future development of a mini-application for circuit simulation.
Parallel Harmony Search Based Distributed Energy Resource Optimization
Ceylan, Oguzhan; Liu, Guodong; Tomsovic, Kevin
2015-01-01
This paper presents a harmony search based parallel optimization algorithm to minimize voltage deviations in three phase unbalanced electrical distribution systems and to maximize active power outputs of distributed energy resources (DR). The main contribution is to reduce the adverse impacts on voltage profile during a day as photovoltaics (PVs) output or electrical vehicles (EVs) charging changes throughout a day. The IEEE 123- bus distribution test system is modified by adding DRs and EVs under different load profiles. The simulation results show that by using parallel computing techniques, heuristic methods may be used as an alternative optimization tool in electrical power distribution systems operation.
Parallel LU-factorization algorithms for dense matrices
Oppe, T.C.; Kincaid, D.R.
1987-05-01
Several serial and parallel algorithms for computing the LU-factorization of a dense matrix are investigated. Numerical experiments and programming considerations to reduce bank conflicts on the Cray X-MP4 parallel computer are presented. Speedup factors are given for the parallel algorithms. 15 refs., 6 tabs.
A parallel genetic algorithm for the set partitioning problem
Levine, D.
1994-05-01
In this dissertation the author reports on his efforts to develop a parallel genetic algorithm and apply it to the solution of set partitioning problem -- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. He developed a distributed steady-state genetic algorithm in conjunction with a specialized local search heuristic for solving the set partitioning problem. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. The authors found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulation found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high-quality integer feasible solutions were found for problems with 36,699 and 43,749 integer variables, respectively. A notable limitation they found was the difficulty solving problems with many constraints.
Massively Parallel Algorithms for Solution of Schrodinger Equation
NASA Technical Reports Server (NTRS)
Fijany, Amir; Barhen, Jacob; Toomerian, Nikzad
1994-01-01
In this paper massively parallel algorithms for solution of Schrodinger equation are developed. Our results clearly indicate that the Crank-Nicolson method, in addition to its excellent numerical properties, is also highly suitable for massively parallel computation.
Towards Distributed Memory Parallel Program Analysis
Quinlan, D; Barany, G; Panas, T
2008-06-17
This paper presents a parallel attribute evaluation for distributed memory parallel computer architectures where previously only shared memory parallel support for this technique has been developed. Attribute evaluation is a part of how attribute grammars are used for program analysis within modern compilers. Within this work, we have extended ROSE, a open compiler infrastructure, with a distributed memory parallel attribute evaluation mechanism to support user defined global program analysis required for some forms of security analysis which can not be addressed by a file by file view of large scale applications. As a result, user defined security analyses may now run in parallel without the user having to specify the way data is communicated between processors. The automation of communication enables an extensible open-source parallel program analysis infrastructure.
Differences Between Distributed and Parallel Systems
Brightwell, R.; Maccabe, A.B.; Rissen, R.
1998-10-01
Distributed systems have been studied for twenty years and are now coming into wider use as fast networks and powerful workstations become more readily available. In many respects a massively parallel computer resembles a network of workstations and it is tempting to port a distributed operating system to such a machine. However, there are significant differences between these two environments and a parallel operating system is needed to get the best performance out of a massively parallel system. This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these differences on operating system design. In the second part of the report, we introduce Puma, an operating system specifically developed for massively parallel systems. We describe Puma portals, the basic building blocks for message passing paradigms implemented on top of Puma, and show how the differences observed in the first part of the report have influenced the design and implementation of Puma.
Parallel implementation of the FETI-DPEM algorithm for general 3D EM simulations
NASA Astrophysics Data System (ADS)
Li, Yu-Jia; Jin, Jian-Ming
2009-05-01
A parallel implementation of the electromagnetic dual-primal finite element tearing and interconnecting algorithm (FETI-DPEM) is designed for general three-dimensional (3D) electromagnetic large-scale simulations. As a domain decomposition implementation of the finite element method, the FETI-DPEM algorithm provides fully decoupled subdomain problems and an excellent numerical scalability, and thus is well suited for parallel computation. The parallel implementation of the FETI-DPEM algorithm on a distributed-memory system using the message passing interface (MPI) is discussed in detail along with a few practical guidelines obtained from numerical experiments. Numerical examples are provided to demonstrate the efficiency of the parallel implementation.
A parallel algorithm for the non-symmetric eigenvalue problem
Dongarra, J.; Sidani, M. |
1991-12-01
This paper describes a parallel algorithm for computing the eigenvalues and eigenvectors of a non-symmetric matrix. The algorithm is based on a divide-and-conquer procedure and uses an iterative refinement technique.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways.
Jevremović, Dimitrije; Trinh, Cong T; Srienc, Friedrich; Sosa, Carlos P; Boley, Daniel
2011-06-01
Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100-1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency.
A general construction for parallelizing Metropolis−Hastings algorithms
Calderhead, Ben
2014-01-01
Markov chain Monte Carlo methods (MCMC) are essential tools for solving many modern-day statistical and computational problems; however, a major limitation is the inherently sequential nature of these algorithms. In this paper, we propose a natural generalization of the Metropolis−Hastings algorithm that allows for parallelizing a single chain using existing MCMC methods. We do so by proposing multiple points in parallel, then constructing and sampling from a finite-state Markov chain on the proposed points such that the overall procedure has the correct target density as its stationary distribution. Our approach is generally applicable and straightforward to implement. We demonstrate how this construction may be used to greatly increase the computational speed and statistical efficiency of a variety of existing MCMC methods, including Metropolis-Adjusted Langevin Algorithms and Adaptive MCMC. Furthermore, we show how it allows for a principled way of using every integration step within Hamiltonian Monte Carlo methods; our approach increases robustness to the choice of algorithmic parameters and results in increased accuracy of Monte Carlo estimates with little extra computational cost. PMID:25422442
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Parallel algorithms and architecture for computation of manipulator forward dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
Parallel Breadth-First Search on Distributed Memory Systems
Computational Research Division; Buluc, Aydin; Madduri, Kamesh
2011-04-15
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned par- allel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix- partitioning-based approach that mitigates parallel commu- nication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execu- tion regimes in which these approaches will be competitive, and we demonstrate extremely high performance on lead- ing distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny- Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.
Sort-First, Distributed Memory Parallel Visualization and Rendering
Bethel, E. Wes; Humphreys, Greg; Paul, Brian; Brederson, J. Dean
2003-07-15
While commodity computing and graphics hardware has increased in capacity and dropped in cost, it is still quite difficult to make effective use of such systems for general-purpose parallel visualization and graphics. We describe the results of a recent project that provides a software infrastructure suitable for general-purpose use by parallel visualization and graphics applications. Our work combines and extends two technologies: Chromium, a stream-oriented framework that implements the OpenGL programming interface; and OpenRM Scene Graph, a pipelined-parallel scene graph interface for graphics data management. Using this combination, we implement a sort-first, distributed memory, parallel volume rendering application. We describe the performance characteristics in terms of bandwidth requirements and highlight key algorithmic considerations needed to implement the sort-first system. We characterize system performance using a distributed memory parallel volume rendering application, a nd present performance gains realized by using scene specific knowledge to accelerate rendering through reduced network bandwidth. The contribution of this work is an exploration of general-purpose, sort-first architecture performance characteristics as applied to distributed memory, commodity hardware, along with a description of the algorithmic support needed to realize parallel, sort-first implementations.
Parallel optimization algorithms and their implementation in VLSI design
NASA Technical Reports Server (NTRS)
Lee, G.; Feeley, J. J.
1991-01-01
Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
Applications and accuracy of the parallel diagonal dominant algorithm
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1993-01-01
The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines.
A Parallel Algorithm for Contact in a Finite Element Hydrocode
Pierce, Timothy G.
2003-06-01
A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the same processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
AN ALGORITHM FOR PARALLEL SN SWEEPS ON UNSTRUCTURED MESHES
S. D. PAUTZ
2000-12-01
We develop a new algorithm for performing parallel S{sub n} sweeps on unstructured meshes. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned mesh. For typical problems and with ''normal'' mesh partitionings we have observed nearly linear speedups on up to 126 processors. This is an important and desirable result, since although analyses of structured meshes indicate that parallel sweeps will not scale with normal partitioning approaches, we do not observe any severe asymptotic degradation in the parallel efficiency with modest ({le}100) levels of parallelism. This work is a fundamental step in the development of parallel S{sub n} methods.
A new scheduling algorithm for parallel sparse LU factorization with static pivoting
Grigori, Laura; Li, Xiaoye S.
2002-08-20
In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L' and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU{_}DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
A Parallel Algorithm for the Vehicle Routing Problem
Groer, Christopher S; Golden, Bruce; Edward, Wasil
2011-01-01
The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.
Parallelization and automatic data distribution for nuclear reactor simulations
Liebrock, L.M.
1997-07-01
Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.
Distributed Minimum Hop Algorithms
1982-01-01
acknowledgement), node d starts iteration i+1, and otherwise the algorithm terminates. A detailed description of the algorithm is given in pidgin algol...precise behavior of the algorithm under these circumstances is described by the pidgin algol program in the appendix which is executed by each node. The...l) < N!(2) for each neighbor j, and thus by induction,J -1 N!(2-1) < n-i + (Z-1) + N!(Z-1), completing the proof. Algorithm Dl in Pidgin Algol It is
Distributed parallel messaging for multiprocessor systems
Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka
2013-06-04
A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Algorithm Calculates Cumulative Poisson Distribution
NASA Technical Reports Server (NTRS)
Bowerman, Paul N.; Nolty, Robert C.; Scheuer, Ernest M.
1992-01-01
Algorithm calculates accurate values of cumulative Poisson distribution under conditions where other algorithms fail because numbers are so small (underflow) or so large (overflow) that computer cannot process them. Factors inserted temporarily to prevent underflow and overflow. Implemented in CUMPOIS computer program described in "Cumulative Poisson Distribution Program" (NPO-17714).
Distributed Parallel Particle Advection using Work Requesting
Muller, Cornelius; Camp, David; Hentschel, Bernd; Garth, Christoph
2013-09-30
Particle advection is an important vector field visualization technique that is difficult to apply to very large data sets in a distributed setting due to scalability limitations in existing algorithms. In this paper, we report on several experiments using work requesting dynamic scheduling which achieves balanced work distribution on arbitrary problems with minimal communication overhead. We present a corresponding prototype implementation, provide and analyze benchmark results, and compare our results to an existing algorithm.
A Parallel Prefix Algorithm for Almost Toeplitz Tridiagonal Systems
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Joslin, Ronald D.
1995-01-01
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study has been conducted to provide a simple truncation formula. Experimental results have been measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and for the compact scheme on high-performance computers.
An efficient parallel algorithm for matrix-vector multiplication
Hendrickson, B.; Leland, R.; Plimpton, S.
1993-03-01
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1997-01-01
The multiblock reacting Navier-Stokes flow solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1994-01-01
The multiblock reacting Navier-Stokes flow-solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
A fast algorithm for reordering sparse matrices for parallel factorization
Lewis, J.G.; Peyton, B.W.; Pothen, A.
1989-01-01
Jess and Kees introduced a method for ordering a sparse symmetric matrix A for efficient parallel factorization. The parallel ordering is computed in two steps. First, the matrix A is ordered by some fill-reducing ordering. Second, a parallel ordering of A is computed from the filled graph that results from factoring A using the initial fill-reducing ordering. Among all orderings whose fill lies in the filled graph, this parallel ordering achieves the minimum number of parallel steps in the factorization of A. Jess and Kees did not specify the implementation details of an algorithm for either step of this scheme. Liu and Mirzaian (1987) designed an algorithm implementing the second step, but it has time and space requirements higher than the cost of computing common fill-reducing orderings. We present here a new fast algorithm that implements the parallel ordering step by exploiting the clique tree representation of a chordal graph. We succeed in reducing the cost of the parallel ordering step well below that of the fill-reducing step. Our algorithm has time and space complexity linear in the number of compressed subscripts of L, i.e., the sum of the sizes of the maximal cliques of the filled graph. Empirically we demonstrate running times nearly identical to Liu's heuristic Composite Rotations algorithm that approximates the minimum number of parallel steps. 21 refs., 3 figs., 4 tabs.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
NASA Technical Reports Server (NTRS)
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Ray tracing on distributed memory parallel systems
NASA Technical Reports Server (NTRS)
Jensen, David W.; Reed, Daniel A.
1990-01-01
Among the many techniques in computer graphics, ray tracing is prized because it can render realistic images, albeit at great computational expense. In this note, the performance of several approaches to ray tracing on a distributed memory parallel system is evaluated. A set of performance instrumentation tools and their associated visualization software are used to identify the underlying causes of performance differences.
Parallel algorithms for interactive manipulation of digital terrain models
NASA Technical Reports Server (NTRS)
Davis, E. W.; Mcallister, D. F.; Nagaraj, V.
1988-01-01
Interactive three-dimensional graphics applications, such as terrain data representation and manipulation, require extensive arithmetic processing. Massively parallel machines are attractive for this application since they offer high computational rates, and grid connected architectures provide a natural mapping for grid based terrain models. Presented here are algorithms for data movement on the massive parallel processor (MPP) in support of pan and zoom functions over large data grids. It is an extension of earlier work that demonstrated real-time performance of graphics functions on grids that were equal in size to the physical dimensions of the MPP. When the dimensions of a data grid exceed the processing array size, data is packed in the array memory. Windows of the total data grid are interactively selected for processing. Movement of packed data is needed to distribute items across the array for efficient parallel processing. Execution time for data movement was found to exceed that for arithmetic aspects of graphics functions. Performance figures are given for routines written in MPP Pascal.
Exact parallel maximum clique algorithm for general and protein graphs.
Depolli, Matjaž; Konc, Janez; Rozman, Kati; Trobec, Roman; Janežič, Dušanka
2013-09-23
A new exact parallel maximum clique algorithm MaxCliquePara, which finds the maximum clique (the fully connected subgraph) in undirected general and protein graphs, is presented. First, a new branch and bound algorithm for finding a maximum clique on a single computer core, which builds on ideas presented in two published state of the art sequential algorithms is implemented. The new sequential MaxCliqueSeq algorithm is faster than the reference algorithms on both DIMACS benchmark graphs as well as on protein-derived product graphs used for protein structural comparisons. Next, the MaxCliqueSeq algorithm is parallelized by splitting the branch-and-bound search tree to multiple cores, resulting in MaxCliquePara algorithm. The ability to exploit all cores efficiently makes the new parallel MaxCliquePara algorithm markedly superior to other tested algorithms. On a 12-core computer, the parallelization provides up to 2 orders of magnitude faster execution on the large DIMACS benchmark graphs and up to an order of magnitude faster execution on protein product graphs. The algorithms are freely accessible on http://commsys.ijs.si/~matjaz/maxclique.
NASA Astrophysics Data System (ADS)
Chen, Yufeng; Wu, Zebin; Sun, Le; Wei, Zhihui; Li, Yonglong
2016-04-01
With the gradual increase in the spatial and spectral resolution of hyperspectral images, the size of image data becomes larger and larger, and the complexity of processing algorithms is growing, which poses a big challenge to efficient massive hyperspectral image processing. Cloud computing technologies distribute computing tasks to a large number of computing resources for handling large data sets without the limitation of memory and computing resource of a single machine. This paper proposes a parallel pixel purity index (PPI) algorithm for unmixing massive hyperspectral images based on a MapReduce programming model for the first time in the literature. According to the characteristics of hyperspectral images, we describe the design principle of the algorithm, illustrate the main cloud unmixing processes of PPI, and analyze the time complexity of serial and parallel algorithms. Experimental results demonstrate that the parallel implementation of the PPI algorithm on the cloud can effectively process big hyperspectral data and accelerate the algorithm.
Parallel algorithms and architectures for the manipulator inertia matrix
Amin-Javaheri, M.
1989-01-01
Several parallel algorithms and architectures to compute the manipulator inertia matrix in real time are proposed. An O(N) and an O(log{sub 2}N) parallel algorithm based upon recursive computation of the inertial parameters of sets of composite rigid bodies are formulated. One- and two-dimensional systolic architectures are presented to implement the O(N) parallel algorithm. A cube architecture is employed to implement the diagonal element of the inertia matrix in O(log{sub 2}N) time and the upper off-diagonal elements in O(N) time. The resulting K{sub 1}O(N) + K{sub 2}O(log{sub 2}N) parallel algorithm is more efficient for a cube network implementation. All the architectural configurations are based upon a VLSI Robotics Processor exploiting fine-grain parallelism. In evaluation all the architectural configurations, significant performance parameters such as I/O time and idle time due to processor synchronization as well as CPU utilization and on-chip memory size are fully included. The O(N) and O(log{sub 2}N) parallel algorithms adhere to the precedence relationships among the processors. In order to achieve a higher speedup factor; however, parallel algorithms in conjunction with Non-Strict Computational Models are devised to relax interprocess precedence, and as a result, to decrease the effective computational delays. The effectiveness of the Non-strict Computational Algorithms is verified by computer simulations, based on a PUMA 560 robot manipulator. It is demonstrated that a combination of parallel algorithms and architectures results in a very effective approach to achieve real-time response for computing the manipulator inertia matrix.
Algorithms for parallel and vector computations
NASA Technical Reports Server (NTRS)
Ortega, James M.
1995-01-01
This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.
Algorithmic support for commodity-based parallel computing systems.
Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
Hadoop neural network for parallel and distributed feature selection.
Hodge, Victoria J; O'Keefe, Simon; Austin, Jim
2016-06-01
In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop.
An improved spectral graph partitioning algorithm for mapping parallel computations
Hendrickson, B.; Leland, R.
1992-09-01
Efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. We present a new domain mapping algorithm that extends recent work in which ideas from spectral graph theory have been applied to this problem. Our generalization of spectral graph bisection involves a novel use of multiple eigenvectors to allow for division of a computation into four or eight parts at each stage of a recursive decomposition. The resulting method is suitable for scientific computations like irregular finite elements or differences performed on hypercube or mesh architecture machines. Experimental results confirm that the new method provides better decompositions arrived at more economically and robustly than with previous spectral methods. We have also improved upon the known spectral lower bound for graph bisection.
Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
2014-01-01
It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
A Programming Environment for Parallel Vision Algorithms
1990-04-11
linear parallel speedup. Many appli- cations for the image processing pipeline (including tracking, color histograrmning, feature detection, frame-rate...pure logic. For example, a language based on algebra of real numbers might treat constraints such as "X = Y + Z", "X = Y x Z", and so on as primitives. A...however, time for a more usable version of the language. A front end processor is therefore being written to parse expressions written in an algebraic
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel
Multimodal Estimation of Distribution Algorithms.
Yang, Qiang; Chen, Wei-Neng; Li, Yun; Chen, C L Philip; Xu, Xiang-Min; Zhang, Jun
2016-02-15
Taking the advantage of estimation of distribution algorithms (EDAs) in preserving high diversity, this paper proposes a multimodal EDA. Integrated with clustering strategies for crowding and speciation, two versions of this algorithm are developed, which operate at the niche level. Then these two algorithms are equipped with three distinctive techniques: 1) a dynamic cluster sizing strategy; 2) an alternative utilization of Gaussian and Cauchy distributions to generate offspring; and 3) an adaptive local search. The dynamic cluster sizing affords a potential balance between exploration and exploitation and reduces the sensitivity to the cluster size in the niching methods. Taking advantages of Gaussian and Cauchy distributions, we generate the offspring at the niche level through alternatively using these two distributions. Such utilization can also potentially offer a balance between exploration and exploitation. Further, solution accuracy is enhanced through a new local search scheme probabilistically conducted around seeds of niches with probabilities determined self-adaptively according to fitness values of these seeds. Extensive experiments conducted on 20 benchmark multimodal problems confirm that both algorithms can achieve competitive performance compared with several state-of-the-art multimodal algorithms, which is supported by nonparametric tests. Especially, the proposed algorithms are very promising for complex problems with many local optima.
Fast parallel algorithm for CT image reconstruction.
Flores, Liubov A; Vidal, Vicent; Mayo, Patricia; Rodenas, Francisco; Verdú, Gumersindo
2012-01-01
In X-ray computed tomography (CT) the X rays are used to obtain the projection data needed to generate an image of the inside of an object. The image can be generated with different techniques. Iterative methods are more suitable for the reconstruction of images with high contrast and precision in noisy conditions and from a small number of projections. Their use may be important in portable scanners for their functionality in emergency situations. However, in practice, these methods are not widely used due to the high computational cost of their implementation. In this work we analyze iterative parallel image reconstruction with the Portable Extensive Toolkit for Scientific computation (PETSc).
Resource Management for Distributed Parallel Systems
NASA Technical Reports Server (NTRS)
Neuman, B. Clifford; Rao, Santosh
1993-01-01
Multiprocessor systems should exist in the the larger context of distributed systems, allowing multiprocessor resources to be shared by those that need them. Unfortunately, typical multiprocessor resource management techniques do not scale to large networks. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and multiprocessor systems. To manage resources in such distributed parallel systems, PRM employs three types of managers: system managers, job managers, and node managers. There exist multiple independent instances of each type of manager, reducing bottlenecks. The complexity of each manager is further reduced because each is designed to utilize information at an appropriate level of abstraction.
Efficient sequential and parallel algorithms for record linkage
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar
2014-01-01
Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837
Cloud identification using genetic algorithms and massively parallel computation
NASA Technical Reports Server (NTRS)
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
A scalable parallel algorithm for multiple objective linear programs
NASA Technical Reports Server (NTRS)
Wiecek, Malgorzata M.; Zhang, Hong
1994-01-01
This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
A communication-less parallel algorithm for tridiagonal Toeplitz systems
NASA Astrophysics Data System (ADS)
McNally, Jeffrey M.; Garey, L. E.; Shaw, R. E.
2008-03-01
Diagonally dominant tridiagonal Toeplitz systems of linear equations arise in many application areas and have been well studied in the past. Modern interest in numerical linear algebra is often focusing on solving classic problems in parallel. In McNally [Fast parallel algorithms for tri-diagonal symmetric Toeplitz systems, MCS Thesis, University of New Brunswick, Saint John, 1999], an m processor Split & Correct algorithm was presented for approximating the solution to a symmetric tridiagonal Toeplitz linear system of equations. Nemani [Perturbation methods for circulant-banded systems and their parallel implementation, Ph.D. Thesis, University of New Brunswick, Saint John, 2001] and McNally (2003) adapted the works of Rojo [A new method for solving symmetric circulant tri-diagonal system of linear equations, Comput. Math. Appl. 20 (1990) 61-67], Yan and Chung [A fast algorithm for solving special tri-diagonal systems, Computing 52 (1994) 203-211] and McNally et al. [A split-correct parallel algorithm for solving tri-diagonal symmetric Toeplitz systems, Internat. J. Comput. Math. 75 (2000) 303-313] to the non-symmetric case. In this paper we present relevant background from these methods and then introduce an m processor scalable communication-less approximation algorithm for solving a diagonally dominant tridiagonal Toeplitz system of linear equations.
Parallel Detection Algorithm for Fast Frequency Hopping OFDM
NASA Astrophysics Data System (ADS)
Kun, Xu; Xiao-xin, Yi
2011-05-01
Fast frequency hopping OFDM (FFH-OFDM) exploits frequency diversity in one OFDM symbol to enhance conventional OFDM performance without using channel coding. Zero-forcing (ZF) and minimum mean square error (MMSE) equalization were first used to detect FFH-OFDM signal with a relatively poor bit error rate (BER) performance compared to QR-based detection algorithm. This paper proposes a parallel detection algorithm (PDA) to further improve the BER performance with parallel interference cancelation (PIC) based on MMSE criterion. Our proposed PDA not only improves the BER performance at high signal to noise ratio (SNR) regime but also possesses lower decoding delay property with respect to QR-based detection algorithm while maintaining comparable computation complexity. Simulation results indicate that at BER = 10-3 the PDA achieves 5 dB SNR gain over QR-based detection algorithm and more as SNR increases.
Parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
A biconjugate gradient type algorithm on massively parallel architectures
NASA Technical Reports Server (NTRS)
Freund, Roland W.; Hochbruck, Marlis
1991-01-01
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
Highly parallel consistent labeling algorithm suitable for optoelectronic implementation.
Marsden, G C; Kiamilev, F; Esener, S; Lee, S H
1991-01-10
Constraint satisfaction problems require a search through a large set of possibilities. Consistent labeling is a method by which search spaces can be drastically reduced. We present a highly parallel consistent labeling algorithm, which achieves strong k-consistency for any value k and which can include higher-order constraints. The algorithm uses vector outer product, matrix summation, and matrix intersection operations. These operations require local computation with global communication and, therefore, are well suited to a optoelectronic implementation.
NavP: Structured and Multithreaded Distributed Parallel Programming
NASA Technical Reports Server (NTRS)
Pan, Lei
2007-01-01
We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.
NASA Astrophysics Data System (ADS)
Mighell, Kenneth John
2011-11-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. The CRBLASTER source code is freely available at the official application website at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image takes 44 seconds with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. CRBLASTER is 7.4 times faster processing the same image on a single core on the same machine. Processing the same image with CRBLASTER simultaneously on all 8 cores of the same machine takes 0.875 seconds -- which is a speedup factor of 50.3 times faster than the IRAF script. A detailed analysis is presented of the performance of CRBLASTER using between 1 and 57 processors on a low-power Tilera 700-MHz 64-core TILE64 processor.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
NASA Technical Reports Server (NTRS)
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
NASA Astrophysics Data System (ADS)
Reif, John H.; Tyagi, Akhilesh
1997-10-01
Optical-computing technology offers new challenges to algorithm designers since it can perform an n -point discrete Fourier transform (DFT) computation in only unit time. Note that the DFT is a nontrivial computation in the parallel random-access machine model, a model of computing commonly used by parallel-algorithm designers. We develop two new models, the DFT VLSIO (very-large-scale integrated optics) and the DFT circuit, to capture this characteristic of optical computing. We also provide two paradigms for developing parallel algorithms in these models. Efficient parallel algorithms for many problems, including polynomial and matrix computations, sorting, and string matching, are presented. The sorting and string-matching algorithms are particularly noteworthy. Almost all these algorithms are within a polylog factor of the optical-computing (VLSIO) lower bounds derived by Barakat and Reif Appl. Opt. 26, 1015 (1987) and by Tyagi and Reif Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing (Institute of Electrical and Electronics Engineers, New York, 1990) p. 14 .
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
NASA Technical Reports Server (NTRS)
Lohn, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris; Norvig, Peter (Technical Monitor)
2000-01-01
We describe a parallel genetic algorithm (GA) that automatically generates circuit designs using evolutionary search. A circuit-construction programming language is introduced and we show how evolution can generate practical analog circuit designs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. We present experimental results as applied to analog filter and amplifier design tasks.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.
1989-01-01
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
Parallel algorithm of VLBI software correlator under multiprocessor environment
NASA Astrophysics Data System (ADS)
Zheng, Weimin; Zhang, Dong
2007-11-01
The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
The delayed coupling method: An algorithm for solving banded diagonal matrix problems in parallel
Mattor, N.; Williams, T.J.; Hewett, D.W.; Dimits, A.M.
1997-09-01
We present a new algorithm for solving banded diagonal matrix problems efficiently on distributed-memory parallel computers, designed originally for use in dynamic alternating-direction implicit partial differential equation solvers. The algorithm optimizes efficiency with respect to the number of numerical operations and to the amount of interprocessor communication. This is called the ``delayed coupling method`` because the communication is deferred until needed. We focus here on tridiagonal and periodic tridiagonal systems.
Fast parallel algorithms: from images to level sets and labels
NASA Astrophysics Data System (ADS)
Nguyen, H. T.; Jung, Ken K.; Raghavan, Raghu
1990-07-01
Decomposition into level sets refers to assigning a code with respect to intensity or elevation while labeling refers to assigning a code with respect to disconnected regions. We present a sequence of parallel algorithms for these two processes. The process of labeling includes re-assign labels into a natural sequence and compare different labeling algorithm. We discuss the difference between edge-based and region-based labeling. The speed improvements in this labeling scheme come from the collective efficiency of different techniques. We have implemented these algorithms on an in-house built Geometric Single Instruction Multiple Data (GSIMD) parallel machine with global buses and a Multiple Instruction Multiple Data (MIMD) controller. This allows real time image interpretation on live data at a rate that is much higher than video rate. The performance figures will be shown.
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Masalma, Yahya; Jiao, Yu
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
A parallel stereo reconstruction algorithm with applications in entomology (APSRA)
NASA Astrophysics Data System (ADS)
Bhasin, Rajesh; Jang, Won Jun; Hart, John C.
2012-03-01
We propose a fast parallel algorithm for the reconstruction of 3-Dimensional point clouds of insects from binocular stereo image pairs using a hierarchical approach for disparity estimation. Entomologists study various features of insects to classify them, build their distribution maps, and discover genetic links between specimens among various other essential tasks. This information is important to the pesticide and the pharmaceutical industries among others. When considering the large collections of insects entomologists analyze, it becomes difficult to physically handle the entire collection and share the data with researchers across the world. With the method presented in our work, Entomologists can create an image database for their collections and use the 3D models for studying the shape and structure of the insects thus making it easier to maintain and share. Initial feedback shows that the reconstructed 3D models preserve the shape and size of the specimen. We further optimize our results to incorporate multiview stereo which produces better overall structure of the insects. Our main contribution is applying stereoscopic vision techniques to entomology to solve the problems faced by entomologists.
Parallel global optimization with the particle swarm algorithm.
Schutte, J F; Reinbolt, J A; Fregly, B J; Haftka, R T; George, A D
2004-12-07
Present day engineering optimization problems often impose large computational demands, resulting in long solution times even on a modern high-end processor. To obtain enhanced computational throughput and global search capability, we detail the coarse-grained parallelization of an increasingly popular global search method, the particle swarm optimization (PSO) algorithm. Parallel PSO performance was evaluated using two categories of optimization problems possessing multiple local minima-large-scale analytical test problems with computationally cheap function evaluations and medium-scale biomechanical system identification problems with computationally expensive function evaluations. For load-balanced analytical test problems formulated using 128 design variables, speedup was close to ideal and parallel efficiency above 95% for up to 32 nodes on a Beowulf cluster. In contrast, for load-imbalanced biomechanical system identification problems with 12 design variables, speedup plateaued and parallel efficiency decreased almost linearly with increasing number of nodes. The primary factor affecting parallel performance was the synchronization requirement of the parallel algorithm, which dictated that each iteration must wait for completion of the slowest fitness evaluation. When the analytical problems were solved using a fixed number of swarm iterations, a single population of 128 particles produced a better convergence rate than did multiple independent runs performed using sub-populations (8 runs with 16 particles, 4 runs with 32 particles, or 2 runs with 64 particles). These results suggest that (1) parallel PSO exhibits excellent parallel performance under load-balanced conditions, (2) an asynchronous implementation would be valuable for real-life problems subject to load imbalance, and (3) larger population sizes should be considered when multiple processors are available.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Time parallelization of plasma simulations using the parareal algorithm
Samaddar, D.; Houlberg, Wayne A; Berry, Lee A; Elwasif, Wael R; Huysmans, G; Batchelor, Donald B
2011-01-01
Simulation of fusion plasmas involve a broad range of timescales. In magnetically confined plasmas, such as in ITER, the timescale associated with the microturbulence responsible for transport and confinement timescales vary by an order of 10^6 10^9. Simulating this entire range of timescales is currently impossible, even on the most powerful supercomputers available. Space parallelization has so far been the most common approach to solve partial differential equations. Space parallelization alone has led to computational saturation for fluid codes, which means that the walltime for computaion does not linearly decrease with the increasing number of processors used. The application of the parareal algorithm to simulations of fusion plasmas ushers in a new avenue of parallelization, namely temporal parallelization. The algorithm has been successfully applied to plasma turbulence simulations, prior to which it has been applied to other relatively simpler problems. This work explores the extension of the applicability of the parareal algorithm to ITER relevant problems, starting with a diffusion-convection model.
Approximation algorithms for scheduling unrelated parallel machines with release dates
NASA Astrophysics Data System (ADS)
Avdeenko, T. V.; Mesentsev, Y. A.; Estraykh, I. V.
2017-01-01
In this paper we propose approaches to optimal scheduling of unrelated parallel machines with release dates. One approach is based on the scheme of dynamic programming modified with adaptive narrowing of search domain ensuring its computational effectiveness. We discussed complexity of the exact schedules synthesis and compared it with approximate, close to optimal, solutions. Also we explain how the algorithm works for the example of two unrelated parallel machines and five jobs with release dates. Performance results that show the efficiency of the proposed approach have been given.
Fault Tolerant Statistical Signal Processing Algorithms for Parallel Architectures.
2014-09-26
AD-fi57 393 FAULT TOLERANT STATISTICAL SIGNAL PROCESSING ALGORITHMS i/i FOR PARALLEL ARCH U) JOHNS HOPKINS UNIV BALTIMORE MD DEPT OF ELECTRICAL...COVERED * ’ Fault Tolerant Statistical Signal Processing Technical A l g o r i t h m s f o r P a r a l l e l A r c h i t e c t u r e s a ._ P E R F O R M I...Identify by block number) , Fault Tolerance, Signal Processing, Parallel Architecture 0 20. ABSTRACT (Continue on reveree side It neceseary and identify by
NASA Astrophysics Data System (ADS)
Mighell, Kenneth John
2010-10-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the
NASA Astrophysics Data System (ADS)
Hu, Hongda; Shu, Hong
2015-05-01
Heavy computation limits the use of Kriging interpolation methods in many real-time applications, especially with the ever-increasing problem size. Many researchers have realized that parallel processing techniques are critical to fully exploit computational resources and feasibly solve computation-intensive problems like Kriging. Much research has addressed the parallelization of traditional approach to Kriging, but this computation-intensive procedure may not be suitable for high-resolution interpolation of spatial data. On the basis of a more effective serial approach, we propose an improved coarse-grained parallel algorithm to accelerate ordinary Kriging interpolation. In particular, the interpolation task of each unobserved point is considered as a basic parallel unit. To reduce time complexity and memory consumption, the large right hand side matrix in the Kriging linear system is transformed and fixed at only two columns and therefore no longer directly relevant to the number of unobserved points. The MPI (Message Passing Interface) model is employed to implement our parallel programs in a homogeneous distributed memory system. Experimentally, the improved parallel algorithm performs better than the traditional one in spatial interpolation of annual average precipitation in Victoria, Australia. For example, when the number of processors is 24, the improved algorithm keeps speed-up at 20.8 while the speed-up of the traditional algorithm only reaches 9.3. Likewise, the weak scaling efficiency of the improved algorithm is nearly 90% while that of the traditional algorithm almost drops to 40% with 16 processors. Experimental results also demonstrate that the performance of the improved algorithm is enhanced by increasing the problem size.
a Distributed Polygon Retrieval Algorithm Using Mapreduce
NASA Astrophysics Data System (ADS)
Guo, Q.; Palanisamy, B.; Karimi, H. A.
2015-07-01
The burst of large-scale spatial terrain data due to the proliferation of data acquisition devices like 3D laser scanners poses challenges to spatial data analysis and computation. Among many spatial analyses and computations, polygon retrieval is a fundamental operation which is often performed under real-time constraints. However, existing sequential algorithms fail to meet this demand for larger sizes of terrain data. Motivated by the MapReduce programming model, a well-adopted large-scale parallel data processing technique, we present a MapReduce-based polygon retrieval algorithm designed with the objective of reducing the IO and CPU loads of spatial data processing. By indexing the data based on a quad-tree approach, a significant amount of unneeded data is filtered in the filtering stage and it reduces the IO overhead. The indexed data also facilitates querying the relationship between the terrain data and query area in shorter time. The results of the experiments performed in our Hadoop cluster demonstrate that our algorithm performs significantly better than the existing distributed algorithms.
Parallel and Grid-Based Data Mining - Algorithms, Models and Systems for High-Performance KDD
NASA Astrophysics Data System (ADS)
Congiusta, Antonio; Talia, Domenico; Trunfio, Paolo
Data Mining often is a computing intensive and time requiring process. For this reason, several Data Mining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of large data sets. Moreover, when large data repositories are coupled with geographical distribution of data, users and systems, more sophisticated technologies are needed to implement high-performance distributed KDD systems. Since computational Grids emerged as privileged platforms for distributed computing, a growing number of Grid-based KDD systems has been proposed. In this chapter we first discuss different ways to exploit parallelism in the main Data Mining techniques and algorithms, then we discuss Grid-based KDD systems. Finally, we introduce the Knowledge Grid, an environment which makes use of standard Grid middleware to support the development of parallel and distributed knowledge discovery applications.
Feed-forward volume rendering algorithm for moderately parallel MIMD machines
NASA Technical Reports Server (NTRS)
Yagel, Roni
1993-01-01
Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance.
Parallel Algorithms for Graph Optimization using Tree Decompositions
Sullivan, Blair D; Weerapurage, Dinesh P; Groer, Christopher S
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Pathfinder: A parallel search algorithm for concerted atomistic events
NASA Astrophysics Data System (ADS)
Nakano, Aiichiro
2007-02-01
An algorithm has been designed to search for the escape paths with the lowest activation barriers when starting from a local minimum-energy configuration of a many-atom system. The pathfinder algorithm combines: (1) a steered eigenvector-following method that guides a constrained escape from the convex region and subsequently climbs to a transition state tangentially to the eigenvector corresponding to the lowest negative Hessian eigenvalue; (2) discrete abstraction of the atomic configuration to systematically enumerate concerted events as linear combinations of atomistic events; (3) evolutionary control of the population dynamics of low activation-barrier events; and (4) hybrid task + spatial decompositions to implement massive search for complex events on parallel computers. The program exhibits good scalability on parallel computers and has been used to study concerted bond-breaking events in the fracture of alumina.
Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm
NASA Astrophysics Data System (ADS)
Banicescu, Ioana
Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of irregular scientific problems, the N -body problem, using a classical hierarchical algorithm: the Fast Multipole Algorithm (FMA). Hierarchical N-body algorithms in general, and the FMA in particular, are amenable to parallel execution. However, performance gains are difficult to obtain, due to load imbalances that are primarily caused by the irregular distribution of bodies and of computation domains. Understanding application characteristics is essential for obtaining high performance implementations on parallel machines. After surveying the available parallelism in the FMA, we address the problem of exploiting this parallelism with partitioning and scheduling techniques that optimally map it onto a parallel machine, the KSR1. The KSR1 is a parallel shared address-space machine with a hierarchical cache-only architecture. The tension between maintaining data locality and balancing processor loads requires a scheduling scheme that combines static techniques (that exploit data locality) with dynamic techniques (that improve load balancing). An effective combined scheduling scheme that balances processor loads and maintains locality, by exploiting self-similarity properties of fractals, is Fractiling. Fractiling is based on a probabilistic analysis. It thus accommodates load imbalances caused by predictable events (such as irregular data) as well as unpredictable events (such as data access latency). Fractiling adapts to algorithmic and system induced load imbalances while maximizing data locality. We used Fractiling to schedule a parallel FMA on the KSR1. Our parallel 2-d and 3-d FMA implementations were run using uniform and nonuniform data set distributions under a
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix
NASA Technical Reports Server (NTRS)
Shroff, Gautam
1989-01-01
A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
Kannan, Ramakrishnan; Sukumar, Sreenivas R.; Ballard, Grey M.; Park, Haesun
2016-08-22
NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for $\\WW$ and $\\HH$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $\\WW$ and $\\HH$ within the alternating iterations.
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
In part 1 architecture of NETRA is presented. A performance evaluation of NETRA using several common vision algorithms is also presented. Performance of algorithms when they are mapped on one cluster is described. It is shown that SIMD, MIMD, and systolic algorithms can be easily mapped onto processor clusters, and almost linear speedups are possible. For some algorithms, analytical performance results are compared with implementation performance results. It is observed that the analysis is very accurate. Performance analysis of parallel algorithms when mapped across clusters is presented. Mappings across clusters illustrate the importance and use of shared as well as distributed memory in achieving high performance. The parameters for evaluation are derived from the characteristics of the parallel algorithms, and these parameters are used to evaluate the alternative communication strategies in NETRA. Furthermore, the effect of communication interference from other processors in the system on the execution of an algorithm is studied. Using the analysis, performance of many algorithms with different characteristics is presented. It is observed that if communication speeds are matched with the computation speeds, good speedups are possible when algorithms are mapped across clusters.
Superelement model based parallel algorithm for vehicle dynamics
NASA Astrophysics Data System (ADS)
Agrawal, O. P.; Danhof, K. J.; Kumar, R.
1994-05-01
This paper presents a superelement model based parallel algorithm for a planar vehicle dynamics. The vehicle model is made up of a chassis and two suspension systems each of which consists of an axle-wheel assembly and two trailing arms. In this model, the chassis is treated as a Cartesian element and each suspension system is treated as a superelement. The parameters associated with the superelements are computed using an inverse dynamics technique. Suspension shock absorbers and the tires are modeled by nonlinear springs and dampers. The Euler-Lagrange approach is used to develop the system equations of motion. This leads to a system of differential and algebraic equations in which the constraints internal to superelements appear only explicitly. The above formulation is implemented on a multiprocessor machine. The numerical flow chart is divided into modules and the computation of several modules is performed in parallel to gain computational efficiency. In this implementation, the master (parent processor) creates a pool of slaves (child processors) at the beginning of the program. The slaves remain in the pool until they are needed to perform certain tasks. Upon completion of a particular task, a slave returns to the pool. This improves the overall response time of the algorithm. The formulation presented is general which makes it attractive for a general purpose code development. Speedups obtained in the different modules of the dynamic analysis computation are also presented. Results show that the superelement model based parallel algorithm can significantly reduce the vehicle dynamics simulation time.
A parallel genetic algorithm for the set partitioning problem
Levine, D.
1996-12-31
This paper describes a parallel genetic algorithm developed for the solution of the set partitioning problem- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their own subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. We found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high- quality integer feasible solutions were found for problems with 36, 699 and 43,749 integer variables, respectively. A notable limitation we found was the difficulty solving problems with many constraints.
A novel highly parallel algorithm for linearly unmixing hyperspectral images
NASA Astrophysics Data System (ADS)
Guerra, Raúl; López, Sebastián.; Callico, Gustavo M.; López, Jose F.; Sarmiento, Roberto
2014-10-01
Endmember extraction and abundances calculation represent critical steps within the process of linearly unmixing a given hyperspectral image because of two main reasons. The first one is due to the need of computing a set of accurate endmembers in order to further obtain confident abundance maps. The second one refers to the huge amount of operations involved in these time-consuming processes. This work proposes an algorithm to estimate the endmembers of a hyperspectral image under analysis and its abundances at the same time. The main advantage of this algorithm is its high parallelization degree and the mathematical simplicity of the operations implemented. This algorithm estimates the endmembers as virtual pixels. In particular, the proposed algorithm performs the descent gradient method to iteratively refine the endmembers and the abundances, reducing the mean square error, according with the linear unmixing model. Some mathematical restrictions must be added so the method converges in a unique and realistic solution. According with the algorithm nature, these restrictions can be easily implemented. The results obtained with synthetic images demonstrate the well behavior of the algorithm proposed. Moreover, the results obtained with the well-known Cuprite dataset also corroborate the benefits of our proposal.
Prototyping Parallel and Distributed Programs in Proteus
1990-10-01
Cole90, Gibb89]. " Highly-parallel processors - Applications for highly-parallel machines such as the CM- 2 or the iPSC are programmed using data...Programming, (Prentice-Hall, Englewood Cliffs, NJ) 1990. [Gibb89] Gibbons , P.B., "A more practical PRAM model", in: Proceedings of the First ACM
Serial Order: A Parallel Distributed Processing Approach.
ERIC Educational Resources Information Center
Jordan, Michael I.
Human behavior shows a variety of serially ordered action sequences. This paper presents a theory of serial order which describes how sequences of actions might be learned and performed. In this theory, parallel interactions across time (coarticulation) and parallel interactions across space (dual-task interference) are viewed as two aspects of a…
Potts-model grain growth simulations: Parallel algorithms and applications
Wright, S.A.; Plimpton, S.J.; Swiler, T.P.
1997-08-01
Microstructural morphology and grain boundary properties often control the service properties of engineered materials. This report uses the Potts-model to simulate the development of microstructures in realistic materials. Three areas of microstructural morphology simulations were studied. They include the development of massively parallel algorithms for Potts-model grain grow simulations, modeling of mass transport via diffusion in these simulated microstructures, and the development of a gradient-dependent Hamiltonian to simulate columnar grain growth. Potts grain growth models for massively parallel supercomputers were developed for the conventional Potts-model in both two and three dimensions. Simulations using these parallel codes showed self similar grain growth and no finite size effects for previously unapproachable large scale problems. In addition, new enhancements to the conventional Metropolis algorithm used in the Potts-model were developed to accelerate the calculations. These techniques enable both the sequential and parallel algorithms to run faster and use essentially an infinite number of grain orientation values to avoid non-physical grain coalescence events. Mass transport phenomena in polycrystalline materials were studied in two dimensions using numerical diffusion techniques on microstructures generated using the Potts-model. The results of the mass transport modeling showed excellent quantitative agreement with one dimensional diffusion problems, however the results also suggest that transient multi-dimension diffusion effects cannot be parameterized as the product of the grain boundary diffusion coefficient and the grain boundary width. Instead, both properties are required. Gradient-dependent grain growth mechanisms were included in the Potts-model by adding an extra term to the Hamiltonian. Under normal grain growth, the primary driving term is the curvature of the grain boundary, which is included in the standard Potts-model Hamiltonian.
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation
NASA Technical Reports Server (NTRS)
Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.
1996-01-01
We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
Automatic Management of Parallel and Distributed System Resources
NASA Technical Reports Server (NTRS)
Yan, Jerry; Ngai, Tin Fook; Lundstrom, Stephen F.
1990-01-01
Viewgraphs on automatic management of parallel and distributed system resources are presented. Topics covered include: parallel applications; intelligent management of multiprocessing systems; performance evaluation of parallel architecture; dynamic concurrent programs; compiler-directed system approach; lattice gaseous cellular automata; and sparse matrix Cholesky factorization.
Distributed and parallel Ada and the Ada 9X recommendations
NASA Technical Reports Server (NTRS)
Volz, Richard A.; Goldsack, Stephen J.; Theriault, R.; Waldrop, Raymond S.; Holzbacher-Valero, A. A.
1992-01-01
Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada 9X, will become the new standard sometimes in the 1990s. It is intended that Ada 9X should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.
Distributed and parallel Ada and the Ada 9X recommendations
Volz, R.A.; Goldsack, S.J.; Theriault, R.; Waldrop, R.S.; Holzbacher-Valero, A.A.
1992-04-01
Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada9x, will become the new standard sometimes in the 1990s. It is intended that Ada9x should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.
Space complexity of estimation of distribution algorithms.
Gao, Yong; Culberson, Joseph
2005-01-01
In this paper, we investigate the space complexity of the Estimation of Distribution Algorithms (EDAs), a class of sampling-based variants of the genetic algorithm. By analyzing the nature of EDAs, we identify criteria that characterize the space complexity of two typical implementation schemes of EDAs, the factorized distribution algorithm and Bayesian network-based algorithms. Using random additive functions as the prototype, we prove that the space complexity of the factorized distribution algorithm and Bayesian network-based algorithms is exponential in the problem size even if the optimization problem has a very sparse interaction structure.
Parallel hybrid algorithm for solution in electrical impedance equation
NASA Astrophysics Data System (ADS)
Ponomaryov, Volodymyr; Robles-Gonzalez, Marco; Bucio-Ramirez, Ariana; Ramirez-Tachiquin, Marco; Ramos-Diaz, Eduardo
2015-02-01
This work is dedicated to the analysis of the forward and the inverse problem to obtain a better approximation to the Electrical Impedance Tomography equation. In this case, we employ for the forward problem the numerical method based on the Taylor series in formal power and for the inverse problem the Finite Element Method. For the analysis of the forward problem, we proposed a novel algorithm, which employs a regularization technique for the stability, additionally the parallel computing is used to obtain the solution faster; this modification permits to obtain an efficient solution of the forward problem. Then, the found solution is used in the inverse problem for the approximation employing the Finite Element Method. The algorithms employed in this work are developed in structural programming paradigm in C++, including parallel processing; the time run analysis is performed only in the forward problem because the Finite Element Method due to their high recursive does not accept parallelism. Some examples are performed for this analysis, in which several conductivity functions are employed for two different cases: for the analytical cases: the exponential and sinusoidal functions are used, and for the geometrical cases the circle at center and five disk structure are revised as conductivity functions. The Lebesgue measure is used as metric for error estimation in the forward problem, meanwhile, in the inverse problem PSNR, SSIM, MSE criteria are applied, to determine the convergence of both methods.
An experimental APL compiler for a distributed memory parallel machine
Ching, W.M.; Katz, A.
1994-12-31
The authors developed an experimental APL compiler for the IBM SP1 distributed memory parallel machine. It accepts classical APL programs, without additional directives, and generates parallelized C code for execution on the SP1 machine. The compiler exploits data parallelism in APL programs based on parallel high level primitives. Program variables are either replicated or partitioned. They also present performance data for five moderate size programs running on the SP1.
BMI optimization by using parallel UNDX real-coded genetic algorithm with Beowulf cluster
NASA Astrophysics Data System (ADS)
Handa, Masaya; Kawanishi, Michihiro; Kanki, Hiroshi
2007-12-01
This paper deals with the global optimization algorithm of the Bilinear Matrix Inequalities (BMIs) based on the Unimodal Normal Distribution Crossover (UNDX) GA. First, analyzing the structure of the BMIs, the existence of the typical difficult structures is confirmed. Then, in order to improve the performance of algorithm, based on results of the problem structures analysis and consideration of BMIs characteristic properties, we proposed the algorithm using primary search direction with relaxed Linear Matrix Inequality (LMI) convex estimation. Moreover, in these algorithms, we propose two types of evaluation methods for GA individuals based on LMI calculation considering BMI characteristic properties more. In addition, in order to reduce computational time, we proposed parallelization of RCGA algorithm, Master-Worker paradigm with cluster computing technique.
MM Algorithms for Some Discrete Multivariate Distributions.
Zhou, Hua; Lange, Kenneth
2010-09-01
The MM (minorization-maximization) principle is a versatile tool for constructing optimization algorithms. Every EM algorithm is an MM algorithm but not vice versa. This article derives MM algorithms for maximum likelihood estimation with discrete multivariate distributions such as the Dirichlet-multinomial and Connor-Mosimann distributions, the Neerchal-Morel distribution, the negative-multinomial distribution, certain distributions on partitions, and zero-truncated and zero-inflated distributions. These MM algorithms increase the likelihood at each iteration and reliably converge to the maximum from well-chosen initial values. Because they involve no matrix inversion, the algorithms are especially pertinent to high-dimensional problems. To illustrate the performance of the MM algorithms, we compare them to Newton's method on data used to classify handwritten digits.
A parallel algorithm for solving the 3d Schroedinger equation
Strickland, Michael; Yager-Elorriaga, David
2010-08-20
We describe a parallel algorithm for solving the time-independent 3d Schroedinger equation using the finite difference time domain (FDTD) method. We introduce an optimized parallelization scheme that reduces communication overhead between computational nodes. We demonstrate that the compute time, t, scales inversely with the number of computational nodes as t {proportional_to} (N{sub nodes}){sup -0.95} {sup {+-} 0.04}. This makes it possible to solve the 3d Schroedinger equation on extremely large spatial lattices using a small computing cluster. In addition, we present a new method for precisely determining the energy eigenvalues and wavefunctions of quantum states based on a symmetry constraint on the FDTD initial condition. Finally, we discuss the usage of multi-resolution techniques in order to speed up convergence on extremely large lattices.
A Simple Physical Optics Algorithm Perfect for Parallel Computing
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1993-01-01
One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Adaptive Mesh Refinement Algorithms for Parallel Unstructured Finite Element Codes
Parsons, I D; Solberg, J M
2006-02-03
This project produced algorithms for and software implementations of adaptive mesh refinement (AMR) methods for solving practical solid and thermal mechanics problems on multiprocessor parallel computers using unstructured finite element meshes. The overall goal is to provide computational solutions that are accurate to some prescribed tolerance, and adaptivity is the correct path toward this goal. These new tools will enable analysts to conduct more reliable simulations at reduced cost, both in terms of analyst and computer time. Previous academic research in the field of adaptive mesh refinement has produced a voluminous literature focused on error estimators and demonstration problems; relatively little progress has been made on producing efficient implementations suitable for large-scale problem solving on state-of-the-art computer systems. Research issues that were considered include: effective error estimators for nonlinear structural mechanics; local meshing at irregular geometric boundaries; and constructing efficient software for parallel computing environments.
Carey, G.F.; Young, D.M.
1993-12-31
The program outlined here is directed to research on methods, algorithms, and software for distributed parallel supercomputers. Of particular interest are finite element methods and finite difference methods together with sparse iterative solution schemes for scientific and engineering computations of very large-scale systems. Both linear and nonlinear problems will be investigated. In the nonlinear case, applications with bifurcation to multiple solutions will be considered using continuation strategies. The parallelizable numerical methods of particular interest are a family of partitioning schemes embracing domain decomposition, element-by-element strategies, and multi-level techniques. The methods will be further developed incorporating parallel iterative solution algorithms with associated preconditioners in parallel computer software. The schemes will be implemented on distributed memory parallel architectures such as the CRAY MPP, Intel Paragon, the NCUBE3, and the Connection Machine. We will also consider other new architectures such as the Kendall-Square (KSQ) and proposed machines such as the TERA. The applications will focus on large-scale three-dimensional nonlinear flow and reservoir problems with strong convective transport contributions. These are legitimate grand challenge class computational fluid dynamics (CFD) problems of significant practical interest to DOE. The methods developed and algorithms will, however, be of wider interest.
Vascular system modeling in parallel environment - distributed and shared memory approaches.
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-07-01
This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages, and therefore, this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multicore machines, show that both algorithms provide a significant speedup.
Algorithms for parallel flow solvers on message passing architectures
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
Parallel algorithms for finding cliques in a graph
NASA Astrophysics Data System (ADS)
Szabó, S.
2011-01-01
A clique is a subgraph in a graph that is complete in the sense that each two of its nodes are connected by an edge. Finding cliques in a given graph is an important procedure in discrete mathematical modeling. The paper will show how concepts such as splitting partitions, quasi coloring, node and edge dominance are related to clique search problems. In particular we will discuss the connection with parallel clique search algorithms. These concepts also suggest practical guide lines to inspect a given graph before starting a large scale search.
An intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-09-01
Poisson disk sampling has excellent spatial and spectral properties, and plays an important role in a variety of visual computing. Although many promising algorithms have been proposed for multidimensional sampling in euclidean space, very few studies have been reported with regard to the problem of generating Poisson disks on surfaces due to the complicated nature of the surface. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. In sharp contrast to the conventional parallel approaches, our method neither partitions the given surface into small patches nor uses any spatial data structure to maintain the voids in the sampling domain. Instead, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. Our algorithm guarantees that the generated Poisson disks are uniformly and randomly distributed without bias. It is worth noting that our method is intrinsic and independent of the embedding space. This intrinsic feature allows us to generate Poisson disk patterns on arbitrary surfaces in IR(n). To our knowledge, this is the first intrinsic, parallel, and accurate algorithm for surface Poisson disk sampling. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Efficient parallel algorithms for string editing and related problems
NASA Technical Reports Server (NTRS)
Apostolico, Alberto; Atallah, Mikhail J.; Larmore, Lawrence; Mcfaddin, H. S.
1988-01-01
The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol x with another symbol. This problem has a well known O((absolute value of x)(absolute value of y)) time sequential solution (25). The efficient Program Requirements Analysis Methods (PRAM) parallel algorithms for the string editing problem are given. If m = ((absolute value of x),(absolute value of y)) and n = max((absolute value of x),(absolute value of y)), then the CREW bound is O (log m log n) time with O (mn/log m) processors. In all algorithms, space is O (mn).
Embedded diagonally implicit Runge-Kutta algorithms on parallel computers
NASA Astrophysics Data System (ADS)
van der Houwen, P. J.; Sommeijer, B. P.; Couzy, W.
1992-01-01
This paper investigates diagonally implicit Runge-Kutta methods in which the implicit relations can be solved in parallel and are singly diagonal-implicit on each processor. The algorithms are based on diagonally implicit iteration of fully implicit Runge-Kutta methods of high order. The iteration scheme is chosen in such a way that the resulting algorithm is A(α ) -stable or L(α ) -stable with α equal or very close to π /2 . In this way, highly stable, singly diagonal-implicit Runge-Kutta methods of orders up to 10 can be constructed. Because of the iterative nature of the methods, embedded formulas of lower orders are automatically available, allowing a strategy for step and order variation.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
NASA Technical Reports Server (NTRS)
Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris
2000-01-01
Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency
Multi-jagged: A scalable parallel spatial partitioning algorithm
Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen D.; ...
2015-03-18
Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficientmore » implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Lastly, our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.« less
Multi-jagged: A scalable parallel spatial partitioning algorithm
Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen D.; Catalyurek, Umit V.
2015-03-18
Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficient implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Lastly, our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-03-08
Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
NASA Astrophysics Data System (ADS)
Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu
2015-09-01
We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.
Efficient parallel algorithms for elastic plastic finite element analysis
NASA Astrophysics Data System (ADS)
Ding, K. Z.; Qin, Q.-H.; Cardew-Hall, M.; Kalyanasundaram, S.
2008-03-01
This paper presents our new development of parallel finite element algorithms for elastic plastic problems. The proposed method is based on dividing the original structure under consideration into a number of substructures which are treated as isolated finite element models via the interface conditions. Throughout the analysis, each processor stores only the information relevant to its substructure and generates the local stiffness matrix. A parallel substructure oriented preconditioned conjugate gradient method, which is combined with MR smoothing and diagonal storage scheme are employed to solve linear systems of equations. After having obtained the displacements of the problem under consideration, a substepping scheme is used to integrate elastic plastic stress strain relations. The procedure outlined controls the error of the computed stress by choosing each substep size automatically according to a prescribed tolerance. The combination of these algorithms shows a good speedup when increasing the number of processors and the effective solution of 3D elastic plastic problems whose size is much too large for a single workstation becomes possible.
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
Arampatzis, Giorgos; Katsoulakis, Markos A.; Plechac, Petr; Taufer, Michela; Xu, Lifan
2012-10-01
We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decomposition corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re
Tsai, Ming-Chi; Tsui, Fu-Chiang; Wagner, Michael M
2007-10-11
Performing fast data analysis to detect disease outbreaks plays a critical role in real-time biosurveillance. In this paper, we described and evaluated an Algorithm Distribution Manager Service (ADMS) based on grid technologies, which dynamically partition and distribute detection algorithms across multiple computers. We compared the execution time to perform the analysis on a single computer and on a grid network (3 computing nodes) with and without using dynamic algorithm distribution. We found that algorithms with long runtime completed approximately three times earlier in distributed environment than in a single computer while short runtime algorithms performed worse in distributed environment. A dynamic algorithm distribution approach also performed better than static algorithm distribution approach. This pilot study shows a great potential to reduce lengthy analysis time through dynamic algorithm partitioning and parallel processing, and provides the opportunity of distributing algorithms from a client to remote computers in a grid network.
The high performance parallel algorithm for Unified Gas-Kinetic Scheme
NASA Astrophysics Data System (ADS)
Li, Shiyi; Li, Qibing; Fu, Song; Xu, Jinxiu
2016-11-01
A high performance parallel algorithm for UGKS is developed to simulate three-dimensional flows internal and external on arbitrary grid system. The physical domain and velocity domain are divided into different blocks and distributed according to the two-dimensional Cartesian topology with intra-communicators in physical domain for data exchange and other intra-communicators in velocity domain for sum reduction to moment integrals. Numerical results of three-dimensional cavity flow and flow past a sphere agree well with the results from the existing studies and validate the applicability of the algorithm. The scalability of the algorithm is tested both on small (1-16) and large (729-5832) scale processors. The tested speed-up ratio is near linear ashind thus the efficiency is around 1, which reveals the good scalability of the present algorithm.
Massively parallel algorithms for trace-driven cache simulations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.
1991-01-01
Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
NASA Astrophysics Data System (ADS)
Niknam, Mehdi; Thulasiraman, Parimala; Camorlinga, Sergio
2010-11-01
Connected component labelling is an essential step in image processing. We provide a parallel version of Suzuki's sequential connected component algorithm in order to speed up the labelling process. Also, we modify the algorithm to enable labelling gray-scale images. Due to the data dependencies in the algorithm we used a method similar to pipeline to exploit parallelism. The parallel algorithm method achieved a speedup of 2.5 for image size of 256 × 256 pixels using 4 processing threads.
Parallel algorithm for computing points on a computation front hyperplane
NASA Astrophysics Data System (ADS)
Krasnov, M. M.
2015-01-01
A parallel algorithm for computing points on a computation front hyperplane is described. This task arises in the computation of a quantity defined on a multidimensional rectangular domain. Three-dimensional domains are usually discussed, but the material is given in the general form when the number of measurements is at least two. When the values of a quantity at different points are internally independent (which is frequently the case), the corresponding computations are independent as well and can be performed in parallel. However, if there are internal dependences (as, for example, in the Gauss-Seidel method for systems of linear equations), then the order of scanning points of the domain is an important issue. A conventional approach in this case is to form a computation front hyperplane (a usual plane in the three-dimensional case and a line in the two-dimensional case) that moves linearly across the domain at a certain angle. At every step in the course of motion of this hyperplane, its intersection points with the domain can be treated independently and, hence, in parallel, but the steps themselves are executed sequentially. At different steps, the intersection of the hyperplane with the entire domain can have a rather complex geometry and the search for all points of the domain lying on the hyperplane at a given step is a nontrivial problem. This problem (i.e., the computation of the coordinates of points lying in the intersection of the domain with the hyperplane at a given step in the course of hyperplane motion) is addressed below. The computations over the points of the hyperplane can be executed in parallel.
Katouda, Michio; Nakajima, Takahito
2013-12-10
A new algorithm for massively parallel calculations of electron correlation energy of large molecules based on the resolution of identity second-order Møller-Plesset perturbation (RI-MP2) technique is developed and implemented into the quantum chemistry software NTChem. In this algorithm, a Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) hybrid parallel programming model is applied to attain efficient parallel performance on massively parallel supercomputers. An in-core storage scheme of intermediate data of three-center electron repulsion integrals utilizing the distributed memory is developed to eliminate input/output (I/O) overhead. The parallel performance of the algorithm is tested on massively parallel supercomputers such as the K computer (using up to 45 992 central processing unit (CPU) cores) and a commodity Intel Xeon cluster (using up to 8192 CPU cores). The parallel RI-MP2/cc-pVTZ calculation of two-layer nanographene sheets (C150H30)2 (number of atomic orbitals is 9640) is performed using 8991 node and 71 288 CPU cores of the K computer.
Distributed sensor data compression algorithm
NASA Astrophysics Data System (ADS)
Ambrose, Barry; Lin, Freddie
2006-04-01
Theoretically it is possible for two sensors to reliably send data at rates smaller than the sum of the necessary data rates for sending the data independently, essentially taking advantage of the correlation of sensor readings to reduce the data rate. In 2001, Caltech researchers Michelle Effros and Qian Zhao developed new techniques for data compression code design for correlated sensor data, which were published in a paper at the 2001 Data Compression Conference (DCC 2001). These techniques take advantage of correlations between two or more closely positioned sensors in a distributed sensor network. Given two signals, X and Y, the X signal is sent using standard data compression. The goal is to design a partition tree for the Y signal. The Y signal is sent using a code based on the partition tree. At the receiving end, if ambiguity arises when using the partition tree to decode the Y signal, the X signal is used to resolve the ambiguity. We have extended this work to increase the efficiency of the code search algorithms. Our results have shown that development of a highly integrated sensor network protocol that takes advantage of a correlation in sensor readings can result in 20-30% sensor data transport cost savings. In contrast, the best possible compression using state-of-the-art compression techniques that did not take into account the correlation of the incoming data signals achieved only 9-10% compression at most. This work was sponsored by MDA, but has very widespread applicability to ad hoc sensor networks, hyperspectral imaging sensors and vehicle health monitoring sensors for space applications.
Parallel-vector algorithms for particle simulations on shared-memory multiprocessors
Nishiura, Daisuke; Sakaguchi, Hide
2011-03-01
Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton's third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.
The design and implementation of MPI master-slave parallel genetic algorithm
NASA Astrophysics Data System (ADS)
Liu, Shuping; Cheng, Yanliu
2013-03-01
In this paper, the MPI master-slave parallel genetic algorithm is implemented by analyzing the basic genetic algorithm and parallel MPI program, and building a Linux cluster. This algorithm is used for the test of maximum value problems (Rosen brocks function) .And we acquire the factors influencing the master-slave parallel genetic algorithm by deriving from the analysis of test data. The experimental data shows that the balanced hardware configuration and software design optimization can improve the performance of system in the complexity of the computing environment using the master-slave parallel genetic algorithms.
Partitioning problems in parallel, pipelined and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, S.
1985-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Partitioning problems in parallel, pipelined, and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, Shahid H.
1988-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.
Performance of a parallel algorithm for standard cell placement on the Intel Hypercube
NASA Technical Reports Server (NTRS)
Jones, Mark; Banerjee, Prithviraj
1987-01-01
A parallel simulated annealing algorithm for standard cell placement on the Intel Hypercube is presented. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than uniprocessor simulated annealing algorithms.
NASA Astrophysics Data System (ADS)
Di Pierro, Massimo
2001-11-01
We present a set of programming tools (classes and functions written in C++ and based on Message Passing Interface) for fast development of generic parallel (and non-parallel) lattice simulations. They are collectively called MDP 1.2. These programming tools include classes and algorithms for matrices, random number generators, distributed lattices (with arbitrary topology), fields and parallel iterations. No previous knowledge of MPI is required in order to use them. Some applications in electromagnetism, electronics, condensed matter and lattice QCD are presented.
Parallel Molecular Distributed Detection With Brownian Motion.
Rogers, Uri; Koh, Min-Sung
2016-12-01
This paper explores the in vivo distributed detection of an undesired biological agent's (BAs) biomarkers by a group of biological sized nanomachines in an aqueous medium under drift. The term distributed, indicates that the system information relative to the BAs presence is dispersed across the collection of nanomachines, where each nanomachine possesses limited communication, computation, and movement capabilities. Using Brownian motion with drift, a probabilistic detection and optimal data fusion framework, coined molecular distributed detection, will be introduced that combines theory from both molecular communication and distributed detection. Using the optimal data fusion framework as a guide, simulation indicates that a sub-optimal fusion method exists, allowing for a significant reduction in implementation complexity while retaining BA detection accuracy.
Parallel Molecular Distributed Detection with Brownian Motion.
Rogers, Uri; Koh, Min-Sung
2016-12-05
This paper explores the in vivo distributed detection of an undesired biological agent's (BAs) biomarkers by a group of biological sized nanomachines in an aqueous medium under drift. The term distributed, indicates that the system information relative to the BAs presence is dispersed across the collection of nanomachines, where each nanomachine possesses limited communication, computation, and movement capabilities. Using Brownian motion with drift, a probabilistic detection and optimal data fusion framework, coined molecular distributed detection, will be introduced that combines theory from both molecular communication and distributed detection. Using the optimal data fusion framework as a guide, simulation indicates that a suboptimal fusion method exists, allowing for a significant reduction in implementation complexity while retaining BA detection accuracy.
Adaptive link selection algorithms for distributed estimation
NASA Astrophysics Data System (ADS)
Xu, Songcen; de Lamare, Rodrigo C.; Poor, H. Vincent
2015-12-01
This paper presents adaptive link selection algorithms for distributed estimation and considers their application to wireless sensor networks and smart grids. In particular, exhaustive search-based least mean squares (LMS) / recursive least squares (RLS) link selection algorithms and sparsity-inspired LMS / RLS link selection algorithms that can exploit the topology of networks with poor-quality links are considered. The proposed link selection algorithms are then analyzed in terms of their stability, steady-state, and tracking performance and computational complexity. In comparison with the existing centralized or distributed estimation strategies, the key features of the proposed algorithms are as follows: (1) more accurate estimates and faster convergence speed can be obtained and (2) the network is equipped with the ability of link selection that can circumvent link failures and improve the estimation performance. The performance of the proposed algorithms for distributed estimation is illustrated via simulations in applications of wireless sensor networks and smart grids.
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Parallel multiphysics algorithms and software for computational nuclear engineering
NASA Astrophysics Data System (ADS)
Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.
2009-07-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
NASA Technical Reports Server (NTRS)
Eidson, T. M.; Erlebacher, G.
1994-01-01
While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
Some parallel algorithms on the four processor Cray X-MP4 supercomputer
Kincaid, D.R.; Oppe, T.C.
1988-05-01
Three numerical studies of parallel algorithms on a four processor Cray X-MP4 supercomputer are presented. These numerical experiments involve the following: a parallel version of ITPACKV 2C, a package for solving large sparse linear systems, a parallel version of the conjugate gradient method with line Jacobi preconditioning, and several parallel algorithms for computing the LU-factorization of dense matrices. 27 refs., 4 tabs.
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Models and Measurements of Parallelism for a Distributed Computer System.
1982-01-01
that parallel execution of the processes comprising an application program will defray U the overhead costs of distributed computing . This...of Different Approaches to Distributed Computing ", Proceedings of the Ist International Conference on Distributed Comput er Systems, Huntsville, AL...Oct. 1-5, 1979), pp. 222-232. [20] Liskov, B., "Primitives for Distributed Computing ", Froceedings of the 7--th Symposium on Operating System
NASA Astrophysics Data System (ADS)
Slattery, Stuart R.
2016-02-01
In this paper we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothness and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. These scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.
Slattery, Stuart R.
2015-12-02
In this study we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothnessmore » and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. Finally, these scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.« less
Slattery, Stuart R.
2015-12-02
In this study we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothness and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. Finally, these scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation.
Para-GMRF: parallel algorithm for anomaly detection of hyperspectral image
NASA Astrophysics Data System (ADS)
Dong, Chao; Zhao, Huijie; Li, Na; Wang, Wei
2007-12-01
The hyperspectral imager is capable of collecting hundreds of images corresponding to different wavelength channels for the observed area simultaneously, which make it possible to discriminate man-made objects from natural background. However, the price paid for the wealthy information is the enormous amounts of data, usually hundreds of Gigabytes per day. Turning the huge volume data into useful information and knowledge in real time is critical for geoscientists. In this paper, the proposed parallel Gaussian-Markov random field (Para-GMRF) anomaly detection algorithm is an attempt of applying parallel computing technology to solve the problem. Based on the locality of GMRF algorithm, we partition the 3-D hyperspectral image cube in spatial domain and distribute data blocks to multiple computers for concurrent detection. Meanwhile, to achieve load balance, a work pool scheduler is designed for task assignment. The Para-GMRF algorithm is organized in master-slave architecture, coded in C programming language using message passing interface (MPI) library and tested on a Beowulf cluster. Experimental results show that Para-GMRF algorithm successfully conquers the challenge and can be used in time sensitive areas, such as environmental monitoring and battlefield reconnaissance.
Performance of a parallel algorithm for standard cell placement on the Intel Hypercube
NASA Technical Reports Server (NTRS)
Jones, Mark; Banerjee, Prithviraj
1987-01-01
A parallel simulated annealing algorithm for standard cell placement that is targeted to run on the Intel Hypercube is presented. A tree broadcasting strategy that is used extensively in our algorithm for updating cell locations in the parallel environment is presented. Studies on the performance of our algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms.
Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges
NASA Technical Reports Server (NTRS)
Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.
2000-01-01
This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.
Parallelization of the Wolff single-cluster algorithm
NASA Astrophysics Data System (ADS)
Kaupužs, J.; Rimšāns, J.; Melnik, R. V. N.
2010-02-01
A parallel [open multiprocessing (OpenMP)] implementation of the Wolff single-cluster algorithm has been developed and tested for the three-dimensional (3D) Ising model. The developed procedure is generalizable to other lattice spin models and its effectiveness depends on the specific application at hand. The applicability of the developed methodology is discussed in the context of the applications, where a sophisticated shuffling scheme is used to generate pseudorandom numbers of high quality, and an iterative method is applied to find the critical temperature of the 3D Ising model with a great accuracy. For the lattice with linear size L=1024 , we have reached the speedup about 1.79 times on two processors and about 2.67 times on four processors, as compared to the serial code. According to our estimation, the speedup about three times on four processors is reachable for the O(n) models with n≥2 . Furthermore, the application of the developed OpenMP code allows us to simulate larger lattices due to greater operative (shared) memory available.
Parallelization of the Wolff single-cluster algorithm.
Kaupuzs, J; Rimsāns, J; Melnik, R V N
2010-02-01
A parallel [open multiprocessing (OpenMP)] implementation of the Wolff single-cluster algorithm has been developed and tested for the three-dimensional (3D) Ising model. The developed procedure is generalizable to other lattice spin models and its effectiveness depends on the specific application at hand. The applicability of the developed methodology is discussed in the context of the applications, where a sophisticated shuffling scheme is used to generate pseudorandom numbers of high quality, and an iterative method is applied to find the critical temperature of the 3D Ising model with a great accuracy. For the lattice with linear size L=1024, we have reached the speedup about 1.79 times on two processors and about 2.67 times on four processors, as compared to the serial code. According to our estimation, the speedup about three times on four processors is reachable for the O(n) models with n> or =2. Furthermore, the application of the developed OpenMP code allows us to simulate larger lattices due to greater operative (shared) memory available.
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu
1995-01-01
As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
A Fast parallel tridiagonal algorithm for a class of CFD applications
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Sun, Xian-He
1996-01-01
The parallel diagonal dominant (PDD) algorithm is an efficient tridiagonal solver. This paper presents for study a variation of the PDD algorithm, the reduced PDD algorithm. The new algorithm maintains the minimum communication provided by the PDD algorithm, but has a reduced operation count. The PDD algorithm also has a smaller operation count than the conventional sequential algorithm for many applications. Accuracy analysis is provided for the reduced PDD algorithm for symmetric Toeplitz tridiagonal (STT) systems. Implementation results on Langley's Intel Paragon and IBM SP2 show that both the PDD and reduced PDD algorithms are efficient and scalable.
Mesh Algorithms for PDE with Sieve I: Mesh Distribution
Knepley, Matthew G.; Karpeev, Dmitry A.
2009-01-01
We have developed a new programming framework, called Sieve, to support parallel numerical partial differential equation(s) (PDE) algorithms operating over distributed meshes. We have also developed a reference implementation of Sieve in C++ as a library of generic algorithms operating on distributed containers conforming to the Sieve interface. Sieve makes instances of the incidence relation, or arrows, the conceptual first-class objects represented in the containers. Further, generic algorithms acting on this arrow container are systematically used to provide natural geometric operations on the topology and also, through duality, on the data. Finally, coverings and duality are used to encode notmore » only individual meshes, but all types of hierarchies underlying PDE data structures, including multigrid and mesh partitions. In order to demonstrate the usefulness of the framework, we show how the mesh partition data can be represented and manipulated using the same fundamental mechanisms used to represent meshes. We present the complete description of an algorithm to encode a mesh partition and then distribute a mesh, which is independent of the mesh dimension, element shape, or embedding. Moreover, data associated with the mesh can be similarly distributed with exactly the same algorithm. The use of a high level of abstraction within the Sieve leads to several benefits in terms of code reuse, simplicity, and extensibility. We discuss these benefits and compare our approach to other existing mesh libraries.« less
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
The openGL visualization of the 2D parallel FDTD algorithm
NASA Astrophysics Data System (ADS)
Walendziuk, Wojciech
2005-02-01
This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
NASA Technical Reports Server (NTRS)
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Parallel vision algorithms. Annual technical report No. 1, 1 October 1986-30 September 1987
Ibrahim, H.A.; Kender, J.R.; Brown, L.G.
1987-10-01
The objective of this project is to develop and implement, on highly parallel computers, vision algorithms that combine stereo, texture, and multi-resolution techniques for determining local surface orientation and depth. Such algorithms will immediately serve as front-ends for autonomous land vehicle navigation systems. During the first year of the project, efforts have concentrated on two fronts. First, developing and testing the parallel programming environment that will be used to develop, implement and test the parallel vision algorithms. Second, developing and testing multi-resolution stereo, and texture algorithms. This report describes the status and progress on these two fronts. The authors describe first the programming environment developed, and mapping scheme that allows efficient use of the connection machine for pyramid (multi-resolution) algorithms. Second, they present algorithms and test results for multi-resolution stereo, and texture algorithms. Also the initial results of the starting efforts of integrating stereo and texture algorithms are presented.
Advanced algorithms for distributed fusion
NASA Astrophysics Data System (ADS)
Gelfand, A.; Smith, C.; Colony, M.; Bowman, C.; Pei, R.; Huynh, T.; Brown, C.
2008-03-01
The US Military has been undergoing a radical transition from a traditional "platform-centric" force to one capable of performing in a "Network-Centric" environment. This transformation will place all of the data needed to efficiently meet tactical and strategic goals at the warfighter's fingertips. With access to this information, the challenge of fusing data from across the batttlespace into an operational picture for real-time Situational Awareness emerges. In such an environment, centralized fusion approaches will have limited application due to the constraints of real-time communications networks and computational resources. To overcome these limitations, we are developing a formalized architecture for fusion and track adjudication that allows the distribution of fusion processes over a dynamically created and managed information network. This network will support the incorporation and utilization of low level tracking information within the Army Distributed Common Ground System (DCGS-A) or Future Combat System (FCS). The framework is based on Bowman's Dual Node Network (DNN) architecture that utilizes a distributed network of interlaced fusion and track adjudication nodes to build and maintain a globally consistent picture across all assets.
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation
Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn
2014-11-14
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.
Speedup properties of phases in the execution profile of distributed parallel programs
Carlson, B.M.; Wagner, T.D.; Dowdy, L.W.; Worley, P.H.
1992-08-01
The execution profile of a distributed-memory parallel program specifies the number of busy processors as a function of time. Periods of homogeneous processor utilization are manifested in many execution profiles. These periods can usually be correlated with the algorithms implemented in the underlying parallel code. Three families of methods for smoothing execution profile data are presented. These approaches simplify the problem of detecting end points of periods of homogeneous utilization. These periods, called phases, are then examined in isolation, and their speedup characteristics are explored. A specific workload executed on an Intel iPSC/860 is used for validation of the techniques described.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
NASA Technical Reports Server (NTRS)
Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
2000-01-01
The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Postscript: Parallel Distributed Processing in Localist Models without Thresholds
ERIC Educational Resources Information Center
Plaut, David C.; McClelland, James L.
2010-01-01
The current authors reply to a response by Bowers on a comment by the current authors on the original article. Bowers (2010) mischaracterizes the goals of parallel distributed processing (PDP research)--explaining performance on cognitive tasks is the primary motivation. More important, his claim that localist models, such as the interactive…
NavP: Structured and Multithreaded Distributed Parallel Programming
NASA Technical Reports Server (NTRS)
Pan, Lei; Xu, Jingling
2006-01-01
This slide presentation reviews some of the issues around distributed parallel programming. It compares and contrast two methods of programming: Single Program Multiple Data (SPMD) with the Navigational Programming (NAVP). It then reviews the distributed sequential computing (DSC) method and the methodology of NavP. Case studies are presented. It also reviews the work that is being done to enable the NavP system.
A simple parallel prefix algorithm for compact finite-difference schemes
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Joslin, Ronald D.
1993-01-01
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers.
Multi-Core Parallel Implementation of Data Filtering Algorithm for Multi-Beam Bathymetry Data
NASA Astrophysics Data System (ADS)
Liu, Tianyang; Xu, Weiming; Yin, Xiaodong; Zhao, Xiliang
In order to improve the multi-beam bathymetry data processing speed, we propose a parallel filtering algorithm based on multi thread technology. The algorithm consists of two parts. The first is the parallel data re-order step, in which the surveying area is divided into a regular grid, and the discrete bathymetry data is arranged into each grid by parallel method. The second part is the parallel filtering step, which involves dividing the grid into blocks and parallel executing filtering process in each block. In the experiment, the speedup of the proposed algorithm reaches to about 3.67 with an 8 core computer. The result shows the method can improve computing efficiency significantly comparing to the traditional algorithm.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Measurements of parallel electron velocity distributions using whistler wave absorption
Thuecks, D. J.; Skiff, F.; Kletzing, C. A.
2012-08-15
We describe a diagnostic to measure the parallel electron velocity distribution in a magnetized plasma that is overdense ({omega}{sub pe} > {omega}{sub ce}). This technique utilizes resonant absorption of whistler waves by electrons with velocities parallel to a background magnetic field. The whistler waves were launched and received by a pair of dipole antennas immersed in a cylindrical discharge plasma at two positions along an axial background magnetic field. The whistler wave frequency was swept from somewhat below and up to the electron cyclotron frequency {omega}{sub ce}. As the frequency was swept, the wave was resonantly absorbed by the part of the electron phase space density which was Doppler shifted into resonance according to the relation {omega}-k{sub ||v||} = {omega}{sub ce}. The measured absorption is directly related to the reduced parallel electron distribution function integrated along the wave trajectory. The background theory and initial results from this diagnostic are presented here. Though this diagnostic is best suited to detect tail populations of the parallel electron distribution function, these first results show that this diagnostic is also rather successful in measuring the bulk plasma density and temperature both during the plasma discharge and into the afterglow.
Brown, C.
1990-04-11
This contract developed and disseminated papers, ideas, algorithms, analysis, software, applications, and implementations for parallel programming environments for computer vision and for vision applications. The work has been widely reported and highly influential. The most significant work centered on the Butterfly Parallel Processor, the MaxVideo pipelined parallel image processor, and the development of the real-time computer vision laboratory. For the Butterfly, the Psyche multi-model operating system was developed and the CONSUL autoparallelizing compiler was designed. Much basic and influential performance monitoring and debugging work was completed, resulting in working systems and novel algorithms. There was also significant research in systems and applications using other parallel architectures in the laboratory, such as the MaxVideo parallel pipelined image processor. The contract developed a heterogeneous parallel architecture involving pipelined and MIMD parallelism and integrated it with a robot head.
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.
Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim
2016-08-01
In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor
NASA Technical Reports Server (NTRS)
Rao, Hariprasad Nannapaneni
1989-01-01
The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
A sweep algorithm for massively parallel simulation of circuit-switched networks
NASA Technical Reports Server (NTRS)
Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.
1992-01-01
A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
NASA Astrophysics Data System (ADS)
Ouyang, Bo; Shang, Weiwei
2016-03-01
The solution of tension distributions is infinite for cable-driven parallel manipulators(CDPMs) with redundant cables. A rapid optimization method for determining the optimal tension distribution is presented. The new optimization method is primarily based on the geometry properties of a polyhedron and convex analysis. The computational efficiency of the optimization method is improved by the designed projection algorithm, and a fast algorithm is proposed to determine which two of the lines are intersected at the optimal point. Moreover, a method for avoiding the operating point on the lower tension limit is developed. Simulation experiments are implemented on a six degree-of-freedom(6-DOF) CDPM with eight cables, and the results indicate that the new method is one order of magnitude faster than the standard simplex method. The optimal distribution of tension distribution is thus rapidly established on real-time by the proposed method.
Parallel algorithms for semi-Lagrangian transport in global atmospheric circulation models
Drake, J.B.; Worley, P.H.; Michalakes, J.; Foster, I.T.
1995-02-01
Global atmospheric circulation models (GCM) typically have three primary algorithmic components: columnar physics, spectral transform, and semi-Lagrangian transport. In developing parallel implementations, these three components are equally important and can be examined somewhat independently. A two-dimensional horizontal data decomposition of the three-dimensional computational grid leaves all physics computations on processor, and the only efficiency issues arise in load balancing. A recently completed study by the authors of different approaches to parallelizing the spectral transform showed several viable algorithms. Preliminary results of an analogous study of algorithmic alternatives for parallel semi-Lagrangian transport are described here.
A parallel encryption algorithm for dual-core processor based on chaotic map
NASA Astrophysics Data System (ADS)
Liu, Jiahui; Song, Dahua; Xu, Yiqiu
2011-12-01
In this paper, we propose a parallel chaos-based encryption scheme in order to take advantage of the dual-core processor. The chaos-based cryptosystem is combinatorially generated by the logistic map and Fibonacci sequence. Fibonacci sequence is employed to convert the value of the logistic map to integer data. The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI). The experimental results show that chaotic cryptosystem possesses good statistical properties, and the parallel algorithm provides more enhanced performance against the serial version of the algorithm. It is suitable for encryption/decryption large sensitive data or multimedia.
Object-oriented parallel algorithms for computing three-dimensional isopycnal flow
Concus, Paul; Golub, Gene H.; Sun, Yong
2000-12-01
In this paper, we derive an object-oriented parallel algorithm for three-dimensional isopycnal flow simulations. The matrix formulation is central to the algorithm. It enables us to apply an efficient preconditioned conjugate gradient linear solver for the global system of equations, and leads naturally to an object-oriented data structure design and parallel implementation. We discuss as well, in less detail, a similar algorithm based on the reduced system, suitable also for parallel computation. Favorable performances are observed on test problems.
Distributed and parallel approach for handle and perform huge datasets
NASA Astrophysics Data System (ADS)
Konopko, Joanna
2015-12-01
Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.
Liu, Lei; Zhao, Jing
2014-01-01
An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users. PMID:24790579
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Ho, Qirong; Cipar, James; Cui, Henggang; Kim, Jin Kyu; Lee, Seunghak; Gibbons, Phillip B.; Gibson, Garth A.; Ganger, Gregory R.; Xing, Eric P.
2014-01-01
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes. PMID:25400488
Zhong, Cheng; Liu, Lei; Zhao, Jing
2014-01-01
An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users.
NASA Astrophysics Data System (ADS)
Mattei, D.; Smith, I.; Ferrari, A.; Carbillet, M.
2010-10-01
Post-processing for exoplanet detection using direct imaging requires large data cubes and/or sophisticated signal processing technics. For alt-azimuthal mounts, a projection effect called field rotation makes the potential planet rotate in a known manner on the set of images. For ground based telescopes that use extreme adaptive optics and advanced coronagraphy, technics based on field rotation are already broadly used and still under progress. In most such technics, for a given initial position of the planet the planet intensity estimate is a linear function of the set of images. However, due to field rotation the modified instrumental response applied is not shift invariant like usual linear filters. Testing all possible initial positions is therefore very time-consuming. To reduce the time process, we propose to deal with each subset of initial positions computed on a different machine using parallelization programming. In particular, the MOODS algorithm dedicated to the VLT-SPHERE instrument, that estimates jointly the light contributions of the star and the potential exoplanet, is parallelized on the Observatoire de la Cote d'Azur cluster. Different parallelization methods (OpenMP, MPI, Jobs Array) have been elaborated for the initial MOODS code and compared to each other. The one finally chosen splits the initial positions on the processors available by accounting at best for the different constraints of the cluster structure: memory, job submission queues, number of available CPUs, cluster average load. At the end, a standard set of images is satisfactorily processed in a few hours instead of a few days.
New SIMD Algorithms for Cluster Labeling on Parallel Computers
NASA Astrophysics Data System (ADS)
Apostolakis, John; Coddington, Paul; Marinari, Enzo
Cluster algorithms are non-local Monte Carlo update schemes which can greatly increase the efficiency of computer simulations of spin models of magnets. The major computational task in these algorithms is connected component labeling, to identify clusters of connected sites on a lattice. We have devised some new SIMD component labeling algorithms, and implemented them on the Connection Machine. We investigate their performance when applied to the cluster update of the two-dimensional Ising spin model. These algorithms could also be applied to other problems which use connected component labeling, such as percolation and image analysis.
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
NASA Technical Reports Server (NTRS)
Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.
1998-01-01
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, "routine" parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (Psi-NKS) algorithmic framework is presented as an answer. We show that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, Psi-NKS can simultaneously deliver: globalized, asymptotically rapid convergence through adaptive pseudo- transient continuation and Newton's method-, reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per- processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Psi-NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of Psi-NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD.
Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.; Mathematics and Computer Science; Old Dominion Univ.; Iowa State Univ.
2000-01-01
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz ({psi}NKS) algorithmic framework is presented as a widely applicable answer. This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, {psi}NKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of {psi}NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. The authors therefore distill several recommendations from their experience and reading of the literature on various algorithmic components of {psi}NKS, and they describe a freely available MPI-based portable parallel software implementation of the solver employed here.
Madduri, Kamesh; Bader, David A.
2009-02-15
Graph-theoretic abstractions are extensively used to analyze massive data sets. Temporal data streams from socioeconomic interactions, social networking web sites, communication traffic, and scientific computing can be intuitively modeled as graphs. We present the first study of novel high-performance combinatorial techniques for analyzing large-scale information networks, encapsulating dynamic interaction data in the order of billions of entities. We present new data structures to represent dynamic interaction networks, and discuss algorithms for processing parallel insertions and deletions of edges in small-world networks. With these new approaches, we achieve an average performance rate of 25 million structural updates per second and a parallel speedup of nearly28 on a 64-way Sun UltraSPARC T2 multicore processor, for insertions and deletions to a small-world network of 33.5 million vertices and 268 million edges. We also design parallel implementations of fundamental dynamic graph kernels related to connectivity and centrality queries. Our implementations are freely distributed as part of the open-source SNAP (Small-world Network Analysis and Partitioning) complex network analysis framework.
The lower bound on complexity of parallel branch-and-bound algorithm for subset sum problem
NASA Astrophysics Data System (ADS)
Kolpakov, Roman; Posypkin, Mikhail
2016-10-01
The subset sum problem is a particular case of the Boolean knapsack problem where each item has the price equal to its weight. This problem can be informally stated as searching for most dense packing of a set of items into a box with limited capacity. Recently, coarse-grain parallelization approaches to Branch-and-Bound (B&B) method attracted some attention due to the growing popularity of weakly-connected distributed computing platforms. In this paper we consider one of such approaches for solving the subset sum problem. One of the processors (manager) performs some number of B&B steps on the first stage with generating some subproblems. On the second stage, the generated subproblems are sent to other processors, one subproblem per processor. The processors solve completely the received subproblems, the manager collects all the obtained solutions and chooses the optimal one. For this algorithm we formally define the parallel execution model (frontal scheme of parallelization) and the notion of the frontal scheme complexity. We study the frontal scheme complexity for a series of subset sum problems.
NASA Technical Reports Server (NTRS)
Krosel, S. M.; Milner, E. J.
1982-01-01
The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.
PDoublePop: An implementation of parallel genetic algorithm for function optimization
NASA Astrophysics Data System (ADS)
Tsoulos, Ioannis G.; Tzallas, Alexandros; Tsalikakis, Dimitris
2016-12-01
A software for the implementation of parallel genetic algorithms is presented in this article. The underlying genetic algorithm is aimed to locate the global minimum of a multidimensional function inside a rectangular hyperbox. The proposed software named PDoublePop implements a client-server model for parallel genetic algorithms with advanced features for the local genetic algorithms such as: an enhanced stopping rule, an advanced mutation scheme and periodical application of a local search procedure. The user may code the objective function either in C++ or in Fortran77. The method is tested on a series of well-known test functions and the results are reported.
NASA Astrophysics Data System (ADS)
Shim, Yunsic; Amar, Jacques G.
2005-03-01
The standard kinetic Monte Carlo algorithm is an extremely efficient method to carry out serial simulations of dynamical processes such as thin film growth. However, in some cases it is necessary to study systems over extended time and length scales, and therefore a parallel algorithm is desired. Here we describe an efficient, semirigorous synchronous sublattice algorithm for parallel kinetic Monte Carlo simulations. The accuracy and parallel efficiency are studied as a function of diffusion rate, processor size, and number of processors for a variety of simple models of epitaxial growth. The effects of fluctuations on the parallel efficiency are also studied. Since only local communications are required, linear scaling behavior is observed, e.g., the parallel efficiency is independent of the number of processors for fixed processor size.
Parallel OSEM Reconstruction Algorithm for Fully 3-D SPECT on a Beowulf Cluster.
Rong, Zhou; Tianyu, Ma; Yongjie, Jin
2005-01-01
In order to improve the computation speed of ordered subset expectation maximization (OSEM) algorithm for fully 3-D single photon emission computed tomography (SPECT) reconstruction, an experimental beowulf-type cluster was built and several parallel reconstruction schemes were described. We implemented a single-program-multiple-data (SPMD) parallel 3-D OSEM reconstruction algorithm based on message passing interface (MPI) and tested it with combinations of different number of calculating processors and different size of voxel grid in reconstruction (64×64×64 and 128×128×128). Performance of parallelization was evaluated in terms of the speedup factor and parallel efficiency. This parallel implementation methodology is expected to be helpful to make fully 3-D OSEM algorithms more feasible in clinical SPECT studies.
Estimation of distribution algorithms with Kikuchi approximations.
Santana, Roberto
2005-01-01
The question of finding feasible ways for estimating probability distributions is one of the main challenges for Estimation of Distribution Algorithms (EDAs). To estimate the distribution of the selected solutions, EDAs use factorizations constructed according to graphical models. The class of factorizations that can be obtained from these probability models is highly constrained. Expanding the class of factorizations that could be employed for probability approximation is a necessary step for the conception of more robust EDAs. In this paper we introduce a method for learning a more general class of probability factorizations. The method combines a reformulation of a probability approximation procedure known in statistical physics as the Kikuchi approximation of energy, with a novel approach for finding graph decompositions. We present the Markov Network Estimation of Distribution Algorithm (MN-EDA), an EDA that uses Kikuchi approximations to estimate the distribution, and Gibbs Sampling (GS) to generate new points. A systematic empirical evaluation of MN-EDA is done in comparison with different Bayesian network based EDAs. From our experiments we conclude that the algorithm can outperform other EDAs that use traditional methods of probability approximation in the optimization of functions with strong interactions among their variables.
Nagurney, A.; Kim, D.S.
1989-01-01
The authors have applied parallel and serial variational inequality (VI) diagonal decomposition algorithms to large-scale multicommodity market equilibrium problems. These decomposition algorithms resolve the VI problems into single commodity problems, which are then solved as quadratic programming problems. The algorithms are implemented on an IBM 3090-600E, and randomly generated linear and nonlinear problems with as many as 100 markets and 12 commodities are solved. The computational results demonstrate that the parallel diagonal decomposition scheme is amenable to parallelization. This is the first time that multicommodity equilibrium problems of this scale and level of generality have been solved. Furthermore, this is the first study to compare the efficiencies of parallel and serial VI decomposition algorithms. Although the authors have selected as a prototype an equilibrium problem in economics, virtually any equilibrium problem can be formulated and studied as a variational inequality problem. Hence, their results are not limited to applications in economics and operations research.
A portable implementation of ARPACK for distributed memory parallel architectures
Maschhoff, K.J.; Sorensen, D.C.
1996-12-31
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).
Parallelized event chain algorithm for dense hard sphere and polymer systems
Kampmann, Tobias A. Boltz, Horst-Holger; Kierfeld, Jan
2015-01-15
We combine parallelization and cluster Monte Carlo for hard sphere systems and present a parallelized event chain algorithm for the hard disk system in two dimensions. For parallelization we use a spatial partitioning approach into simulation cells. We find that it is crucial for correctness to ensure detailed balance on the level of Monte Carlo sweeps by drawing the starting sphere of event chains within each simulation cell with replacement. We analyze the performance gains for the parallelized event chain and find a criterion for an optimal degree of parallelization. Because of the cluster nature of event chain moves massive parallelization will not be optimal. Finally, we discuss first applications of the event chain algorithm to dense polymer systems, i.e., bundle-forming solutions of attractive semiflexible polymers.
Current distribution within parallel-connected battery cells
NASA Astrophysics Data System (ADS)
Brand, Martin J.; Hofmann, Markus H.; Steinhardt, Marco; Schuster, Simon F.; Jossen, Andreas
2016-12-01
Parallel connections can be found in many battery applications. Therefore, it is of high interest to understand how the current distributes within parallel battery cells. However, the number of publications on this topic is comparably low. Furthermore, the measurement set-ups are often not clearly defined in existing publications and it is likely that additional impedances distorted the measured current distributions. In this work, the principles of current distributions within parallel-connected battery cells are investigated theoretically, with an equivalent electric circuit model, and by measurements. A measurement set-up is developed that does not significantly influence the measurements, as proven by impedance spectroscopy. On this basis, two parameter scenarios are analyzed: the ΔR scenario stands for battery cells with differing impedances but similar capacities and the ΔC scenario for differing capacities and similar impedances. Out of 172 brand-new lithium-ion battery cells, pairs are built to practically represent the ΔR and ΔC scenarios. If a charging pulse is applied to the ΔR scenario, currents initially divide according to the current divider but equalize in constant current phases. The current divider has no effect on ΔC pairs but, as a rule of thumb for long-term loads, currents divide according to the battery cell capacities.
Distributed parallel computing in stochastic modeling of groundwater systems.
Dong, Yanhui; Li, Guomin; Xu, Haizhen
2013-03-01
Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling.
NASA Astrophysics Data System (ADS)
Brossier, R.
2011-04-01
Full waveform inversion (FWI) is an appealing seismic data-fitting procedure for the derivation of high-resolution quantitative models of the subsurface at various scales. Full modelling and inversion of visco-elastic waves from multiple seismic sources allow for the recovering of different physical parameters, although they remain computationally challenging tasks. An efficient massively parallel, frequency-domain FWI algorithm is implemented here on large-scale distributed-memory platforms for imaging two-dimensional visco-elastic media. The resolution of the elastodynamic equations, as the forward problem of the inversion, is performed in the frequency domain on unstructured triangular meshes, using a low-order finite element discontinuous Galerkin method. The linear system resulting from discretization of the forward problem is solved with a parallel direct solver. The inverse problem, which is presented as a non-linear local optimization problem, is solved in parallel with a quasi-Newton method, and this allows for reliable estimation of multiple classes of visco-elastic parameters. Two levels of parallelism are implemented in the algorithm, based on message passing interfaces and multi-threading, for optimal use of computational time and the core-memory resources available on modern distributed-memory multi-core computational platforms. The algorithm allows for imaging of realistic targets at various scales, ranging from near-surface geotechnic applications to crustal-scale exploration.
Perm Web: remote parallel and distributed volume visualization
Wittenbrink, C.M.; Kim, K.; Story, J.; Pang, A.; Hollerbach, K.; Max, N.
1997-01-01
In this paper we present a system for visualizing volume data from remote supercomputers (PermWeb). We have developed both parallel volume rendering algorithms, and the World Wide Web software for accessing the data at the remote sites. The implementation uses Hypertext Markup Language (HTML), Java, and Common Gateway Interface (CGI) scripts to connect World Wide Web (WWW) servers/clients to our volume renderers. The front ends are interactive Java classes for specification of view, shading, and classification inputs. We present performance results, and implementation details for connections to our computing resources at the University of California Santa Cruz including a MasPar MP-2, SGI Reality Engine-RE2, and SGI Challenge machines. We apply the system to the task of visualizing trabecular bone from finite element simulations. Fast volume rendering on remote compute servers through a web interface allows us to increase the accessibility of the results to more users. User interface issues, overviews of parallel algorithm developments, and overall system interfaces and protocols are presented. Access is available through Uniform Resource Locator (URL) http://www.cse.ucsc.edu/research/slvg/. 26 refs., 7 figs.
Parallel vision algorithms. Annual technical report No. 2, 1 October 1987-28 December 1988
Ibrahim, H.A.; Kender, J.R.; Brown, L.G.
1989-01-01
This Second Annual Technical Report covers the project activities during the period from October 1, 1987 through December 31, 1988. The objective of this project is to develop and implement, on highly parallel computers, vision algorithms that combine stereo, texture, and multi-resolution techniques for determining local surface orientation and depth. Such algorithms can serve as front-end components of autonomous land-vehicle vision systems. During the second year of the project, efforts concentrated on the following: first, implementing and testing on the Connection Machine the parallel programming environment that will be used to develop, implement and test our parallel vision algorithms; second, implementing and testing primitives for the multi-resolution stereo and texture algorithms, in this environment. Also, efforts were continued to refine techniques used in the texture algorithms, and to develop a system that integrates information from several shape-from-texture methods. This report describes the status and progress of these efforts. The authors describe first the programming environment implementation, and how to use it. They summarize the results for multi-resolution based depth-interpolation algorithms on parallel architectures. Then, they present algorithms and test results for the texture algorithms. Finally, the results of the efforts of integrating information from various shape-from-texture algorithms are presented.
A computational fluid dynamics algorithm on a massively parallel computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The implementation and performance of a finite-difference algorithm for the compressible Navier-Stokes equations in two or three dimensions on the Connection Machine are described. This machine is a single-instruction multiple-data machine with up to 65536 physical processors. The implicit portion of the algorithm is of particular interest. Running times and megadrop rates are given for two- and three-dimensional problems. Included are comparisons with the standard codes on a Cray X-MP/48.
Event parallelism: Distributed memory parallel computing for high energy physics experiments
Nash, T.
1989-05-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs.
LYDIAN: An Extensible Educational Animation Environment for Distributed Algorithms
ERIC Educational Resources Information Center
Koldehofe, Boris; Papatriantafilou, Marina; Tsigas, Philippas
2006-01-01
LYDIAN is an environment to support the teaching and learning of distributed algorithms. It provides a collection of distributed algorithms as well as continuous animations. Users can combine algorithms and animations with arbitrary network structures defining the interconnection and behavior of the distributed algorithm. Further, it facilitates…
NASA Astrophysics Data System (ADS)
Subramanian, Nithya
Optimization under uncertainty accounts for design variables and external parameters or factors with probabilistic distributions instead of fixed deterministic values; it enables problem formulations that might maximize or minimize an expected value while satisfying constraints using probabilities. For discrete optimization under uncertainty, a Monte Carlo Sampling (MCS) approach enables high-accuracy estimation of expectations but it also results in high computational expense. The Genetic Algorithm (GA) with a Population-Based Sampling (PBS) technique enables optimization under uncertainty with discrete variables at a lower computational expense than using Monte Carlo sampling for every fitness evaluation. Population-Based Sampling uses fewer samples in the exploratory phase of the GA and a larger number of samples when `good designs' start emerging over the generations. This sampling technique therefore reduces the computational effort spent on `poor designs' found in the initial phase of the algorithm. Parallel computation evaluates the expected value of the objective and constraints in parallel to facilitate reduced wall-clock time. A customized stopping criterion is also developed for the GA with Population-Based Sampling. The stopping criterion requires that the design with the minimum expected fitness value to have at least 99% constraint satisfaction and to have accumulated at least 10,000 samples. The average change in expected fitness values in the last ten consecutive generations is also monitored. The optimization of composite laminates using ply orientation angle as a discrete variable provides an example to demonstrate further developments of the GA with Population-Based Sampling for discrete optimization under uncertainty. The focus problem aims to reduce the expected weight of the composite laminate while treating the laminate's fiber volume fraction and externally applied loads as uncertain quantities following normal distributions. Construction of
Pruning Neural Networks with Distribution Estimation Algorithms
Cantu-Paz, E
2003-01-15
This paper describes the application of four evolutionary algorithms to the pruning of neural networks used in classification problems. Besides of a simple genetic algorithm (GA), the paper considers three distribution estimation algorithms (DEAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to determine if the DEAs present advantages over the simple GA in terms of accuracy or speed in this problem. The experiments used a feed forward neural network trained with standard back propagation and public-domain and artificial data sets. The pruned networks seemed to have better or equal accuracy than the original fully-connected networks. Only in a few cases, pruning resulted in less accurate networks. We found few differences in the accuracy of the networks pruned by the four EAs, but found important differences in the execution time. The results suggest that a simple GA with a small population might be the best algorithm for pruning networks on the data sets we tested.
D`Azevedo, E.F.; Romine, C.H.
1992-09-01
The standard formulation of the conjugate gradient algorithm involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsequent distribution of these two values requires two separate communication and synchronization phases. In this paper, we present a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm
NASA Technical Reports Server (NTRS)
Povitsky, A.
1998-01-01
In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Advanced Algorithms and Automation Tools for Discrete Ordinates Methods in Parallel Environments
Alireza Haghighat
2003-05-07
This final report discusses major accomplishments of a 3-year project under the DOE's NEER Program. The project has developed innovative and automated algorithms, codes, and tools for solving the discrete ordinates particle transport method efficiently in parallel environments. Using a number of benchmark and real-life problems, the performance and accuracy of the new algorithms have been measured and analyzed.
A Parallel Processing Algorithm for Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Parallel algorithms for simulating continuous time Markov chains
NASA Technical Reports Server (NTRS)
Nicol, David M.; Heidelberger, Philip
1992-01-01
We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuous-time Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors.
Execution models for mapping programs onto distributed memory parallel computers
NASA Technical Reports Server (NTRS)
Sussman, Alan
1992-01-01
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Image coding using parallel implementations of the embedded zerotree wavelet algorithm
NASA Astrophysics Data System (ADS)
Creusere, Charles D.
1996-03-01
We explore here the implementation of Shapiro's embedded zerotree wavelet (EZW) image coding algorithms on an array of parallel processors. To this end, we first consider the problem of parallelizing the basic wavelet transform, discussing past work in this area and the compatibility of that work with the zerotree coding process. From this discussion, we present a parallel partitioning of the transform which is computationally efficient and which allows the wavelet coefficients to be coded with little or no additional inter-processor communication. The key to achieving low data dependence between the processors is to ensure that each processor contains only entire zerotrees of wavelet coefficients after the decomposition is complete. We next quantify the rate-distortion tradeoffs associated with different levels of parallelization for a few variations of the basic coding algorithm. Studying these results, we conclude that the quality of the coder decreases as the number of parallel processors used to implement it increases. Noting that the performance of the parallel algorithm might be unacceptably poor for large processor arrays, we also develop an alternate algorithm which always achieves the same rate-distortion performance as the original sequential EZW algorithm at the cost of higher complexity and reduced scalability.
NASA Astrophysics Data System (ADS)
Yang, Huizhen; Li, Xinyang
2011-04-01
Optimizing the system performance metric directly is an important method for correcting wavefront aberrations in an adaptive optics (AO) system where wavefront sensing methods are unavailable or ineffective. An appropriate "Deformable Mirror" control algorithm is the key to successful wavefront correction. Based on several stochastic parallel optimization control algorithms, an adaptive optics system with a 61-element Deformable Mirror (DM) is simulated. Genetic Algorithm (GA), Stochastic Parallel Gradient Descent (SPGD), Simulated Annealing (SA) and Algorithm Of Pattern Extraction (Alopex) are compared in convergence speed and correction capability. The results show that all these algorithms have the ability to correct for atmospheric turbulence. Compared with least squares fitting, they almost obtain the best correction achievable for the 61-element DM. SA is the fastest and GA is the slowest in these algorithms. The number of perturbation by GA is almost 20 times larger than that of SA, 15 times larger than SPGD and 9 times larger than Alopex.
NASA Technical Reports Server (NTRS)
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations
NASA Technical Reports Server (NTRS)
Stone, H. S.
1971-01-01
Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David
2006-05-01
The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication
2013-05-17
given subproblem share the same m-digit suffix . After the above communication is performed, the layout of Si and Ti has parameters (n/2, P/7, s − 1...MULTIPLICATION 11 Algorithm 2 CAPS, in detail Input: A, B, are n× n matrices P = number of processors rank = processor number base-7 as an array M = local
Parallel Implementation of the Wideband DOA Algorithm on the IBM Cell BE Processor
2010-05-01
Abstract—The Multiple Signal Classification ( MUSIC ) algorithm is a powerful technique for determining the Direction of Arrival (DOA) of signals...Broadband Engine Processor (Cell BE). The process of adapting the serial based MUSIC algorithm to the Cell BE will be analyzed in terms of parallelism and...using Multiple Signal Classification MUSIC algorithm [4] • Computation of Focus matrix • Computation of number of sources • Separation of Signal
Parallel field line and stream line tracing algorithms for space physics applications
NASA Astrophysics Data System (ADS)
Toth, G.; de Zeeuw, D.; Monostori, G.
2004-05-01
Field line and stream line tracing is required in various space physics applications, such as the coupling of the global magnetosphere and inner magnetosphere models, the coupling of the solar energetic particle and heliosphere models, or the modeling of comets, where the multispecies chemical equations are solved along stream lines of a steady state solution obtained with single fluid MHD model. Tracing a vector field is an inherently serial process, which is difficult to parallelize. This is especially true when the data corresponding to the vector field is distributed over a large number of processors. We designed algorithms for the various applications, which scale well to a large number of processors. In the first algorithm the computational domain is divided into blocks. Each block is on a single processor. The algorithm folows the vector field inside the blocks, and calculates a mapping of the block surfaces. The blocks communicate the values at the coinciding surfaces, and the results are interpolated. Finally all block surfaces are defined and values inside the blocks are obtained. In the second algorithm all processors start integrating along the vector field inside the accessible volume. When the field line leaves the local subdomain, the position and other information is stored in a buffer. Periodically the processors exchange the buffers, and continue integration of the field lines until they reach a boundary. At that point the results are sent back to the originating processor. Efficiency is achieved by a careful phasing of computation and communication. In the third algorithm the results of a steady state simulation are stored on a hard drive. The vector field is contained in blocks. All processors read in all the grid and vector field data and the stream lines are integrated in parallel. If a stream line enters a block, which has already been integrated, the results can be interpolated. By a clever ordering of the blocks the execution speed can be
apGA: An adaptive parallel genetic algorithm
Liepins, G.E. ); Baluja, S. )
1991-01-01
We develop apGA, a parallel variant of the standard generational GA, that combines aggressive search with perpetual novelty, yet is able to preserve enough genetic structure to optimally solve variably scaled, non-uniform block deceptive and hierarchical deceptive problems. apGA combines elitism, adaptive mutation, adaptive exponential scaling, and temporal memory. We present empirical results for six classes of problems, including the DeJong test suite. Although we have not investigated hybrids, we note that apGA could be incorporated into other recent GA variants such as GENITOR, CHC, and the recombination stage of mGA. 12 refs., 2 figs., 2 tabs.
A dataflow analysis tool for parallel processing of algorithms
NASA Technical Reports Server (NTRS)
Jones, Robert L., III
1993-01-01
A graph-theoretic design process and software tool is presented for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described using a dataflow graph and are intended to be executed repetitively on a set of identical parallel processors. Typical applications include signal processing and control law problems. Graph analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool is shown to facilitate the application of the design process to a given problem.
Application of parallel distributed processing to space based systems
NASA Technical Reports Server (NTRS)
Macdonald, J. R.; Heffelfinger, H. L.
1987-01-01
The concept of using Parallel Distributed Processing (PDP) to enhance automated experiment monitoring and control is explored. Recent very large scale integration (VLSI) advances have made such applications an achievable goal. The PDP machine has demonstrated the ability to automatically organize stored information, handle unfamiliar and contradictory input data and perform the actions necessary. The PDP machine has demonstrated that it can perform inference and knowledge operations with greater speed and flexibility and at lower cost than traditional architectures. In applications where the rule set governing an expert system's decisions is difficult to formulate, PDP can be used to extract rules by associating the information an expert receives with the actions taken.
A Self Consistent Multiprocessor Space Charge Algorithm that is Almost Embarrassingly Parallel
Edward Nissen, B. Erdelyi, S.L. Manikonda
2012-07-01
We present a space charge code that is self consistent, massively parallelizeable, and requires very little communication between computer nodes; making the calculation almost embarrassingly parallel. This method is implemented in the code COSY Infinity where the differential algebras used in this code are important to the algorithm's proper functioning. The method works by calculating the self consistent space charge distribution using the statistical moments of the test particles, and converting them into polynomial series coefficients. These coefficients are combined with differential algebraic integrals to form the potential, and electric fields. The result is a map which contains the effects of space charge. This method allows for massive parallelization since its statistics based solver doesn't require any binning of particles, and only requires a vector containing the partial sums of the statistical moments for the different nodes to be passed. All other calculations are done independently. The resulting maps can be used to analyze the system using normal form analysis, as well as advance particles in numbers and at speeds that were previously impossible.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Chiou, Jin-Chern
1990-01-01
Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
Chen, Wei-Chen; Ostrouchov, George; Pugmire, Dave; Prabhat,; Wehner, Michael
2013-01-01
We develop a parallel EM algorithm for multivariate Gaussian mixture models and use it to perform model-based clustering of a large climate data set. Three variants of the EM algorithm are reformulated in parallel and a new variant that is faster is presented. All are implemented using the single program, multiple data (SPMD) programming model, which is able to take advantage of the combined collective memory of large distributed computer architectures to process larger data sets. Displays of the estimated mixture model rather than the data allow us to explore multivariate relationships in a way that scales to arbitrary size data. We study the performance of our methodology on simulated data and apply our methodology to a high resolution climate dataset produced by the community atmosphere model (CAM5). This article has supplementary material online.
A block-wise approximate parallel implementation for ART algorithm on CUDA-enabled GPU.
Fan, Zhongyin; Xie, Yaoqin
2015-01-01
Computed tomography (CT) has been widely used to acquire volumetric anatomical information in the diagnosis and treatment of illnesses in many clinics. However, the ART algorithm for reconstruction from under-sampled and noisy projection is still time-consuming. It is the goal of our work to improve a block-wise approximate parallel implementation for the ART algorithm on CUDA-enabled GPU to make the ART algorithm applicable to the clinical environment. The resulting method has several compelling features: (1) the rays are allotted into blocks, making the rays in the same block parallel; (2) GPU implementation caters to the actual industrial and medical application demand. We test the algorithm on a digital shepp-logan phantom, and the results indicate that our method is more efficient than the existing CPU implementation. The high computation efficiency achieved in our algorithm makes it possible for clinicians to obtain real-time 3D images.
Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua
2011-01-01
A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform.
NASA Astrophysics Data System (ADS)
Hongchao, Ma; Wang, Zongyue
2011-02-01
This paper proposes a novel method for distributed data organization and parallel data retrieval from huge volume point clouds generated by airborne Light Detection and Ranging (LiDAR) technology under a cluster computing environment, in order to allow fast analysis, processing, and visualization of the point clouds within a given area. The proposed method is suitable for both grid and quadtree data structures. As for distribution strategy, cross distribution of the dataset would be more efficient than serial distribution in terms of non-redundant datasets, since a dataset is more uniformly distributed in the former arrangement. However, redundant datasets are necessary in order to meet the frequent need of input and output operations in multi-client scenarios: the first copy would be distributed by a cross distribution strategy while the second (and later) would be distributed by an iterated exchanging distribution strategy. Such a distribution strategy would distribute datasets more uniformly to each data server. In data retrieval, a greedy algorithm is used to allocate the query task to a data server, where the computing load is lightest if the data block needing to be retrieved is stored among multiple data servers. Experiments show that the method proposed in this paper can satisfy the demands of frequent and fast data query.
Algorithms for distributed and mobile sensing
NASA Astrophysics Data System (ADS)
Isler, Ibrahim Volkan
Sensing remote, complex and large environments is an important task that arises in diverse applications including planetary exploration, monitoring forest fires and the surveillance of large factories. Currently, automation of such sensing tasks in complex environments is achieved either by deploying many stationary sensors to the environment, or by mounting a sensor on a mobile device and using the device to sense the environment. The Eighties and Nineties witnessed tremendous advances in both distributed and mobile sensing technologies. To take advantage of these technologies, it is crucial to design algorithms to perform sensing tasks in an autonomous fashion. In this dissertation, we study four fundamental sensing problems that arise in sensing complex environments with distributed and mobile systems. For mobile sensing systems we study exploration and pursuit-evasion problems. In the exploration problem, the goal is to design a strategy for a mobile robot so that the robot sees every point in an unknown environment as quickly as possible. In the pursuit-evasion problem, the goal is to design a strategy for a pursuer to capture an adversarial evader. For distributed sensing systems we study placement and assignment problems. In the placement problem, the goal is to place sensors to an environment so that every point in the environment is in the range of at least one sensor. The assignment problem deals with the issue of assigning targets to sensors in a network, so that overall error in estimating the position of the targets is minimized. We present algorithms to perform these sensing tasks in an efficient fashion. Performance guarantees of the algorithms are mathematically proven and evaluated by simulations.
New sampling distributions for evolution algorithms
NASA Astrophysics Data System (ADS)
Sweeney, Francis Dermot
Evolution algorithms are stochastic optimization methods based on evolutionary principles. They have long been used in optimization, and are gaining in popularity. They are particularly useful in high dimensional problems, or in problems where gradient methods fails. Evolution strategies, a class of evolutionary algorithms, are stochastic searches which evolve by mutation. This work proposes a new mutation distribution for use in single objective optimization. Up to now, cost function information obtained by mutations that do not improve fitness has been discarded. In many problems, particularly when cost function calls are expensive, it is desirable to use all available information to guide the search. The new method in this work patches Gaussians of different variances together to create a sampling distribution which delivers mutations designed to direct the search away from regions where low values of fitness have been observed. Analytic results for this new method are derived on idealized problems. The method is compared with existing methods on a range of test problems, and its overall performance attributes are assessed. A new method for multiobjective optimization is also developed. Genetic Algorithms introduce innovation into their populations by a process of bit mutation. This small scale mutation is often insufficient to successfully direct the search, unless the initial population is of sufficient quality. The new method proposed here, termed Rank Biased Sampling, uses the population to create new members, which are resampled across the entire search space from a distribution designed to favor regions which are inadequately represented by the current population. Again, this method is compared to existing methods on some standard test problems. These new optimization methods are then applied to some real-world problems of engineering interest. The optimization routines developed in this work performed well on these applications, and provide good
Parallel algorithm to analyze the brain signals: application on epileptic spikes.
Keshri, Anup Kumar; Das, Barda Nand; Mallick, Dheeresh Kumar; Sinha, Rakesh Kumar
2011-02-01
In the current work, we have proposed a parallel algorithm for the recognition of Epileptic Spikes (ES) in EEG. The automated systems are used in biomedical field to help the doctors and pathologist by producing the result of an inspection in real time. Generally, the biomedical signal data to be processed are very large in size. A uniprocessor computer is having its own limitation regarding its speed. So the fastest available computer with latest configuration also may not produce results in real time for the immense computation. Parallel computing can be proved as a useful tool for processing the huge data with higher speed. In the proposed algorithm 'Data Parallelism' has been applied where multiple processors perform the same operation on different part of the data to produce fast result. All the processors are interconnected with each other by an interconnection network. The complexity of the algorithm was analyzed as Θ((n + δn) / N) where, 'n' is the length of the input data, 'N' is the number of processor used in the algorithm and 'δn' is the amount of overlapped data between two consecutive intermediate processors (IPs). This algorithm is scalable as the level of parallelism increase linearly with the increase in number of processors. The algorithm has been implemented in Message Passing Interface (MPI). It was tested with 60 min recorded EEG signal data files. The recognition rate of ES on an average was 95.68%.
Two and Three-Dimensional Nonlocal DFT for Inhomogeneous Fluids I: Algorithms and Parallelization
Frink, Laura J. Douglas; Salinger, Andrew
1999-08-09
Fluids adsorbed near surfaces, macromolecules, and in porous materials are inhomogeneous, inhibiting spatially varying density distributions. This inhomogeneity in the fluid plays an important role in controlling a wide variety of complex physical phenomena including wetting, self-assembly, corrosion, and molecular recognition. One of the key methods for studying the properties of inhomogeneous fluids in simple geometries has been density functional theory (DFT). However, there has been a conspicuous lack of calculations in complex 2D and 3D geometries. The computational difficulty arises from the need to perform nested integrals that are due to nonlocal terms in the free energy functional These integral equations are expensive both in evaluation time and in memory requirements; however, the expense can be mitigated by intelligent algorithms and the use of parallel computers. This paper details our efforts to develop efficient numerical algorithms so that no local DFT calculations in complex geometries that require two or three dimensions can be performed. The success of this implementation will enable the study of solvation effects at heterogeneous surfaces, in zeolites, in solvated (bio)polymers, and in colloidal suspensions.
NASA Astrophysics Data System (ADS)
Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.
2016-10-01
Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.
The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce
NASA Astrophysics Data System (ADS)
Chen, Xi; Zhou, Liqing
2015-12-01
With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.
Marucci, Evandro A.; Neves, Leandro A.; Valêncio, Carlo R.; Pinto, Alex R.; Cansian, Adriano M.; de Souza, Rogeria C. G.; Shiyou, Yang; Machado, José M.
2014-01-01
With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE. PMID:25140318
Evaluation of a new parallel numerical parameter optimization algorithm for a dynamical system
NASA Astrophysics Data System (ADS)
Duran, Ahmet; Tuncel, Mehmet
2016-10-01
It is important to have a scalable parallel numerical parameter optimization algorithm for a dynamical system used in financial applications where time limitation is crucial. We use Message Passing Interface parallel programming and present such a new parallel algorithm for parameter estimation. For example, we apply the algorithm to the asset flow differential equations that have been developed and analyzed since 1989 (see [3-6] and references contained therein). We achieved speed-up for some time series to run up to 512 cores (see [10]). Unlike [10], we consider more extensive financial market situations, for example, in presence of low volatility, high volatility and stock market price at a discount/premium to its net asset value with varying magnitude, in this work. Moreover, we evaluated the convergence of the model parameter vector, the nonlinear least squares error and maximum improvement factor to quantify the success of the optimization process depending on the number of initial parameter vectors.
Parallel Algorithms for Computer Vision on the Connection Machine.
1986-11-01
INTELLIGENCE LAB J J LITTLE UNCLASSIFIED NOV 86 AI-M-928 DRCA76-85-C-0818 F/G 12/7 NL EmlolEllllllEIIIIIIEIIEIIIE El ..... 9’-2 4 2. 0 ~~1.8 .22 -C% .1...connect ed cornpontoit label ii ig (; ii’, - Bl~~~ ello ( h explained the use of se 1’irig. osltdoiayo i o tii - and devise~d the NIST algorithm), Mike 1...tinliated """ Edge (hetec t ion . Convolution 3ms 2rus . Find Zero-Crossings 0.5ms (.57, I Propagate lal) el 36ms 3r)ins * La urricrate c u rves 350ps
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications
NASA Technical Reports Server (NTRS)
Povitsky, Alex; Morris, Philip J.
1999-01-01
In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
NASA Astrophysics Data System (ADS)
Quillen, Alice C.; Moore, A.
2008-09-01
Planetesimal and dust dynamical simulations require collision and nearest neighbor detection. A brute force implementation for sorting interparticle distances requires O(N2) computations for N particles, limiting the numbers of particles that have been simulated. Parallel algorithms recently developed for the GPU (graphics processing unit), such as the radix sort, can run as fast as O(N) and sort distances between a million particles in a few hundred milliseconds. We introduce improvements in collision and nearest neighbor detection algorithms and how we have incorporated them into our efficient parallel 2nd order democratic heliocentric method symplectic integrator written in NVIDIA's CUDA for the GPU.
Genetic algorithms in a distributed computing environment using PVM
Cronje, G.A.; Steeb, W.H.
1997-04-01
The Parallel Virtual Machine (PVM) is a software system that enables a collection of heterogeneous computer systems to be used as a coherent and flexible concurrent computation resource. We show that genetic algorithms can be implemented using a Parallel Virtual Machine and C++. Problems with constraints are also discussed.
NASA Astrophysics Data System (ADS)
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-09-01
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2-D and a random hydraulic conductivity field in 3-D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ˜101 to ˜102 in a multicore computational environment. Therefore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate to large-scale problems.
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-09-01
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~10^{1} to ~10^{2} in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-09-01
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less
Wang, Zhiteng; Zhang, Hongjun; Zhang, Rui; Li, Yong; Zhang, Xuliang
2014-01-01
Service oriented modeling and simulation are hot issues in the field of modeling and simulation, and there is need to call service resources when simulation task workflow is running. How to optimize the service resource allocation to ensure that the task is complete effectively is an important issue in this area. In military modeling and simulation field, it is important to improve the probability of success and timeliness in simulation task workflow. Therefore, this paper proposes an optimization algorithm for multipath service resource parallel allocation, in which multipath service resource parallel allocation model is built and multiple chains coding scheme quantum optimization algorithm is used for optimization and solution. The multiple chains coding scheme quantum optimization algorithm is to extend parallel search space to improve search efficiency. Through the simulation experiment, this paper investigates the effect for the probability of success in simulation task workflow from different optimization algorithm, service allocation strategy, and path number, and the simulation result shows that the optimization algorithm for multipath service resource parallel allocation is an effective method to improve the probability of success and timeliness in simulation task workflow.
Distributed Storage Algorithm for Geospatial Image Data Based on Data Access Patterns.
Pan, Shaoming; Li, Yongkai; Xu, Zhengquan; Chong, Yanwen
2015-01-01
Declustering techniques are widely used in distributed environments to reduce query response time through parallel I/O by splitting large files into several small blocks and then distributing those blocks among multiple storage nodes. Unfortunately, however, many small geospatial image data files cannot be further split for distributed storage. In this paper, we propose a complete theoretical system for the distributed storage of small geospatial image data files based on mining the access patterns of geospatial image data using their historical access log information. First, an algorithm is developed to construct an access correlation matrix based on the analysis of the log information, which reveals the patterns of access to the geospatial image data. Then, a practical heuristic algorithm is developed to determine a reasonable solution based on the access correlation matrix. Finally, a number of comparative experiments are presented, demonstrating that our algorithm displays a higher total parallel access probability than those of other algorithms by approximately 10-15% and that the performance can be further improved by more than 20% by simultaneously applying a copy storage strategy. These experiments show that the algorithm can be applied in distributed environments to help realize parallel I/O and thereby improve system performance.
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling
NASA Technical Reports Server (NTRS)
Brown, Matthew; Johnston, Mark D.
2013-01-01
Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
A New Parallel Algorithm Analogous to Elastic Net Method forBipartite Subgraph Problem
NASA Astrophysics Data System (ADS)
Tang, Zheng; Wang, Rong Long; Wang, Jia Hai; Cao, Qi Ping
The goal of the bipartite subgraph problem, which is an NP-complete problem, is to remove the minimum number of edges in a given graph such that the remaining graph is a bipartite graph. Enlightened by the elastic net method that was introduced by Durbin and Willshaw for finding shortest route for the Traveling Salesman Problem (TSP), we proposed a new parallel algorithm for the bipartite subgraph problem. The approach jointly tends to satisfy the constraint condition and minimizes the number of removed edges. The collective computational properties of the proposed approach are also proved theoretically. A large number of instances have been simulated to verify the proposed algorithm. The simulation results show that our algorithm finds a solution superior to that of the best existing parallel algorithms.
NASA Astrophysics Data System (ADS)
Qin, Cheng-Zhi; Zhan, Lijun
2012-06-01
As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU
Tahara, Tatsuki; Shimozato, Yuki; Xia, Peng; Ito, Yasunori; Awatsuji, Yasuhiro; Nishio, Kenzo; Ura, Shogo; Matoba, Osamu; Kubota, Toshihiro
2012-08-27
We propose an image-reconstruction algorithm of parallel phase-shifting digital holography (PPSDH) which is a technique of single-shot phase-shifting interferometry. In the conventional algorithms in PPSDH, the residual 0th-order diffraction wave and the conjugate images cannot be removed completely and a part of space-bandwidth information is discarded. The proposed algorithm can remove these residual images by modifying the calculation of phase-shifting interferometry and by using Fourier transform technique, respectively. Then, several types of complex amplitudes are derived from a recorded hologram according to the directions in which the neighboring pixels used for carrying out the spatial phase-shifting interferometry are aligned. Several distributions are Fourier-transformed and wide space-bandwidth information of the object wave is obtained by selecting the spectrum among the Fourier-transformed images in each region of the spatial frequency domain and synthesizing a Fourier-transformed image from the spectrum.
Sentence comprehension: A parallel distributed processing approach. Technical report
McClelland, J.L.; St John, M.; Taraban, R.
1989-07-14
Basic aspects are reviewed of conventional approaches to sentence comprehension and point out are some of the difficulties faced by models that take these approaches. An alternative approach is described, based on the principles of parallel distributed processing, and shown how it offers different answers to basic questions about the nature of the language processing mechanism. An illustrative simulation model captures the key characteristics of the approach, and illustrates how it can cope with the difficulties faced by conventional models. Alternative ways of conceptualizing basic aspects of language processing within the framework of this approach will consider how it can address several arguments that might be brought to bear against it, and suggest avenues for future development.
Reusable Component Model Development Approach for Parallel and Distributed Simulation
Zhu, Feng; Yao, Yiping; Chen, Huilong; Yao, Feng
2014-01-01
Model reuse is a key issue to be resolved in parallel and distributed simulation at present. However, component models built by different domain experts usually have diversiform interfaces, couple tightly, and bind with simulation platforms closely. As a result, they are difficult to be reused across different simulation platforms and applications. To address the problem, this paper first proposed a reusable component model framework. Based on this framework, then our reusable model development approach is elaborated, which contains two phases: (1) domain experts create simulation computational modules observing three principles to achieve their independence; (2) model developer encapsulates these simulation computational modules with six standard service interfaces to improve their reusability. The case study of a radar model indicates that the model developed using our approach has good reusability and it is easy to be used in different simulation platforms and applications. PMID:24729751
A new chirp scaling algorithm of bistatic SAR with parallel flight paths
NASA Astrophysics Data System (ADS)
Li, Ning; Wang, Luping
2011-10-01
The precise point target reference spectrum of bistatic SAR has been a difficult problem for a long time. Many of the current available algorithms have approximation during deducing. This paper deduces the precise expression in Doppler- Frequency domain with the configuration of parallel flight paths and constant velocity of each platform. Then a new chirp scaling algorithm is put forward. At last, simulations are given to demonstrate the good focusing performance.
OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms
2016-05-01
mzhy@ucla.edu aekoniges@lbl.gov Abstract. We investigate the OpenMP parallelization and optimiza- tion of two novel data classification algorithms. The...new algorithms are based on graph and PDE solution techniques and provide significant ac- curacy and performance advantages over traditional data ...supercomputer nodes (in our case a Cray XC30), and predict behavior on emerging testbed systems based on Intel’s Knights Corner and Landing processors . We
NASA Astrophysics Data System (ADS)
Yerkes, Christopher R.; Webster, Eric D.
1994-06-01
Advanced algorithms for synthetic aperture radar (SAR) imaging have in the past required computing capabilities only available from high performance special purpose hardware. Such architectures have tended to have short life cycles with respect to development expense. Current generation Massively Parallel Processors (MPP) are offering high performance capabilities necessary for such applications with both a scalable architecture and a longer projected life cycle. In this paper we explore issues associated with implementation of a SAR imaging algorithm on a mesh configured MPP architecture.
Parallel algorithms of relative radiometric correction for images of TH-1 satellite
NASA Astrophysics Data System (ADS)
Wang, Xiang; Zhang, Tingtao; Cheng, Jiasheng; Yang, Tao
2014-05-01
The first generation of transitive stereo-metric satellites in China, TH-1 Satellite, is able to gain stereo images of three-line-array with resolution of 5 meters, multispectral images of 10 meters, and panchromatic high resolution images of 2 meters. The procedure between level 0 and level 1A of high resolution images is so called relative radiometric correction (RRC for short). The processing algorithm of high resolution images, with large volumes of data, is complicated and time consuming. In order to bring up the processing speed, people in industry commonly apply parallel processing techniques based on CPU or GPU. This article firstly introduces the whole process and each step of the algorithm - that is in application - of RRC for high resolution images in level 0; secondly, the theory and characteristics of MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) parallel programming techniques is briefly described, as well as the superiority for parallel technique in image processing field; thirdly, aiming at each step of the algorithm in application and based on MPI+OpenMP hybrid paradigm, the parallelizability and the strategies of parallelism for three processing steps: Radiometric Correction, Splicing Pieces of TDICCD (Time Delay Integration Charge-Coupled Device) and Gray Level Adjustment among pieces of TDICCD are deeply discussed, and furthermore, deducts the theoretical acceleration rates of each step and the one of whole procedure, according to the processing styles and independence of calculation; for the step Splicing Pieces of TDICCD, two different strategies of parallelism are proposed, which are to be chosen with consideration of hardware capabilities; finally, series of experiments are carried out to verify the parallel algorithms by applying 2-meter panchromatic high resolution images of TH-1 Satellite, and the experimental results are analyzed. Strictly on the basis of former parallel algorithms, the programs in the experiments
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images
Wang, Yangping; Wang, Song
2016-01-01
The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU). PMID:28053653
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.
Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao
2016-01-01
The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
Guo, Wensheng; Yang, Guowu; Wu, Wei; He, Lei; Sun, Mingyu
2014-01-01
In biological systems, the dynamic analysis method has gained increasing attention in the past decade. The Boolean network is the most common model of a genetic regulatory network. The interactions of activation and inhibition in the genetic regulatory network are modeled as a set of functions of the Boolean network, while the state transitions in the Boolean network reflect the dynamic property of a genetic regulatory network. A difficult problem for state transition analysis is the finding of attractors. In this paper, we modeled the genetic regulatory network as a Boolean network and proposed a solving algorithm to tackle the attractor finding problem. In the proposed algorithm, we partitioned the Boolean network into several blocks consisting of the strongly connected components according to their gradients, and defined the connection between blocks as decision node. Based on the solutions calculated on the decision nodes and using a satisfiability solving algorithm, we identified the attractors in the state transition graph of each block. The proposed algorithm is benchmarked on a variety of genetic regulatory networks. Compared with existing algorithms, it achieved similar performance on small test cases, and outperformed it on larger and more complex ones, which happens to be the trend of the modern genetic regulatory network. Furthermore, while the existing satisfiability-based algorithms cannot be parallelized due to their inherent algorithm design, the proposed algorithm exhibits a good scalability on parallel computing architectures. PMID:24718686
On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching
Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh; Manne, Fredrik
2014-07-01
We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for the parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.
Guo, Wensheng; Yang, Guowu; Wu, Wei; He, Lei; Sun, Mingyu
2014-01-01
In biological systems, the dynamic analysis method has gained increasing attention in the past decade. The Boolean network is the most common model of a genetic regulatory network. The interactions of activation and inhibition in the genetic regulatory network are modeled as a set of functions of the Boolean network, while the state transitions in the Boolean network reflect the dynamic property of a genetic regulatory network. A difficult problem for state transition analysis is the finding of attractors. In this paper, we modeled the genetic regulatory network as a Boolean network and proposed a solving algorithm to tackle the attractor finding problem. In the proposed algorithm, we partitioned the Boolean network into several blocks consisting of the strongly connected components according to their gradients, and defined the connection between blocks as decision node. Based on the solutions calculated on the decision nodes and using a satisfiability solving algorithm, we identified the attractors in the state transition graph of each block. The proposed algorithm is benchmarked on a variety of genetic regulatory networks. Compared with existing algorithms, it achieved similar performance on small test cases, and outperformed it on larger and more complex ones, which happens to be the trend of the modern genetic regulatory network. Furthermore, while the existing satisfiability-based algorithms cannot be parallelized due to their inherent algorithm design, the proposed algorithm exhibits a good scalability on parallel computing architectures.
NASA Astrophysics Data System (ADS)
Lin, K.-M.; Hu, M.-H.; Hung, C.-T.; Wu, J.-S.; Hwang, F.-N.; Chen, Y.-S.; Cheng, G.
2012-12-01
Development of a hybrid numerical algorithm which couples weakly with the gas flow model (GFM) and the plasma fluid model (PFM) for simulating an atmospheric-pressure plasma jet (APPJ) and its acceleration by two approaches is presented. The weak coupling between gas flow and discharge is introduced by transferring between the results obtained from the steady-state solution of the GFM and cycle-averaged solution of the PFM respectively. Approaches of reducing the overall runtime include parallel computing of the GFM and the PFM solvers, and employing a temporal multi-scale method (TMSM) for PFM. Parallel computing of both solvers is realized using the domain decomposition method with the message passing interface (MPI) on distributed-memory machines. The TMSM considers only chemical reactions by ignoring the transport terms when integrating temporally the continuity equations of heavy species at each time step, and then the transport terms are restored only at an interval of time marching steps. The total reduction of runtime is 47% by applying the TMSM to the APPJ example presented in this study. Application of the proposed hybrid algorithm is demonstrated by simulating a parallel-plate helium APPJ impinging onto a substrate, which the cycle-averaged properties of the 200th cycle are presented. The distribution patterns of species densities are strongly correlated by the background gas flow pattern, which shows that consideration of gas flow in APPJ simulations is critical.
Parallel Fock matrix construction with distributed shared memory model for the FMO-MO method.
Umeda, Hiroaki; Inadomi, Yuichi; Watanabe, Toshio; Yagi, Toru; Ishimoto, Takayoshi; Ikegami, Tsutomu; Tadano, Hiroto; Sakurai, Tetsuya; Nagashima, Umpei
2010-10-01
A parallel Fock matrix construction program for FMO-MO method has been developed with the distributed shared memory model. To construct a large-sized Fock matrix during FMO-MO calculations, a distributed parallel algorithm was designed to make full use of local memory to reduce communication, and was implemented on the Global Array toolkit. A benchmark calculation for a small system indicates that the parallelization efficiency of the matrix construction portion is as high as 93% at 1,024 processors. A large FMO-MO application on the epidermal growth factor receptor (EGFR) protein (17,246 atoms and 96,234 basis functions) was also carried out at the HF/6-31G level of theory, with the frontier orbitals being extracted by a Sakurai-Sugiura eigensolver. It takes 11.3 h for the FMO calculation, 49.1 h for the Fock matrix construction, and 10 min to extract 94 eigen-components on a PC cluster system using 256 processors.
Creating IRT-Based Parallel Test Forms Using the Genetic Algorithm Method
ERIC Educational Resources Information Center
Sun, Koun-Tem; Chen, Yu-Jen; Tsai, Shu-Yen; Cheng, Chien-Fen
2008-01-01
In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel…
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.
Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael
2017-03-30
Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about 20 better than the fastest sequential algorithm and speed-up goes up to 30 40 on 64 threads.
Parallel algorithms for computer vision. Final report, 31 August 1988-31 January 1990
Poggio, T.
1990-04-01
The main effort in this project has been directed towards the development of an integrated vision system, - the Vision Machine - based on a parallel supercomputer. The core of the Vision Machine is in fact a set of parallel algorithms for visual recognition and navigation in an unstructured environment. The present version of the Vision Machine has been demonstrated to process images in close to real time by (1) computing first several low-level cues, such as edges, stereo disparity, optical flow, color and texture, (2) integrating them to extract a cartoon-like description of the scene in terms of the physical discontinuities of surfaces, and (3) using this cartoon in a recognition stage, based on parallel model matching. In addition to the development of the parallel algorithms, their implementation and testing, we have also done substantial work in several areas that are very closely related. These include (1) design and fabrication of VLSI circuits to transfer to potentially cheap and fast hardware some of the software algorithms, (2) initial development of techniques to synthesize by learning vision algorithms, and (3) several projects involving autonomous navigation of small robots.
A 1 log N parallel algorithm for detecting convex hulls on image boards.
Lin, J C; Lin, J Y
1998-01-01
By finding the maximum and minimum of {yi-mxi|1=orparallel algorithm to obtain the convex hull of N arbitrarily given points on an image board, The mathematical theory needed is included, and computation time is 1 log N.
Ellison, C. Leland; Finn, J. M.; Qin, H.; Tang, William M.
2014-10-01
Structure-preserving algorithms obtained via discrete variational principles exhibit strong promise for the calculation of guiding center test particle trajectories. The non-canonical Hamiltonian structure of the guiding center equations forms a novel and challenging context for geometric integration. To demonstrate the practical relevance of these methods, a prototypical variational midpoint algorithm is applied to an experimental magnetic equilibrium. The stability characteristics, conservation properties, and implementation requirements associated with the variational algorithms are addressed. Furthermore, computational run time is reduced for large numbers of particles by parallelizing the calculation on GPU hardware.
Nexus: An interoperability layer for parallel and distributed computer systems
Foster, I.; Kesselman, C.; Olson, R.; Tuecke, S.
1994-05-01
Nexus is a set of services that can be used to implement various task-parallel languages, data-parallel languages, and message-passing libraries. Nexus is designed to permit the efficient portable implementation of individual parallel programming systems and the interoperability of programs developed with different tools. Nexus supports lightweight threading and active message technology, allowing integration of message passing and threads.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
Kakue, Takashi; Moritani, Yuri; Ito, Kenichi; Shimozato, Yuki; Awatsuji, Yasuhiro; Nishio, Kenzo; Ura, Shogo; Kubota, Toshihiro; Matoba, Osamu
2010-04-26
We propose an algorithm that can improve the quality of the reconstructed image from the single hologram recorded by the optical system of the parallel four-step phase-shifting digital holography. The proposed algorithm applies the image-reconstruction algorithm of parallel two-step phase-shifting digital holography to the hologram so as to reduce errors in the reconstructed image and eliminate ghosts. We numerically and experimentally confirmed that the proposed algorithm decreased 25% in terms of root mean square error in amplitude, and eliminated the ghosts, respectively.
NASA Astrophysics Data System (ADS)
Wang, Congzhe; Fang, Yuefa; Guo, Sheng
2015-07-01
Dimensional synthesis is one of the most difficult issues in the field of parallel robots with actuation redundancy. To deal with the optimal design of a redundantly actuated parallel robot used for ankle rehabilitation, a methodology of dimensional synthesis based on multi-objective optimization is presented. First, the dimensional synthesis of the redundant parallel robot is formulated as a nonlinear constrained multi-objective optimization problem. Then four objective functions, separately reflecting occupied space, input/output transmission and torque performances, and multi-criteria constraints, such as dimension, interference and kinematics, are defined. In consideration of the passive exercise of plantar/dorsiflexion requiring large output moment, a torque index is proposed. To cope with the actuation redundancy of the parallel robot, a new output transmission index is defined as well. The multi-objective optimization problem is solved by using a modified Differential Evolution(DE) algorithm, which is characterized by new selection and mutation strategies. Meanwhile, a special penalty method is presented to tackle the multi-criteria constraints. Finally, numerical experiments for different optimization algorithms are implemented. The computation results show that the proposed indices of output transmission and torque, and constraint handling are effective for the redundant parallel robot; the modified DE algorithm is superior to the other tested algorithms, in terms of the ability of global search and the number of non-dominated solutions. The proposed methodology of multi-objective optimization can be also applied to the dimensional synthesis of other redundantly actuated parallel robots only with rotational movements.
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform
Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.
2013-01-01
Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Parallel algorithm for determining motion vectors in ice floe images by matching edge features
NASA Technical Reports Server (NTRS)
Manohar, M.; Ramapriyan, H. K.; Strong, J. P.
1988-01-01
A parallel algorithm is described to determine motion vectors of ice floes using time sequences of images of the Arctic ocean obtained from the Synthetic Aperture Radar (SAR) instrument flown on-board the SEASAT spacecraft. Researchers describe a parallel algorithm which is implemented on the MPP for locating corresponding objects based on their translationally and rotationally invariant features. The algorithm first approximates the edges in the images by polygons or sets of connected straight-line segments. Each such edge structure is then reduced to a seed point. Associated with each seed point are the descriptions (lengths, orientations and sequence numbers) of the lines constituting the corresponding edge structure. A parallel matching algorithm is used to match packed arrays of such descriptions to identify corresponding seed points in the two images. The matching algorithm is designed such that fragmentation and merging of ice floes are taken into account by accepting partial matches. The technique has been demonstrated to work on synthetic test patterns and real image pairs from SEASAT in times ranging from .5 to 0.7 seconds for 128 x 128 images.
Parallelism exploitation of a PCA algorithm for hyperspectral images using RVC-CAL
NASA Astrophysics Data System (ADS)
Lazcano, R.; Sidrach-Cardona, I.; Madroñal, D.; Desnos, K.; Pelcat, M.; Juárez, E.; Sanz, C.
2016-10-01
Hyperspectral imaging (HI) collects information from across the electromagnetic spectrum, covering a wide range of wavelengths. The tremendous development of this technology within the field of remote sensing has led to new research fields, such as cancer automatic detection or precision agriculture, but has also increased the performance requirements of the applications. For instance, strong time constraints need to be respected, since many applications imply real-time responses. Achieving real-time is a challenge, as hyperspectral sensors generate high volumes of data to process. Thus, so as to achieve this requisite, first the initial image data needs to be reduced by discarding redundancies and keeping only useful information. Then, the intrinsic parallelism in a system specification must be explicitly highlighted. In this paper, the PCA (Principal Component Analysis) algorithm is implemented using the RVC-CAL dataflow language, which specifies a system as a set of blocks or actors and allows its parallelization by scheduling the blocks over different processing units. Two implementations of PCA for hyperspectral images have been compared when aiming at obtaining the first few principal components: first, the algorithm has been implemented using the Jacobi approach for obtaining the eigenvectors; thereafter, the NIPALS-PCA algorithm, which approximates the principal components iteratively, has also been studied. Both implementations have been compared in terms of accuracy and computation time; then, the parallelization of both models has also been analyzed. These comparisons show promising results in terms of computation time and parallelization: the performance of the NIPALS-PCA algorithm is clearly better when only the first principal component is achieved, while the partitioning of the algorithm execution over several cores shows an important speedup for the PCA-Jacobi. Thus, experimental results show the potential of RVC-CAL to automatically generate
Huang, Yu; Guo, Feng; Li, Yongling; Liu, Yufeng
2015-01-01
Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO) is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm.
An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs
NASA Astrophysics Data System (ADS)
Honma, Hirotoshi; Honma, Saki; Masuyama, Shigeru
The spanning tree problem is to find a tree that connects all the vertices of G. This problem has many applications, such as electric power systems, computer network design and circuit analysis. Klein and Stein demonstrated that a spanning tree can be found in O(log n) time with O(n + m) processors on the CRCW PRAM. In general, it is known that more efficient parallel algorithms can be developed by restricting classes of graphs. Circular permutation graphs properly contain the set of permutation graphs as a subclass and are first introduced by Rotem and Urrutia. They provided O(n2.376) time recognition algorithm. Circular permutation graphs and their models find several applications in VLSI layout. In this paper, we propose an optimal parallel algorithm for constructing a spanning tree on circular permutation graphs. It runs in O(log n) time with O(n/ log n) processors on the EREW PRAM.
A divide-and-inner product parallel algorithm for polynomial evaluation
Hu, Jie; Li, Lei; Nakamura, Tadao
1994-12-31
In this paper, a divide-and-inner product parallel algorithm for evaluating a polynomial of degree N (N+1=KL) on a MIMD computer is presented. It needs 2K + log{sub 2}L steps to evaluate a polynomial of degree N in parallel on L+1 processors (L{<=}2K-2log{sub 2}K) which is a decrease of log{sub 2}L steps as compared with the L-order Homer`s method, and which is a decrease of (2log{sub 2}L){sup 1/2} steps as compared with the some MIMD algorithms. The new algorithm is simple in structure and easy to be realized.
A Moldable Online Scheduling Algorithm and Its Application to Parallel Short Sequence Mapping
NASA Astrophysics Data System (ADS)
Saule, Erik; Bozdağ, Doruk; Catalyurek, Umit V.
A crucial step in DNA sequence analysis is mapping short sequences generated by next-generation instruments to a reference genome. In this paper, we focus on efficient online scheduling of multi-user parallel short sequence mapping queries on a multiprocessor system. With the availability of parallel execution models, the problem at hand becomes a moldable task scheduling problem where the number of processors needed to execute a task is determined by the scheduler. We propose an online scheduling algorithm to minimize the stretch of the tasks in the system. This metric provides improved fairness to small tasks compared to flow time metric and suits well to the nature of the problem. Experimental evaluation on two workload scenarios indicate that the algorithm results in significantly smaller stretch compared to a recent algorithm and it is more fair to small sized tasks.
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
[Parallel PLS algorithm using MapReduce and its aplication in spectral modeling].
Yang, Hui-Hua; Du, Ling-Ling; Li, Ling-Qiao; Tang, Tian-Biao; Guo, Tuo; Liang, Qiong-Lin; Wang, Yi-Ming; Luo, Guo-An
2012-09-01
Partial least squares (PLS) has been widely used in spectral analysis and modeling, and it is computation-intensive and time-demanding when dealing with massive data To solve this problem effectively, a novel parallel PLS using MapReduce is proposed, which consists of two procedures, the parallelization of data standardizing and the parallelization of principal component computing. Using NIR spectral modeling as an example, experiments were conducted on a Hadoop cluster, which is a collection of ordinary computers. The experimental results demonstrate that the parallel PLS algorithm proposed can handle massive spectra, can significantly cut down the modeling time, and gains a basically linear speedup, and can be easily scaled up.
Wiese, Kay C; Hendriks, Andrew; Deschênes, Alain; Ben Youssef, Belgacem
2005-09-01
This paper presents a fully parallel version of RnaPredict, a genetic algorithm (GA) for RNA secondary structure prediction. The research presented here builds on previous work and examines the impact of three different pseudorandom number generators (PRNGs) on the GA's performance. The three generators tested are the C standard library PRNG RAND, a parallelized multiplicative congruential generator (MCG), and a parallelized Mersenne Twister (MT). A fully parallel version of RnaPredict using the Message Passing Interface (MPI) was implemented on a 128-node Beowulf cluster. The PRNG comparison tests were performed with known structures whose sequences are 118, 122, 468, 543, and 556 nucleotides in length. The effects of the PRNGs are investigated and the predicted structures are compared to known structures. Results indicate that P-RnaPredict demonstrated good prediction accuracy, particularly so for shorter sequences.
Distributed machine learning: Scaling up with coarse-grained parallelism
Provost, F.J.; Hennessy, D.N.
1994-12-31
Machine teaming methods are becoming accepted as additions to the biologist`s data-analysis tool kit. However, scaling these techniques up to large data sets, such as those in biological and medical domains, is problematic in terms of both the required computational search effort and required memory (and the detrimental effects of excessive swapping). Our approach to tackling the problem of scaling up to large datasets is to take advantage of the ubiquitous workstation networks that are generally available in scientific and engineering environments. This paper introduces the notion of the invariant-partitioning property--that for certain evaluation criteria it is possible to partition a data set across multiple processors such that any rule that is satisfactory over the entire data set will also be satisfactory on at least one subset. In addition, by taking advantage of cooperation through interprocess communication, it is possible to build distributed learning algorithms such that only rules that are satisfactory over the entire data set will be learned. We describe a distributed learning system, CorPRL, that takes advantage of the invariant-partitioning property to learn from very large data sets, and present results demonstrating CorPRL`s effectiveness in analyzing data from two databases.
A Parallel Newton-Krylov-Schur Algorithm for the Reynolds-Averaged Navier-Stokes Equations
NASA Astrophysics Data System (ADS)
Osusky, Michal
Aerodynamic shape optimization and multidisciplinary optimization algorithms have the potential not only to improve conventional aircraft, but also to enable the design of novel configurations. By their very nature, these algorithms generate and analyze a large number of unique shapes, resulting in high computational costs. In order to improve their efficiency and enable their use in the early stages of the design process, a fast and robust flow solution algorithm is necessary. This thesis presents an efficient parallel Newton-Krylov-Schur flow solution algorithm for the three-dimensional Navier-Stokes equations coupled with the Spalart-Allmaras one-equation turbulence model. The algorithm employs second-order summation-by-parts (SBP) operators on multi-block structured grids with simultaneous approximation terms (SATs) to enforce block interface coupling and boundary conditions. The discrete equations are solved iteratively with an inexact-Newton method, while the linear system at each Newton iteration is solved using the flexible Krylov subspace iterative method GMRES with an approximate-Schur parallel preconditioner. The algorithm is thoroughly verified and validated, highlighting the correspondence of the current algorithm with several established flow solvers. The solution for a transonic flow over a wing on a mesh of medium density (15 million nodes) shows good agreement with experimental results. Using 128 processors, deep convergence is obtained in under 90 minutes. The solution of transonic flow over the Common Research Model wing-body geometry with grids with up to 150 million nodes exhibits the expected grid convergence behavior. This case was completed as part of the Fifth AIAA Drag Prediction Workshop, with the algorithm producing solutions that compare favourably with several widely used flow solvers. The algorithm is shown to scale well on over 6000 processors. The results demonstrate the effectiveness of the SBP-SAT spatial discretization, which can
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
Komarov, Ivan; D'Souza, Roshan M
2012-01-01
The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
Parallel distributed processing: Implications for cognition and development. Technical report
McClelland, J.L.
1988-07-11
This paper provides a brief overview of the connectionist or parallel distributed processing framework for modeling cognitive processes, and considers the application of the connectionist framework to problems of cognitive development. Several aspects of cognitive development might result from the process of learning as it occurs in multi-layer networks. This learning process has the characteristic that it reduces the discrepancy between expected and observed events. As it does this, representations develop on hidden units which dramatically change both the way in which the network represents the environment from which it learns and the expectations that the network generates about environmental events. The learning process exhibits relatively abrupt transitions corresponding to stage shifts in cognitive development. These points are illustrated using a network that learns to anticipate which side of a balance beam will go down, based on the number of weights on each side of the fulcrum and their distance from the fulcrum on each side of the beam. The network is trained in an environment in which weight more frequently governs which side will go down. It recapitulates the states of development seen in children, as well as the stage transitions, as it learns to represent weight and distance information.
Parallelizing Sylvester-like operations on a distributed memory computer
Hu, D.Y.; Sorensen, D.C.
1994-12-31
Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.
Lin, Lin; Yang, Chao; Lu, Jiangfeng; Ying, Lexing; E, Weinan
2009-09-25
We present an efficient parallel algorithm and its implementation for computing the diagonal of $H^-1$ where $H$ is a 2D Kohn-Sham Hamiltonian discretized on a rectangular domain using a standard second order finite difference scheme. This type of calculation can be used to obtain an accurate approximation to the diagonal of a Fermi-Dirac function of $H$ through a recently developed pole-expansion technique \\cite{LinLuYingE2009}. The diagonal elements are needed in electronic structure calculations for quantum mechanical systems \\citeHohenbergKohn1964, KohnSham 1965,DreizlerGross1990. We show how elimination tree is used to organize the parallel computation and how synchronization overhead is reduced by passing data level by level along this tree using the technique of local buffers and relative indices. We analyze the performance of our implementation by examining its load balance and communication overhead. We show that our implementation exhibits an excellent weak scaling on a large-scale high performance distributed parallel machine. When compared with standard approach for evaluating the diagonal a Fermi-Dirac function of a Kohn-Sham Hamiltonian associated a 2D electron quantum dot, the new pole-expansion technique that uses our algorithm to compute the diagonal of $(H-z_i I)^-1$ for a small number of poles $z_i$ is much faster, especially when the quantum dot contains many electrons.
NASA Astrophysics Data System (ADS)
Zemlyanaya, E. V.; Bashashin, M. V.; Rahmonov, I. R.; Shukrinov, Yu. M.; Atanasova, P. Kh.; Volokhova, A. V.
2016-10-01
We consider a model of system of long Josephson junctions (LJJ) with inductive and capacitive coupling. Corresponding system of nonlinear partial differential equations is solved by means of the standard three-point finite-difference approximation in the spatial coordinate and utilizing the Runge-Kutta method for solution of the resulting Cauchy problem. A parallel algorithm is developed and implemented on a basis of the MPI (Message Passing Interface) technology. Effect of the coupling between the JJs on the properties of LJJ system is demonstrated. Numerical results are discussed from the viewpoint of effectiveness of parallel implementation.
Debelak, Rudolf; Tran, Ulrich S.
2016-01-01
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis. PMID:26845032
NASA Astrophysics Data System (ADS)
Wu, J.; Yang, Y.; Luo, Q.; Wu, J.
2012-12-01
This study presents a new hybrid multi-objective evolutionary algorithm, the niched Pareto tabu search combined with a genetic algorithm (NPTSGA), whereby the global search ability of niched Pareto tabu search (NPTS) is improved by the diversification of candidate solutions arose from the evolving nondominated sorting genetic algorithm II (NSGA-II) population. Also, the NPTSGA coupled with the commonly used groundwater flow and transport codes, MODFLOW and MT3DMS, is developed for multi-objective optimal design of groundwater remediation systems. The proposed methodology is then applied to a large-scale field groundwater remediation system for cleanup of large trichloroethylene (TCE) plume at the Massachusetts Military Reservation (MMR) in Cape Cod, Massachusetts. Furthermore, a master-slave (MS) parallelization scheme based on the Message Passing Interface (MPI) is incorporated into the NPTSGA to implement objective function evaluations in distributed processor environment, which can greatly improve the efficiency of the NPTSGA in finding Pareto-optimal solutions to the real-world application. This study shows that the MS parallel NPTSGA in comparison with the original NPTS and NSGA-II can balance the tradeoff between diversity and optimality of solutions during the search process and is an efficient and effective tool for optimizing the multi-objective design of groundwater remediation systems under complicated hydrogeologic conditions.
pSIN: A scalable, Parallel algorithm for Seismic INterferometry of large-N ambient-noise data
NASA Astrophysics Data System (ADS)
Chen, Po; Taylor, Nicholas J.; Dueker, Ken G.; Keifer, Ian S.; Wilson, Andra K.; McGuffy, Casey L.; Novitsky, Christopher G.; Spears, Alec J.; Holbrook, W. Steven
2016-08-01
Seismic interferometry is a technique for extracting deterministic signals (i.e., ambient-noise Green's functions) from recordings of ambient-noise wavefields through cross-correlation and other related signal processing techniques. The extracted ambient-noise Green's functions can be used in ambient-noise tomography for constructing seismic structure models of the Earth's interior. The amount of calculations involved in the seismic interferometry procedure can be significant, especially for ambient-noise datasets collected by large seismic sensor arrays (i.e., "large-N" data). We present an efficient parallel algorithm, named pSIN (Parallel Seismic INterferometry), for solving seismic interferometry problems on conventional distributed-memory computer clusters. The design of the algorithm is based on a two-dimensional partition of the ambient-noise data recorded by a seismic sensor array. We pay special attention to the balance of the computational load, inter-process communication overhead and memory usage across all MPI processes and we minimize the total number of I/O operations. We have tested the algorithm using a real ambient-noise dataset and obtained a significant amount of savings in processing time. Scaling tests have shown excellent strong scalability from 80 cores to over 2000 cores.
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic–there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions. PMID:27487242
A parallel systematic-Monte Carlo algorithm for exploring conformational space.
Perez-Riverol, Yasset; Vera, Roberto; Mazola, Yuliet; Musacchio, Alexis
2012-01-01
Computational algorithms to explore the conformational space of small molecules are complex and computer demand field in chemoinformatics. In this paper a hybrid algorithm to explore the conformational space of organic molecules is presented. This hybrid algorithm is based in a systematic search approach combined with a Monte Carlo based method in order to obtain an ensemble of low-energy conformations simulating the flexibility of small chemical compounds. The Monte Carlo method uses the Metropolis criterion to accept or reject a conformation through an in-house implementation of the MMFF94s force field to calculate the conformational energy. The parallel design of this algorithm, based on the message passing interface (MPI) paradigm, was implemented. The results showed a performance increase in the terms of speed and efficiency.
A Time-Optimal On-the-Fly Parallel Algorithm for Model Checking of Weak LTL Properties
NASA Astrophysics Data System (ADS)
Barnat, Jiří; Brim, Luboš; Ročkai, Petr
One of the most important open problems of parallel LTL model-checking is to design an on-the-fly scalable parallel algorithm with linear time complexity. Such an algorithm would give the optimality we have in sequential LTL model-checking. In this paper we give a partial solution to the problem. We propose an algorithm that has the required properties for a very rich subset of LTL properties, namely those expressible by weak Büchi automata.
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
NASA Technical Reports Server (NTRS)
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
Parallel algorithm for computation of second-order sequential best rotations
NASA Astrophysics Data System (ADS)
Redif, Soydan; Kasap, Server
2013-12-01
Algorithms for computing an approximate polynomial matrix eigenvalue decomposition of para-Hermitian systems have emerged as a powerful, generic signal processing tool. A technique that has shown much success in this regard is the sequential best rotation (SBR2) algorithm. Proposed is a scheme for parallelising SBR2 with a view to exploiting the modern architectural features and inherent parallelism of field-programmable gate array (FPGA) technology. Experiments show that the proposed scheme can achieve low execution times while requiring minimal FPGA resources.
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
NASA Astrophysics Data System (ADS)
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
NASA Technical Reports Server (NTRS)
Tilton, James C.; Plaza, Antonio J. (Editor); Chang, Chein-I. (Editor)
2008-01-01
The hierarchical image segmentation algorithm (referred to as HSEG) is a hybrid of hierarchical step-wise optimization (HSWO) and constrained spectral clustering that produces a hierarchical set of image segmentations. HSWO is an iterative approach to region grooving segmentation in which the optimal image segmentation is found at N(sub R) regions, given a segmentation at N(sub R+1) regions. HSEG's addition of constrained spectral clustering makes it a computationally intensive algorithm, for all but, the smallest of images. To counteract this, a computationally efficient recursive approximation of HSEG (called RHSEG) has been devised. Further improvements in processing speed are obtained through a parallel implementation of RHSEG. This chapter describes this parallel implementation and demonstrates its computational efficiency on a Landsat Thematic Mapper test scene.
Transform methods for developing parallel algorithms for cyclic-block signal processing
NASA Astrophysics Data System (ADS)
Marshall, T. G., Jr.
A class of FIR and IIR single and multirate parallel filtering algorithms is introduced in which blocks of inputs and outputs are processed on-the-fly in a cyclic manner. There is no inherent latency introduced by the decomposition procedure giving the parallelism, the system latency being primarily due to the component processors. The structure is particularly well-suited for systems in which the component processors are the familiar DSP chips optimized for convolution although other component structures can be accommodated. In particular, the automatic data shifting feature of the TMS320 series processors can be utilized in these algorithms. A transform notation, introduced for digital filter banks, is recast in the desired form for this application. The resulting structure of the system, in this notation, is a circulant matrix for FIR filtering or a related matrix in other cases. The cyclic properties of the system and useful implementation flexibility result from this matrix structure.
Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.
NASA Technical Reports Server (NTRS)
Krasteva, Denitza T.
1998-01-01
Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.
NASA Astrophysics Data System (ADS)
Hyams, Daniel Gaiennie
The primary objective of this study is to develop an efficient, scalable, parallel incompressible flow solver capable of performing viscous, high Reynolds number flow simulations for complex geometries using multielement unstructured grids. The present parallel unstructured viscous flow solver is based on domain decomposition for concurrent solution within subdomains assigned to multiple processors. The solution algorithm employs iterative solution of the implicit approximation, and its software implementation uses MPI message passing for interprocessor communication. Key parallelization issues addressed in this work are (1) definition of the iteration hierarchy, (2) treatment of connectivity between subdomain interfaces, and (3) methods for coupling of subdomains. A heuristic, semiempirical performance estimate is developed and evaluated. With this performance estimate, scalability characteristics of the solution algorithm may be calculated for a particular architecture and/or predicted for a given problem a priori. Validation and verification of the solution procedure are carried out on several small steady and unsteady model problems with excellent agreement to experimental, theoretical, and numerical results. The present parallel flow solver is demonstrated for large-scale meshes with viscous sublayer resolution (y+ ˜ 1) and approximately 106 points or more. Complex geometry 3D applications include (1) a full-scale ship hull, (2) a SUBOFF model hull with stern appendages, (3) a fully-configured high-lift transport, and (4) a maneuvering tiltrotor aircraft. The first three computations are shown to agree well with available experimental data. The maneuvering tiltrotor aircraft simulation is a demonstration of capability for the parallel solution algorithm in the context of an extremely complex geometry and unsteady flowfield.
BSIRT: a block-iterative SIRT parallel algorithm using curvilinear projection model.
Zhang, Fa; Zhang, Jingrong; Lawrence, Albert; Ren, Fei; Wang, Xuan; Liu, Zhiyong; Wan, Xiaohua
2015-03-01
Large-field high-resolution electron tomography enables visualizing detailed mechanisms under global structure. As field enlarges, the distortions of reconstruction and processing time become more critical. Using the curvilinear projection model can improve the quality of large-field ET reconstruction, but its computational complexity further exacerbates the processing time. Moreover, there is no parallel strategy on GPU for iterative reconstruction method with curvilinear projection. Here we propose a new Block-iterative SIRT parallel algorithm with the curvilinear projection model (BSIRT) for large-field ET reconstruction, to improve the quality of reconstruction and accelerate the reconstruction process. We also develop some key techniques, including block-iterative method with the curvilinear projection, a scope-based data decomposition method and a page-based data transfer scheme to implement the parallelization of BSIRT on GPU platform. Experimental results show that BSIRT can improve the reconstruction quality as well as the speed of the reconstruction process.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with
1986-11-29
Madison, Wiscon- sin, August 1982. [161 Fitzpatrick, D. T., Foderaro, J. K., Katevenis, M . G. H., Landman, H. A.. Patterson, D. A., Peek, J. B ., Peshkess...October 18-22, 1982. [33] Levitan , S. P., Parallel Algorithms and Architectures: A Programmer’s Per- 35 AN I%. . m ,,-1we, V .r V . , - .7...e. . . e. ** -! ~ * ~ - . . . . . 0.Wty C^11Cri m . op~ bo* pa, U FILE- copy(4 REPORT DOCUMENTATION PAGE e PQTSIC%.RSTV C6AUSIPCATION 16
Class Notes: Programming Parallel Algorithms CS 15-840B (Fall 1992)
1993-02-01
840: Programming Parallel Algorithms Lecture #15 Scribe: Bob Wheeler Thursday, 6 Nov 92 Overview * Connected components (continued). * Minimum spanning...Sriram Sethuraman Singular value decomposition Ken Tew EEG analysis Eric Thayer Speech recognition Xuemei Wang & Bob Wheeler Matrix operations Matt...Computing, 14(4):862-874, 1985. [33] L. W. Tucker, C. R. Feynman , and D. M. Fritzsche. Object recognition using the Connection Machine. Proceedings CVPR
NASA Astrophysics Data System (ADS)
Chen, Ruijiu; Wang, Meng; Yan, Xinliang; Yang, Qiong; Lam, Yihua; Yang, Lei; Zhang, Yuhu
We developed a new program by using a parallelization scheme of the periodic signals tracking algorithm for isochronous mass spectrometry on GPUs. The computing time of data analysis can be reduced by a factor of ˜71 and ˜346 by using our new program on Tesla C1060 GPU and Tesla K20c GPU, compared to using old program on Xeon E5540 CPU. We succeed in performing real-time data analysis by using this new program.
Ozmutlu, H. Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms. PMID:24977204
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms.
Calculating Hurst exponent and neutron monitor data in a single parallel algorithm
NASA Astrophysics Data System (ADS)
Kussainov, A. S.; Kussainov, S. G.
2015-09-01
We implemented an algorithm for simultaneous parallel calculation of the Hurst exponent H and the fractal dimension D for the time series of interest. Parallel programming environment was provided by OpenMPI library installed on three machines networked in the virtual cluster and operated by Debian Wheeze operating system. We applied our program for a comparative analysis of week and a half long, one minute resolution, six channels data from neutron monitor. To ensure a faultless functioning of the written code we applied it to analysis of the random Gaussian noise signal and time series with manually introduced self-affinity features. Both of them have the well-known values of H and D. All results are in good correspondence with each other and supported by the modern theories on signal processing thus confirming the validity of the implemented algorithms. Our code could be used as a standalone tool for the different time series data analysis as well as for the further work on development and optimization of the parallel algorithms for the time series parameters calculations.
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches.
Rognes, T
2001-04-01
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Parallel pivoting combined with parallel reduction
NASA Technical Reports Server (NTRS)
Alaghband, Gita
1987-01-01
Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation.
1986-03-01
P36-844. **VAX is a trademark of Digital Equipment Corporation . ..- ’. 100 *e .................................................... Paper 2L Parallel...ming, Computzng Surveyv, 9, March, pp. 29-59. U .nix is a trademark AI Bell Lajboratories. ... VAX is a trademark of Digital Equipment Corporation ...parallelism will not reduce the processor communicatio s response time. Thus, there are associated costs and limitations (•) Amount of memory
Parallel algorithm for dominant points correspondences in robot binocular stereo vision
NASA Technical Reports Server (NTRS)
Al-Tammami, A.; Singh, B.
1993-01-01
This paper presents an algorithm to find the correspondences of points representing dominant feature in robot stereo vision. The algorithm consists of two main steps: dominant point extraction and dominant point matching. In the feature extraction phase, the algorithm utilizes the widely used Moravec Interest Operator and two other operators: the Prewitt Operator and a new operator called Gradient Angle Variance Operator. The Interest Operator in the Moravec algorithm was used to exclude featureless areas and simple edges which are oriented in the vertical, horizontal, and two diagonals. It was incorrectly detecting points on edges which are not on the four main directions (vertical, horizontal, and two diagonals). The new algorithm uses the Prewitt operator to exclude featureless areas, so that the Interest Operator is applied only on the edges to exclude simple edges and to leave interesting points. This modification speeds-up the extraction process by approximately 5 times. The Gradient Angle Variance (GAV), an operator which calculates the variance of the gradient angle in a window around the point under concern, is then applied on the interesting points to exclude the redundant ones and leave the actual dominant ones. The matching phase is performed after the extraction of the dominant points in both stereo images. The matching starts with dominant points in the left image and does a local search, looking for corresponding dominant points in the right image. The search is geometrically constrained the epipolar line of the parallel-axes stereo geometry and the maximum disparity of the application environment. If one dominant point in the right image lies in the search areas, then it is the corresponding point of the reference dominant point in the left image. A parameter provided by the GAV is thresholded and used as a rough similarity measure to select the corresponding dominant point if there is more than one point the search area. The correlation is used as
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine
NASA Technical Reports Server (NTRS)
Dubey, A.; Zubair, M.; Grosch, C. E.
1992-01-01
One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
ClustalW-MPI: ClustalW analysis using distributed and parallel computing.
Li, Kuo-Bin
2003-08-12
ClustalW is a tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. All three steps have been parallelized to reduce the execution time. The software uses a message-passing library called MPI (Message Passing Interface) and runs on distributed workstation clusters as well as on traditional parallel computers.
G.A. Pope; K. Sephernoori; D.C. McKinney; M.F. Wheeler
1996-03-15
This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computers using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
A massively parallel semi-Lagrangian algorithm for solving the transport equation
Manson, Russell; Wang, Dali
2010-01-01
The scalar transport equation underpins many models employed in science, engineering, technology and business. Application areas include, but are not restricted to, pollution transport, weather forecasting, video analysis and encoding (the optical flow equation), options and stock pricing (the Black-Scholes equation) and spatially explicit ecological models. Unfortunately finding numerical solutions to this equation which are fast and accurate is not trivial. Moreover, finding such numerical algorithms that can be implemented on high performance computer architectures efficiently is challenging. In this paper the authors describe a massively parallel algorithm for solving the advection portion of the transport equation. We present an approach here which is different to that used in most transport models and which we have tried and tested for various scenarios. The approach employs an intelligent domain decomposition based on the vector field of the system equations and thus automatically partitions the computational domain into algorithmically autonomous regions. The solution of a classic pure advection transport problem is shown to be conservative, monotonic and highly accurate at large time steps. Additionally we demonstrate that the algorithm is highly efficient for high performance computer architectures and thus offers a route towards massively parallel application.
Optimized simulations of Olami-Feder-Christensen systems using parallel algorithms
NASA Astrophysics Data System (ADS)
Dominguez, Rachele; Necaise, Rance; Montag, Eric
The sequential nature of the Olami-Feder-Christensen (OFC) model for earthquake simulations limits the benefits of parallel computing approaches because of the frequent communication required between processors. We developed a parallel version of the OFC algorithm for multi-core processors. Our data, even for relatively small system sizes and low numbers of processors, indicates that increasing the number of processors provides significantly faster simulations; producing more efficient results than previous attempts that used network-based Beowulf clusters. Our algorithm optimizes performance by exploiting the multi-core processor architecture, minimizing communication time in contrast to the networked Beowulf-cluster approaches. Our multi-core algorithm is the basis for a new algorithm using GPUs that will drastically increase the number of processors available. Previous studies incorporating realistic structural features of faults into OFC models have revealed spatial and temporal patterns observed in real earthquake systems. The computational advances presented here will allow for studying interacting networks of faults, rather than individual faults, further enhancing our understanding of the relationship between the earth's structure and the triggering process. Support for this project comes from the Chenery Research Fund, the Rashkind Family Endowment, the Walter Williams Craigie Teaching Endowment, and the Schapiro Undergraduate Research Fellowship.
2008-05-01
Reynolds’ Distributed Behavior Model [18]. Corner [5,6,20] ported the model from a single-processor Windows platform to a parallel Linux-based Beowulf ... Beowulf clusters consist of 1 to 16 processors and 1 to 1024 problems in intervals of powers of two. Each test is run 30 times for statistical analysis
Renaut, R.; He, Q.
1994-12-31
In a new parallel iterative algorithm for unconstrained optimization by multisplitting is proposed. In this algorithm the original problem is split into a set of small optimization subproblems which are solved using well known sequential algorithms. These algorithms are iterative in nature, e.g. DFP variable metric method. Here the authors use sequential algorithms based on an inexact subspace search, which is an extension to the usual idea of an inexact fine search. Essentially the idea of the inexact line search for nonlinear minimization is that at each iteration the authors only find an approximate minimum in the line search direction. Hence by inexact subspace search, they mean that, instead of finding the minimum of the subproblem at each interation, they do an incomplete down hill search to give an approximate minimum. Some convergence and numerical results for this algorithm will be presented. Further, the original theory will be generalized to the situation with a singular Hessian. Applications for nonlinear least squares problems will be presented. Experimental results will be presented for implementations on an Intel iPSC/860 Hypercube with 64 nodes as well as on the Intel Paragon.
Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes. Parts 1 and 2
NASA Technical Reports Server (NTRS)
Barth, Timothy J.; Kwak, Dochan (Technical Monitor)
1995-01-01
The Advisory Group for Aerospace Research and Development (AGARD) has requested my participation in the lecture series entitled Parallel Computing in Computational Fluid Dynamics to be held at the von Karman Institute in Brussels, Belgium on May 15-19, 1995. In addition, a request has been made from the US Coordinator for AGARD at the Pentagon for NASA Ames to hold a repetition of the lecture series on October 16-20, 1995. I have been asked to be a local coordinator for the Ames event. All AGARD lecture series events have attendance limited to NATO allied countries. A brief of the lecture series is provided in the attached enclosure. Specifically, I have been asked to give two lectures of approximately 75 minutes each on the subject of parallel solution techniques for the fluid flow equations on unstructured meshes. The title of my lectures is "Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes" (Parts I-II). The contents of these lectures will be largely review in nature and will draw upon previously published work in this area. Topics of my lectures will include: (1) Mesh partitioning algorithms. Recursive techniques based on coordinate bisection, Cuthill-McKee level structures, and spectral bisection. (2) Newton's method for large scale CFD problems. Size and complexity estimates for Newton's method, modifications for insuring global convergence. (3) Techniques for constructing the Jacobian matrix. Analytic and numerical techniques for Jacobian matrix-vector products, constructing the transposed matrix, extensions to optimization and homotopy theories. (4) Iterative solution algorithms. Practical experience with GIVIRES and BICG-STAB matrix solvers. (5) Parallel matrix preconditioning. Incomplete Lower-Upper (ILU) factorization, domain-decomposed ILU, approximate Schur complement strategies.
An efficient algorithm for estimating noise covariances in distributed systems
NASA Technical Reports Server (NTRS)
Dee, D. P.; Cohn, S. E.; Ghil, M.; Dalcher, A.
1985-01-01
An efficient computational algorithm for estimating the noise covariance matrices of large linear discrete stochatic-dynamic systems is presented. Such systems arise typically by discretizing distributed-parameter systems, and their size renders computational efficiency a major consideration. The proposed adaptive filtering algorithm is based on the ideas of Belanger, and is algebraically equivalent to his algorithm. The earlier algorithm, however, has computational complexity proportional to p to the 6th, where p is the number of observations of the system state, while the new algorithm has complexity proportional to only p-cubed. Further, the formulation of noise covariance estimation as a secondary filter, analogous to state estimation as a primary filter, suggests several generalizations of the earlier algorithm. The performance of the proposed algorithm is demonstrated for a distributed system arising in numerical weather prediction.
Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)
NASA Technical Reports Server (NTRS)
Riley, Christopher J.; Cheatwood, F. McNeil
1997-01-01
The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier-Stokes solver, has been modified for use in a parallel, distributed-memory environment using the Message-Passing Interface (MPI) standard. A standard domain decomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The effect of domain decomposition and frequency of boundary updates on performance and convergence is also examined for several realistic configurations and conditions typical of large-scale computational fluid dynamic analysis.
Skil: An imperative language with algorithmic skeletons for efficient distributed programming
Botorog, G.H.; Kuchen, H.
1996-12-31
In this paper we present Skil, an imperative language enhanced with higher-order functions and currying, as well as with a polymorphic type system. The high level of Skil allows the integration of algorithmic skeletons, i.e. of higher-order functions representing parallel computation patterns. At the same time, the language can be efficiently implemented. After describing a series of skeletons which work with distributed arrays, we give two examples of parallel programs implemented on the basis of skeletons, namely shortest paths in graphs and Gaussian elimination. Runtime measurements show that we approach the efficiency of message-passing C up to a factor between 1 and 2.5.
Logistics distribution centers location problem and algorithm under fuzzy environment
NASA Astrophysics Data System (ADS)
Yang, Lixing; Ji, Xiaoyu; Gao, Ziyou; Li, Keping
2007-11-01
Distribution centers location problem is concerned with how to select distribution centers from the potential set so that the total relevant cost is minimized. This paper mainly investigates this problem under fuzzy environment. Consequentially, chance-constrained programming model for the problem is designed and some properties of the model are investigated. Tabu search algorithm, genetic algorithm and fuzzy simulation algorithm are integrated to seek the approximate best solution of the model. A numerical example is also given to show the application of the algorithm.
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Execution time supports for adaptive scientific algorithms on distributed memory machines
NASA Technical Reports Server (NTRS)
Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey
1990-01-01
Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
NASA Technical Reports Server (NTRS)
Fijany, Amir
1993-01-01
In this paper parallel 0(log N) algorithms for dynamic simulation of single closed-chain rigid multibody system as specialized to the case of a robot manipulatoar in contact with the environment are developed.
Fast parallel algorithms that compute transitive closure of a fuzzy relation
NASA Technical Reports Server (NTRS)
Kreinovich, Vladik YA.
1993-01-01
The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).
NASA Astrophysics Data System (ADS)
Terekhov, Andrew V.
2015-04-01
A spectral-difference parallel algorithm for modeling acoustic and elastic wave fields for the 2.5D geometry in the presence of irregular surface topography is considered. The initial boundary-value problem is transformed to a series of boundary-value problems for elliptic equations via the integral Laguerre transform with respect to time. For solving difference equations, it is proposed to use efficient parallel procedures based on the fast Fourier transform and the dichotomy algorithm, the latter was designed for solving systems of linear algebraic equations (SLAEs) with tridiagonal and block-tridiagonal matrices. A modification of the dichotomy algorithm for diagonally dominant matrices, which makes it possible to reduce the time of preparatory computations and increase scalability of the method relative to the number of processors, is considered. The influence of different methods of curved boundary approximation on the quality of solution is investigated; practical evaluation of accuracy is performed. Calculations of the wave field with the use of high-resolution meshes for the Canadian Foothills medium model are presented. Implementation of the complex frequency-shifted PML boundary conditions for a dynamic elasticity problem is considered in the context of the spectral-difference approach.
Optimizing ion channel models using a parallel genetic algorithm on graphical processors.
Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon
2012-01-01
We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs.
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
NASA Technical Reports Server (NTRS)
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Goldberg, L.A.; Jerrum, M.; Leighton, T.; Rao, S.
1993-01-20
In this paper we consider the problem of interprocessor communication on a Completely Connected Optical Communication Parallel Computer (OCPC). The particular problem we study is that of realizing an h-relation. In this problem, each processor has at most h messages to send and at most h messages to receive. It is clear that any 1-relation can be realized in one communication step on an OCPC. However, the best known p-processor OCPC algorithm for realizing an arbitrary h-relation for h > 1 requires {Theta}(h + log p) expected communication steps. (This algorithm is due to Valiant and is based on earlier work of Anderson and Miller.) Valiant`s algorithm is optimal only for h = {Omega}(log p) and it is an open question of Gereb-Graus and Tsantilas whether there is a faster algorithm for h = o(log p). In this paper we answer this question in the affirmative by presenting a {Theta} (h + log log p) communication step algorithm that realizes an arbitrary h-relation on a p-processor OCPC. We show that if h {le} log p then the failure probability can be made as small as p{sup -{alpha}} for any positive constant {alpha}.
NASA Astrophysics Data System (ADS)
Finney, Greg A.; Persons, Christopher M.; Henning, Stephan; Hazen, Jessie; Whitley, Daniel
2014-06-01
IERUS Technologies, Inc. and the University of Alabama in Huntsville have partnered to perform characterization and development of algorithms and hardware for adaptive optics. To date the algorithm work has focused on implementation of the stochastic parallel gradient descent (SPGD) algorithm. SPGD is a metric-based approach in which a scalar metric is optimized by taking random perturbative steps for many actuators simultaneously. This approach scales to systems with a large number of actuators while maintaining bandwidth, while conventional methods are negatively impacted by the very large matrix multiplications that are required. The metric approach enables the use of higher speed sensors with fewer (or even a single) sensing element(s), enabling a higher control bandwidth. Furthermore, the SPGD algorithm is model-free, and thus is not strongly impacted by the presence of nonlinearities which degrade the performance of conventional phase reconstruction methods. Finally, for high energy laser applications, SPGD can be performed using the primary laser beam without the need for an additional beacon laser. The conventional SPGD algorithm was modified to use an adaptive gain to improve convergence while maintaining low steady state error. Results from laboratory experiments using phase plates as atmosphere surrogates will be presented, demonstrating areas in which the adaptive gain yields better performance and areas which require further investigation.
Muckley, Matthew J; Noll, Douglas C; Fessler, Jeffrey A
2015-02-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms.
MELD: A Logical Approach to Distributed and Parallel Programming
2012-03-01
extremely successful. A recent success story is the MapReduce programming model, which can be viewed as a somewhat more generalized version of the data...parallel model that is optimized for large scale clusters. In MapReduce , the data sharing and scheduling model is very simple: the computation for...models than MapReduce , but they do not allow the programmer to specify scheduling strategies or support formal proof techniques. Hellerstein’s group
Optimization of composite structures by estimation of distribution algorithms
NASA Astrophysics Data System (ADS)
Grosset, Laurent
The design of high performance composite laminates, such as those used in aerospace structures, leads to complex combinatorial optimization problems that cannot be addressed by conventional methods. These problems are typically solved by stochastic algorithms, such as evolutionary algorithms. This dissertation proposes a new evolutionary algorithm for composite laminate optimization, named Double-Distribution Optimization Algorithm (DDOA). DDOA belongs to the family of estimation of distributions algorithms (EDA) that build a statistical model of promising regions of the design space based on sets of good points, and use it to guide the search. A generic framework for introducing statistical variable dependencies by making use of the physics of the problem is proposed. The algorithm uses two distributions simultaneously: the marginal distributions of the design variables, complemented by the distribution of auxiliary variables. The combination of the two generates complex distributions at a low computational cost. The dissertation demonstrates the efficiency of DDOA for several laminate optimization problems where the design variables are the fiber angles and the auxiliary variables are the lamination parameters. The results show that its reliability in finding the optima is greater than that of a simple EDA and of a standard genetic algorithm, and that its advantage increases with the problem dimension. A continuous version of the algorithm is presented and applied to a constrained quadratic problem. Finally, a modification of the algorithm incorporating probabilistic and directional search mechanisms is proposed. The algorithm exhibits a faster convergence to the optimum and opens the way for a unified framework for stochastic and directional optimization.
Gopinath, T; Kumar, Anil
2006-12-01
Hadamard spectroscopy has earlier been used to speed-up multi-dimensional NMR experiments. In this work, we speed-up the two-dimensional quantum computing scheme, by using Hadamard spectroscopy in the indirect dimension, resulting in a scheme which is faster and requires the Fourier transformation only in the direct dimension. Two and three qubit quantum gates are implemented with an extra observer qubit. We also use one-dimensional Hadamard spectroscopy for binary information storage by spatial encoding and implementation of a parallel search algorithm.
Parallel SOR Iterative Algorithms and Performance Evaluation on a Linux Cluster
2005-06-01
Red - Black two-color SOR implementation. Two other iterative methods , Jacobi method is preferred. Yanheh [4] showed that the and Gauss - Seidel (G-S...The optimal value of co lies in (0, 2). The choice 40 J +" of co = 1 corresponds to the Gauss - Seidel - j.1)( - 11 iteration. 2.2 Red - Black SOR...paper, a parallel algorithm for the structure of a matrix or a grid. However, the red - black SOR method with domain decomposition is multi-color
Wang, Xiaolong; Jiang, Aipeng; Jiangzhou, Shu; Li, Ping
2014-01-01
A large-scale parallel-unit seawater reverse osmosis desalination plant contains many reverse osmosis (RO) units. If the operating conditions change, these RO units will not work at the optimal design points which are computed before the plant is built. The operational optimization problem (OOP) of the plant is to find out a scheduling of operation to minimize the total running cost when the change happens. In this paper, the OOP is modelled as a mixed-integer nonlinear programming problem. A two-stage differential evolution algorithm is proposed to solve this OOP. Experimental results show that the proposed method is satisfactory in solution quality. PMID:24701180
Parallel algorithm for linear feature detection from airborne LiDAR data
NASA Astrophysics Data System (ADS)
Mareboyana, Manohar; Chi, Paul
2006-05-01
Linear features from airport images correspond to runways, taxiways and roads. Detecting runways helps pilots to focus on runway incursions in poor visibility conditions. In this work, we attempt to detect linear features from LiDAR swath in near real time using parallel implementation on G5-based apple cluster called Xseed. Data from LiDAR swath is converted into a uniform grid with nearest neighbor interpolation. The edges and gradient directions are computed using standard edge detection algorithms such as Canny's detector. Edge linking and detecting straight-line features are described. Preliminary results on Reno, Nevada airport data are included.
Software Model Checking for Verifying Distributed Algorithms
2014-10-28
Verification procedure is an intelligent exhaustive search of the state space of the design Model Checking 6 Verifying Synchronous Distributed App...Distributed App Sagar Chaki, June 11, 2014 © 2014 Carnegie Mellon University Tool Usage Project webpage (http://mcda.googlecode.com) • Tutorial
NASA Astrophysics Data System (ADS)
Kiesewetter, Simon; Drummond, Peter D.
2017-03-01
A variance reduction method for stochastic integration of Fokker-Planck equations is derived. This unifies the cumulant hierarchy and stochastic equation approaches to obtaining moments, giving a performance superior to either. We show that the brute force method of reducing sampling error by just using more trajectories in a sampled stochastic equation is not the best approach. The alternative of using a hierarchy of moment equations is also not optimal, as it may converge to erroneous answers. Instead, through Bayesian conditioning of the stochastic noise on the requirement that moment equations are satisfied, we obtain improved results with reduced sampling errors for a given number of stochastic trajectories. The method used here converges faster in time-step than Ito-Euler algorithms. This parallel optimized sampling (POS) algorithm is illustrated by several examples, including a bistable nonlinear oscillator case where moment hierarchies fail to converge.
Model-based spectral estimation of Doppler signals using parallel genetic algorithms.
Solano González, J; Rodríguez Vázquez, K; García Nocetti, D F
2000-05-01
Conventional spectral analysis methods use a fast Fourier transform (FFT) on consecutive or overlapping windowed data segments. For Doppler ultrasound signals, this approach suffers from an inadequate frequency resolution due to the time segment duration and the non-stationarity characteristics of the signals. Parametric or model-based estimators can give significant improvements in the time-frequency resolution at the expense of a higher computational complexity. This work describes an approach which implements in real-time a parametric spectral estimator method using genetic algorithms (GAs) in order to find the optimum set of parameters for the adaptive filter that minimises the error function. The aim is to reduce the computational complexity of the conventional algorithm by using the simplicity associated to GAs and exploiting its parallel characteristics. This will allow the implementation of higher order filters, increasing the spectrum resolution, and opening a greater scope for using more complex methods.
A parallel algorithm for solving the n-queens problem based on inspired computational model.
Wang, Zhaocai; Huang, Dongmei; Tan, Jian; Liu, Taigang; Zhao, Kai; Li, Lei
2015-05-01
DNA computing provides a promising method to solve the computationally intractable problems. The n-queens problem is a well-known NP-hard problem, which arranges n queens on an n × n board in different rows, columns and diagonals in order to avoid queens attack each other. In this paper, we present a novel parallel DNA algorithm for solving the n-queens problem using DNA molecular operations based on a biologically inspired computational model. For the n-queens problem, we reasonably design flexible length DNA strands representing elements of the allocation matrix, take appropriate biologic manipulations and get the solutions of the n-queens problem in proper length and O(n(2)) time complexity. We extend the application of DNA molecular operations, simultaneity simplify the complexity of the computation and simulate to verify the feasibility of the DNA algorithm.
Du, Tingsong; Hu, Yang; Ke, Xianting
2015-01-01
An improved quantum artificial fish swarm algorithm (IQAFSA) for solving distributed network programming considering distributed generation is proposed in this work. The IQAFSA based on quantum computing which has exponential acceleration for heuristic algorithm uses quantum bits to code artificial fish and quantum revolving gate, preying behavior, and following behavior and variation of quantum artificial fish to update the artificial fish for searching for optimal value. Then, we apply the proposed new algorithm, the quantum artificial fish swarm algorithm (QAFSA), the basic artificial fish swarm algorithm (BAFSA), and the global edition artificial fish swarm algorithm (GAFSA) to the simulation experiments for some typical test functions, respectively. The simulation results demonstrate that the proposed algorithm can escape from the local extremum effectively and has higher convergence speed and better accuracy. Finally, applying IQAFSA to distributed network problems and the simulation results for 33-bus radial distribution network system show that IQAFSA can get the minimum power loss after comparing with BAFSA, GAFSA, and QAFSA.
Geiger, D.; Girosi, F.
1989-05-01
In recent years many researchers have investigated the use of Markov random fields (MRFs) for computer vision. They can be applied for example in the output of the visual processes to reconstruct surfaces from sparse and noisy depth data, or to integrate early vision processes to label physical discontinuities. Drawbacks of MRFs models have been the computational complexity of the implementation and the difficulty in estimating the parameters of the model. This paper derives deterministic approximations to MRFs models. One of the considered models is shown to give in a natural way the graduate non convexity (GNC) algorithm. This model can be applied to smooth a field preserving its discontinuities. A new model is then proposed: it allows the gradient of the field to be enhanced at the discontinuities and smoothed elsewhere. All the theoretical results are obtained in the framework of the mean field theory, that is a well known statistical mechanics technique. A fast, parallel, and iterative algorithm to solve the deterministic equations of the two models is presented, together with experiments on synthetic and real images. The algorithm is applied to the problem of surface reconstruction is in the case of sparse data. A fast algorithm is also described that solves the problem of aligning the discontinuities of different visual models with intensity edges via integration.
A parallel algorithm for viewshed analysis in three-dimensional Digital Earth
NASA Astrophysics Data System (ADS)
Feng, Wang; Gang, Wang; Deji, Pan; Yuan, Liu; Liuzhong, Yang; Hongbo, Wang
2015-02-01
Viewshed analysis, often supported by geographic information systems, is widely used in the three-dimensional (3D) Digital Earth system. Many of the analyzes involve the siting of features and real-timedecision-making. Viewshed analysis is usually performed at a large scale, which poses substantial computational challenges, as geographic datasets continue to become increasingly large. Previous research on viewshed analysis has been generally limited to a single data structure (i.e., DEM), which cannot be used to analyze viewsheds in complicated scenes. In this paper, a real-time algorithm for viewshed analysis in Digital Earth is presented using the parallel computing of graphics processing units (GPUs). An occlusion for each geometric entity in the neighbor space of the viewshed point is generated according to line-of-sight. The region within the occlusion is marked by a stencil buffer within the programmable 3D visualization pipeline. The marked region is drawn with red color concurrently. In contrast to traditional algorithms based on line-of-sight, the new algorithm, in which the viewshed calculation is integrated with the rendering module, is more efficient and stable. This proposed method of viewshed generation is closer to the reality of the virtual geographic environment. No DEM interpolation, which is seen as a computational burden, is needed. The algorithm was implemented in a 3D Digital Earth system (GeoBeans3D) with the DirectX application programming interface (API) and has been widely used in a range of applications.
NASA Astrophysics Data System (ADS)
Rerucha, Simon; Sarbort, Martin; Hola, Miroslava; Cizek, Martin; Hucl, Vaclav; Cip, Ondrej; Lazar, Josef
2016-12-01
The homodyne detection with only a single detector represents a promising approach in the interferometric application which enables a significant reduction of the optical system complexity while preserving the fundamental resolution and dynamic range of the single frequency laser interferometers. We present the design, implementation and analysis of algorithmic methods for computational processing of the single-detector interference signal based on parallel pipelined processing suitable for real time implementation on a programmable hardware platform (e.g. the FPGA - Field Programmable Gate Arrays or the SoC - System on Chip). The algorithmic methods incorporate (a) the single detector signal (sine) scaling, filtering, demodulations and mixing necessary for the second (cosine) quadrature signal reconstruction followed by a conic section projection in Cartesian plane as well as (a) the phase unwrapping together with the goniometric and linear transformations needed for the scale linearization and periodic error correction. The digital computing scheme was designed for bandwidths up to tens of megahertz which would allow to measure the displacements at the velocities around half metre per second. The algorithmic methods were tested in real-time operation with a PC-based reference implementation that employed the advantage pipelined processing by balancing the computational load among multiple processor cores. The results indicate that the algorithmic methods are suitable for a wide range of applications [3] and that they are bringing the fringe counting interferometry closer to the industrial applications due to their optical setup simplicity and robustness, computational stability, scalability and also a cost-effectiveness.
Experiences with serial and parallel algorithms for channel routing using simulated annealing
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1988-01-01
Two algorithms for channel routing using simulated annealing are presented. Simulated annealing is an optimization methodology which allows the solution process to back up out of local minima that may be encountered by inappropriate selections. By properly controlling the annealing process, it is very likely that the optimal solution to an NP-complete problem such as channel routing may be found. The algorithm presented proposes very relaxed restrictions on the types of allowable transformations, including overlapping nets. By freeing that restriction and controlling overlap situations with an appropriate cost function, the algorithm becomes very flexible and can be applied to many extensions of channel routing. The selection of the transformation utilizes a number of heuristics, still retaining the pseudorandom nature of simulated annealing. The algorithm was implemented as a serial program for a workstation, and a parallel program designed for a hypercube computer. The details of the serial implementation are presented, including many of the heuristics used and some of the resulting solutions.
Fast and parallel spectral transform algorithms for global shallow water models. Doctoral thesis
Jakob, R.
1993-01-01
The dissertation examines spectral transform algorithms for the solution of the shallow water equations on the sphere and studies their implementation and performance on shared memory vector multiprocessors. Beginning with the standard spectral transform algorithm in vorticity divergence form and its implementation in the Fortran based parallel programming language Force, two modifications are researched. First, the transforms and matrices associated with the meridional derivatives of the associated Legendre functions are replaced by corresponding operations with the spherical harmonic coefficients. Second, based on the fast Fourier transform and the fast multipole method, a lower complexity algorithm is derived that uses fast transformations between Legendre and interior Fourier nodes, fast surface spherical truncation and a fast spherical Helmholz solver. Because the global shallow water equations are similar to the horizontal dynamical component of general circulation models, the results can be applied to spectral transform numerical weather prediction and climate models. In general, the derived algorithms may speed up the solution of time dependent partial differential equations in spherical geometry.
Web based parallel/distributed medical data mining using software agents
Kargupta, H.; Stafford, B.; Hamzaoglu, I.
1997-12-31
This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.
A distributed scheduling algorithm for heterogeneous real-time systems
NASA Technical Reports Server (NTRS)
Zeineldine, Osman; El-Toweissy, Mohamed; Mukkamala, Ravi
1991-01-01
Much of the previous work on load balancing and scheduling in distributed environments was concerned with homogeneous systems and homogeneous loads. Several of the results indicated that random policies are as effective as other more complex load allocation policies. The effects of heterogeneity on scheduling algorithms for hard real time systems is examined. A distributed scheduler specifically to handle heterogeneities in both nodes and node traffic is proposed. The performance of the algorithm is measured in terms of the percentage of jobs discarded. While a random task allocation is very sensitive to heterogeneities, the algorithm is shown to be robust to such non-uniformities in system components and load.
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martin, Gabriel; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2016-04-01
In the last years, hyperspectral analysis have been applied in many remote sensing applications. In fact, hyperspectral unmixing has been a challenging task in hyperspectral data exploitation. This process consists of three stages: (i) estimation of the number of pure spectral signatures or endmembers, (ii) automatic identification of the estimated endmembers, and (iii) estimation of the fractional abundance of each endmember in each pixel of the scene. However, unmixing algorithms can be computationally very expensive, a fact that compromises their use in applications under real-time constraints. In recent years, several techniques have been proposed to solve the aforementioned problem but until now, most works have focused on the second and third stages. The execution cost of the first stage is usually lower than the other stages. Indeed, it can be optional if we known a priori this estimation. However, its acceleration on parallel architectures is still an interesting and open problem. In this paper we have addressed this issue focusing on the GENE algorithm, a promising geometry-based proposal introduced in.1 We have evaluated our parallel implementation in terms of both accuracy and computational performance through Monte Carlo simulations for real and synthetic data experiments. Performance results on a modern GPU shows satisfactory 16x speedup factors, which allow us to expect that this method could meet real-time requirements on a fully operational unmixing chain.
Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael
2015-04-08
The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on themore » performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.« less
Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael
2015-04-08
The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.
NASA Astrophysics Data System (ADS)
Zhang, Zhi-Yong; Tan, Han-Dong; Wang, Kun-Peng; Lin, Chang-Hong; Zhang, Bin; Xie, Mao-Bi
2016-03-01
Traditional two-dimensional (2D) complex resistivity forward modeling is based on Poisson's equation but spectral induced polarization (SIP) data are the coproducts of the induced polarization (IP) and the electromagnetic induction (EMI) effects. This is especially true under high frequencies, where the EMI effect can exceed the IP effect. 2D inversion that only considers the IP effect reduces the reliability of the inversion data. In this paper, we derive differential equations using Maxwell's equations. With the introduction of the Cole-Cole model, we use the finite-element method to conduct 2D SIP forward modeling that considers the EMI and IP effects simultaneously. The data-space Occam method, in which different constraints to the model smoothness and parametric boundaries are introduced, is then used to simultaneously obtain the four parameters of the Cole—Cole model using multi-array electric field data. This approach not only improves the stability of the inversion but also significantly reduces the solution ambiguity. To improve the computational efficiency, message passing interface programming was used to accelerate the 2D SIP forward modeling and inversion. Synthetic datasets were tested using both serial and parallel algorithms, and the tests suggest that the proposed parallel algorithm is robust and efficient.
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.
Tao, Liang; Kwan, Hon Keung
2012-07-01
Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.
1997-12-31
The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle. The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.
Multirate parallel distributed compensation of a cluster in wireless sensor and actor networks
NASA Astrophysics Data System (ADS)
Yang, Chun-xi; Huang, Ling-yun; Zhang, Hao; Hua, Wang
2016-01-01
The stabilisation problem for one of the clusters with bounded multiple random time delays and packet dropouts in wireless sensor and actor networks is investigated in this paper. A new multirate switching model is constructed to describe the feature of this single input multiple output linear system. According to the difficulty of controller design under multi-constraints in multirate switching model, this model can be converted to a Takagi-Sugeno fuzzy model. By designing a multirate parallel distributed compensation, a sufficient condition is established to ensure this closed-loop fuzzy control system to be globally exponentially stable. The solution of the multirate parallel distributed compensation gains can be obtained by solving an auxiliary convex optimisation problem. Finally, two numerical examples are given to show, compared with solving switching controller, multirate parallel distributed compensation can be obtained easily. Furthermore, it has stronger robust stability than arbitrary switching controller and single-rate parallel distributed compensation under the same conditions.
NASA Astrophysics Data System (ADS)
Baba, Toshitaka; Takahashi, Narumi; Kaneda, Yoshiyuki; Ando, Kazuto; Matsuoka, Daisuke; Kato, Toshihiro
2015-12-01
Because of improvements in offshore tsunami observation technology, dispersion phenomena during tsunami propagation have often been observed in recent tsunamis, for example the 2004 Indian Ocean and 2011 Tohoku tsunamis. The dispersive propagation of tsunamis can be simulated by use of the Boussinesq model, but the model demands many computational resources. However, rapid progress has been made in parallel computing technology. In this study, we investigated a parallelized approach for dispersive tsunami wave modeling. Our new parallel software solves the nonlinear Boussinesq dispersive equations in spherical coordinates. A variable nested algorithm was used to increase spatial resolution in the target region. The software can also be used to predict tsunami inundation on land. We used the dispersive tsunami model to simulate the 2011 Tohoku earthquake on the Supercomputer K. Good agreement was apparent between the dispersive wave model results and the tsunami waveforms observed offshore. The finest bathymetric grid interval was 2/9 arcsec (approx. 5 m) along longitude and latitude lines. Use of this grid simulated tsunami soliton fission near the Sendai coast. Incorporating the three-dimensional shape of buildings and structures led to improved modeling of tsunami inundation.
Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures
NASA Astrophysics Data System (ADS)
Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi
2017-04-01
Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.
A hybrid-algorithm-based parallel computing framework for optimal reservoir operation
NASA Astrophysics Data System (ADS)
Li, X.; Wei, J.; Li, T.; Wang, G.
2012-12-01
Up to date, various optimization models have been developed to offer optimal operating policies for reservoirs. Each optimization model has its own merits and limitations, and no general algorithm exists even today. At times, some optimization models have to be combined to obtain desired results. In this paper, we present a parallel computing framework to combine various optimization models in a different way compared to traditional serial computing. This framework consists of three functional processor types, that is, master processor, slave processor and transfer processor. The master processor has a full computation scheme that allocates optimization models to slave processors; slave processors perform allocated optimization models; the transfer processor is in charge of the solution communication among all slave processors. Based on these, the proposed framework can perform various optimization models in parallel. Because of the solution communication, the framework can also integrate the merits of involved optimization models while in iteration and the performance of each optimization model can therefore be improved. And more, it can be concluded the framework can effectively improve the solution quality and increase the solution speed by making full use of computing power of parallel computers.
NASA Astrophysics Data System (ADS)
Moryakov, A. V.
2016-12-01
An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
A distributed parallel storage architecture and its potential application within EOSDIS
NASA Technical Reports Server (NTRS)
Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony
1994-01-01
We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Current distribution in parallel paths of the coils of a 50 Hz prototype dipole magnet
Otter, A.J.
1996-07-01
The prototype dipole made for TRIUMF`s Kaon Factory proposal used coils with 12 parallel paths to reduce eddy current losses in the conductors. The ac current distribution in these paths was non-uniform due to different self and mutual inductances. Small differences in inductance can cause large circulating currents in the parallel windings. This paper describes the measurement of the inductances and shows an attempt to predict the current distribution for two alternative connection schemes.
A distributed parallel storage architecture and its potential application within EOSDIS
Johnston, W.E.; Tierney, B.; Feuquay, J.; Butzer, T.
1995-01-01
We describe the architecture, implementation, use, and potential use of a scale, high-performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
NASA Technical Reports Server (NTRS)
Sanyal, Soumya; Jain, Amit; Das, Sajal K.; Biswas, Rupak
2003-01-01
In this paper, we propose a distributed approach for mapping a single large application to a heterogeneous grid environment. To minimize the execution time of the parallel application, we distribute the mapping overhead to the available nodes of the grid. This approach not only provides a fast mapping of tasks to resources but is also scalable. We adopt a hierarchical grid model and accomplish the job of mapping tasks to this topology using a scheduler tree. Results show that our three-phase algorithm provides high quality mappings, and is fast and scalable.
NASA Technical Reports Server (NTRS)
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
Parallel and distributed computation for fault-tolerant object recognition
NASA Technical Reports Server (NTRS)
Wechsler, Harry
1988-01-01
The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.
Rausch, Tobias; Thomas, Alun; Camp, Nicola J.; Cannon-Albright, Lisa A.; Facelli, Julio C.
2008-01-01
This paper describes a novel algorithm to analyze genetic linkage data using pattern recognition techniques and genetic algorithms (GA). The method allows a search for regions of the chromosome that may contain genetic variations that jointly predispose individuals for a particular disease. The method uses correlation analysis, filtering theory and genetic algorithms (GA) to achieve this goal. Because current genome scans use from hundreds to hundreds of thousands of markers, two versions of the method have been implemented. The first is an exhaustive analysis version that can be used to visualize, explore, and analyze small genetic data sets for two marker correlations; the second is a GA version, which uses a parallel implementation allowing searches of higher-order correlations in large data sets. Results on simulated data sets indicate that the method can be informative in the identification of major disease loci and gene-gene interactions in genome-wide linkage data and that further exploration of these techniques is justified. The results presented for both variants of the method show that it can help genetic epidemiologists to identify promising combinations of genetic factors that might predispose to complex disorders. In particular, the correlation analysis of IBD expression patterns might hint to possible gene-gene interactions and the filtering might be a fruitful approach to distinguish true correlation signals from noise. PMID:18547558
A Portable 3D FFT Package for Distributed-Memory Parallel Architectures
NASA Technical Reports Server (NTRS)
Ding, H. Q.; Ferraro, R. D.; Gennery, D. B.
1995-01-01
A parallel algorithm for 3D FFTs is implemented as a series of local 1D FFTs combined with data transposes. This allows the use of vendor supplied (often fully optimized) sequential 1D FFTs. The FFTs are carried out in-place by using an in-place data transpose across the processors.
Studies of parallel algorithms for the solution of a Fokker-Planck equation
Deck, D.; Samba, G.
1995-11-01
The study of laser-created plasmas often requires the use of a kinetic model rather than a hydrodynamic one. This model change occurs, for example, in the hot spot formation in an ICF experiment or during the relaxation of colliding plasmas. When the gradients scalelengths or the size of a given system are not small compared to the characteristic mean-free-path, we have to deal with non-equilibrium situations, which can be described by the distribution functions of every species in the system. We present here a numerical method in plane or spherical 1-D geometry, for the solution of a Fokker-Planck equation that describes the evolution of stich functions in the phase space. The size and the time scale of kinetic simulations require the use of Massively Parallel Computers (MPP). We have adopted a message-passing strategy using Parallel Virtual Machine (PVM).
Scalable load balancing for massively parallel distributed Monte Carlo particle transport
O'Brien, M. J.; Brantley, P. S.; Joy, K. I.
2013-07-01
In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrence Livermore National Laboratory. (authors)
Lilith: A scalable secure tool for massively parallel distributed computing
Armstrong, R.C.; Camp, L.J.; Evensky, D.A.; Gentile, A.C.
1997-06-01
Changes in high performance computing have necessitated the ability to utilize and interrogate potentially many thousands of processors. The ASCI (Advanced Strategic Computing Initiative) program conducted by the United States Department of Energy, for example, envisions thousands of distinct operating systems connected by low-latency gigabit-per-second networks. In addition multiple systems of this kind will be linked via high-capacity networks with latencies as low as the speed of light will allow. Code which spans systems of this sort must be scalable; yet constructing such code whether for applications, debugging, or maintenance is an unsolved problem. Lilith is a research software platform that attempts to answer these questions with an end toward meeting these needs. Presently, Lilith exists as a test-bed, written in Java, for various spanning algorithms and security schemes. The test-bed software has, and enforces, hooks allowing implementation and testing of various security schemes.
Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs
2008-01-01
follows. In Sections 2 and 3, we present precise defini- tions for distributed programs, specifications, and fault- tolerance. We formally state the...Subsequently, experimental results and analysis are presented in Section 6. Related work is discussed in Section 7. Finally, we conclude in Section...infinite com- putation by stuttering at sl. On the other hand, if there exists a state sd such that there is no outgoing transition (or a self-loop
Measurement of parallel ion energy distribution function in PISCES plasma
Tynan, G.R.; Goebel, D.M.; Conn, R.W.
1987-08-01
The PISCES facility is used to conduct controlled plasma-surface interaction experiments. Plasma parameters typical of those found in the edge plasmas of major fusion confinement experiments are produced. In this work, the energy distribution of the ion flux incident on a material surface is measured using a gridded energy analyzer in place of a material sample. The full width at half maximum energy distribution of the ion flux is found to vary from 10 eV to 30 eV both hydrogen and deuterium plasmas. Helium plasmas have a much lower FWHM energy spread than hydrogen and deuterium plasmas. The FWHM ion energy spread is found to be linearly related to the electron temperature. The most probable ion energy is found to be linearly related to the bias applied to the energy analyzer. Other plasma parameters have a weak influence upon the energy distribution of the ion flux. Two possible physical mechanisms for producing the observed results are introduced and suggestions for further work are made. The impact of the reported measurements on the materials experiments conducted in the PISCES facility are discussed and recommendations for future experiments are made. 11 refs., 13 figs.
On the consequences of bi-Maxwellian plasma distributions for parallel electric fields
NASA Technical Reports Server (NTRS)
Olsen, Richard C.
1992-01-01
The objective is to use the measurements of the equatorial particle distributions to obtain the parallel electric field structure and the evolution of the plasma distribution function along the field line. Appropriate uses of kinetic theory allows us to use the measured ( and inferred) particle distributions to obtain the electric field, and hence the variation on plasma density along the magnetic field line. The approach, here, is to utilize the adiabatic invariants, and assume the plasma distributions are in equilibrium.
Learning factorizations in estimation of distribution algorithms using affinity propagation.
Santana, Roberto; Larrañaga, Pedro; Lozano, José A
2010-01-01
Estimation of distribution algorithms (EDAs) that use marginal product model factorizations have been widely applied to a broad range of mainly binary optimization problems. In this paper, we introduce the affinity propagation EDA (AffEDA) which learns a marginal product model by clustering a matrix of mutual information learned from the data using a very efficient message-passing algorithm known as affinity propagation. The introduced algorithm is tested on a set of binary and nonbinary decomposable functions and using a hard combinatorial class of problem known as the HP protein model. The results show that the algorithm is a very efficient alternative to other EDAs that use marginal product model factorizations such as the extended compact genetic algorithm (ECGA) and improves the quality of the results achieved by ECGA when the cardinality of the variables is increased.
A review of estimation of distribution algorithms in bioinformatics
Armañanzas, Rubén; Inza, Iñaki; Santana, Roberto; Saeys, Yvan; Flores, Jose Luis; Lozano, Jose Antonio; Peer, Yves Van de; Blanco, Rosa; Robles, Víctor; Bielza, Concha; Larrañaga, Pedro
2008-01-01
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain. PMID:18822112
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-10-23
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays
Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin
2016-01-01
In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301
Quartic scaling MP2 for solids: A highly parallelized algorithm in the plane wave basis
NASA Astrophysics Data System (ADS)
Schäfer, Tobias; Ramberger, Benjamin; Kresse, Georg
2017-03-01
We present a low-complexity algorithm to calculate the correlation energy of periodic systems in second-order Møller-Plesset (MP2) perturbation theory. In contrast to previous approximation-free MP2 codes, our implementation possesses a quartic scaling, O ( N 4 ) , with respect to the system size N and offers an almost ideal parallelization efficiency. The general issue that the correlation energy converges slowly with the number of basis functions is eased by an internal basis set extrapolation. The key concept to reduce the scaling is to eliminate all summations over virtual orbitals which can be elegantly achieved in the Laplace transformed MP2 formulation using plane wave basis sets and fast Fourier transforms. Analogously, this approach could allow us to calculate second order screened exchange as well as particle-hole ladder diagrams with a similar low complexity. Hence, the presented method can be considered as a step towards systematically improved correlation energies.
Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
NASA Technical Reports Server (NTRS)
Godoy, William F.; Liu, Xu
2011-01-01
General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-01-01
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation. PMID:26512650
Impacts of Time Delays on Distributed Algorithms for Economic Dispatch
Yang, Tao; Wu, Di; Sun, Yannan; Lian, Jianming
2015-07-26
Economic dispatch problem (EDP) is an important problem in power systems. It can be formulated as an optimization problem with the objective to minimize the total generation cost subject to the power balance constraint and generator capacity limits. Recently, several consensus-based algorithms have been proposed to solve EDP in a distributed manner. However, impacts of communication time delays on these distributed algorithms are not fully understood, especially for the case where the communication network is directed, i.e., the information exchange is unidirectional. This paper investigates communication time delay effects on a distributed algorithm for directed communication networks. The algorithm has been tested by applying time delays to different types of information exchange. Several case studies are carried out to evaluate the effectiveness and performance of the algorithm in the presence of time delays in communication networks. It is found that time delay effects have negative effects on the convergence rate, and can even result in an incorrect converge value or fail the algorithm to converge.
Cohen, J.D.; Dunbar, K.; McClelland, J.L.
1989-11-22
A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirical data suggest that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a processing pathway and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning. This was accomplished by combining the cascade mechanism described by McClelland (1979) with the back propagation learning algorithm (Rumelhart, Hinton, Williams, 1986). The model is able to simulate performance in the standard Stroop task, as well as aspects of performance in variants of this task which manipulate SOA, response set, and degree of practice. In the discussion we contrast our model with other models, and indicate how it relates to many of the central issues in the literature on attention, automaticity, and interference.
Distributed query plan generation using multiobjective genetic algorithm.
Panicker, Shina; Kumar, T V Vijay
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability.
A novel parallel-rotation algorithm for atomistic Monte Carlo simulation of dense polymer systems
NASA Astrophysics Data System (ADS)
Santos, S.; Suter, U. W.; Müller, M.; Nievergelt, J.
2001-06-01
We develop and test a new elementary Monte Carlo move for use in the off-lattice simulation of polymer systems. This novel Parallel-Rotation algorithm (ParRot) permits moving very efficiently torsion angles that are deeply inside long chains in melts. The parallel-rotation move is extremely simple and is also demonstrated to be computationally efficient and appropriate for Monte Carlo simulation. The ParRot move does not affect the orientation of those parts of the chain outside the moving unit. The move consists of a concerted rotation around four adjacent skeletal bonds. No assumption is made concerning the backbone geometry other than that bond lengths and bond angles are held constant during the elementary move. Properly weighted sampling techniques are needed for ensuring detailed balance because the new move involves a correlated change in four degrees of freedom along the chain backbone. The ParRot move is supplemented with the classical Metropolis Monte Carlo, the Continuum-Configurational-Bias, and Reptation techniques in an isothermal-isobaric Monte Carlo simulation of melts of short and long chains. Comparisons are made with the capabilities of other Monte Carlo techniques to move the torsion angles in the middle of the chains. We demonstrate that ParRot constitutes a highly promising Monte Carlo move for the treatment of long polymer chains in the off-lattice simulation of realistic models of dense polymer systems.
ERIC Educational Resources Information Center
Vazquez Aranda, Armando I.; Henquin, Eduardo R.; Torres, Israel Rodriguez; Bisang, Jose M.
2012-01-01
A laboratory experiment is described to determine the primary current distribution in parallel-plate electrochemical reactors. The electrolyte is simulated by conductive paper and the electrodes are segmented to measure the current distribution. Experiments are reported with the electrolyte confined to the interelectrode gap, where the current…
ERIC Educational Resources Information Center
Hayton, James C.
2009-01-01
In the article "Exploring the Sensitivity of Horn's Parallel Analysis to the Distributional Form of Random Data," Dinno (this issue) provides strong evidence that the distribution of random data does not have a significant influence on the outcome of the analysis. Hayton appreciates the thorough approach to evaluating this assumption, and agrees…
A new parallel-vector finite element analysis software on distributed-memory computers
NASA Technical Reports Server (NTRS)
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
NASA Astrophysics Data System (ADS)
Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
2017-03-01
In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Kolakowska, A; Novotny, M A; Korniss, G
2003-04-01
We consider parallel simulations for asynchronous systems employing L processing elements that are arranged on a ring. Processors communicate only among the nearest neighbors and advance their local simulated time only if it is guaranteed that this does not violate causality. In simulations with no constraints, in the infinite L limit the utilization scales [Korniss et al., Phys. Rev. Lett. 84, 1351 (2000)]; but, the width of the virtual time horizon diverges (i.e., the measurement phase of the algorithm does not scale). In this work, we introduce a moving Delta-window global constraint, which modifies the algorithm so that the measurement phase scales as well. We present results of systematic studies in which the system size (i.e., L and the volume load per processor) as well as the constraint are varied. The Delta constraint eliminates the extreme fluctuations in the virtual time horizon, provides a bound on its width, and controls the average progress rate. The width of the Delta window can serve as a tuning parameter that, for a given volume load per processor, could be adjusted to optimize the utilization, so as to maximize the efficiency. This result may find numerous applications in modeling the evolution of general spatially extended short-range interacting systems with asynchronous dynamics, including dynamic Monte Carlo studies.
An O(log sup 2 N) parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix
NASA Technical Reports Server (NTRS)
Swarztrauber, Paul N.
1989-01-01
An O(log sup 2 N) parallel algorithm is presented for computing the eigenvalues of a symmetric tridiagonal matrix using a parallel algorithm for computing the zeros of the characteristic polynomial. The method is based on a quadratic recurrence in which the characteristic polynomial is constructed on a binary tree from polynomials whose degree doubles at each level. Intervals that contain exactly one zero are determined by the zeros of polynomials at the previous level which ensures that different processors compute different zeros. The exact behavior of the polynomials at the interval endpoints is used to eliminate the usual problems induced by finite precision arithmetic.
Parallel computation of the SAR distribution in a 3D human head model
NASA Astrophysics Data System (ADS)
Walendziuk, Wojciech
2008-01-01
This work presents a way of parallel computation of the Specific Absorption Rate distribution. The parallel program used in the computation was based on the FDTD (Finite-Difference Time-Domain) method [1,2,3]. In order to establish communication among the computational nodes, the MPI (Message Passing Interface) standard was used [4,5,6]. The presented example of a human head numerical model was built with the use of MRI (Magnetic Resonance Image) pictures.
Parallel kinematic mechanisms for distributed actuation of future structures
NASA Astrophysics Data System (ADS)
Lai, G.; Plummer, A. R.; Cleaver, D. J.; Zhou, H.
2016-09-01
Future machines will require distributed actuation integrated with load-bearing structures, so that they are lighter, move faster, use less energy, and are more adaptable. Good examples are shape-changing aircraft wings which can adapt precisely to the ideal aerodynamic form for current flying conditions, and light but powerful robotic manipulators which can interact safely with human co-workers. A 'tensegrity structure' is a good candidate for this application due to its potentially excellent stiffness and strength-to-weight ratio and a multi-element structure into which actuators could be embedded. This paper presents results of an analysis of an example practical actuated tensegrity structure consisting of 3 ‘unit cells’. A numerical method is used to determine the stability of the structure with varying actuator length, showing how four actuators can be used to control movement in three degrees of freedom as well as simultaneously maintaining the structural pre-load. An experimental prototype has been built, in which 4 pneumatic artificial muscles (PAMs) are embedded in one unit cell. The PAMs are controlled antagonistically, by high speed switching of on-off valves, to achieve control of position and structure pre-load. Experimental and simulation results are presented, and future prospects for the approach are discussed.
NASA Astrophysics Data System (ADS)
Stramaglia, Sebastiano; Satalino, Giuseppe; Sternieri, A.; Anelli, P.; Blonda, Palma N.; Pasquariello, Guido
1998-10-01
We consider the problem of classification of remote sensed data from LANDSAT Thematic Mapper images. The data have been acquired in July 1986 on an area locate din South Italy. We compare the performance obtained by feed-forward neural networks designed by a parallel genetic algorithm to determine their topology with the ones obtained by means of a multi-layer perceptron trained with Back Propagation learning rule. The parallel genetic algorithm, implemented on the APE100/Quadrics platform, is based on the coding scheme recently proposed by Sternieri and Anelli and exploits a recently proposed environment for genetic algorithms on Quadrics, called AGAPE. The SASIMD architecture of Quadrics forces the chromosome representation. The coding scheme provides that the connections weights of the neural network are organized as a floating point string. The parallelization scheme adopted is the elitistic coarse grained stepping stone model, with migration occurring only towards neighboring processors. The fitness function depends on the mean square error.After fixing the total number of individuals and running the algorithm on Quadrics architectures with different number of processors, the proposed parallel genetic algorithm displayed a superlinear speedup. We report results obtained on a data set made of 1400 patterns.
NASA Astrophysics Data System (ADS)
Bleuler, Andreas; Teyssier, Romain; Carassou, Sébastien; Martizzi, Davide
2015-06-01
We introduce phew ( Parallel Hi Erarchical Watershed), a new segmentation algorithm to detect structures in astrophysical fluid simulations, and its implementation into the adaptive mesh refinement (AMR) code ramses. phew works on the density field defined on the adaptive mesh, and can thus be used on the gas density or the dark matter density after a projection of the particles onto the grid. The algorithm is based on a `watershed' segmentation of the computational volume into dense regions, followed by a merging of the segmented patches based on the saddle point topology of the density field. phew is capable of automatically detecting connected regions above the adopted density threshold, as well as the entire set of substructures within. Our algorithm is fully parallel and uses the MPI library. We describe in great detail the parallel algorithm and perform a scaling experiment which proves the capability of phew to run efficiently on massively parallel systems. Future work will add a particle unbinding procedure and the calculation of halo properties onto our segmentation algorithm, thus expanding the scope of phew to genuine halo finding.
NASA Astrophysics Data System (ADS)
Bernabe, Sergio; Igual, Francisco D.; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2015-10-01
In the last decade, the issue of endmember variability has received considerable attention, particularly when each pixel is modeled as a linear combination of endmembers or pure materials. As a result, several models and algorithms have been developed for considering the effect of endmember variability in spectral unmixing and possibly include multiple endmembers in the spectral unmixing stage. One of the most popular approach for this purpose is the multiple endmember spectral mixture analysis (MESMA) algorithm. The procedure executed by MESMA can be summarized as follows: (i) First, a standard linear spectral unmixing (LSU) or fully constrained linear spectral unmixing (FCLSU) algorithm is run in an iterative fashion; (ii) Then, we use different endmember combinations, randomly selected from a spectral library, to decompose each mixed pixel; (iii) Finally, the model with the best fit, i.e., with the lowest root mean square error (RMSE) in the reconstruction of the original pixel, is adopted. However, this procedure can be computationally very expensive due to the fact that several endmember combinations need to be tested and several abundance estimation steps need to be conducted, a fact that compromises the use of MESMA in applications under real-time constraints. In this paper we develop (for the first time in the literature) an efficient implementation of MESMA on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems. Our experiments have been conducted using a simulated data set and the clMAGMA mathematical library. This kind of implementations with the same descriptive language on different architectures are very important in order to actually calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
NASA Astrophysics Data System (ADS)
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2017-02-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
NASA Astrophysics Data System (ADS)
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2016-07-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Mesoscale Simulations of Particulate Flows with Parallel Distributed Lagrange Multiplier Technique
Kanarska, Y
2010-03-24
Fluid particulate flows are common phenomena in nature and industry. Modeling of such flows at micro and macro levels as well establishing relationships between these approaches are needed to understand properties of the particulate matter. We propose a computational technique based on the direct numerical simulation of the particulate flows. The numerical method is based on the distributed Lagrange multiplier technique following the ideas of Glowinski et al. (1999). Each particle is explicitly resolved on an Eulerian grid as a separate domain, using solid volume fractions. The fluid equations are solved through the entire computational domain, however, Lagrange multiplier constrains are applied inside the particle domain such that the fluid within any volume associated with a solid particle moves as an incompressible rigid body. Mutual forces for the fluid-particle interactions are internal to the system. Particles interact with the fluid via fluid dynamic equations, resulting in implicit fluid-rigid-body coupling relations that produce realistic fluid flow around the particles (i.e., no-slip boundary conditions). The particle-particle interactions are implemented using explicit force-displacement interactions for frictional inelastic particles similar to the DEM method of Cundall et al. (1979) with some modifications using a volume of an overlapping region as an input to the contact forces. The method is flexible enough to handle arbitrary particle shapes and size distributions. A parallel implementation of the method is based on the SAMRAI (Structured Adaptive Mesh Refinement Application Infrastructure) library, which allows handling of large amounts of rigid particles and enables local grid refinement. Accuracy and convergence of the presented method has been tested against known solutions for a falling sphere as well as by examining fluid flows through stationary particle beds (periodic and cubic packing). To evaluate code performance and validate particle
Tao, X. Lu, Q.
2014-02-15
In space plasmas, charged particles are frequently observed to possess a high-energy tail, which is often modeled by a kappa-type distribution function. In this work, the formation of the electron kappa distribution in generation of parallel propagating whistler waves is investigated using fully nonlinear particle-in-cell (PIC) simulations. A previous research concluded that the bi-Maxwellian character of electron distributions is preserved in PIC simulations. We now demonstrate that for interactions between electrons and parallel propagating whistler waves, a non-Maxwellian high-energy tail can be formed, and a kappa distribution can be used to fit the electron distribution in time-asymptotic limit. The κ-parameter is found to decrease with increasing initial temperature anisotropy or decreasing ratio of electron plasma frequency to cyclotron frequency. The results might be helpful to understanding the origin of electron kappa distributions observed in space plasmas.
Intelligent decision support algorithm for distribution system restoration.
Singh, Reetu; Mehfuz, Shabana; Kumar, Parmod
2016-01-01
Distribution system is the means of revenue for electric utility. It needs to be restored at the earliest if any feeder or complete system is tripped out due to fault or any other cause. Further, uncertainty of the loads, result in variations in the distribution network's parameters. Thus, an intelligent algorithm incorporating hybrid fuzzy-grey relation, which can take into account the uncertainties and compare the sequences is discussed to analyse and restore the distribution system. The simulation studies are carried out to show the utility of the method by ranking the restoration plans for a typical distribution system. This algorithm also meets the smart grid requirements in terms of an automated restoration plan for the partial/full blackout of network.