Predicting performance of parallel computations
NASA Technical Reports Server (NTRS)
Mak, Victor W.; Lundstrom, Stephen F.
1990-01-01
An accurate and computationally efficient method for predicting the performance of a class of parallel computations running on concurrent systems is described. A parallel computation is modeled as a task system with precedence relationships expressed as a series-parallel directed acyclic graph. Resources in a concurrent system are modeled as service centers in a queuing network model. Using these two models as inputs, the method outputs predictions of expected execution time of the parallel computation and the concurrent system utilization. The method is validated against both detailed simulation and actual execution on a commercial multiprocessor. Using 100 test cases, the average error of the prediction when compared to simulation statistics is 1.7 percent, with a standard deviation of 1.5 percent; the maximum error is about 10 percent.
Visualizing Parallel Computer System Performance
NASA Technical Reports Server (NTRS)
Malony, Allen D.; Reed, Daniel A.
1988-01-01
Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.
High Performance Parallel Computational Nanotechnology
NASA Technical Reports Server (NTRS)
Saini, Subhash; Craw, James M. (Technical Monitor)
1995-01-01
At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to
Interfacing Computer Aided Parallelization and Performance Analysis
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)
2003-01-01
When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
Measuring performance of parallel computers. Final report
Sullivan, F.
1994-07-01
Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.
Misleading Performance Claims in Parallel Computations
Bailey, David H.
2009-05-29
In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.
Performance Evaluation in Network-Based Parallel Computing
NASA Technical Reports Server (NTRS)
Dezhgosha, Kamyar
1996-01-01
Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.
Performance issues for engineering analysis on MIMD parallel computers
Fang, H.E.; Vaughan, C.T.; Gardner, D.R.
1994-08-01
We discuss how engineering analysts can obtain greater computational resolution in a more timely manner from applications codes running on MIMD parallel computers. Both processor speed and memory capacity are important to achieving better performance than a serial vector supercomputer. To obtain good performance, a parallel applications code must be scalable. In addition, the aspect ratios of the subdomains in the decomposition of the simulation domain onto the parallel computer should be of order 1. We demonstrate these conclusions using simulations conducted with the PCTH shock wave physics code running on a Cray Y-MP, a 1024-node nCUBE 2, and an 1840-node Paragon.
Measuring performance of parallel computers. Progress report, 1989
Sullivan, F.
1994-07-01
Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.
Routing performance analysis and optimization within a massively parallel computer
Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen
2013-04-16
An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.
Energy Proportionality and Performance in Data Parallel Computing Clusters
Kim, Jinoh; Chou, Jerry; Rotem, Doron
2011-02-14
Energy consumption in datacenters has recently become a major concern due to the rising operational costs andscalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumedby the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerancepurposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A coveringset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In thiswork, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizesenergy consumption while placing the remaining nodes in lowpower standby mode. Our algorithms can also discover coveringsets in heterogeneous computing environments. In order to allow more data parallelism, we generalize our algorithms so that itcan discover k-covering sets, i.e., a set of nodes that contain at least k replicas of the data blocks. Our experimental results showthat we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and workingenvironments.
Development of Message Passing Routines for High Performance Parallel Computations
NASA Technical Reports Server (NTRS)
Summers, Edward K.
2004-01-01
Computational Fluid Dynamics (CFD) calculations require a great deal of computing power for completing the detailed computations involved. In an effort shorten the time it takes to complete such calculations they are implemented on a parallel computer. In the case of a parallel computer some sort of message passing structure must be used to communicate between the computers because, unlike a single machine, each computer in a parallel computing cluster does not have access to all the data or run all the parts of the total program. Thus, message passing is used to divide up the data and send instructions to each machine. The nature of my work this summer involves programming the "message passing" aspect of the parallel computer. I am working on modifying an existing program, which was written with OpenMP, and does not use a multi-machine parallel computing structure, to work with Message Passing Interface (MPI) routines. The actual code is being written in the FORTRAN 90 programming language. My goal is to write a parameterized message passing structure that could be used for a variety of individual applications and implement it on Silicon Graphics Incorporated s (SGI) IRIX operating system. With this new parameterized structure engineers would be able to speed up computations for a wide variety of purposes without having to use larger and more expensive computing equipment from another division or another NASA center.
Application Specific Performance Technology for Productive Parallel Computing
Malony, Allen D.; Shende, Sameer
2008-09-30
Our accomplishments over the last three years of the DOE project Application- Specific Performance Technology for Productive Parallel Computing (DOE Agreement: DE-FG02-05ER25680) are described below. The project will have met all of its objectives by the time of its completion at the end of September, 2008. Two extensive yearly progress reports were produced in in March 2006 and 2007 and were previously submitted to the DOE Office of Advanced Scientific Computing Research (OASCR). Following an overview of the objectives of the project, we summarize for each of the project areas the achievements in the first two years, and then describe in some more detail the project accomplishments this past year. At the end, we discuss the relationship of the proposed renewal application to the work done on the current project.
Performing an allreduce operation on a plurality of compute nodes of a parallel computer
Faraj, Ahmad
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallel computer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.
Performing a global barrier operation in a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2014-12-09
Executing computing tasks on a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.
A Generic Scheduling Simulator for High Performance Parallel Computers
Yoo, B S; Choi, G S; Jette, M A
2001-08-01
It is well known that efficient job scheduling plays a crucial role in achieving high system utilization in large-scale high performance computing environments. A good scheduling algorithm should schedule jobs to achieve high system utilization while satisfying various user demands in an equitable fashion. Designing such a scheduling algorithm is a non-trivial task even in a static environment. In practice, the computing environment and workload are constantly changing. There are several reasons for this. First, the computing platforms constantly evolve as the technology advances. For example, the availability of relatively powerful commodity off-the-shelf (COTS) components at steadily diminishing prices have made it feasible to construct ever larger massively parallel computers in recent years [1, 4]. Second, the workload imposed on the system also changes constantly. The rapidly increasing compute resources have provided many applications developers with the opportunity to radically alter program characteristics and take advantage of these additional resources. New developments in software technology may also trigger changes in user applications. Finally, political climate change may alter user priorities or the mission of the organization. System designers in such dynamic environments must be able to accurately forecast the effect of changes in the hardware, software, and/or policies under consideration. If the environmental changes are significant, one must also reassess scheduling algorithms. Simulation has frequently been relied upon for this analysis, because other methods such as analytical modeling or actual measurements are usually too difficult or costly. A drawback of the simulation approach, however, is that developing a simulator is a time-consuming process. Furthermore, an existing simulator cannot be easily adapted to a new environment. In this research, we attempt to develop a generic job-scheduling simulator, which facilitates the evaluation of
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Not Available
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Performing a local reduction operation on a parallel computer
Blocksome, Michael A; Faraj, Daniel A
2013-06-04
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Performing a local reduction operation on a parallel computer
Blocksome, Michael A.; Faraj, Daniel A.
2012-12-11
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
High Performance Input/Output for Parallel Computer Systems
NASA Technical Reports Server (NTRS)
Ligon, W. B.
1996-01-01
The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.
NASA Technical Reports Server (NTRS)
Morgan, Philip E.
2004-01-01
This final report contains reports of research related to the tasks "Scalable High Performance Computing: Direct and Lark-Eddy Turbulent FLow Simulations Using Massively Parallel Computers" and "Devleop High-Performance Time-Domain Computational Electromagnetics Capability for RCS Prediction, Wave Propagation in Dispersive Media, and Dual-Use Applications. The discussion of Scalable High Performance Computing reports on three objectives: validate, access scalability, and apply two parallel flow solvers for three-dimensional Navier-Stokes flows; develop and validate a high-order parallel solver for Direct Numerical Simulations (DNS) and Large Eddy Simulation (LES) problems; and Investigate and develop a high-order Reynolds averaged Navier-Stokes turbulence model. The discussion of High-Performance Time-Domain Computational Electromagnetics reports on five objectives: enhancement of an electromagnetics code (CHARGE) to be able to effectively model antenna problems; utilize lessons learned in high-order/spectral solution of swirling 3D jets to apply to solving electromagnetics project; transition a high-order fluids code, FDL3DI, to be able to solve Maxwell's Equations using compact-differencing; develop and demonstrate improved radiation absorbing boundary conditions for high-order CEM; and extend high-order CEM solver to address variable material properties. The report also contains a review of work done by the systems engineer.
NASA Astrophysics Data System (ADS)
Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu
2015-09-01
We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.
A high performance parallel computing architecture for robust image features
NASA Astrophysics Data System (ADS)
Zhou, Renyan; Liu, Leibo; Wei, Shaojun
2014-03-01
A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Performing an allreduce operation on a plurality of compute nodes of a parallel computer
Faraj, Ahmad
2013-02-12
Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: performing, for each node, a local reduction operation using allreduce contribution data for the cores of that node, yielding, for each node, a local reduction result for one or more representative cores for that node; establishing one or more logical rings among the nodes, each logical ring including only one of the representative cores from each node; performing, for each logical ring, a global allreduce operation using the local reduction result for the representative cores included in that logical ring, yielding a global allreduce result for each representative core included in that logical ring; and performing, for each node, a local broadcast operation using the global allreduce results for each representative core on that node.
Hardware Efficient and High-Performance Networks for Parallel Computers.
NASA Astrophysics Data System (ADS)
Chien, Minze Vincent
High performance interconnection networks are the key to high utilization and throughput in large-scale parallel processing systems. Since many interconnection problems in parallel processing such as concentration, permutation and broadcast problems can be cast as sorting problems, this dissertation considers the problem of sorting on a new model, called an adaptive sorting network. It presents four adaptive binary sorters the first two of which are ordinary combinational circuits while the last two exploit time-multiplexing and pipelining techniques. These sorter constructions demonstrate that any sequence of n bits can be sorted in O(log^2n) bit-level delay, using O(n) constant fanin gates. This improves the cost complexity of Batcher's binary sorters by a factor of O(log^2n) while matching their sorting time. It is further shown that any sequence of n numbers can be sorted on the same model in O(log^2n) comparator-level delay using O(nlog nloglog n) comparators. The adaptive binary sorter constructions lead to new O(n) bit-level cost concentrators and superconcentrators with O(log^2n) bit-level delay. Their employment in recently constructed permutation and generalized connectors lead to permutation and generalized connection networks with O(nlog n) bit-level cost and O(log^3n) bit-level delay. These results provide the least bit-level cost for such networks with competitive delays. Finally, the dissertation considers a key issue in the implementation of interconnection networks, namely, the pin constraint. Current VLSI technologies can house a large number of switches in a single chip, but the mere fact that one chip cannot have too many pins precludes the possibility of implementing a large connection network on a single chip. The dissertation presents techniques for partitioning connection networks into identical modules of switches in such a way that each module is contained in a single chip with an arbitrarily specified number of pins, and that the cost of
Parallel beam dynamics calculations on high performance computers
NASA Astrophysics Data System (ADS)
Ryne, Robert; Habib, Salman
1997-02-01
Faced with a backlog of nuclear waste and weapons plutonium, as well as an ever-increasing public concern about safety and environmental issues associated with conventional nuclear reactors, many countries are studying new, accelerator-driven technologies that hold the promise of providing safe and effective solutions to these problems. Proposed projects include accelerator transmutation of waste (ATW), accelerator-based conversion of plutonium (ABC), accelerator-driven energy production (ADEP), and accelerator production of tritium (APT). Also, next-generation spallation neutron sources based on similar technology will play a major role in materials science and biological science research. The design of accelerators for these projects will require a major advance in numerical modeling capability. For example, beam dynamics simulations with approximately 100 million particles will be needed to ensure that extremely stringent beam loss requirements (less than a nanoampere per meter) can be met. Compared with typical present-day modeling using 10,000-100,000 particles, this represents an increase of 3-4 orders of magnitude. High performance computing (HPC) platforms make it possible to perform such large scale simulations, which require 10's of GBytes of memory. They also make it possible to perform smaller simulations in a matter of hours that would require months to run on a single processor workstation. This paper will describe how HPC platforms can be used to perform the numerically intensive beam dynamics simulations required for development of these new accelerator-driven technologies.
Performing an allreduce operation on a plurality of compute nodes of a parallel computer
Faraj, Ahmad
2013-07-09
Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: establishing, for each node, a plurality of logical rings, each ring including a different set of at least one core on that node, each ring including the cores on at least two of the nodes; iteratively for each node: assigning each core of that node to one of the rings established for that node to which the core has not previously been assigned, and performing, for each ring for that node, a global allreduce operation using contribution data for the cores assigned to that ring or any global allreduce results from previous global allreduce operations, yielding current global allreduce results for each core; and performing, for each node, a local allreduce operation using the global allreduce results.
Geometrically nonlinear design sensitivity analysis on parallel-vector high-performance computers
NASA Technical Reports Server (NTRS)
Baddourah, Majdi A.; Nguyen, Duc T.
1993-01-01
Parallel-vector solution strategies for generation and assembly of element matrices, solution of the resulted system of linear equations, calculations of the unbalanced loads, displacements, stresses, and design sensitivity analysis (DSA) are all incorporated into the Newton Raphson (NR) procedure for nonlinear finite element analysis and DSA. Numerical results are included to show the performance of the proposed method for structural analysis and DSA in a parallel-vector computer environment.
NASA Technical Reports Server (NTRS)
Logan, Terry G.
1994-01-01
The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
High performance parallel architectures
Anderson, R.E. )
1989-09-01
In this paper the author describes current high performance parallel computer architectures. A taxonomy is presented to show computer architecture from the user programmer's point-of-view. The effects of the taxonomy upon the programming model are described. Some current architectures are described with respect to the taxonomy. Finally, some predictions about future systems are presented. 5 refs., 1 fig.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)
2002-01-01
The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
Towards a high performance parallel library to compute fluid and flexible structures interactions
NASA Astrophysics Data System (ADS)
Nagar, Prateek
LBM-IB method is useful and popular simulation technique that is adopted ubiquitously to solve Fluid-Structure interaction problems in computational fluid dynamics. These problems are known for utilizing computing resources intensively while solving mathematical equations involved in simulations. Problems involving such interactions are omnipresent, therefore, it is eminent that a faster and accurate algorithm exists for solving these equations, to reproduce a real-life model of such complex analytical problems in a shorter time period. LBM-IB being inherently parallel, proves to be an ideal candidate for developing a parallel software. This research focuses on developing a parallel software library, LBM-IB based on the algorithm proposed by [1] which is first of its kind that utilizes the high performance computing abilities of supercomputers procurable today. An initial sequential version of LBM-IB is developed that is used as a benchmark for correctness and performance evaluation of shared memory parallel versions. Two shared memory parallel versions of LBM-IB have been developed using OpenMP and Pthread library respectively. The OpenMP version is able to scale well enough, as good as 83% speedup on multicore machines for 8 cores. Based on the profiling and instrumentation done on this version, to improve the data-locality and increase the degree of parallelism, Pthread based data centric version is developed which is able to outperform the OpenMP version by 53% on manycore machines. A distributed version using the MPI interfaces on top of the cube based Pthread version has also been designed to be used by extreme scale distributed memory manycore systems.
Parallel-vector unsymmetric Eigen-Solver on high performance computers
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.; Jiangning, Qin
1993-01-01
The popular QR algorithm for solving all eigenvalues of an unsymmetric matrix is reviewed. Among the basic components in the QR algorithm, it was concluded from this study, that the reduction of an unsymmetric matrix to a Hessenberg form (before applying the QR algorithm itself) can be done effectively by exploiting the vector speed and multiple processors offered by modern high-performance computers. Numerical examples of several test cases have indicated that the proposed parallel-vector algorithm for converting a given unsymmetric matrix to a Hessenberg form offers computational advantages over the existing algorithm. The time saving obtained by the proposed methods is increased as the problem size increased.
High performance parallel computing of flows in complex geometries: I. Methods
NASA Astrophysics Data System (ADS)
Gourdain, N.; Gicquel, L.; Montagnac, M.; Vermorel, O.; Gazaix, M.; Staffelbach, G.; Garcia, M.; Boussuge, J.-F.; Poinsot, T.
2009-01-01
Efficient numerical tools coupled with high-performance computers, have become a key element of the design process in the fields of energy supply and transportation. However flow phenomena that occur in complex systems such as gas turbines and aircrafts are still not understood mainly because of the models that are needed. In fact, most computational fluid dynamics (CFD) predictions as found today in industry focus on a reduced or simplified version of the real system (such as a periodic sector) and are usually solved with a steady-state assumption. This paper shows how to overcome such barriers and how such a new challenge can be addressed by developing flow solvers running on high-end computing platforms, using thousands of computing cores. Parallel strategies used by modern flow solvers are discussed with particular emphases on mesh-partitioning, load balancing and communication. Two examples are used to illustrate these concepts: a multi-block structured code and an unstructured code. Parallel computing strategies used with both flow solvers are detailed and compared. This comparison indicates that mesh-partitioning and load balancing are more straightforward with unstructured grids than with multi-block structured meshes. However, the mesh-partitioning stage can be challenging for unstructured grids, mainly due to memory limitations of the newly developed massively parallel architectures. Finally, detailed investigations show that the impact of mesh-partitioning on the numerical CFD solutions, due to rounding errors and block splitting, may be of importance and should be accurately addressed before qualifying massively parallel CFD tools for a routine industrial use.
High performance parallel computing of flows in complex geometries: II. Applications
NASA Astrophysics Data System (ADS)
Gourdain, N.; Gicquel, L.; Staffelbach, G.; Vermorel, O.; Duchaine, F.; Boussuge, J.-F.; Poinsot, T.
2009-01-01
Present regulations in terms of pollutant emissions, noise and economical constraints, require new approaches and designs in the fields of energy supply and transportation. It is now well established that the next breakthrough will come from a better understanding of unsteady flow effects and by considering the entire system and not only isolated components. However, these aspects are still not well taken into account by the numerical approaches or understood whatever the design stage considered. The main challenge is essentially due to the computational requirements inferred by such complex systems if it is to be simulated by use of supercomputers. This paper shows how new challenges can be addressed by using parallel computing platforms for distinct elements of a more complex systems as encountered in aeronautical applications. Based on numerical simulations performed with modern aerodynamic and reactive flow solvers, this work underlines the interest of high-performance computing for solving flow in complex industrial configurations such as aircrafts, combustion chambers and turbomachines. Performance indicators related to parallel computing efficiency are presented, showing that establishing fair criterions is a difficult task for complex industrial applications. Examples of numerical simulations performed in industrial systems are also described with a particular interest for the computational time and the potential design improvements obtained with high-fidelity and multi-physics computing methods. These simulations use either unsteady Reynolds-averaged Navier-Stokes methods or large eddy simulation and deal with turbulent unsteady flows, such as coupled flow phenomena (thermo-acoustic instabilities, buffet, etc). Some examples of the difficulties with grid generation and data analysis are also presented when dealing with these complex industrial applications.
Performance of a parallel code for the Euler equations on hypercube computers
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Chan, Tony F.; Jesperson, Dennis C.; Tuminaro, Raymond S.
1990-01-01
The performance of hypercubes were evaluated on a computational fluid dynamics problem and the parallel environment issues were considered that must be addressed, such as algorithm changes, implementation choices, programming effort, and programming environment. The evaluation focuses on a widely used fluid dynamics code, FLO52, which solves the two dimensional steady Euler equations describing flow around the airfoil. The code development experience is described, including interacting with the operating system, utilizing the message-passing communication system, and code modifications necessary to increase parallel efficiency. Results from two hypercube parallel computers (a 16-node iPSC/2, and a 512-node NCUBE/ten) are discussed and compared. In addition, a mathematical model of the execution time was developed as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained and is used to explore the performance of the code in interesting but yet physically realizable regions of the parameter space. Based on this model, predictions about future hypercubes are made.
A parallel-vector algorithm for rapid structural analysis on high-performance computers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1990-01-01
A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the loop unrolling technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
Next generation Purex modeling by way of parallel processing with high performance computers
DeMuth, S.F.
1993-08-01
The Plutonium and Uranium Extraction (Purex) process is the predominant method used worldwide for solvent extraction in reprocessing spent nuclear fuels. Proper flowsheet design has a significant impact on the character of the process waste. Past Purex flowsheet modeling has been based on equilibrium conditions. It can be shown for the Purex process that optimum separation does not necessarily occur at equilibrium conditions. The next generation Purex flowsheet models should incorporate the fundamental diffusion and chemical kinetic processes required to study time-dependent behavior. Use of parallel processing with high-performance computers will permit transient multistage and multispecies design calculations based on mass transfer with simultaneous chemical reaction models. This paper presents an applicable mass transfer with chemical reaction model for the Purex system and presents a parallel processing solution methodology.
Gaalop—High Performance Parallel Computing Based on Conformal Geometric Algebra
NASA Astrophysics Data System (ADS)
Hildenbrand, Dietmar; Pitt, Joachim; Koch, Andreas
We present Gaalop (Geometric algebra algorithms optimizer), our tool for high-performance computing based on conformal geometric algebra. The main goal of Gaalop is to realize implementations that are most likely faster than conventional solutions. In order to achieve this goal, our focus is on parallel target platforms like FPGA (field-programmable gate arrays) or the CUDA technology from NVIDIA. We describe the concepts, current status, and future perspectives of Gaalop dealing with optimized software implementations, hardware implementations, and mixed solutions. An inverse kinematics algorithm of a humanoid robot is described as an example.
NASA Technical Reports Server (NTRS)
Denning, Peter J.; Tichy, Walter F.
1990-01-01
Among the highly parallel computing architectures required for advanced scientific computation, those designated 'MIMD' and 'SIMD' have yielded the best results to date. The present development status evaluation of such architectures shown neither to have attained a decisive advantage in most near-homogeneous problems' treatment; in the cases of problems involving numerous dissimilar parts, however, such currently speculative architectures as 'neural networks' or 'data flow' machines may be entailed. Data flow computers are the most practical form of MIMD fine-grained parallel computers yet conceived; they automatically solve the problem of assigning virtual processors to the real processors in the machine.
NASA Technical Reports Server (NTRS)
Katz, D.; Cwik, T.; Sterling, T.
1998-01-01
This paper uses the parallel calculation of the radiation integral for examination of performance and compiler issues on a Beowulf-class computer. This type of computer, built from mass-market, commodity, off-the-shelf components, has limited communications performance and therefore also has a limited regime of codes for which it is suitable.
Frequency-dependent performance analysis of a parallel DSP-based computer system
NASA Astrophysics Data System (ADS)
Christou, Ch. S.
2014-11-01
The performance of a shared-memory low-cost high-performance DSP-Based multiprocessor system [3] is investigated, by varying the frequency of the core processor from 200MHz to 1GHZ, in steps of 200 MHZ, and keeping constant parameters such as the shared-memory-access-time and the prefetching-workload-size. The innovation of this Parallel DSP-Based computer system is the introduction of two small programmable small fast memories (Twins) between the processor and the shared bus interconnect. While one memory (Twin) transfers data from/to the shared memory, the other Twin supplies the core DSP-processor with data. Results indicate an increase of the shared-bus bottleneck as the core DSP processors' clock-rate increases. Workload of the Twins is processed faster thus greater the demand of the shared-bus. Results show an effectively supported robust parallel shared-memory system where fewer but faster (clocked with higher frequency) processors produce the same execution times as a greater number of slower processors, with most system configurations achieving perfect speedups, mainly due to the twin-prefetching mechanism.
Koniges, A.
1996-02-09
This project is a package of 11 individual CRADA`s plus hardware. This innovative project established a three-year multi-party collaboration that is significantly accelerating the availability of commercial massively parallel processing computing software technology to U.S. government, academic, and industrial end-users. This report contains individual presentations from nine principal investigators along with overall program information.
Turbomachinery CFD on parallel computers
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.
1992-01-01
The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.
pWeb: A High-Performance, Parallel-Computing Framework for Web-Browser-Based Medical Simulation.
Halic, Tansel; Ahn, Woojin; De, Suvranu
2014-01-01
This work presents a pWeb - a new language and compiler for parallelization of client-side compute intensive web applications such as surgical simulations. The recently introduced HTML5 standard has enabled creating unprecedented applications on the web. Low performance of the web browser, however, remains the bottleneck of computationally intensive applications including visualization of complex scenes, real time physical simulations and image processing compared to native ones. The new proposed language is built upon web workers for multithreaded programming in HTML5. The language provides fundamental functionalities of parallel programming languages as well as the fork/join parallel model which is not supported by web workers. The language compiler automatically generates an equivalent parallel script that complies with the HTML5 standard. A case study on realistic rendering for surgical simulations demonstrates enhanced performance with a compact set of instructions. PMID:24732497
NASA Technical Reports Server (NTRS)
Denning, Peter J.; Tichy, Walter F.
1990-01-01
Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed.
DeHart, Mark D; Williams, Mark L; Bowman, Stephen M
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
Trajectory optimization using parallel shooting method on parallel computer
Wirthman, D.J.; Park, S.Y.; Vadali, S.R.
1995-03-01
The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.
Massively parallel quantum computer simulator
NASA Astrophysics Data System (ADS)
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation
Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn
2014-11-14
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
Parallel Computing Using Web Servers and "Servlets".
ERIC Educational Resources Information Center
Lo, Alfred; Bloor, Chris; Choi, Y. K.
2000-01-01
Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…
Merlin - Massively parallel heterogeneous computing
NASA Technical Reports Server (NTRS)
Wittie, Larry; Maples, Creve
1989-01-01
Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
Parallel NPARC: Implementation and Performance
NASA Technical Reports Server (NTRS)
Townsend, S. E.
1996-01-01
Version 3 of the NPARC Navier-Stokes code includes support for large-grain (block level) parallelism using explicit message passing between a heterogeneous collection of computers. This capability has the potential for significant performance gains, depending upon the block data distribution. The parallel implementation uses a master/worker arrangement of processes. The master process assigns blocks to workers, controls worker actions, and provides remote file access for the workers. The processes communicate via explicit message passing using an interface library which provides portability to a number of message passing libraries, such as PVM (Parallel Virtual Machine). A Bourne shell script is used to simplify the task of selecting hosts, starting processes, retrieving remote files, and terminating a computation. This script also provides a simple form of fault tolerance. An analysis of the computational performance of NPARC is presented, using data sets from an F/A-18 inlet study and a Rocket Based Combined Cycle Engine analysis. Parallel speedup and overall computational efficiency were obtained for various NPARC run parameters on a cluster of IBM RS6000 workstations. The data show that although NPARC performance compares favorably with the estimated potential parallelism, typical data sets used with previous versions of NPARC will often need to be reblocked for optimum parallel performance. In one of the cases studied, reblocking increased peak parallel speedup from 3.2 to 11.8.
Computational results for parallel unstructured mesh computations
Jones, M.T.; Plassmann, P.E.
1994-12-31
The majority of finite element models in structural engineering are composed of unstructured meshes. These unstructured meshes are often very large and require significant computational resources; hence they are excellent candidates for massively parallel computation. Parallel solution of the sparse matrices that arise from such meshes has been studied heavily, and many good algorithms have been developed. Unfortunately, many of the other aspects of parallel unstructured mesh computation have gone largely ignored. The authors present a set of algorithms that allow the entire unstructured mesh computation process to execute in parallel -- including adaptive mesh refinement, equation reordering, mesh partitioning, and sparse linear system solution. They briefly describe these algorithms and state results regarding their running-time and performance. They then give results from the 512-processor Intel DELTA for a large-scale structural analysis problem. These results demonstrate that the new algorithms are scalable and efficient. The algorithms are able to achieve up to 2.2 gigaflops for this unstructured mesh problem.
Achieving high performance in numerical computations on RISC workstations and parallel systems
Goedecker, S.; Hoisie, A.
1997-08-20
The nominal peak speeds of both serial and parallel computers is raising rapidly. At the same time however it is becoming increasingly difficult to get out a significant fraction of this high peak speed from modern computer architectures. In this tutorial the authors give the scientists and engineers involved in numerically demanding calculations and simulations the necessary basic knowledge to write reasonably efficient programs. The basic principles are rather simple and the possible rewards large. Writing a program by taking into account optimization techniques related to the computer architecture can significantly speedup your program, often by factors of 10--100. As such, optimizing a program can for instance be a much better solution than buying a faster computer. If a few basic optimization principles are applied during program development, the additional time needed for obtaining an efficient program is practically negligible. In-depth optimization is usually only needed for a few subroutines or kernels and the effort involved is therefore also acceptable.
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1995-01-01
The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
NASA Technical Reports Server (NTRS)
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
Parallel computing in enterprise modeling.
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1991-01-01
The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.
A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)
NASA Astrophysics Data System (ADS)
van Werkhoven, B.; Maassen, J.; Kliphuis, M.; Dijkstra, H. A.; Brunnabend, S. E.; van Meersbergen, M.; Seinstra, F. J.; Bal, H. E.
2014-02-01
The Parallel Ocean Program (POP) is used in many strongly eddying ocean circulation simulations. Ideally it would be desirable to be able to do thousand-year-long simulations, but the current performance of POP prohibits these types of simulations. In this work, using a new distributed computing approach, two methods to improve the performance of POP are presented. The first is a block-partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is the implementation of part of the POP model code on graphics processing units (GPUs). We show that the combination of both innovations also leads to a substantial performance increase when running POP simultaneously over multiple computational platforms.
A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)
NASA Astrophysics Data System (ADS)
van Werkhoven, B.; Maassen, J.; Kliphuis, M.; Dijkstra, H. A.; Brunnabend, S. E.; van Meersbergen, M.; Seinstra, F. J.; Bal, H. E.
2013-09-01
The Parallel Ocean Program (POP) is used in many strongly eddying ocean circulation simulations. Ideally one would like to do thousand-year long simulations, but the current performance of POP prohibits this type of simulations. In this work, using a new distributed computing approach, two innovations to improve the performance of POP are presented. The first is a new block partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is an implementation of part of the POP model code on Graphics Processing Units. We show that the combination of both innovations leads to a substantial performance increase also when running POP simultaneously over multiple computational platforms.
Algorithmically specialized parallel computers
Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.
1985-01-01
This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens; Inglett, Todd Alan
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Reordering computations for parallel execution
NASA Technical Reports Server (NTRS)
Adams, L.
1985-01-01
The computations are reordered in the SOR algorithm to maintain the same asymptotic rate of convergence as the rowwise ordering to obtain parallelism at different levels. A parallel program is written to illustrate these ideas and actual machines for implementation of this program are discussed.
Chang, Justin; Karra, Satish; Nakshatrala, Kalyana B.
2016-07-26
It is well-known that the standard Galerkin formulation, which is often the formulation of choice under the finite element method for solving self-adjoint diffusion equations, does not meet maximum principles and the non-negative constraint for anisotropic diffusion equations. Recently, optimization-based methodologies that satisfy maximum principles and the non-negative constraint for steady-state and transient diffusion-type equations have been proposed. To date, these methodologies have been tested only on small-scale academic problems. The purpose of this paper is to systematically study the performance of the non-negative methodology in the context of high performance computing (HPC). PETSc and TAO libraries are, respectively, usedmore » for the parallel environment and optimization solvers. For large-scale problems, it is important for computational scientists to understand the computational performance of current algorithms available in these scientific libraries. The numerical experiments are conducted on the state-of-the-art HPC systems, and a single-core performance model is used to better characterize the efficiency of the solvers. Furthermore, our studies indicate that the proposed non-negative computational framework for diffusion-type equations exhibits excellent strong scaling for real-world large-scale problems.« less
Parallel computation using limited resources
Sugla, B.
1985-01-01
This thesis addresses itself to the task of designing and analyzing parallel algorithms when the resources of processors, communication, and time are limited. The two parts of this thesis deal with multiprocessor systems and VLSI - the two important parallel processing environments that are prevalent today. In the first part a time-processor-communication tradeoff analysis is conducted for two kinds of problems - N input, 1 output, and N input, N output computations. In the class of problems of the second kind, the problem of prefix computation, an important problem due to the number of naturally occurring computations it can model, is studied. Finally, a general methodology is given for design of parallel algorithms that can be used to optimize a given design to a wide set of architectural variations. The second part of the thesis considers the design of parallel algorithms for the VLSI model of computation when the resource of time is severely restricted.
Parallel Architecture For Robotics Computation
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1990-01-01
Universal Real-Time Robotic Controller and Simulator (URRCS) is highly parallel computing architecture for control and simulation of robot motion. Result of extensive algorithmic study of different kinematic and dynamic computational problems arising in control and simulation of robot motion. Study led to development of class of efficient parallel algorithms for these problems. Represents algorithmically specialized architecture, in sense capable of exploiting common properties of this class of parallel algorithms. System with both MIMD and SIMD capabilities. Regarded as processor attached to bus of external host processor, as part of bus memory.
Parallel computation with the force
NASA Technical Reports Server (NTRS)
Jordan, H. F.
1985-01-01
A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.
NASA Astrophysics Data System (ADS)
Alameda, J. C.
2011-12-01
Development and optimization of computational science models, particularly on high performance computers, and with the advent of ubiquitous multicore processor systems, practically on every system, has been accomplished with basic software tools, typically, command-line based compilers, debuggers, performance tools that have not changed substantially from the days of serial and early vector computers. However, model complexity, including the complexity added by modern message passing libraries such as MPI, and the need for hybrid code models (such as openMP and MPI) to be able to take full advantage of high performance computers with an increasing core count per shared memory node, has made development and optimization of such codes an increasingly arduous task. Additional architectural developments, such as many-core processors, only complicate the situation further. In this paper, we describe how our NSF-funded project, "SI2-SSI: A Productive and Accessible Development Workbench for HPC Applications Using the Eclipse Parallel Tools Platform" (WHPC) seeks to improve the Eclipse Parallel Tools Platform, an environment designed to support scientific code development targeted at a diverse set of high performance computing systems. Our WHPC project to improve Eclipse PTP takes an application-centric view to improve PTP. We are using a set of scientific applications, each with a variety of challenges, and using PTP to drive further improvements to both the scientific application, as well as to understand shortcomings in Eclipse PTP from an application developer perspective, to drive our list of improvements we seek to make. We are also partnering with performance tool providers, to drive higher quality performance tool integration. We have partnered with the Cactus group at Louisiana State University to improve Eclipse's ability to work with computational frameworks and extremely complex build systems, as well as to develop educational materials to incorporate into
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Biswas, Rupak; Frumkin, Michael; Feng, Huiyu; Biegel, Bryan (Technical Monitor)
2001-01-01
The contents include: 1) A brief history of NPB; 2) What is (not) being measured by NPB; 3) Irregular dynamic applications (UA Benchmark); and 4) Wide area distributed computing (NAS Grid Benchmarks-NGB). This paper is presented in viewgraph form.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
NASA Astrophysics Data System (ADS)
Chugunov, Svyatoslav; Li, Changying
2015-09-01
Parallel implementation of two numerical tools popular in optical studies of biological materials-Inverse Adding-Doubling (IAD) program and Monte Carlo Multi-Layered (MCML) program-was developed and tested in this study. The implementation was based on Message Passing Interface (MPI) and standard C-language. Parallel versions of IAD and MCML programs were compared to their sequential counterparts in validation and performance tests. Additionally, the portability of the programs was tested using a local high performance computing (HPC) cluster, Penguin-On-Demand HPC cluster, and Amazon EC2 cluster. Parallel IAD was tested with up to 150 parallel cores using 1223 input datasets. It demonstrated linear scalability and the speedup was proportional to the number of parallel cores (up to 150x). Parallel MCML was tested with up to 1001 parallel cores using problem sizes of 104-109 photon packets. It demonstrated classical performance curves featuring communication overhead and performance saturation point. Optimal performance curve was derived for parallel MCML as a function of problem size. Typical speedup achieved for parallel MCML (up to 326x) demonstrated linear increase with problem size. Precision of MCML results was estimated in a series of tests - problem size of 106 photon packets was found optimal for calculations of total optical response and 108 photon packets for spatially-resolved results. The presented parallel versions of MCML and IAD programs are portable on multiple computing platforms. The parallel programs could significantly speed up the simulation for scientists and be utilized to their full potential in computing systems that are readily available without additional costs.
Computational electromagnetics and parallel dense matrix computations
Forsman, K.; Kettunen, L.; Gropp, W.; Levine, D.
1995-06-01
We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.
Hayes, J C; Norman, M
1999-10-28
This report details an investigation into the efficacy of two approaches to solving the radiation diffusion equation within a radiation hydrodynamic simulation. Because leading-edge scientific computing platforms have evolved from large single-node vector processors to parallel aggregates containing tens to thousands of individual CPU's, the ability of an algorithm to maintain high compute efficiency when distributed over a large array of nodes is critically important. The viability of an algorithm thus hinges upon the tripartite question of numerical accuracy, total time to solution, and parallel efficiency.
Locating hardware faults in a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-04-13
Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
NASA Astrophysics Data System (ADS)
Gowtham, S.
Small clusters of gallium oxide, technologically important high temperature ceramic, together with interaction of nucleic acid bases with graphene and small-diameter carbon nanotube are focus of first principles calculations in this work. A high performance parallel computing platform is also developed to perform these calculations at Michigan Tech. First principles calculations are based on density functional theory employing either local density or gradient-corrected approximation together with plane wave and Gaussian basis sets. The bulk Ga2O3 is known to be a very good candidate for fabricating electronic devices that operate at high temperatures. To explore the properties of Ga2O3 at nanoscale, we have performed a systematic theoretical study on the small polyatomic gallium oxide clusters. The calculated results find that all lowest energy isomers of GamO n clusters are dominated by the Ga-O bonds over the metal-metal or the oxygen-oxygen bonds. Analysis of atomic charges suggest the clusters to be highly ionic similar to the case of bulk Ga2O3. In the study of sequential oxidation of these clusters starting from Ga3O, it is found that the most stable isomers display up to four different backbones of constituent atoms. Furthermore, the predicted configuration of the ground state of Ga2O is recently confirmed by the experimental results of Neumark's group. Guided by the results of calculations the study of gallium oxide clusters, performance related challenge of computational simulations, of producing high performance computers/platforms, has been addressed. Several engineering aspects were thoroughly studied during the design, development and implementation of the high performance parallel computing platform, RAMA, at Michigan Tech. In an attempt to stay true to the principles of Beowulf revolution, the RAMA cluster was extensively customized to make it easy to understand, and use - for administrators as well as end-users. Following the results of benchmark
The new landscape of parallel computer architecture
NASA Astrophysics Data System (ADS)
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Computer-Aided Parallelizer and Optimizer
NASA Technical Reports Server (NTRS)
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Parallel Algormiivls For Optical Digital Computers
NASA Astrophysics Data System (ADS)
Huang, Alan
1983-04-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterized by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognizing a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed.
Parallel visualization on leadership computing resources
NASA Astrophysics Data System (ADS)
Peterka, T.; Ross, R. B.; Shen, H.-W.; Ma, K.-L.; Kendall, W.; Yu, H.
2009-07-01
Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.
Parallel algorithms for optical digital computers
Huang, A.
1983-01-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.
Parallel and vector computation in heat transfer
Georgiadis, J.G. ); Murthy, J.Y. )
1990-01-01
This collection of manuscripts complements a number of other volumes related to engineering numerical analysis in general; it also gives a preview of the potential contribution of vector and parallel computing to heat transfer. Contributions have been made from the fields of heat transfer, computational fluid mechanics or physics, and from researchers in industry or in academia. This work serves to indicate that new or modified numerical algorithms have to be developed depending on the hardware used (as the long titles of most of the papers in this volume imply). This volume contains six examples of numerical simulation on parallel and vector computers that demonstrate the competitiveness of the novel methodologies. A common thread through all the manuscripts is that they address problems involving irregular geometries or complex physics, or both. Comparative studies of the performance of certain algorithms on various computers are also presented. Most machines used in this work belong to the coarse- to medium-grain group (consisting of a few to a hundred processors) with architectures of the multiple-instruction-stream-multiple- data-stream (MIMD) type. Some of the machines used have both parallel and vector processors, while parallel computations are certainly emphasized. We hope that this work will contribute to the increasing involvement of heat transfer specialists with parallel computation.
Parallel processing for computer vision and display
Dew, P.M. . Dept. of Computer Studies); Earnshaw, R.A. ); Heywood, T.R. )
1989-01-01
The widespread availability of high performance computers has led to an increased awareness of the importance of visualization techniques particularly in engineering and science. However, many visualization tasks involve processing large amounts of data or manipulating complex computer models of 3D objects. For example, in the field of computer aided engineering it is often necessary to display an edit solid object (see Plate 1) which can take many minutes even on the fastest serial processors. Another example of a computationally intensive problem, this time from computer vision, is the recognition of objects in a 3D scene from a stereo image pair. To perform visualization tasks of this type in real and reasonable time it is necessary to exploit the advances in parallel processing that have taken place over the last decade. This book uniquely provides a collection of papers from leading visualization researchers with a common interest in the application and exploitation of parallel processing techniques.
Finite element computation with parallel VLSI
NASA Technical Reports Server (NTRS)
Mcgregor, J.; Salama, M.
1983-01-01
This paper describes a parallel processing computer consisting of a 16-bit microcomputer as a master processor which controls and coordinates the activities of 8086/8087 VLSI chip set slave processors working in parallel. The hardware is inexpensive and can be flexibly configured and programmed to perform various functions. This makes it a useful research tool for the development of, and experimentation with parallel mathematical algorithms. Application of the hardware to computational tasks involved in the finite element analysis method is demonstrated by the generation and assembly of beam finite element stiffness matrices. A number of possible schemes for the implementation of N-elements on N- or n-processors (N is greater than n) are described, and the speedup factors of their time consumption are determined as a function of the number of available parallel processors.
Efficient, massively parallel eigenvalue computation
NASA Technical Reports Server (NTRS)
Huo, Yan; Schreiber, Robert
1993-01-01
In numerical simulations of disordered electronic systems, one of the most common approaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. An effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems, is described. Results of numerical tests and timings are also presented.
NWChem: scalable parallel computational chemistry
van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat
2011-11-01
NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features
Impact of Parallel Computing on Large Scale Aeroelastic Computations
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.; Kwak, Dochan (Technical Monitor)
2000-01-01
Aeroelasticity is computationally one of the most intensive fields in aerospace engineering. Though over the last three decades the computational speed of supercomputers have substantially increased, they are still inadequate for large scale aeroelastic computations using high fidelity flow and structural equations. In addition to reaching a saturation in computational speed because of changes in economics, computer manufactures are stopping the manufacturing of mainframe type supercomputers. This has led computational aeroelasticians to face the gigantic task of finding alternate approaches for fulfilling their needs. The alternate path to over come speed and availability limitations of mainframe type supercomputers is to use parallel computers. During this decade several different architectures have evolved. In FY92 the US Government started the High Performance Computing and Communication (HPCC) program. As a participant in this program NASA developed several parallel computational tools for aeroelastic applications. This talk describes the impact of those application tools on high fidelity based multidisciplinary analysis.
Parallel Computation Of Forward Dynamics Of Manipulators
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1993-01-01
Report presents parallel algorithms and special parallel architecture for computation of forward dynamics of robotics manipulators. Products of effort to find best method of parallel computation to achieve required computational efficiency. Significant speedup of computation anticipated as well as cost reduction.
Efficient communication in massively parallel computers
Cypher, R.E.
1989-01-01
A fundamental operation in parallel computation is sorting. Sorting is important not only because it is required by many algorithms, but also because it can be used to implement irregular, pointer-based communication. The author studies two algorithms for sorting in massively parallel computers. First, he examines Shellsort. Shellsort is a sorting algorithm that is based on a sequence of parameters called increments. Shellsort can be used to create a parallel sorting device known as a sorting network. Researchers have suggested that if the correct increment sequence is used, an optimal size sorting network can be obtained. All published increment sequences have been monotonically decreasing. He shows that no monotonically decreasing increment sequence will yield an optimal size sorting network. Second, he presents a sorting algorithm called Cubesort. Cubesort is the fastest known sorting algorithm for a variety of parallel computers aver a wide range of parameters. He also presents a paradigm for developing parallel algorithms that have efficient communication. The paradigm, called the data reduction paradigm, consists of using a divide-and-conquer strategy. Both the division and combination phases of the divide-and-conquer algorithm may require irregular, pointer-based communication between processors. However, the problem is divided so as to limit the amount of data that must be communicated. As a result the communication can be performed efficiently. He presents data reduction algorithms for the image component labeling problem, the closest pair problem and four versions of the parallel prefix problem.
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Moitra, Stuti
1996-01-01
Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. Analytical study are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.
Optics Program Modified for Multithreaded Parallel Computing
NASA Technical Reports Server (NTRS)
Lou, John; Bedding, Dave; Basinger, Scott
2006-01-01
A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Parallel Pascal - An extended Pascal for parallel computers
NASA Technical Reports Server (NTRS)
Reeves, A. P.
1984-01-01
Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.
Opportunities in computational mechanics: Advances in parallel computing
Lesar, R.A.
1999-02-01
In this paper, the authors will discuss recent advances in computing power and the prospects for using these new capabilities for studying plasticity and failure. They will first review the new capabilities made available with parallel computing. They will discuss how these machines perform and how well their architecture might work on materials issues. Finally, they will give some estimates on the size of problems possible using these computers.
Pronk, Sander; Pouya, Iman; Lundborg, Magnus; Rotskoff, Grant; Wesén, Björn; Kasson, Peter M; Lindahl, Erik
2015-06-01
Computational chemistry and other simulation fields are critically dependent on computing resources, but few problems scale efficiently to the hundreds of thousands of processors available in current supercomputers-particularly for molecular dynamics. This has turned into a bottleneck as new hardware generations primarily provide more processing units rather than making individual units much faster, which simulation applications are addressing by increasingly focusing on sampling with algorithms such as free-energy perturbation, Markov state modeling, metadynamics, or milestoning. All these rely on combining results from multiple simulations into a single observation. They are potentially powerful approaches that aim to predict experimental observables directly, but this comes at the expense of added complexity in selecting sampling strategies and keeping track of dozens to thousands of simulations and their dependencies. Here, we describe how the distributed execution framework Copernicus allows the expression of such algorithms in generic workflows: dataflow programs. Because dataflow algorithms explicitly state dependencies of each constituent part, algorithms only need to be described on conceptual level, after which the execution is maximally parallel. The fully automated execution facilitates the optimization of these algorithms with adaptive sampling, where undersampled regions are automatically detected and targeted without user intervention. We show how several such algorithms can be formulated for computational chemistry problems, and how they are executed efficiently with many loosely coupled simulations using either distributed or parallel resources with Copernicus. PMID:26575558
Parallel Algebraic Multigrid Methods - High Performance Preconditioners
Yang, U M
2004-11-11
The development of high performance, massively parallel computers and the increasing demands of computationally challenging applications have necessitated the development of scalable solvers and preconditioners. One of the most effective ways to achieve scalability is the use of multigrid or multilevel techniques. Algebraic multigrid (AMG) is a very efficient algorithm for solving large problems on unstructured grids. While much of it can be parallelized in a straightforward way, some components of the classical algorithm, particularly the coarsening process and some of the most efficient smoothers, are highly sequential, and require new parallel approaches. This chapter presents the basic principles of AMG and gives an overview of various parallel implementations of AMG, including descriptions of parallel coarsening schemes and smoothers, some numerical results as well as references to existing software packages.
Synchronizing compute node time bases in a parallel computer
Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip
2014-12-30
Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
Synchronizing compute node time bases in a parallel computer
Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip
2015-01-27
Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Broadcasting a message in a parallel computer
Berg, Jeremy E.; Faraj, Ahmad A.
2011-08-02
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.
LEWICE droplet trajectory calculations on a parallel computer
NASA Technical Reports Server (NTRS)
Caruso, Steven C.
1993-01-01
A parallel computer implementation (128 processors) of LEWICE, a NASA Lewis code used to predict the time-dependent ice accretion process for two-dimensional aerodynamic bodies of simple geometries, is described. Two-dimensional parallel droplet trajectory calculations are performed to demonstrate the potential benefits of applying parallel processing to ice accretion analysis. Parallel performance is evaluated as a function of the number of trajectories and the number of processors. For comparison, similar trajectory calculations are performed on single-processor Cray computers, and the best parallel results are found to be 33 and 23 times faster, respectively, than those of the Cray XMP and YMP.
Efficient Parallel Engineering Computing on Linux Workstations
NASA Technical Reports Server (NTRS)
Lou, John Z.
2010-01-01
A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).
An Expert Assistant for Computer Aided Parallelization
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Chun, Robert; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit
2004-01-01
The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.
Efficient parallel global garbage collection on massively parallel computers
Kamada, Tomio; Matsuoka, Satoshi; Yonezawa, Akinori
1994-12-31
On distributed-memory high-performance MPPs where processors are interconnected by an asynchronous network, efficient Garbage Collection (GC) becomes difficult due to inter-node references and references within pending, unprocessed messages. The parallel global GC algorithm (1) takes advantage of reference locality, (2) efficiently traverses references over nodes, (3) admits minimum pause time of ongoing computations, and (4) has been shown to scale up to 1024 node MPPs. The algorithm employs a global weight counting scheme to substantially reduce message traffic. The two methods for confirming the arrival of pending messages are used: one counts numbers of messages and the other uses network `bulldozing.` Performance evaluation in actual implementations on a multicomputer with 32-1024 nodes, Fujitsu AP1000, reveals various favorable properties of the algorithm.
Implementing clips on a parallel computer
NASA Technical Reports Server (NTRS)
Riley, Gary
1987-01-01
The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Parallel evolutionary computation in bioinformatics applications.
Pinho, Jorge; Sobral, João Luis; Rocha, Miguel
2013-05-01
A large number of optimization problems within the field of Bioinformatics require methods able to handle its inherent complexity (e.g. NP-hard problems) and also demand increased computational efforts. In this context, the use of parallel architectures is a necessity. In this work, we propose ParJECoLi, a Java based library that offers a large set of metaheuristic methods (such as Evolutionary Algorithms) and also addresses the issue of its efficient execution on a wide range of parallel architectures. The proposed approach focuses on the easiness of use, making the adaptation to distinct parallel environments (multicore, cluster, grid) transparent to the user. Indeed, this work shows how the development of the optimization library can proceed independently of its adaptation for several architectures, making use of Aspect-Oriented Programming. The pluggable nature of parallelism related modules allows the user to easily configure its environment, adding parallelism modules to the base source code when needed. The performance of the platform is validated with two case studies within biological model optimization. PMID:23127284
Parallel software support for computational structural mechanics
NASA Technical Reports Server (NTRS)
Jordan, Harry F.
1987-01-01
The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Parallel computing: One opportunity, four challenges
Gaudiot, J.-L.
1989-12-31
The author reviews briefly the area of parallel computer processing. This area has been expanding at a great rate in the past decade. Great strides have been made in the hardware area, and in the speed of performance of chips. However to some degree the hardware area is beginning to run into basic physical speed limits, which will slow the rate of advance of this area simply because of physical limitations. The author looks at ways that computer architecture, and software applications, can work to continue the rate of increase in computing power which has occurred over the past decade. Four particular areas are mentioned: programmability; communication network design; reliable operation; performance evaluation and benchmarking.
Parallel Performance of a Combustion Chemistry Simulation
Skinner, Gregg; Eigenmann, Rudolf
1995-01-01
We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.
A fast algorithm for parallel computation of multibody dynamics on MIMD parallel architectures
NASA Technical Reports Server (NTRS)
Fijany, Amir; Kwan, Gregory; Bagherzadeh, Nader
1993-01-01
In this paper the implementation of a parallel O(LogN) algorithm for computation of rigid multibody dynamics on a Hypercube MIMD parallel architecture is presented. To our knowledge, this is the first algorithm that achieves the time lower bound of O(LogN) by using an optimal number of O(N) processors. However, in addition to its theoretical significance, the algorithm is also highly efficient for practical implementation on commercially available MIMD parallel architectures due to its highly coarse grain size and simple communication and synchronization requirements. We present a multilevel parallel computation strategy for implementation of the algorithm on a Hypercube. This strategy allows the exploitation of parallelism at several computational levels as well as maximum overlapping of computation and communication to increase the performance of parallel computation.
Parallelized reliability estimation of reconfigurable computer networks
NASA Technical Reports Server (NTRS)
Nicol, David M.; Das, Subhendu; Palumbo, Dan
1990-01-01
A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.
Seismic imaging on massively parallel computers
Ober, C.C.; Oldfield, R.A.; Womble, D.E.; Mosher, C.C.
1997-07-01
A key to reducing the risks and costs associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Pre-stack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar-wave equation using finite differences. Current industry computational capabilities are insufficient for the application of finite-difference, 3-D, prestack, depth-migration algorithms. High performance computers and state-of-the-art algorithms and software are required to meet this need. As part of an ongoing ACTI project funded by the US Department of Energy, the authors have developed a finite-difference, 3-D prestack, depth-migration code for massively parallel computer systems. The goal of this work is to demonstrate that massively parallel computers (thousands of processors) can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite-difference, prestack, depth migration practical for oil and gas exploration.
Lattice QCD for parallel computers
NASA Astrophysics Data System (ADS)
Quadling, Henley Sean
Lattice QCD is an important tool in the investigation of Quantum Chromodynamics (QCD). This is particularly true at lower energies where traditional perturbative techniques fail, and where other non-perturbative theoretical efforts are not entirely satisfactory. Important features of QCD such as confinement and the masses of the low lying hadronic states have been demonstrated and calculated in lattice QCD simulations. In calculations such as these, non-lattice techniques in QCD have failed. However, despite the incredible advances in computer technology, a full solution of lattice QCD may still be in the too-distant future. Much effort is being expended in the search for ways to reduce the computational burden so that an adequate solution of lattice QCD is possible in the near future. There has been considerable progress in recent years, especially in the research of improved lattice actions. In this thesis, a new approach to lattice QCD algorithms is introduced, which results in very significant efficiency improvements. The new approach is explained in detail, evaluated and verified by comparing physics results with current lattice QCD simulations. The new sub-lattice layout methodology has been specifically designed for current and future hardware. Together with concurrent research into improved lattice actions and more efficient numerical algorithms, the very significant efficiency improvements demonstrated in this thesis can play an important role in allowing lattice QCD researchers access to much more realistic simulations. The techniques presented in this thesis also allow ambitious QCD simulations to be performed on cheap clusters of commodity computers.
Parallel computations and control of adaptive structures
NASA Technical Reports Server (NTRS)
Park, K. C.; Alvin, Kenneth F.; Belvin, W. Keith; Chong, K. P. (Editor); Liu, S. C. (Editor); Li, J. C. (Editor)
1991-01-01
The equations of motion for structures with adaptive elements for vibration control are presented for parallel computations to be used as a software package for real-time control of flexible space structures. A brief introduction of the state-of-the-art parallel computational capability is also presented. Time marching strategies are developed for an effective use of massive parallel mapping, partitioning, and the necessary arithmetic operations. An example is offered for the simulation of control-structure interaction on a parallel computer and the impact of the approach presented for applications in other disciplines than aerospace industry is assessed.
Remarks on parallel computations in MATLAB environment
NASA Astrophysics Data System (ADS)
Opalska, Katarzyna; Opalski, Leszek
2013-10-01
The paper attempts to summarize author's investigation of parallel computation capability of MATLAB environment in solving large ordinary differential equations (ODEs). Two MATLAB versions were tested and two parallelization techniques: one used multiple processors-cores, the other - CUDA compatible Graphics Processing Units (GPUs). A set of parameterized test problems was specially designed to expose different capabilities/limitations of the different variants of the parallel computation environment tested. Presented results illustrate clearly the superiority of the newer MATLAB version and, elapsed time advantage of GPU-parallelized computations for large dimensionality problems over the multiple processor-cores (with speed-up factor strongly dependent on the problem structure).
Running Geant on T. Node parallel computer
Jejcic, A.; Maillard, J.; Silva, J. ); Mignot, B. )
1990-08-01
AnInmos transputer-based computer has been utilized to overcome the difficulties due to the limitations on the processing abilities of event parallelism and multiprocessor farms (i.e., the so called bus-crisis) and the concern regarding the growing sizes of databases typical in High Energy Physics. This study was done on the T.Node parallel computer manufactured by TELMAT. Detailed figures are reported concerning the event parallelization. (AIP)
Modified mesh-connected parallel computers
Carlson, D.A. )
1988-10-01
The mesh-connected parallel computer is an important parallel processing organization that has been used in the past for the design of supercomputing systems. In this paper, the authors explore modifications of a mesh-connected parallel computer for the purpose of increasing the efficiency of executing important application programs. These modifications are made by adding one or more global mesh structures to the processing array. They show how our modifications allow asymptotic improvements in the efficiency of executing computations having low to medium interprocessor communication requirements (e.g., tree computations, prefix computations, finding the connected components of a graph). For computations with high interprocessor communication requirements such as sorting, they show that they offer no speedup. They also compare the modified mesh-connected parallel computer to other similar organizations including the pyramid, the X-tree, and the mesh-of-trees.
Computation and parallel implementation for early vision
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
1990-01-01
The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.
Parallel computation of manipulator inverse dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1991-01-01
In this article, parallel computation of manipulator inverse dynamics is investigated. A hierarchical graph-based mapping approach is devised to analyze the inherent parallelism in the Newton-Euler formulation at several computational levels, and to derive the features of an abstract architecture for exploitation of parallelism. At each level, a parallel algorithm represents the application of a parallel model of computation that transforms the computation into a graph whose structure defines the features of an abstract architecture, i.e., number of processors, communication structure, etc. Data-flow analysis is employed to derive the time lower bound in the computation as well as the sequencing of the abstract architecture. The features of the target architecture are defined by optimization of the abstract architecture to exploit maximum parallelism while minimizing architectural complexity. An architecture is designed and implemented that is capable of efficient exploitation of parallelism at several computational levels. The computation time of the Newton-Euler formulation for a 6-degree-of-freedom (dof) general manipulator is measured as 187 microsec. The increase in computation time for each additional dof is 23 microsec, which leads to a computation time of less than 500 microsec, even for a 12-dof redundant arm.
Optimal dynamic remapping of parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Reynolds, Paul F., Jr.
1987-01-01
A large class of computations are characterized by a sequence of phases, with phase changes occurring unpredictably. The decision problem was considered regarding the remapping of workload to processors in a parallel computation when the utility of remapping and the future behavior of the workload is uncertain, and phases exhibit stable execution requirements during a given phase, but requirements may change radically between phases. For these problems a workload assignment generated for one phase may hinder performance during the next phase. This problem is treated formally for a probabilistic model of computation with at most two phases. The fundamental problem of balancing the expected remapping performance gain against the delay cost was addressed. Stochastic dynamic programming is used to show that the remapping decision policy minimizing the expected running time of the computation has an extremely simple structure. Because the gain may not be predictable, the performance of a heuristic policy that does not require estimnation of the gain is examined. The heuristic method's feasibility is demonstrated by its use on an adaptive fluid dynamics code on a multiprocessor. The results suggest that except in extreme cases, the remapping decision problem is essentially that of dynamically determining whether gain can be achieved by remapping after a phase change. The results also suggest that this heuristic is applicable to computations with more than two phases.
Nonisothermal multiphase subsurface transport on parallel computers
Martinez, M.J.; Hopkins, P.L.; Shadid, J.N.
1997-10-01
We present a numerical method for nonisothermal, multiphase subsurface transport in heterogeneous porous media. The mathematical model considers nonisothermal two-phase (liquid/gas) flow, including capillary pressure effects, binary diffusion in the gas phase, conductive, latent, and sensible heat transport. The Galerkin finite element method is used for spatial discretization, and temporal integration is accomplished via a predictor/corrector scheme. Message-passing and domain decomposition techniques are used for implementing a scalable algorithm for distributed memory parallel computers. An illustrative application is shown to demonstrate capabilities and performance.
Hypercluster - Parallel processing for computational mechanics
NASA Technical Reports Server (NTRS)
Blech, Richard A.
1988-01-01
An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Drought monitoring through parallel computing
Burrage, K.; Belward, J.; Lau, L.; Rezny, M.; Young, R.
1993-12-31
One area where high performance computing can make a significant social and economic impact in Australia (especially in view of the recent El-Nino) is in the accurate and efficient monitoring and prediction of drought conditions - both in terms of speed of calculation and in high quality visualization. As a consequence, the Queensland Department of Primary Industries (DPI) is developing a spatial model of pasture growth and utilization for monitoring, assessment and prediction of the future of the state`s rangeloads. This system incorporates soil class, pasture type, tree cover, herbivore density and meterological data. DPI`s drought research program aims to predict the occurrence of feed deficits and land condition alerts on a quarter to half shire basis over Queensland. This will provide a basis for large-scale management decisions by graziers and politicians alike.
Performance studies of the parallel VIM code
Shi, B.; Blomquist, R.N.
1996-05-01
In this paper, the authors evaluate the performance of the parallel version of the VIM Monte Carlo code on the IBM SPx at the High Performance Computing Research Facility at ANL. Three test problems with contrasting computational characteristics were used to assess effects in performance. A statistical method for estimating the inefficiencies due to load imbalance and communication is also introduced. VIM is a large scale continuous energy Monte Carlo radiation transport program and was parallelized using history partitioning, the master/worker approach, and p4 message passing library. Dynamic load balancing is accomplished when the master processor assigns chunks of histories to workers that have completed a previously assigned task, accommodating variations in the lengths of histories, processor speeds, and worker loads. At the end of each batch (generation), the fission sites and tallies are sent from each worker to the master process, contributing to the parallel inefficiency. All communications are between master and workers, and are serial. The SPx is a scalable 128-node parallel supercomputer with high-performance Omega switches of 63 {micro}sec latency and 35 MBytes/sec bandwidth. For uniform and reproducible performance, they used only the 120 identical regular processors (IBM RS/6000) and excluded the remaining eight planet nodes, which may be loaded by other`s jobs.
Parallel algorithms for mapping pipelined and parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Parallel computing techniques for rotorcraft aerodynamics
NASA Astrophysics Data System (ADS)
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
Parallel Proximity Detection for Computer Simulation
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1997-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are includes by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Parallel Proximity Detection for Computer Simulations
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1998-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are included by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Collectively loading an application in a parallel computer
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.
2016-01-05
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Massively Parallel Computing: A Sandia Perspective
Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.
1999-05-06
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
PARALLEL GROUNDWATER COMPUTATIONS USING PVM
Multiprocessing provides an opportunity or faster execution of programs and increased use of idle computing resources, enabling more detailed examination of more comprehensive models. ultiprocessor architectures are currently diverse, experimental, and not widely available. VM (P...
Accurate modeling of parallel scientific computations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Townsend, James C.
1988-01-01
Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until load time. Critical mapping and remapping decisions rest on the ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a one-dimensional fluids code. The models constructed are shown to be accurate, and are used to find optimal remapping schedules.
Compute Server Performance Results
NASA Technical Reports Server (NTRS)
Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)
1994-01-01
Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
Parallelization of ARC3D with Computer-Aided Tools
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Identifying failure in a tree network of a parallel computer
Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.
2010-08-24
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
Seismic imaging on massively parallel computers
Ober, C.C.; Oldfield, R.; Womble, D.E.; VanDyke, J.; Dosanjh, S.
1996-03-01
Fast, accurate imaging of complex, oil-bearing geologies, such as overthrusts and salt domes, is the key to reducing the costs of domestic oil and gas exploration. Geophysicists say that the known oil reserves in the Gulf of Mexico could be significantly increased if accurate seismic imaging beneath salt domes was possible. A range of techniques exist for imaging these regions, but the highly accurate techniques involve the solution of the wave equation and are characterized by large data sets and large computational demands. Massively parallel computers can provide the computational power for these highly accurate imaging techniques. A brief introduction to seismic processing will be presented, and the implementation of a seismic-imaging code for distributed memory computers will be discussed. The portable code, Salvo, performs a wave equation-based, 3-D, prestack, depth imaging and currently runs on the Intel Paragon and the Cray T3D. It used MPI for portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the Intel Paragon.
Simple, parallel virtual machines for extreme computations
NASA Astrophysics Data System (ADS)
Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jürgen
2015-11-01
We introduce a virtual machine (VM) written in a numerically fast language like Fortran or C for evaluating very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte-code from the optimal matrix element generator, O'MEGA. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behavior with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Fast Parallel Computation Of Manipulator Inverse Dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1991-01-01
Method for fast parallel computation of inverse dynamics problem, essential for real-time dynamic control and simulation of robot manipulators, undergoing development. Enables exploitation of high degree of parallelism and, achievement of significant computational efficiency, while minimizing various communication and synchronization overheads as well as complexity of required computer architecture. Universal real-time robotic controller and simulator (URRCS) consists of internal host processor and several SIMD processors with ring topology. Architecture modular and expandable: more SIMD processors added to match size of problem. Operate asynchronously and in MIMD fashion.
NASA Astrophysics Data System (ADS)
Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
2016-05-01
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
Methods for operating parallel computing systems employing sequenced communications
Benner, Robert E.; Gustafson, John L.; Montry, Gary R.
1999-01-01
A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
CFD research, parallel computation and aerodynamic optimization
NASA Technical Reports Server (NTRS)
Ryan, James S.
1995-01-01
Over five years of research in Computational Fluid Dynamics and its applications are covered in this report. Using CFD as an established tool, aerodynamic optimization on parallel architectures is explored. The objective of this work is to provide better tools to vehicle designers. Submarine design requires accurate force and moment calculations in flow with thick boundary layers and large separated vortices. Low noise production is critical, so flow into the propulsor region must be predicted accurately. The High Speed Civil Transport (HSCT) has been the subject of recent work. This vehicle is to be a passenger vehicle with the capability of cutting overseas flight times by more than half. A successful design must surpass the performance of comparable planes. Fuel economy, other operational costs, environmental impact, and range must all be improved substantially. For all these reasons, improved design tools are required, and these tools must eventually integrate optimization, external aerodynamics, propulsion, structures, heat transfer and other disciplines.
Dynamic Load Balancing for Computational Plasticity on Parallel Computers
NASA Technical Reports Server (NTRS)
Pramono, Eddy; Simon, Horst
1994-01-01
The simulation of the computational plasticity on a complex structure remains a formidable computational task, especially when a highly nonlinear, complex material model was used. It appears that the computational requirements for a such problem can only be satisfied by massively parallel architectures. In order to effectively harness the tremendous computational power provided by such architectures, it is imperative to investigate and to study the algorithmic and implementation issues pertaining to dynamic load balancing for computational plasticity on a highly parallel, distributed-memory, multiple-instruction, multiple-data computers. This paper will measure the effectiveness of the algorithms developed in handling the dynamic load balancing.
Link failure detection in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Megerian, Mark G.; Smith, Brian E.
2010-11-09
Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.
Internode data communications in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-03
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Internode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
NASA Technical Reports Server (NTRS)
Abineri, Adrian; Chen, Y. P.
1992-01-01
APTEC Computer Systems is a Portland, Oregon based manufacturer of I/O computers. APTEC's work in the context of high density storage media is on programs requiring real-time data capture with low latency processing and storage requirements. An example of APTEC's work in this area is the Loral/Space Telescope-Data Archival and Distribution System. This is an existing Loral AeroSys designed system, which utilizes an APTEC I/O computer. The key attributes of a system architecture that is suitable for this environment are as follows: (1) data acquisition alternatives; (2) a wide range of supported mass storage devices; (3) data processing options; (4) data availability through standard network connections; and (5) an overall system architecture (hardware and software designed for high bandwidth and low latency). APTEC's approach is outlined in this document.
Fast Parallel Computation Of Multibody Dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Kwan, Gregory L.; Bagherzadeh, Nader
1996-01-01
Constraint-force algorithm fast, efficient, parallel-computation algorithm for solving forward dynamics problem of multibody system like robot arm or vehicle. Solves problem in minimum time proportional to log(N) by use of optimal number of processors proportional to N, where N is number of dynamical degrees of freedom: in this sense, constraint-force algorithm both time-optimal and processor-optimal parallel-processing algorithm.
A Formal Model for Real-Time Parallel Computation
Hui, Peter SY; Chikkagoudar, Satish
2012-12-29
The imposition of real-time constraints on a parallel computing environment--- specifically high-performance, cluster-computing systems--- introduces a variety of challenges with respect to the formal verification of the system's timing properties. In this paper, we briefly motivate the need for such a system, and we introduce an automaton-based method for performing such formal verification. We define the concept of a consistent parallel timing system: a hybrid system consisting of a set of timed automata (specifically, timed Buechi automata as well as a timed variant of standard finite automata), intended to model the timing properties of a well-behaved real-time parallel system. Finally, we give a brief case study to demonstrate the concepts in the paper: a parallel matrix multiplication kernel which operates within provable upper time bounds. We give the algorithm used, a corresponding consistent parallel timing system, and empirical results showing that the system operates under the specified timing constraints.
Parallel Computation of Airflow in the Human Lung Model
NASA Astrophysics Data System (ADS)
Lee, Taehun; Tawhai, Merryn; Hoffman, Eric. A.
2005-11-01
Parallel computations of airflow in the human lung based on domain decomposition are performed. The realistic lung model is segmented and reconstructed from CT images as part of an effort to build a normative atlas (NIH HL-04368) documenting airway geometry over 4 decades of age in healthy and disease-state adult humans. Because of the large number of the airway generation and the sheer complexity of the geometry, massively parallel computation of pulmonary airflow is carried out. We present the parallel algorithm implemented in the custom-developed characteristic-Galerkin finite element method, evaluate the speed-up and scalability of the scheme, and estimate the computing resources needed to simulate the airflow in the conducting airways of the human lungs. It is found that the special tree-like geometry enables the inter-processor communications to occur among only three or four processors for optimal parallelization irrespective of the number of processors involved in the computation.
Parallel aeroelastic computations for wing and wing-body configurations
NASA Technical Reports Server (NTRS)
Byun, Chansup
1994-01-01
The objective of this research is to develop computationally efficient methods for solving fluid-structural interaction problems by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures on parallel computers. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.
Wing-Body Aeroelasticity on Parallel Computers
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.; Byun, Chansup
1996-01-01
This article presents a procedure for computing the aeroelasticity of wing-body configurations on multiple-instruction, multiple-data parallel computers. In this procedure, fluids are modeled using Euler equations discretized by a finite difference method, and structures are modeled using finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. A parallel integration scheme is used to compute aeroelastic responses by solving the coupled fluid and structural equations concurrently while keeping modularity of each discipline. The present procedure is validated by computing the aeroelastic response of a wing and comparing with experiment. Aeroelastic computations are illustrated for a high speed civil transport type wing-body configuration.
Methods for operating parallel computing systems employing sequenced communications
Benner, R.E.; Gustafson, J.L.; Montry, G.R.
1999-08-10
A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.
Chare kernel; A runtime support system for parallel computations
Shu, W. ); Kale, L.V. )
1991-03-01
This paper presents the chare kernel system, which supports parallel computations with irregular structure. The chare kernel is a collection of primitive functions that manage chares, manipulative messages, invoke atomic computations, and coordinate concurrent activities. Programs written in the chare kernel language can be executed on different parallel machines without change. Users writing such programs concern themselves with the creation of parallel actions but not with assigning them to specific processors. The authors describe the design and implementation of the chare kernel. Performance of chare kernel programs on two hypercube machines, the Intel iPSC/2 and the NCUBE, is also given.
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching Yuen
1991-01-01
A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.
Temporal fringe pattern analysis with parallel computing
Tuck Wah Ng; Kar Tien Ang; Argentini, Gianluca
2005-11-20
Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallel computing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallel computing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis.
Highly parallel computer architecture for robotic computation
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)
1991-01-01
In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2014-08-12
Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm
Use of parallel computing in mass processing of laser data
NASA Astrophysics Data System (ADS)
Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.
2015-12-01
The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.
Rectilinear partitioning of irregular data parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1991-01-01
New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed.
Three parallel computation methods for structural vibration analysis
NASA Technical Reports Server (NTRS)
Storaasli, Olaf; Bostic, Susan; Patrick, Merrell; Mahajan, Umesh; Ma, Shing
1988-01-01
The Lanczos (1950), multisectioning, and subspace iteration sequential methods for vibration analysis presently used as bases for three parallel algorithms are noted, in the aftermath of three example problems, to maintain reasonable accuracy in the computation of vibration frequencies. Significant computation time reductions are obtained as the number of processors increases. An analysis is made of the performance of each method, in order to characterize relative strengths and weaknesses as well as to identify those parameters that most strongly affect computation efficiency.
Parallel Domain Decomposition Preconditioning for Computational Fluid Dynamics
NASA Technical Reports Server (NTRS)
Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Kutler, Paul (Technical Monitor)
1998-01-01
This viewgraph presentation gives an overview of the parallel domain decomposition preconditioning for computational fluid dynamics. Details are given on some difficult fluid flow problems, stabilized spatial discretizations, and Newton's method for solving the discretized flow equations. Schur complement domain decomposition is described through basic formulation, simplifying strategies (including iterative subdomain and Schur complement solves, matrix element dropping, localized Schur complement computation, and supersparse computations), and performance evaluation.
Solving tridiagonal linear systems on the Butterfly parallel computer
Kumar, S.P.
1989-01-01
A parallel block partitioning method to solve a tri-diagonal system of linear equations is adapted to the BBN Butterfly multiprocessor. A performance analysis of the programming experiments on the 32-node Butterfly is presented. An upper bound on the number of processors to achieve the best performance with this method is derived. The computational results verify the theoretical speedup and efficiency results of the parallel algorithm over its serial counterpart. Also included is a study comparing performance runs of the same code on the Butterfly processor with a hardware floating point unit and on one with a software floating point facility. The total parallel time of the given code is considerably reduced by making use of the hardware floating point facility whereas the speedup and efficiency of the parallel program considerably improve on the system with software floating point capability. The achieved results are shown to be within 82% to 90% of the predicted performance.
Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges
NASA Technical Reports Server (NTRS)
Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.
2000-01-01
This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.
Numerical simulation of supersonic wake flow with parallel computers
Wong, C.C.; Soetrisno, M.
1995-07-01
Simulating a supersonic wake flow field behind a conical body is a computing intensive task. It requires a large number of computational cells to capture the dominant flow physics and a robust numerical algorithm to obtain a reliable solution. High performance parallel computers with unique distributed processing and data storage capability can provide this need. They have larger computational memory and faster computing time than conventional vector computers. We apply the PINCA Navier-Stokes code to simulate a wind-tunnel supersonic wake experiment on Intel Gamma, Intel Paragon, and IBM SP2 parallel computers. These simulations are performed to study the mean flow in the near wake region of a sharp, 7-degree half-angle, adiabatic cone at Mach number 4.3 and freestream Reynolds number of 40,600. Overall the numerical solutions capture the general features of the hypersonic laminar wake flow and compare favorably with the wind tunnel data. With a refined and clustering grid distribution in the recirculation zone, the calculated location of the rear stagnation point is consistent with the 2D axisymmetric and 3D experiments. In this study, we also demonstrate the importance of having a large local memory capacity within a computer node and the effective utilization of the number of computer nodes to achieve good parallel performance when simulating a complex, large-scale wake flow problem.
Parallel Computational Fluid Dynamics: Current Status and Future Requirements
NASA Technical Reports Server (NTRS)
Simon, Horst D.; VanDalsem, William R.; Dagum, Leonardo; Kutler, Paul (Technical Monitor)
1994-01-01
One or the key objectives of the Applied Research Branch in the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Allies Research Center is the accelerated introduction of highly parallel machines into a full operational environment. In this report we discuss the performance results obtained from the implementation of some computational fluid dynamics (CFD) applications on the Connection Machine CM-2 and the Intel iPSC/860. We summarize some of the experiences made so far with the parallel testbed machines at the NAS Applied Research Branch. Then we discuss the long term computational requirements for accomplishing some of the grand challenge problems in computational aerosciences. We argue that only massively parallel machines will be able to meet these grand challenge requirements, and we outline the computer science and algorithm research challenges ahead.
Improving the performance of molecular dynamics simulations on parallel clusters.
Borstnik, Urban; Hodoscek, Milan; Janezic, Dusanka
2004-01-01
In this article a procedure is derived to obtain a performance gain for molecular dynamics (MD) simulations on existing parallel clusters. Parallel clusters use a wide array of interconnection technologies to connect multiple processors together, often at different speeds, such as multiple processor computers and networking. It is demonstrated how to configure existing programs for MD simulations to efficiently handle collective communication on parallel clusters with processor interconnections of different speeds. PMID:15032512
Applications of parallel supercomputers: Scientific results and computer science lessons
Fox, G.C.
1989-07-12
Parallel Computing has come of age with several commercial and inhouse systems that deliver supercomputer performance. We illustrate this with several major computations completed or underway at Caltech on hypercubes, transputer arrays and the SIMD Connection Machine CM-2 and AMT DAP. Applications covered are lattice gauge theory, computational fluid dynamics, subatomic string dynamics, statistical and condensed matter physics,theoretical and experimental astronomy, quantum chemistry, plasma physics, grain dynamics, computer chess, graphics ray tracing, and Kalman filters. We use these applications to compare the performance of several advanced architecture computers including the conventional CRAY and ETA-10 supercomputers. We describe which problems are suitable for which computers in the terms of a matching between problem and computer architecture. This is part of a set of lessons we draw for hardware, software, and performance. We speculate on the emergence of new academic disciplines motivated by the growing importance of computers. 138 refs., 23 figs., 10 tabs.
Parallel computing in atmospheric chemistry models
Rotman, D.
1996-02-01
Studies of atmospheric chemistry are of high scientific interest, involve computations that are complex and intense, and require enormous amounts of I/O. Current supercomputer computational capabilities are limiting the studies of stratospheric and tropospheric chemistry and will certainly not be able to handle the upcoming coupled chemistry/climate models. To enable such calculations, the authors have developed a computing framework that allows computations on a wide range of computational platforms, including massively parallel machines. Because of the fast paced changes in this field, the modeling framework and scientific modules have been developed to be highly portable and efficient. Here, the authors present the important features of the framework and focus on the atmospheric chemistry module, named IMPACT, and its capabilities. Applications of IMPACT to aircraft studies will be presented.
Intranode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E
2013-07-23
Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Intranode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E
2014-01-07
Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Hydrologic Terrain Processing Using Parallel Computing
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Watson, D. W.; Wallace, R. M.; Schreuders, K.; Tesfa, T. K.
2009-12-01
Topography in the form of Digital Elevation Models (DEMs), is widely used to derive information for the modeling of hydrologic processes. Hydrologic terrain analysis augments the information content of digital elevation data by removing spurious pits, deriving a structured flow field, and calculating surfaces of hydrologic information derived from the flow field. The increasing availability of high-resolution terrain datasets for large areas poses a challenge for existing algorithms that process terrain data to extract this hydrologic information. This paper will describe parallel algorithms that have been developed to enhance hydrologic terrain pre-processing so that larger datasets can be more efficiently computed. Message Passing Interface (MPI) parallel implementations have been developed for pit removal, flow direction, and generalized flow accumulation methods within the Terrain Analysis Using Digital Elevation Models (TauDEM) package. The parallel algorithm works by decomposing the domain into striped or tiled data partitions where each tile is processed by a separate processor. This method also reduces the memory requirements of each processor so that larger size grids can be processed. The parallel pit removal algorithm is adapted from the method of Planchon and Darboux that starts from a high elevation then progressively scans the grid, lowering each grid cell to the maximum of the original elevation or the lowest neighbor. The MPI implementation reconciles elevations along process domain edges after each scan. Generalized flow accumulation extends flow accumulation approaches commonly available in GIS through the integration of multiple inputs and a broad class of algebraic rules into the calculation of flow related quantities. It is based on establishing a flow field through DEM grid cells, that is then used to evaluate any mathematical function that incorporates dependence on values of the quantity being evaluated at upslope (or downslope) grid cells
A parallel sparse algorithm targeting arterial fluid mechanics computations
NASA Astrophysics Data System (ADS)
Manguoglu, Murat; Takizawa, Kenji; Sameh, Ahmed H.; Tezduyar, Tayfun E.
2011-09-01
Iterative solution of large sparse nonsymmetric linear equation systems is one of the numerical challenges in arterial fluid-structure interaction computations. This is because the fluid mechanics parts of the fluid + structure block of the equation system that needs to be solved at every nonlinear iteration of each time step corresponds to incompressible flow, the computational domains include slender parts, and accurate wall shear stress calculations require boundary layer mesh refinement near the arterial walls. We propose a hybrid parallel sparse algorithm, domain-decomposing parallel solver (DDPS), to address this challenge. As the test case, we use a fluid mechanics equation system generated by starting with an arterial shape and flow field coming from an FSI computation and performing two time steps of fluid mechanics computation with a prescribed arterial shape change, also coming from the FSI computation. We show how the DDPS algorithm performs in solving the equation system and demonstrate the scalability of the algorithm.
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching-Yuen
1992-01-01
This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.
Traffic simulations on parallel computers using domain decomposition techniques
Hanebutte, U.R.; Tentner, A.M.
1995-12-31
Large scale simulations of Intelligent Transportation Systems (ITS) can only be achieved by using the computing resources offered by parallel computing architectures. Domain decomposition techniques are proposed which allow the performance of traffic simulations with the standard simulation package TRAF-NETSIM on a 128 nodes IBM SPx parallel supercomputer as well as on a cluster of SUN workstations. Whilst this particular parallel implementation is based on NETSIM, a microscopic traffic simulation model, the presented strategy is applicable to a broad class of traffic simulations. An outer iteration loop must be introduced in order to converge to a global solution. A performance study that utilizes a scalable test network that consist of square-grids is presented, which addresses the performance penalty introduced by the additional iteration loop.
Parallel computation of radio listening rates
NASA Astrophysics Data System (ADS)
Mazzariol, Marc; Gennart, Benoit A.; Hersch, Roger D.; Gomez, Manuel; Balsiger, Peter; Pellandini, Fausto; Leder, Markus; Wuethrich, Daniel; Feitknecht, Juerg
2000-10-01
Obtaining the listening rates of radio stations in function of time is an important instrument for determining the impact of publicity. Since many radio stations are financed by publicity, the exact determination of radio listening rates is vital to their existence and to further development. Existing methods of determining radio listening rates are based on face to face interviews or telephonic interviews made with a sample population. These traditional methods however require the cooperation and compliance of the participants. In order to significantly improve the determination of radio listening rates, special watches were created which incorporate a custom integrated circuit sampling the ambient sound during a few seconds every minutes. Each watch accumulates these compressed sound samples during one full week. Watches are then sent to an evaluation center, where the sound samples are matched with the sound samples recorded from candidate radio stations. The present paper describes the processing steps necessary for computing the radio listening rates, and shows how this application was parallelized on a cluster of PCs using the CAP Computer-aided parallelization framework. Since the application must run in a production environment, the paper describes also the support provided for graceful degradation in case of transient or permanent failure of one of the system's components. The parallel sound matching server offers a linear speedup up to a large number of processing nodes thanks to the fact that disk access operations across the network are done in pipeline with computations.
Dynamic traffic assignment on parallel computers
Nagel, K.; Frye, R.; Jakob, R.; Rickert, M.; Stretz, P.
1998-12-01
The authors describe part of the current framework of the TRANSIMS traffic research project at the Los Alamos National Laboratory. It includes parallel implementations of a route planner and a microscopic traffic simulation model. They present performance figures and results of an offline load-balancing scheme used in one of the iterative re-planning runs required for dynamic route assignment.
Algorithms for parallel and vector computations
NASA Technical Reports Server (NTRS)
Ortega, James M.
1995-01-01
This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.
Instant well-log inversion with a parallel computer
Kimminau, S.J.; Trivedi, H.
1993-08-01
Well-log analysis requires several vectors of input data to be inverted with a physical model that produces more vectors of output data. The problem is inherently suited to either vectorization or parallelization. PLATO (parallel log analysis, timely output) is a research prototype system that uses a parallel architecture computer with memory-mapped graphics to invert vector data and display the result rapidly. By combining this high-performance computing and display system with a graphical user interface, the analyst can interact with the system in real time'' and can visualize the result of changing parameters on up to 1,000 levels of computed volumes and reconstructed logs. It is expected that such instant'' inversion will remove the main disadvantages frequently cited for simultaneous analysis methods, namely difficulty in assessing sensitivity to different parameters and slow output response. Although the prototype system uses highly specific features of a parallel processor, a subsequent version has been implemented on a conventional (Serial) workstation with less performance but adequate functionality to preserve the apparently instant response. PLATO demonstrates the feasibility of petroleum computing applications combining an intuitive graphical interface, high-performance computing of physical models, and real-time output graphics.
National Combustion Code: Parallel Implementation and Performance
NASA Technical Reports Server (NTRS)
Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.
2000-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Method for implementation of recursive hierarchical segmentation on parallel computers
NASA Technical Reports Server (NTRS)
Tilton, James C. (Inventor)
2005-01-01
A method, computer readable storage, and apparatus for implementing a recursive hierarchical segmentation algorithm on a parallel computing platform. The method includes setting a bottom level of recursion that defines where a recursive division of an image into sections stops dividing, and setting an intermediate level of recursion where the recursive division changes from a parallel implementation into a serial implementation. The segmentation algorithm is implemented according to the set levels. The method can also include setting a convergence check level of recursion with which the first level of recursion communicates with when performing a convergence check.
Parallel Computing Environments and Methods for Power Distribution System Simulation
Lu, Ning; Taylor, Zachary T.; Chassin, David P.; Guttromson, Ross T.; Studham, Scott S.
2005-11-10
The development of cost-effective high-performance parallel computing on multi-processor super computers makes it attractive to port excessively time consuming simulation software from personal computers (PC) to super computes. The power distribution system simulator (PDSS) takes a bottom-up approach and simulates load at appliance level, where detailed thermal models for appliances are used. This approach works well for a small power distribution system consisting of a few thousand appliances. When the number of appliances increases, the simulation uses up the PC memory and its run time increases to a point where the approach is no longer feasible to model a practical large power distribution system. This paper presents an effort made to port a PC-based power distribution system simulator (PDSS) to a 128-processor shared-memory super computer. The paper offers an overview of the parallel computing environment and a description of the modification made to the PDSS model. The performances of the PDSS running on a standalone PC and on the super computer are compared. Future research direction of utilizing parallel computing in the power distribution system simulation is also addressed.
Solving unstructured grid problems on massively parallel computers
NASA Technical Reports Server (NTRS)
Hammond, Steven W.; Schreiber, Robert
1990-01-01
A highly parallel graph mapping technique that enables one to efficiently solve unstructured grid problems on massively parallel computers is presented. Many implicit and explicit methods for solving discretized partial differential equations require each point in the discretization to exchange data with its neighboring points every time step or iteration. The cost of this communication can negate the high performance promised by massively parallel computing. To eliminate this bottleneck, the graph of the irregular problem is mapped into the graph representing the interconnection topology of the computer such that the sum of the distances that the messages travel is minimized. It is shown that using the heuristic mapping algorithm significantly reduces the communication time compared to a naive assignment of processes to processors.
CFD Analysis and Design Optimization Using Parallel Computers
NASA Technical Reports Server (NTRS)
Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James
1997-01-01
A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.
Parallel algorithms for computing linked list prefix
Han, Y. )
1989-06-01
Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.
Parallel Computational Environment for Substructure Optimization
NASA Technical Reports Server (NTRS)
Gendy, Atef S.; Patnaik, Surya N.; Hopkins, Dale A.; Berke, Laszlo
1995-01-01
Design optimization of large structural systems can be attempted through a substructure strategy when convergence difficulties are encountered. When this strategy is used, the large structure is divided into several smaller substructures and a subproblem is defined for each substructure. The solution of the large optimization problem can be obtained iteratively through repeated solutions of the modest subproblems. Substructure strategies, in sequential as well as in parallel computational modes on a Cray YMP multiprocessor computer, have been incorporated in the optimization test bed CometBoards. CometBoards is an acronym for Comparative Evaluation Test Bed of Optimization and Analysis Routines for Design of Structures. Three issues, intensive computation, convergence of the iterative process, and analytically superior optimum, were addressed in the implementation of substructure optimization into CometBoards. Coupling between subproblems as well as local and global constraint grouping are essential for convergence of the iterative process. The substructure strategy can produce an analytically superior optimum different from what can be obtained by regular optimization. For the problems solved, substructure optimization in a parallel computational mode made effective use of all assigned processors.
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
A design methodology for portable software on parallel computers
NASA Technical Reports Server (NTRS)
Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.
1993-01-01
This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers
NASA Technical Reports Server (NTRS)
Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak
1996-01-01
This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
Element-topology-independent preconditioners for parallel finite element computations
NASA Technical Reports Server (NTRS)
Park, K. C.; Alexander, Scott
1992-01-01
A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.
Parallel Analysis and Visualization on Cray Compute Node Linux
Pugmire, Dave; Ahern, Sean
2008-01-01
Capability computer systems are deployed to give researchers the computational power required to investigate and solve key challenges facing the scientific community. As the power of these computer systems increases, the computational problem domain typically increases in size, complexity and scope. These increases strain the ability of commodity analysis and visualization clusters to effectively perform post-processing tasks and provide critical insight and understanding to the computed results. An alternative to purchasing increasingly larger, separate analysis and visualization commodity clusters is to use the computational system itself to perform post-processing tasks. In this paper, the recent successful port of VisIt, a parallel, open source analysis and visualization tool, to compute node linux running on the Cray is detailed. Additionally, the unprecedented ability of this resource for analysis and visualization is discussed and a report on obtained results is presented.
Computational fluid dynamics on a massively parallel computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallel computer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
IPython: components for interactive and parallel computing across disciplines. (Invited)
NASA Astrophysics Data System (ADS)
Perez, F.; Bussonnier, M.; Frederic, J. D.; Froehle, B. M.; Granger, B. E.; Ivanov, P.; Kluyver, T.; Patterson, E.; Ragan-Kelley, B.; Sailer, Z.
2013-12-01
Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallel computing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallel computing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.
The Challenge of Massively Parallel Computing
WOMBLE,DAVID E.
1999-11-03
Since the mid-1980's, there have been a number of commercially available parallel computers with hundreds or thousands of processors. These machines have provided a new capability to the scientific community, and they been used successfully by scientists and engineers although with varying degrees of success. One of the reasons for the limited success is the difficulty, or perceived difficulty, in developing code for these machines. In this paper we discuss many of the issues and challenges in developing scalable hardware, system software and algorithms for machines comprising hundreds or thousands of processors.
Parallel computing for automated model calibration
Burke, John S.; Danielson, Gary R.; Schulz, Douglas A.; Vail, Lance W.
2002-07-29
Natural resources model calibration is a significant burden on computing and staff resources in modeling efforts. Most assessments must consider multiple calibration objectives (for example magnitude and timing of stream flow peak). An automated calibration process that allows real time updating of data/models, allowing scientists to focus effort on improving models is needed. We are in the process of building a fully featured multi objective calibration tool capable of processing multiple models cheaply and efficiently using null cycle computing. Our parallel processing and calibration software routines have been generically, but our focus has been on natural resources model calibration. So far, the natural resources models have been friendly to parallel calibration efforts in that they require no inter-process communication, only need a small amount of input data and only output a small amount of statistical information for each calibration run. A typical auto calibration run might involve running a model 10,000 times with a variety of input parameters and summary statistical output. In the past model calibration has been done against individual models for each data set. The individual model runs are relatively fast, ranging from seconds to minutes. The process was run on a single computer using a simple iterative process. We have completed two Auto Calibration prototypes and are currently designing a more feature rich tool. Our prototypes have focused on running the calibration in a distributed computing cross platform environment. They allow incorporation of?smart? calibration parameter generation (using artificial intelligence processing techniques). Null cycle computing similar to SETI@Home has also been a focus of our efforts. This paper details the design of the latest prototype and discusses our plans for the next revision of the software.
Utilizing parallel optimization in computational fluid dynamics
NASA Astrophysics Data System (ADS)
Kokkolaras, Michael
1998-12-01
General problems of interest in computational fluid dynamics are investigated by means of optimization. Specifically, in the first part of the dissertation, a method of optimal incremental function approximation is developed for the adaptive solution of differential equations. Various concepts and ideas utilized by numerical techniques employed in computational mechanics and artificial neural networks (e.g. function approximation and error minimization, variational principles and weighted residuals, and adaptive grid optimization) are combined to formulate the proposed method. The basis functions and associated coefficients of a series expansion, representing the solution, are optimally selected by a parallel direct search technique at each step of the algorithm according to appropriate criteria; the solution is built sequentially. In this manner, the proposed method is adaptive in nature, although a grid is neither built nor adapted in the traditional sense using a-posteriori error estimates. Variational principles are utilized for the definition of the objective function to be extremized in the associated optimization problems, ensuring that the problem is well-posed. Complicated data structures and expensive remeshing algorithms and systems solvers are avoided. Computational efficiency is increased by using low-order basis functions and concurrent computing. Numerical results and convergence rates are reported for a range of steady-state problems, including linear and nonlinear differential equations associated with general boundary conditions, and illustrate the potential of the proposed method. Fluid dynamics applications are emphasized. Conclusions are drawn by discussing the method's limitations, advantages, and possible extensions. The second part of the dissertation is concerned with the optimization of the viscous-inviscid-interaction (VII) mechanism in an airfoil flow analysis code. The VII mechanism is based on the concept of a transpiration velocity
Implementation and performance of parallel Prolog interpreter
Wei, S.; Kale, L.V.; Balkrishna, R. . Dept. of Computer Science)
1988-01-01
In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
CFD Research, Parallel Computation and Aerodynamic Optimization
NASA Technical Reports Server (NTRS)
Ryan, James S.
1995-01-01
During the last five years, CFD has matured substantially. Pure CFD research remains to be done, but much of the focus has shifted to integration of CFD into the design process. The work under these cooperative agreements reflects this trend. The recent work, and work which is planned, is designed to enhance the competitiveness of the US aerospace industry. CFD and optimization approaches are being developed and tested, so that the industry can better choose which methods to adopt in their design processes. The range of computer architectures has been dramatically broadened, as the assumption that only huge vector supercomputers could be useful has faded. Today, researchers and industry can trade off time, cost, and availability, choosing vector supercomputers, scalable parallel architectures, networked workstations, or heterogenous combinations of these to complete required computations efficiently.
Optimized data communications in a parallel computer
Faraj, Daniel A
2014-10-21
A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.
Optimized data communications in a parallel computer
Faraj, Daniel A.
2014-08-19
A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.
Variable-Complexity Multidisciplinary Optimization on Parallel Computers
NASA Technical Reports Server (NTRS)
Grossman, Bernard; Mason, William H.; Watson, Layne T.; Haftka, Raphael T.
1998-01-01
This report covers work conducted under grant NAG1-1562 for the NASA High Performance Computing and Communications Program (HPCCP) from December 7, 1993, to December 31, 1997. The objective of the research was to develop new multidisciplinary design optimization (MDO) techniques which exploit parallel computing to reduce the computational burden of aircraft MDO. The design of the High-Speed Civil Transport (HSCT) air-craft was selected as a test case to demonstrate the utility of our MDO methods. The three major tasks of this research grant included: development of parallel multipoint approximation methods for the aerodynamic design of the HSCT, use of parallel multipoint approximation methods for structural optimization of the HSCT, mathematical and algorithmic development including support in the integration of parallel computation for items (1) and (2). These tasks have been accomplished with the development of a response surface methodology that incorporates multi-fidelity models. For the aerodynamic design we were able to optimize with up to 20 design variables using hundreds of expensive Euler analyses together with thousands of inexpensive linear theory simulations. We have thereby demonstrated the application of CFD to a large aerodynamic design problem. For the predicting structural weight we were able to combine hundreds of structural optimizations of refined finite element models with thousands of optimizations based on coarse models. Computations have been carried out on the Intel Paragon with up to 128 nodes. The parallel computation allowed us to perform combined aerodynamic-structural optimization using state of the art models of a complex aircraft configurations.
Pipeline and parallel architectures for computer communication systems
Reddi, A.V.
1983-01-01
Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
NASA Astrophysics Data System (ADS)
Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.
2009-08-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
Communication-efficient parallel architectures and algorithms for image computations
Alnuweiri, H.M.
1989-01-01
The main purpose of this dissertation is the design of efficient parallel techniques for image computations which require global operations on image pixels, as well as the development of parallel architectures with special communication features which can support global data movement efficiently. The class of image problems considered in this dissertation involves global operations on image pixels, and irregular (data-dependent) data movement operations. Such problems include histogramming, component labeling, proximity computations, computing the Hough Transform, computing convexity of regions and related properties such as computing the diameter and a smallest area enclosing rectangle for each region. Images with multiple figures and multiple labeled-sets of pixels are also considered. Efficient solutions to such problems involve integer sorting, graph theoretic techniques, and techniques from computational geometry. Although such solutions are not computationally intensive (they all require O(n{sup 2}) operations to be performed on an n {times} n image), they require global communications. The emphasis here is on developing parallel techniques for data movement, reduction, and distribution, which lead to processor-time optimal solutions for such problems on the proposed organizations. The proposed parallel architectures are based on a memory array which can be viewed as an arrangement of memory modules in a k-dimensional space such that the modules are connected to buses placed parallel to the orthogonal axes of the space, and each bus is connected to one processor or a group of processors. It will be shown that such organizations are communication-efficient and are thus highly suited to the image problems considered here, and also to several other classes of problems. The proposed organizations have p processors and O(n{sup 2}) words of memory to process n {times} n images.
Performance analysis of parallel supernodal sparse LU factorization
Grigori, Laura; Li, Xiaoye S.
2004-02-05
We investigate performance characteristics for the LU factorization of large matrices with various sparsity patterns. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, making use of static pivoting. We develop a performance model and we validate it using the implementation in SuperLU-DIST, the real matrices and the IBM Power3 machine at NERSC. We use this model to obtain performance bounds on parallel computers, to perform scalability analysis and to identify performance bottlenecks. We also discuss the role of load balance and data distribution in this approach.
Parallel distance matrix computation for Matlab data mining
NASA Astrophysics Data System (ADS)
Skurowski, Przemysław; Staniszewski, Michał
2016-06-01
The paper presents utility functions for computing of a distance matrix, which plays a crucial role in data mining. The goal in the design was to enable operating on relatively large datasets by overcoming basic shortcoming - computing time - with an interface easy to use. The presented solution is a set of functions, which were created with emphasis on practical applicability in real life. The proposed solution is presented along the theoretical background for the performance scaling. Furthermore, different approaches of the parallel computing are analyzed, including shared memory, which is uncommon in Matlab environment.
Parallel multithread computing for spectroscopic analysis in optical coherence tomography
NASA Astrophysics Data System (ADS)
Trojanowski, Michal; Kraszewski, Maciej; Strakowski, Marcin; Pluciński, Jerzy
2014-05-01
Spectroscopic Optical Coherence Tomography (SOCT) is an extension of Optical Coherence Tomography (OCT). It allows gathering spectroscopic information from individual scattering points inside the sample. It is based on time-frequency analysis of interferometric signals. Such analysis requires calculating hundreds of Fourier transforms while performing a single A-scan. Additionally, further processing of acquired spectroscopic information is needed. This significantly increases the time of required computations. During last years, application of graphical processing units (GPU's) was proposed to reduce computation time in OCT by using parallel computing algorithms. GPU technology can be also used to speed-up signal processing in SOCT. However, parallel algorithms used in classical OCT need to be revised because of different character of analyzed data. The classical OCT requires processing of long, independent interferometric signals for obtaining subsequent A-scans. The difference with SOCT is that it requires processing of multiple, shorter signals, which differ only in a small part of samples. We have developed new algorithms for parallel signal processing for usage in SOCT, implemented with NVIDIA CUDA (Compute Unified Device Architecture). We present details of the algorithms and performance tests for analyzing data from in-house SD-OCT system. We also give a brief discussion about usefulness of developed algorithm. Presented algorithms might be useful for researchers working on OCT, as they allow to reduce computation time and are step toward real-time signal processing of SOCT data.
QCMPI: A parallel environment for quantum computing
NASA Astrophysics Data System (ADS)
Tabakin, Frank; Juliá-Díaz, Bruno
2009-06-01
QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with parallel processing capabilities. It is an accessible research tool that permits rapid evaluation of quantum algorithms for a large number of qubits and for various "noise" scenarios. The prime motivation for developing QCMPI is to facilitate numerical examination of not only how QC algorithms work, but also to include noise, decoherence, and attenuation effects and to evaluate the efficacy of error correction schemes. The present work builds on an earlier Mathematica code QDENSITY, which is mainly a pedagogic tool. In that earlier work, although the density matrix formulation was featured, the description using state vectors was also provided. In QCMPI, the stress is on state vectors, in order to employ a large number of qubits. The parallel processing feature is implemented by using the Message-Passing Interface (MPI) protocol. A description of how to spread the wave function components over many processors is provided, along with how to efficiently describe the action of general one- and two-qubit operators on these state vectors. These operators include the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum Fourier transformation. These operators make up the actions needed in QC. Codes for Grover's search and Shor's factoring algorithms are provided as examples. A major feature of this work is that concurrent versions of the algorithms can be evaluated with each version subject to alternate noise effects, which corresponds to the idea of solving a stochastic Schrödinger equation. The density matrix for the ensemble of such noise cases is constructed using parallel distribution methods to evaluate its eigenvalues and associated entropy. Potential applications of this powerful tool include studies of the stability and correction of QC processes using Hamiltonian based dynamics. Program summaryProgram title: QCMPI Catalogue identifier: AECS_v1_0 Program summary URL
Serial multiplier arrays for parallel computation
NASA Technical Reports Server (NTRS)
Winters, Kel
1990-01-01
Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.
Boundary element analysis on vector and parallel computers
NASA Technical Reports Server (NTRS)
Kane, J. H.
1994-01-01
Boundary element analysis (BEA) can be characterized as a numerical technique that generally shifts the computational burden in the analysis toward numerical integration and the solution of nonsymmetric and either dense or blocked sparse systems of algebraic equations. Researchers have explored the concept that the fundamental characteristics of BEA can be exploited to generate effective implementations on vector and parallel computers. In this paper, the results of some of these investigations are discussed. The performance of overall algorithms for BEA on vector supercomputers, massively data parallel single instruction multiple data (SIMD), and relatively fine grained distributed memory multiple instruction multiple data (MIMD) computer systems is described. Some general trends and conclusions are discussed, along with indications of future developments that may prove fruitful in this regard.
Parallel Performance Optimization of the Direct Simulation Monte Carlo Method
NASA Astrophysics Data System (ADS)
Gao, Da; Zhang, Chonglin; Schwartzentruber, Thomas
2009-11-01
Although the direct simulation Monte Carlo (DSMC) particle method is more computationally intensive compared to continuum methods, it is accurate for conditions ranging from continuum to free-molecular, accurate in highly non-equilibrium flow regions, and holds potential for incorporating advanced molecular-based models for gas-phase and gas-surface interactions. As available computer resources continue their rapid growth, the DSMC method is continually being applied to increasingly complex flow problems. Although processor clock speed continues to increase, a trend of increasing multi-core-per-node parallel architectures is emerging. To effectively utilize such current and future parallel computing systems, a combined shared/distributed memory parallel implementation (using both Open Multi-Processing (OpenMP) and Message Passing Interface (MPI)) of the DSMC method is under development. The parallel implementation of a new state-of-the-art 3D DSMC code employing an embedded 3-level Cartesian mesh will be outlined. The presentation will focus on performance optimization strategies for DSMC, which includes, but is not limited to, modified algorithm designs, practical code-tuning techniques, and parallel performance optimization. Specifically, key issues important to the DSMC shared memory (OpenMP) parallel performance are identified as (1) granularity (2) load balancing (3) locality and (4) synchronization. Challenges and solutions associated with these issues as they pertain to the DSMC method will be discussed.
Broadcasting a message in a parallel computer
Archer, Charles J; Faraj, Daniel A
2014-11-18
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.
Broadcasting a message in a parallel computer
Archer, Charles J; Faraj, Ahmad A
2013-04-16
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.
Broadcasting collective operation contributions throughout a parallel computer
Faraj, Ahmad
2012-02-21
Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
Performance of the Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
As the input/output (I/O) needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. Initial experiments, reported in this paper, indicate that Galley is capable of providing high-performance 1/O to applications the applications that rely on them. In Section 3 we describe that access data in patterns that have been observed to be common.
Superfast robust digital image correlation analysis with parallel computing
NASA Astrophysics Data System (ADS)
Pan, Bing; Tian, Long
2015-03-01
Existing digital image correlation (DIC) using the robust reliability-guided displacement tracking (RGDT) strategy for full-field displacement measurement is a path-dependent process that can only be executed sequentially. This path-dependent tracking strategy not only limits the potential of DIC for further improvement of its computational efficiency but also wastes the parallel computing power of modern computers with multicore processors. To maintain the robustness of the existing RGDT strategy and to overcome its deficiency, an improved RGDT strategy using a two-section tracking scheme is proposed. In the improved RGDT strategy, the calculated points with correlation coefficients higher than a preset threshold are all taken as reliably computed points and given the same priority to extend the correlation analysis to their neighbors. Thus, DIC calculation is first executed in parallel at multiple points by separate independent threads. Then for the few calculated points with correlation coefficients smaller than the threshold, DIC analysis using existing RGDT strategy is adopted. Benefiting from the improved RGDT strategy and the multithread computing, superfast DIC analysis can be accomplished without sacrificing its robustness and accuracy. Experimental results show that the presented parallel DIC method performed on a common eight-core laptop can achieve about a 7 times speedup.
QCD on the highly parallel computer AP1000
NASA Astrophysics Data System (ADS)
Akemi, K.; de Forcrand, Ph.; Fujisaki, M.; Hashimoto, T.; Hege, H. C.; Hioki, S.; Makino, J.; Miyamura, O.; Nakamura, A.; Okuda, M.; Stamatescu, I. O.; Tago, Y.; Takaishi, T.; QCD TARO (QCD on Thousand cell ARray processorsOmnipurpose) Collaboration
We have been running quenched QCD simulations on 32 4 and 32 3 × 48 lattices using a 512 processor AP1000, which is a highly parallel computer with up to 1024 processing elements. We have developed programs for update, blocking and hadron propagator calculations. The pseudo heatbath and the overrelaxation algorithms were used for the update with performance of 2.6 and 2.0 μsec/link, respectively.
Semi-coarsening multigrid methods for parallel computing
Jones, J.E.
1996-12-31
Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.
Performance monitoring of parallel scientific applications
Skinner, David
2005-05-01
This paper introduces an infrastructure for efficiently collecting performance profiles from parallel HPC codes. Integrated Performance Monitoring (IPM) brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application. IPM maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage. IPM is open source, relies on portable software technologies and is scalable to thousands of tasks.
Parallel computing in genomic research: advances and applications
Ocaña, Kary; de Oliveira, Daniel
2015-01-01
Today’s genomic experiments have to process the so-called “biological big data” that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. PMID:26604801
Swift : fast, reliable, loosely coupled parallel computation.
Zhao, Y.; Hategan, M.; Clifford, B.; Foster, I.; von Laszewski, G.; Nefedova, V.; Raicu, I.; Stef-Praun, T.; Wilde, M.; Mathematics and Computer Science; Univ. of Chicago
2007-01-01
A common pattern in scientific computing involves the execution of many tasks that are coupled only in the sense that the output of one may be passed as input to one or more others - for example, as a file, or via a Web Services invocation. While such 'loosely coupled' computations can involve large amounts of computation and communication, the concerns of the programmer tend to be different than in traditional high performance computing, being focused on management issues relating to the large numbers of datasets and tasks (and often, the complexities inherent in 'messy' data organizations) rather than the optimization of interprocessor communication. To address these concerns, we have developed Swift, a system that combines a novel scripting language called SwiftScript with a powerful runtime system based on CoG Karajan and Falkon to allow for the concise specification, and reliable and efficient execution, of large loosely coupled computations. Swift adopts and adapts ideas first explored in the GriPhyN virtual data system, improving on that system in many regards. We describe the SwiftScript language and its use of XDTM to describe the logical structure of complex file system structures. We also present the Swift system and its use of CoG Karajan, Falkon, and Globus services to dispatch and manage the execution of many tasks in different execution environments. We summarize application experiences and detail performance experiments that quantify the cost of Swift operations.
SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008
2008-09-08
The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.
NASA Astrophysics Data System (ADS)
Shimizu, Futoshi; Kimizuka, Hajime; Kaburaki, Hideo
2002-08-01
A new parallel computing environment, called as ``Parallel Molecular Dynamics Stencil'', has been developed to carry out a large-scale short-range molecular dynamics simulation of solids. The stencil is written in C language using MPI for parallelization and designed successfully to separate and conceal parts of the programs describing cutoff schemes and parallel algorithms for data communication. This has been made possible by introducing the concept of image atoms. Therefore, only a sequential programming of the force calculation routine is required for executing the stencil in parallel environment. Typical molecular dynamics routines, such as various ensembles, time integration methods, and empirical potentials, have been implemented in the stencil. In the presentation, the performance of the stencil on parallel computers of Hitachi, IBM, SGI, and PC-cluster using the models of Lennard-Jones and the EAM type potentials for fracture problem will be reported.
Performance Analysis and Optimization on a Parallel Atmospheric General Circulation Model Code
NASA Technical Reports Server (NTRS)
Lou, J. Z.; Farrara, J. D.
1997-01-01
An analysis is presented of the primary factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on distributed-memory, massively parallel computer systems.
Local rollback for fault-tolerance in parallel computing systems
Blumrich, Matthias A.; Chen, Dong; Gara, Alan; Giampapa, Mark E.; Heidelberger, Philip; Ohmacht, Martin; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan
2012-01-24
A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.
A microeconomic scheduler for parallel computers
NASA Technical Reports Server (NTRS)
Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex
1995-01-01
We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.
PArallel Reacting Multiphase FLOw Computational Fluid Dynamic Analysis
Energy Science and Technology Software Center (ESTSC)
2002-06-01
PARMFLO is a parallel multiphase reacting flow computational fluid dynamics (CFD) code. It can perform steady or unsteady simulations in three space dimensions. It is intended for use in engineering CFD analysis of industrial flow system components. Its parallel processing capabilities allow it to be applied to problems that use at least an order of magnitude more computational cells than the number that can be used on a typical single processor workstation (about 106 cellsmore » in parallel processing mode versus about io cells in serial processing mode). Alternately, by spreading the work of a CFD problem that could be run on a single workstation over a group of computers on a network, it can bring the runtime down by an order of magnitude or more (typically from many days to less than one day). The software was implemented using the industry standard Message-Passing Interface (MPI) and domain decomposition in one spatial direction. The phases of a flow problem may include an ideal gas mixture with an arbitrary number of chemical species, and dispersed droplet and particle phases. Regions of porous media may also be included within the domain. The porous media may be packed beds, foams, or monolith catalyst supports. With these features, the code is especially suited to analysis of mixing of reactants in the inlet chamber of catalytic reactors coupled to computation of product yields that result from the flow of the mixture through the catalyst coaled support structure.« less
PArallel Reacting Multiphase FLOw Computational Fluid Dynamic Analysis
Lottes, Steven A.
2002-06-01
PARMFLO is a parallel multiphase reacting flow computational fluid dynamics (CFD) code. It can perform steady or unsteady simulations in three space dimensions. It is intended for use in engineering CFD analysis of industrial flow system components. Its parallel processing capabilities allow it to be applied to problems that use at least an order of magnitude more computational cells than the number that can be used on a typical single processor workstation (about 106 cells in parallel processing mode versus about io cells in serial processing mode). Alternately, by spreading the work of a CFD problem that could be run on a single workstation over a group of computers on a network, it can bring the runtime down by an order of magnitude or more (typically from many days to less than one day). The software was implemented using the industry standard Message-Passing Interface (MPI) and domain decomposition in one spatial direction. The phases of a flow problem may include an ideal gas mixture with an arbitrary number of chemical species, and dispersed droplet and particle phases. Regions of porous media may also be included within the domain. The porous media may be packed beds, foams, or monolith catalyst supports. With these features, the code is especially suited to analysis of mixing of reactants in the inlet chamber of catalytic reactors coupled to computation of product yields that result from the flow of the mixture through the catalyst coaled support structure.
National Combustion Code Parallel Performance Enhancements
NASA Technical Reports Server (NTRS)
Quealy, Angela; Benyo, Theresa (Technical Monitor)
2002-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Use Computer-Aided Tools to Parallelize Large CFD Applications
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Yan, J.
2000-01-01
Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of
TSE computers - A means for massively parallel computations
NASA Technical Reports Server (NTRS)
Strong, J. P., III
1976-01-01
A description is presented of hardware concepts for building a massively parallel processing system for two-dimensional data. The processing system is to use logic arrays of 128 x 128 elements which perform over 16 thousand operations simultaneously. Attention is given to image data, logic arrays, basic image logic functions, a prototype negator, an interleaver device, image logic circuits, and an image memory circuit.
Parallel computation of three-dimensional nonlinear magnetostatic problems.
Levine, D.; Gropp, W.; Forsman, K.; Kettunen, L.; Mathematics and Computer Science; Tampere Univ. of Tech.
1999-02-01
We describe a general-purpose parallel electromagnetic code for computing accurate solutions to large computationally demanding, 3D, nonlinear magnetostatic problems. The code, CORAL, is based on a volume integral equation formulation. Using an IBM SP parallel computer and iterative solution methods, we successfully solved the dense linear systems inherent in such formulations. A key component of our work was the use of the PETSc library, which provides parallel portability and access to the latest linear algebra solution technology.
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical
Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; VanderWijngaart, Rob F.
2003-01-01
We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Implementation of ADI: Schemes on MIMD parallel computers
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1993-01-01
In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
Numerical computation on massively parallel hypercubes. [Connection machine
McBryan, O.A.
1986-01-01
We describe numerical computations on the Connection Machine, a massively parallel hypercube architecture with 65,536 single-bit processors and 32 Mbytes of memory. A parallel extension of COMMON LISP, provides access to the processors and network. The rich software environment is further enhanced by a powerful virtual processor capability, which extends the degree of fine-grained parallelism beyond 1,000,000. We briefly describe the hardware and indicate the principal features of the parallel programming environment. We then present implementations of SOR, multigrid and pre-conditioned conjugate gradient algorithms for solving partial differential equations on the Connection Machine. Despite the lack of floating point hardware, computation rates above 100 megaflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly. The software development effort is also facilitated by the fact that hypercube communications prove to be fast and essentially independent of distance. 29 refs., 4 figs.
Parallel CFD design on network-based computer
NASA Technical Reports Server (NTRS)
Cheung, Samson
1995-01-01
Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advanced computational fluid dynamics codes, which can be computationally expensive on mainframe supercomputers. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computing environment utilizing a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package is applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.
CFD Optimization on Network-Based Parallel Computer System
NASA Technical Reports Server (NTRS)
Cheung, Samson H.; Holst, Terry L. (Technical Monitor)
1994-01-01
Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advance computational fluid dynamics codes, which is computationally expensive in mainframe supercomputer. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computer on a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package has been applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.
Parallel computing of a climate model on the dawn 1000 by domain decomposition method
NASA Astrophysics Data System (ADS)
Bi, Xunqiang
1997-12-01
In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.
NASA Astrophysics Data System (ADS)
Gong, Yiyuan; Guan, Senlin; Nakamura, Morikazu
This paper investigates migration effects of parallel genetic algorithms (GAs) on the line topology of heterogeneous computing resources. Evolution process of parallel GAs is evaluated experimentally on two types of arrangements of heterogeneous computing resources: the ascending and descending order arrangements. Migration effects are evaluated from the viewpoints of scalability, chromosome diversity, migration frequency and solution quality. The results reveal that the performance of parallel GAs strongly depends on the design of the chromosome migration in which we need to consider the arrangement of heterogeneous computing resources, the migration frequency and so on. The results contribute to provide referential scheme of implementation of parallel GAs on heterogeneous computing resources.
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Final Report: Super Instruction Architecture for Scalable Parallel Computations
Sanders, Beverly Ann; Bartlett, Rodney; Deumens, Erik
2013-12-23
The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.
Center for Programming Models for Scalable Parallel Computing
John Mellor-Crummey
2008-02-29
Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.