Science.gov

Sample records for beowulf-style parallel computer

  1. Parallel computing works

    SciTech Connect

    Not Available

    1991-10-23

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  2. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Among the highly parallel computing architectures required for advanced scientific computation, those designated 'MIMD' and 'SIMD' have yielded the best results to date. The present development status evaluation of such architectures shown neither to have attained a decisive advantage in most near-homogeneous problems' treatment; in the cases of problems involving numerous dissimilar parts, however, such currently speculative architectures as 'neural networks' or 'data flow' machines may be entailed. Data flow computers are the most practical form of MIMD fine-grained parallel computers yet conceived; they automatically solve the problem of assigning virtual processors to the real processors in the machine.

  3. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed.

  4. Parallel Computing in SCALE

    SciTech Connect

    DeHart, Mark D; Williams, Mark L; Bowman, Stephen M

    2010-01-01

    The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

  5. Algorithmically specialized parallel computers

    SciTech Connect

    Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.

    1985-01-01

    This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.

  6. Reordering computations for parallel execution

    NASA Technical Reports Server (NTRS)

    Adams, L.

    1985-01-01

    The computations are reordered in the SOR algorithm to maintain the same asymptotic rate of convergence as the rowwise ordering to obtain parallelism at different levels. A parallel program is written to illustrate these ideas and actual machines for implementation of this program are discussed.

  7. Predicting performance of parallel computations

    NASA Technical Reports Server (NTRS)

    Mak, Victor W.; Lundstrom, Stephen F.

    1990-01-01

    An accurate and computationally efficient method for predicting the performance of a class of parallel computations running on concurrent systems is described. A parallel computation is modeled as a task system with precedence relationships expressed as a series-parallel directed acyclic graph. Resources in a concurrent system are modeled as service centers in a queuing network model. Using these two models as inputs, the method outputs predictions of expected execution time of the parallel computation and the concurrent system utilization. The method is validated against both detailed simulation and actual execution on a commercial multiprocessor. Using 100 test cases, the average error of the prediction when compared to simulation statistics is 1.7 percent, with a standard deviation of 1.5 percent; the maximum error is about 10 percent.

  8. Parallel computation using limited resources

    SciTech Connect

    Sugla, B.

    1985-01-01

    This thesis addresses itself to the task of designing and analyzing parallel algorithms when the resources of processors, communication, and time are limited. The two parts of this thesis deal with multiprocessor systems and VLSI - the two important parallel processing environments that are prevalent today. In the first part a time-processor-communication tradeoff analysis is conducted for two kinds of problems - N input, 1 output, and N input, N output computations. In the class of problems of the second kind, the problem of prefix computation, an important problem due to the number of naturally occurring computations it can model, is studied. Finally, a general methodology is given for design of parallel algorithms that can be used to optimize a given design to a wide set of architectural variations. The second part of the thesis considers the design of parallel algorithms for the VLSI model of computation when the resource of time is severely restricted.

  9. Parallel Architecture For Robotics Computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1990-01-01

    Universal Real-Time Robotic Controller and Simulator (URRCS) is highly parallel computing architecture for control and simulation of robot motion. Result of extensive algorithmic study of different kinematic and dynamic computational problems arising in control and simulation of robot motion. Study led to development of class of efficient parallel algorithms for these problems. Represents algorithmically specialized architecture, in sense capable of exploiting common properties of this class of parallel algorithms. System with both MIMD and SIMD capabilities. Regarded as processor attached to bus of external host processor, as part of bus memory.

  10. Turbomachinery CFD on parallel computers

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.

    1992-01-01

    The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.

  11. Massively parallel quantum computer simulator

    NASA Astrophysics Data System (ADS)

    De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.

    2007-01-01

    We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.

  12. Computational electromagnetics and parallel dense matrix computations

    SciTech Connect

    Forsman, K.; Kettunen, L.; Gropp, W.; Levine, D.

    1995-06-01

    We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.

  13. Merlin - Massively parallel heterogeneous computing

    NASA Technical Reports Server (NTRS)

    Wittie, Larry; Maples, Creve

    1989-01-01

    Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.

  14. Computational results for parallel unstructured mesh computations

    SciTech Connect

    Jones, M.T.; Plassmann, P.E.

    1994-12-31

    The majority of finite element models in structural engineering are composed of unstructured meshes. These unstructured meshes are often very large and require significant computational resources; hence they are excellent candidates for massively parallel computation. Parallel solution of the sparse matrices that arise from such meshes has been studied heavily, and many good algorithms have been developed. Unfortunately, many of the other aspects of parallel unstructured mesh computation have gone largely ignored. The authors present a set of algorithms that allow the entire unstructured mesh computation process to execute in parallel -- including adaptive mesh refinement, equation reordering, mesh partitioning, and sparse linear system solution. They briefly describe these algorithms and state results regarding their running-time and performance. They then give results from the 512-processor Intel DELTA for a large-scale structural analysis problem. These results demonstrate that the new algorithms are scalable and efficient. The algorithms are able to achieve up to 2.2 gigaflops for this unstructured mesh problem.

  15. Visualizing Parallel Computer System Performance

    NASA Technical Reports Server (NTRS)

    Malony, Allen D.; Reed, Daniel A.

    1988-01-01

    Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.

  16. High Performance Parallel Computational Nanotechnology

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to

  17. Efficient, massively parallel eigenvalue computation

    NASA Technical Reports Server (NTRS)

    Huo, Yan; Schreiber, Robert

    1993-01-01

    In numerical simulations of disordered electronic systems, one of the most common approaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. An effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems, is described. Results of numerical tests and timings are also presented.

  18. NWChem: scalable parallel computational chemistry

    SciTech Connect

    van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

    2011-11-01

    NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features

  19. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1995-01-01

    The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.

  20. Parallel Computation Of Forward Dynamics Of Manipulators

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1993-01-01

    Report presents parallel algorithms and special parallel architecture for computation of forward dynamics of robotics manipulators. Products of effort to find best method of parallel computation to achieve required computational efficiency. Significant speedup of computation anticipated as well as cost reduction.

  1. Trajectory optimization using parallel shooting method on parallel computer

    SciTech Connect

    Wirthman, D.J.; Park, S.Y.; Vadali, S.R.

    1995-03-01

    The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.

  2. Parallel computing in enterprise modeling.

    SciTech Connect

    Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.

    2008-08-01

    This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

  3. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  4. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  5. Parallel Computing Using Web Servers and "Servlets".

    ERIC Educational Resources Information Center

    Lo, Alfred; Bloor, Chris; Choi, Y. K.

    2000-01-01

    Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…

  6. Broadcasting a message in a parallel computer

    DOEpatents

    Berg, Jeremy E.; Faraj, Ahmad A.

    2011-08-02

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.

  7. Implementing clips on a parallel computer

    NASA Technical Reports Server (NTRS)

    Riley, Gary

    1987-01-01

    The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

  8. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  9. Template based parallel checkpointing in a massively parallel computer system

    DOEpatents

    Archer, Charles Jens; Inglett, Todd Alan

    2009-01-13

    A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

  10. Parallel computation with the force

    NASA Technical Reports Server (NTRS)

    Jordan, H. F.

    1985-01-01

    A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.

  11. Remarks on parallel computations in MATLAB environment

    NASA Astrophysics Data System (ADS)

    Opalska, Katarzyna; Opalski, Leszek

    2013-10-01

    The paper attempts to summarize author's investigation of parallel computation capability of MATLAB environment in solving large ordinary differential equations (ODEs). Two MATLAB versions were tested and two parallelization techniques: one used multiple processors-cores, the other - CUDA compatible Graphics Processing Units (GPUs). A set of parameterized test problems was specially designed to expose different capabilities/limitations of the different variants of the parallel computation environment tested. Presented results illustrate clearly the superiority of the newer MATLAB version and, elapsed time advantage of GPU-parallelized computations for large dimensionality problems over the multiple processor-cores (with speed-up factor strongly dependent on the problem structure).

  12. Parallel computations and control of adaptive structures

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alvin, Kenneth F.; Belvin, W. Keith; Chong, K. P. (Editor); Liu, S. C. (Editor); Li, J. C. (Editor)

    1991-01-01

    The equations of motion for structures with adaptive elements for vibration control are presented for parallel computations to be used as a software package for real-time control of flexible space structures. A brief introduction of the state-of-the-art parallel computational capability is also presented. Time marching strategies are developed for an effective use of massive parallel mapping, partitioning, and the necessary arithmetic operations. An example is offered for the simulation of control-structure interaction on a parallel computer and the impact of the approach presented for applications in other disciplines than aerospace industry is assessed.

  13. Running Geant on T. Node parallel computer

    SciTech Connect

    Jejcic, A.; Maillard, J.; Silva, J. ); Mignot, B. )

    1990-08-01

    AnInmos transputer-based computer has been utilized to overcome the difficulties due to the limitations on the processing abilities of event parallelism and multiprocessor farms (i.e., the so called bus-crisis) and the concern regarding the growing sizes of databases typical in High Energy Physics. This study was done on the T.Node parallel computer manufactured by TELMAT. Detailed figures are reported concerning the event parallelization. (AIP)

  14. Modified mesh-connected parallel computers

    SciTech Connect

    Carlson, D.A. )

    1988-10-01

    The mesh-connected parallel computer is an important parallel processing organization that has been used in the past for the design of supercomputing systems. In this paper, the authors explore modifications of a mesh-connected parallel computer for the purpose of increasing the efficiency of executing important application programs. These modifications are made by adding one or more global mesh structures to the processing array. They show how our modifications allow asymptotic improvements in the efficiency of executing computations having low to medium interprocessor communication requirements (e.g., tree computations, prefix computations, finding the connected components of a graph). For computations with high interprocessor communication requirements such as sorting, they show that they offer no speedup. They also compare the modified mesh-connected parallel computer to other similar organizations including the pyramid, the X-tree, and the mesh-of-trees.

  15. Parallel computation of manipulator inverse dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    In this article, parallel computation of manipulator inverse dynamics is investigated. A hierarchical graph-based mapping approach is devised to analyze the inherent parallelism in the Newton-Euler formulation at several computational levels, and to derive the features of an abstract architecture for exploitation of parallelism. At each level, a parallel algorithm represents the application of a parallel model of computation that transforms the computation into a graph whose structure defines the features of an abstract architecture, i.e., number of processors, communication structure, etc. Data-flow analysis is employed to derive the time lower bound in the computation as well as the sequencing of the abstract architecture. The features of the target architecture are defined by optimization of the abstract architecture to exploit maximum parallelism while minimizing architectural complexity. An architecture is designed and implemented that is capable of efficient exploitation of parallelism at several computational levels. The computation time of the Newton-Euler formulation for a 6-degree-of-freedom (dof) general manipulator is measured as 187 microsec. The increase in computation time for each additional dof is 23 microsec, which leads to a computation time of less than 500 microsec, even for a 12-dof redundant arm.

  16. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  17. Collectively loading an application in a parallel computer

    DOEpatents

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.

    2016-01-05

    Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

  18. Parallel Computing Strategies for Irregular Algorithms

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

    2002-01-01

    Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

  19. Massively Parallel Computing: A Sandia Perspective

    SciTech Connect

    Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.

    1999-05-06

    The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.

  20. PARALLEL GROUNDWATER COMPUTATIONS USING PVM

    EPA Science Inventory

    Multiprocessing provides an opportunity or faster execution of programs and increased use of idle computing resources, enabling more detailed examination of more comprehensive models. ultiprocessor architectures are currently diverse, experimental, and not widely available. VM (P...

  1. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  2. Lattice QCD for parallel computers

    NASA Astrophysics Data System (ADS)

    Quadling, Henley Sean

    Lattice QCD is an important tool in the investigation of Quantum Chromodynamics (QCD). This is particularly true at lower energies where traditional perturbative techniques fail, and where other non-perturbative theoretical efforts are not entirely satisfactory. Important features of QCD such as confinement and the masses of the low lying hadronic states have been demonstrated and calculated in lattice QCD simulations. In calculations such as these, non-lattice techniques in QCD have failed. However, despite the incredible advances in computer technology, a full solution of lattice QCD may still be in the too-distant future. Much effort is being expended in the search for ways to reduce the computational burden so that an adequate solution of lattice QCD is possible in the near future. There has been considerable progress in recent years, especially in the research of improved lattice actions. In this thesis, a new approach to lattice QCD algorithms is introduced, which results in very significant efficiency improvements. The new approach is explained in detail, evaluated and verified by comparing physics results with current lattice QCD simulations. The new sub-lattice layout methodology has been specifically designed for current and future hardware. Together with concurrent research into improved lattice actions and more efficient numerical algorithms, the very significant efficiency improvements demonstrated in this thesis can play an important role in allowing lattice QCD researchers access to much more realistic simulations. The techniques presented in this thesis also allow ambitious QCD simulations to be performed on cheap clusters of commodity computers.

  3. Parallel and vector computation in heat transfer

    SciTech Connect

    Georgiadis, J.G. ); Murthy, J.Y. )

    1990-01-01

    This collection of manuscripts complements a number of other volumes related to engineering numerical analysis in general; it also gives a preview of the potential contribution of vector and parallel computing to heat transfer. Contributions have been made from the fields of heat transfer, computational fluid mechanics or physics, and from researchers in industry or in academia. This work serves to indicate that new or modified numerical algorithms have to be developed depending on the hardware used (as the long titles of most of the papers in this volume imply). This volume contains six examples of numerical simulation on parallel and vector computers that demonstrate the competitiveness of the novel methodologies. A common thread through all the manuscripts is that they address problems involving irregular geometries or complex physics, or both. Comparative studies of the performance of certain algorithms on various computers are also presented. Most machines used in this work belong to the coarse- to medium-grain group (consisting of a few to a hundred processors) with architectures of the multiple-instruction-stream-multiple- data-stream (MIMD) type. Some of the machines used have both parallel and vector processors, while parallel computations are certainly emphasized. We hope that this work will contribute to the increasing involvement of heat transfer specialists with parallel computation.

  4. Fast Parallel Computation Of Manipulator Inverse Dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    Method for fast parallel computation of inverse dynamics problem, essential for real-time dynamic control and simulation of robot manipulators, undergoing development. Enables exploitation of high degree of parallelism and, achievement of significant computational efficiency, while minimizing various communication and synchronization overheads as well as complexity of required computer architecture. Universal real-time robotic controller and simulator (URRCS) consists of internal host processor and several SIMD processors with ring topology. Architecture modular and expandable: more SIMD processors added to match size of problem. Operate asynchronously and in MIMD fashion.

  5. Locating hardware faults in a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-04-13

    Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

  6. Drought monitoring through parallel computing

    SciTech Connect

    Burrage, K.; Belward, J.; Lau, L.; Rezny, M.; Young, R.

    1993-12-31

    One area where high performance computing can make a significant social and economic impact in Australia (especially in view of the recent El-Nino) is in the accurate and efficient monitoring and prediction of drought conditions - both in terms of speed of calculation and in high quality visualization. As a consequence, the Queensland Department of Primary Industries (DPI) is developing a spatial model of pasture growth and utilization for monitoring, assessment and prediction of the future of the state`s rangeloads. This system incorporates soil class, pasture type, tree cover, herbivore density and meterological data. DPI`s drought research program aims to predict the occurrence of feed deficits and land condition alerts on a quarter to half shire basis over Queensland. This will provide a basis for large-scale management decisions by graziers and politicians alike.

  7. Finite element computation with parallel VLSI

    NASA Technical Reports Server (NTRS)

    Mcgregor, J.; Salama, M.

    1983-01-01

    This paper describes a parallel processing computer consisting of a 16-bit microcomputer as a master processor which controls and coordinates the activities of 8086/8087 VLSI chip set slave processors working in parallel. The hardware is inexpensive and can be flexibly configured and programmed to perform various functions. This makes it a useful research tool for the development of, and experimentation with parallel mathematical algorithms. Application of the hardware to computational tasks involved in the finite element analysis method is demonstrated by the generation and assembly of beam finite element stiffness matrices. A number of possible schemes for the implementation of N-elements on N- or n-processors (N is greater than n) are described, and the speedup factors of their time consumption are determined as a function of the number of available parallel processors.

  8. Dynamic Load Balancing for Computational Plasticity on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Pramono, Eddy; Simon, Horst

    1994-01-01

    The simulation of the computational plasticity on a complex structure remains a formidable computational task, especially when a highly nonlinear, complex material model was used. It appears that the computational requirements for a such problem can only be satisfied by massively parallel architectures. In order to effectively harness the tremendous computational power provided by such architectures, it is imperative to investigate and to study the algorithmic and implementation issues pertaining to dynamic load balancing for computational plasticity on a highly parallel, distributed-memory, multiple-instruction, multiple-data computers. This paper will measure the effectiveness of the algorithms developed in handling the dynamic load balancing.

  9. Link failure detection in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Megerian, Mark G.; Smith, Brian E.

    2010-11-09

    Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.

  10. Internode data communications in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.

    2013-09-03

    Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.

  11. Internode data communications in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.

  12. Fast Parallel Computation Of Multibody Dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Kwan, Gregory L.; Bagherzadeh, Nader

    1996-01-01

    Constraint-force algorithm fast, efficient, parallel-computation algorithm for solving forward dynamics problem of multibody system like robot arm or vehicle. Solves problem in minimum time proportional to log(N) by use of optimal number of processors proportional to N, where N is number of dynamical degrees of freedom: in this sense, constraint-force algorithm both time-optimal and processor-optimal parallel-processing algorithm.

  13. Efficient communication in massively parallel computers

    SciTech Connect

    Cypher, R.E.

    1989-01-01

    A fundamental operation in parallel computation is sorting. Sorting is important not only because it is required by many algorithms, but also because it can be used to implement irregular, pointer-based communication. The author studies two algorithms for sorting in massively parallel computers. First, he examines Shellsort. Shellsort is a sorting algorithm that is based on a sequence of parameters called increments. Shellsort can be used to create a parallel sorting device known as a sorting network. Researchers have suggested that if the correct increment sequence is used, an optimal size sorting network can be obtained. All published increment sequences have been monotonically decreasing. He shows that no monotonically decreasing increment sequence will yield an optimal size sorting network. Second, he presents a sorting algorithm called Cubesort. Cubesort is the fastest known sorting algorithm for a variety of parallel computers aver a wide range of parameters. He also presents a paradigm for developing parallel algorithms that have efficient communication. The paradigm, called the data reduction paradigm, consists of using a divide-and-conquer strategy. Both the division and combination phases of the divide-and-conquer algorithm may require irregular, pointer-based communication between processors. However, the problem is divided so as to limit the amount of data that must be communicated. As a result the communication can be performed efficiently. He presents data reduction algorithms for the image component labeling problem, the closest pair problem and four versions of the parallel prefix problem.

  14. Parallel Algormiivls For Optical Digital Computers

    NASA Astrophysics Data System (ADS)

    Huang, Alan

    1983-04-01

    Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterized by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognizing a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed.

  15. Parallel visualization on leadership computing resources

    NASA Astrophysics Data System (ADS)

    Peterka, T.; Ross, R. B.; Shen, H.-W.; Ma, K.-L.; Kendall, W.; Yu, H.

    2009-07-01

    Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.

  16. Parallel algorithms for optical digital computers

    SciTech Connect

    Huang, A.

    1983-01-01

    Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.

  17. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  18. Wing-Body Aeroelasticity on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Byun, Chansup

    1996-01-01

    This article presents a procedure for computing the aeroelasticity of wing-body configurations on multiple-instruction, multiple-data parallel computers. In this procedure, fluids are modeled using Euler equations discretized by a finite difference method, and structures are modeled using finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. A parallel integration scheme is used to compute aeroelastic responses by solving the coupled fluid and structural equations concurrently while keeping modularity of each discipline. The present procedure is validated by computing the aeroelastic response of a wing and comparing with experiment. Aeroelastic computations are illustrated for a high speed civil transport type wing-body configuration.

  19. Parallel processing for computer vision and display

    SciTech Connect

    Dew, P.M. . Dept. of Computer Studies); Earnshaw, R.A. ); Heywood, T.R. )

    1989-01-01

    The widespread availability of high performance computers has led to an increased awareness of the importance of visualization techniques particularly in engineering and science. However, many visualization tasks involve processing large amounts of data or manipulating complex computer models of 3D objects. For example, in the field of computer aided engineering it is often necessary to display an edit solid object (see Plate 1) which can take many minutes even on the fastest serial processors. Another example of a computationally intensive problem, this time from computer vision, is the recognition of objects in a 3D scene from a stereo image pair. To perform visualization tasks of this type in real and reasonable time it is necessary to exploit the advances in parallel processing that have taken place over the last decade. This book uniquely provides a collection of papers from leading visualization researchers with a common interest in the application and exploitation of parallel processing techniques.

  20. The new landscape of parallel computer architecture

    NASA Astrophysics Data System (ADS)

    Shalf, John

    2007-07-01

    The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.

  1. Temporal fringe pattern analysis with parallel computing

    SciTech Connect

    Tuck Wah Ng; Kar Tien Ang; Argentini, Gianluca

    2005-11-20

    Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallel computing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallel computing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis.

  2. Interfacing Computer Aided Parallelization and Performance Analysis

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

    2003-01-01

    When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.

  3. Parallel computing using a Lagrangian formulation

    NASA Technical Reports Server (NTRS)

    Liou, May-Fun; Loh, Ching Yuen

    1991-01-01

    A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.

  4. Highly parallel computer architecture for robotic computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)

    1991-01-01

    In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  5. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  6. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  7. Impact of Parallel Computing on Large Scale Aeroelastic Computations

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Kwak, Dochan (Technical Monitor)

    2000-01-01

    Aeroelasticity is computationally one of the most intensive fields in aerospace engineering. Though over the last three decades the computational speed of supercomputers have substantially increased, they are still inadequate for large scale aeroelastic computations using high fidelity flow and structural equations. In addition to reaching a saturation in computational speed because of changes in economics, computer manufactures are stopping the manufacturing of mainframe type supercomputers. This has led computational aeroelasticians to face the gigantic task of finding alternate approaches for fulfilling their needs. The alternate path to over come speed and availability limitations of mainframe type supercomputers is to use parallel computers. During this decade several different architectures have evolved. In FY92 the US Government started the High Performance Computing and Communication (HPCC) program. As a participant in this program NASA developed several parallel computational tools for aeroelastic applications. This talk describes the impact of those application tools on high fidelity based multidisciplinary analysis.

  8. Measuring performance of parallel computers. Final report

    SciTech Connect

    Sullivan, F.

    1994-07-01

    Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.

  9. Rectilinear partitioning of irregular data parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1991-01-01

    New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed.

  10. Efficient parallel global garbage collection on massively parallel computers

    SciTech Connect

    Kamada, Tomio; Matsuoka, Satoshi; Yonezawa, Akinori

    1994-12-31

    On distributed-memory high-performance MPPs where processors are interconnected by an asynchronous network, efficient Garbage Collection (GC) becomes difficult due to inter-node references and references within pending, unprocessed messages. The parallel global GC algorithm (1) takes advantage of reference locality, (2) efficiently traverses references over nodes, (3) admits minimum pause time of ongoing computations, and (4) has been shown to scale up to 1024 node MPPs. The algorithm employs a global weight counting scheme to substantially reduce message traffic. The two methods for confirming the arrival of pending messages are used: one counts numbers of messages and the other uses network `bulldozing.` Performance evaluation in actual implementations on a multicomputer with 32-1024 nodes, Fujitsu AP1000, reveals various favorable properties of the algorithm.

  11. Parallel computing in atmospheric chemistry models

    SciTech Connect

    Rotman, D.

    1996-02-01

    Studies of atmospheric chemistry are of high scientific interest, involve computations that are complex and intense, and require enormous amounts of I/O. Current supercomputer computational capabilities are limiting the studies of stratospheric and tropospheric chemistry and will certainly not be able to handle the upcoming coupled chemistry/climate models. To enable such calculations, the authors have developed a computing framework that allows computations on a wide range of computational platforms, including massively parallel machines. Because of the fast paced changes in this field, the modeling framework and scientific modules have been developed to be highly portable and efficient. Here, the authors present the important features of the framework and focus on the atmospheric chemistry module, named IMPACT, and its capabilities. Applications of IMPACT to aircraft studies will be presented.

  12. Intranode data communications in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

    2014-01-07

    Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

  13. Intranode data communications in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

    2013-07-23

    Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

  14. Hydrologic Terrain Processing Using Parallel Computing

    NASA Astrophysics Data System (ADS)

    Tarboton, D. G.; Watson, D. W.; Wallace, R. M.; Schreuders, K.; Tesfa, T. K.

    2009-12-01

    Topography in the form of Digital Elevation Models (DEMs), is widely used to derive information for the modeling of hydrologic processes. Hydrologic terrain analysis augments the information content of digital elevation data by removing spurious pits, deriving a structured flow field, and calculating surfaces of hydrologic information derived from the flow field. The increasing availability of high-resolution terrain datasets for large areas poses a challenge for existing algorithms that process terrain data to extract this hydrologic information. This paper will describe parallel algorithms that have been developed to enhance hydrologic terrain pre-processing so that larger datasets can be more efficiently computed. Message Passing Interface (MPI) parallel implementations have been developed for pit removal, flow direction, and generalized flow accumulation methods within the Terrain Analysis Using Digital Elevation Models (TauDEM) package. The parallel algorithm works by decomposing the domain into striped or tiled data partitions where each tile is processed by a separate processor. This method also reduces the memory requirements of each processor so that larger size grids can be processed. The parallel pit removal algorithm is adapted from the method of Planchon and Darboux that starts from a high elevation then progressively scans the grid, lowering each grid cell to the maximum of the original elevation or the lowest neighbor. The MPI implementation reconciles elevations along process domain edges after each scan. Generalized flow accumulation extends flow accumulation approaches commonly available in GIS through the integration of multiple inputs and a broad class of algebraic rules into the calculation of flow related quantities. It is based on establishing a flow field through DEM grid cells, that is then used to evaluate any mathematical function that incorporates dependence on values of the quantity being evaluated at upslope (or downslope) grid cells

  15. Parallel software support for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.

    1987-01-01

    The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.

  16. Parallel evolutionary computation in bioinformatics applications.

    PubMed

    Pinho, Jorge; Sobral, João Luis; Rocha, Miguel

    2013-05-01

    A large number of optimization problems within the field of Bioinformatics require methods able to handle its inherent complexity (e.g. NP-hard problems) and also demand increased computational efforts. In this context, the use of parallel architectures is a necessity. In this work, we propose ParJECoLi, a Java based library that offers a large set of metaheuristic methods (such as Evolutionary Algorithms) and also addresses the issue of its efficient execution on a wide range of parallel architectures. The proposed approach focuses on the easiness of use, making the adaptation to distinct parallel environments (multicore, cluster, grid) transparent to the user. Indeed, this work shows how the development of the optimization library can proceed independently of its adaptation for several architectures, making use of Aspect-Oriented Programming. The pluggable nature of parallelism related modules allows the user to easily configure its environment, adding parallelism modules to the base source code when needed. The performance of the platform is validated with two case studies within biological model optimization. PMID:23127284

  17. Synchronizing compute node time bases in a parallel computer

    DOEpatents

    Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

    2014-12-30

    Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.

  18. Synchronizing compute node time bases in a parallel computer

    DOEpatents

    Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

    2015-01-27

    Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.

  19. Parallel computing using a Lagrangian formulation

    NASA Technical Reports Server (NTRS)

    Liou, May-Fun; Loh, Ching-Yuen

    1992-01-01

    This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.

  20. Opportunities in computational mechanics: Advances in parallel computing

    SciTech Connect

    Lesar, R.A.

    1999-02-01

    In this paper, the authors will discuss recent advances in computing power and the prospects for using these new capabilities for studying plasticity and failure. They will first review the new capabilities made available with parallel computing. They will discuss how these machines perform and how well their architecture might work on materials issues. Finally, they will give some estimates on the size of problems possible using these computers.

  1. Parallel computation of radio listening rates

    NASA Astrophysics Data System (ADS)

    Mazzariol, Marc; Gennart, Benoit A.; Hersch, Roger D.; Gomez, Manuel; Balsiger, Peter; Pellandini, Fausto; Leder, Markus; Wuethrich, Daniel; Feitknecht, Juerg

    2000-10-01

    Obtaining the listening rates of radio stations in function of time is an important instrument for determining the impact of publicity. Since many radio stations are financed by publicity, the exact determination of radio listening rates is vital to their existence and to further development. Existing methods of determining radio listening rates are based on face to face interviews or telephonic interviews made with a sample population. These traditional methods however require the cooperation and compliance of the participants. In order to significantly improve the determination of radio listening rates, special watches were created which incorporate a custom integrated circuit sampling the ambient sound during a few seconds every minutes. Each watch accumulates these compressed sound samples during one full week. Watches are then sent to an evaluation center, where the sound samples are matched with the sound samples recorded from candidate radio stations. The present paper describes the processing steps necessary for computing the radio listening rates, and shows how this application was parallelized on a cluster of PCs using the CAP Computer-aided parallelization framework. Since the application must run in a production environment, the paper describes also the support provided for graceful degradation in case of transient or permanent failure of one of the system's components. The parallel sound matching server offers a linear speedup up to a large number of processing nodes thanks to the fact that disk access operations across the network are done in pipeline with computations.

  2. Efficient Parallel Engineering Computing on Linux Workstations

    NASA Technical Reports Server (NTRS)

    Lou, John Z.

    2010-01-01

    A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).

  3. Seismic imaging on massively parallel computers

    SciTech Connect

    Ober, C.C.; Oldfield, R.A.; Womble, D.E.; Mosher, C.C.

    1997-07-01

    A key to reducing the risks and costs associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Pre-stack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar-wave equation using finite differences. Current industry computational capabilities are insufficient for the application of finite-difference, 3-D, prestack, depth-migration algorithms. High performance computers and state-of-the-art algorithms and software are required to meet this need. As part of an ongoing ACTI project funded by the US Department of Energy, the authors have developed a finite-difference, 3-D prestack, depth-migration code for massively parallel computer systems. The goal of this work is to demonstrate that massively parallel computers (thousands of processors) can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite-difference, prestack, depth migration practical for oil and gas exploration.

  4. Parallelized reliability estimation of reconfigurable computer networks

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Das, Subhendu; Palumbo, Dan

    1990-01-01

    A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.

  5. Parallel algorithms for computing linked list prefix

    SciTech Connect

    Han, Y. )

    1989-06-01

    Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.

  6. Parallel Computational Environment for Substructure Optimization

    NASA Technical Reports Server (NTRS)

    Gendy, Atef S.; Patnaik, Surya N.; Hopkins, Dale A.; Berke, Laszlo

    1995-01-01

    Design optimization of large structural systems can be attempted through a substructure strategy when convergence difficulties are encountered. When this strategy is used, the large structure is divided into several smaller substructures and a subproblem is defined for each substructure. The solution of the large optimization problem can be obtained iteratively through repeated solutions of the modest subproblems. Substructure strategies, in sequential as well as in parallel computational modes on a Cray YMP multiprocessor computer, have been incorporated in the optimization test bed CometBoards. CometBoards is an acronym for Comparative Evaluation Test Bed of Optimization and Analysis Routines for Design of Structures. Three issues, intensive computation, convergence of the iterative process, and analytically superior optimum, were addressed in the implementation of substructure optimization into CometBoards. Coupling between subproblems as well as local and global constraint grouping are essential for convergence of the iterative process. The substructure strategy can produce an analytically superior optimum different from what can be obtained by regular optimization. For the problems solved, substructure optimization in a parallel computational mode made effective use of all assigned processors.

  7. Parallel computing: One opportunity, four challenges

    SciTech Connect

    Gaudiot, J.-L.

    1989-12-31

    The author reviews briefly the area of parallel computer processing. This area has been expanding at a great rate in the past decade. Great strides have been made in the hardware area, and in the speed of performance of chips. However to some degree the hardware area is beginning to run into basic physical speed limits, which will slow the rate of advance of this area simply because of physical limitations. The author looks at ways that computer architecture, and software applications, can work to continue the rate of increase in computing power which has occurred over the past decade. Four particular areas are mentioned: programmability; communication network design; reliable operation; performance evaluation and benchmarking.

  8. Computation and parallel implementation for early vision

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    1990-01-01

    The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

  9. An Expert Assistant for Computer Aided Parallelization

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.

  10. Parallel computing techniques for rotorcraft aerodynamics

    NASA Astrophysics Data System (ADS)

    Ekici, Kivanc

    The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).

  11. Computational fluid dynamics on a massively parallel computer

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.; Levit, Creon

    1989-01-01

    A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallel computer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.

  12. Nonisothermal multiphase subsurface transport on parallel computers

    SciTech Connect

    Martinez, M.J.; Hopkins, P.L.; Shadid, J.N.

    1997-10-01

    We present a numerical method for nonisothermal, multiphase subsurface transport in heterogeneous porous media. The mathematical model considers nonisothermal two-phase (liquid/gas) flow, including capillary pressure effects, binary diffusion in the gas phase, conductive, latent, and sensible heat transport. The Galerkin finite element method is used for spatial discretization, and temporal integration is accomplished via a predictor/corrector scheme. Message-passing and domain decomposition techniques are used for implementing a scalable algorithm for distributed memory parallel computers. An illustrative application is shown to demonstrate capabilities and performance.

  13. The Challenge of Massively Parallel Computing

    SciTech Connect

    WOMBLE,DAVID E.

    1999-11-03

    Since the mid-1980's, there have been a number of commercially available parallel computers with hundreds or thousands of processors. These machines have provided a new capability to the scientific community, and they been used successfully by scientists and engineers although with varying degrees of success. One of the reasons for the limited success is the difficulty, or perceived difficulty, in developing code for these machines. In this paper we discuss many of the issues and challenges in developing scalable hardware, system software and algorithms for machines comprising hundreds or thousands of processors.

  14. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  15. Parallel computing for automated model calibration

    SciTech Connect

    Burke, John S.; Danielson, Gary R.; Schulz, Douglas A.; Vail, Lance W.

    2002-07-29

    Natural resources model calibration is a significant burden on computing and staff resources in modeling efforts. Most assessments must consider multiple calibration objectives (for example magnitude and timing of stream flow peak). An automated calibration process that allows real time updating of data/models, allowing scientists to focus effort on improving models is needed. We are in the process of building a fully featured multi objective calibration tool capable of processing multiple models cheaply and efficiently using null cycle computing. Our parallel processing and calibration software routines have been generically, but our focus has been on natural resources model calibration. So far, the natural resources models have been friendly to parallel calibration efforts in that they require no inter-process communication, only need a small amount of input data and only output a small amount of statistical information for each calibration run. A typical auto calibration run might involve running a model 10,000 times with a variety of input parameters and summary statistical output. In the past model calibration has been done against individual models for each data set. The individual model runs are relatively fast, ranging from seconds to minutes. The process was run on a single computer using a simple iterative process. We have completed two Auto Calibration prototypes and are currently designing a more feature rich tool. Our prototypes have focused on running the calibration in a distributed computing cross platform environment. They allow incorporation of?smart? calibration parameter generation (using artificial intelligence processing techniques). Null cycle computing similar to SETI@Home has also been a focus of our efforts. This paper details the design of the latest prototype and discusses our plans for the next revision of the software.

  16. Utilizing parallel optimization in computational fluid dynamics

    NASA Astrophysics Data System (ADS)

    Kokkolaras, Michael

    1998-12-01

    General problems of interest in computational fluid dynamics are investigated by means of optimization. Specifically, in the first part of the dissertation, a method of optimal incremental function approximation is developed for the adaptive solution of differential equations. Various concepts and ideas utilized by numerical techniques employed in computational mechanics and artificial neural networks (e.g. function approximation and error minimization, variational principles and weighted residuals, and adaptive grid optimization) are combined to formulate the proposed method. The basis functions and associated coefficients of a series expansion, representing the solution, are optimally selected by a parallel direct search technique at each step of the algorithm according to appropriate criteria; the solution is built sequentially. In this manner, the proposed method is adaptive in nature, although a grid is neither built nor adapted in the traditional sense using a-posteriori error estimates. Variational principles are utilized for the definition of the objective function to be extremized in the associated optimization problems, ensuring that the problem is well-posed. Complicated data structures and expensive remeshing algorithms and systems solvers are avoided. Computational efficiency is increased by using low-order basis functions and concurrent computing. Numerical results and convergence rates are reported for a range of steady-state problems, including linear and nonlinear differential equations associated with general boundary conditions, and illustrate the potential of the proposed method. Fluid dynamics applications are emphasized. Conclusions are drawn by discussing the method's limitations, advantages, and possible extensions. The second part of the dissertation is concerned with the optimization of the viscous-inviscid-interaction (VII) mechanism in an airfoil flow analysis code. The VII mechanism is based on the concept of a transpiration velocity

  17. Optimal dynamic remapping of parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Reynolds, Paul F., Jr.

    1987-01-01

    A large class of computations are characterized by a sequence of phases, with phase changes occurring unpredictably. The decision problem was considered regarding the remapping of workload to processors in a parallel computation when the utility of remapping and the future behavior of the workload is uncertain, and phases exhibit stable execution requirements during a given phase, but requirements may change radically between phases. For these problems a workload assignment generated for one phase may hinder performance during the next phase. This problem is treated formally for a probabilistic model of computation with at most two phases. The fundamental problem of balancing the expected remapping performance gain against the delay cost was addressed. Stochastic dynamic programming is used to show that the remapping decision policy minimizing the expected running time of the computation has an extremely simple structure. Because the gain may not be predictable, the performance of a heuristic policy that does not require estimnation of the gain is examined. The heuristic method's feasibility is demonstrated by its use on an adaptive fluid dynamics code on a multiprocessor. The results suggest that except in extreme cases, the remapping decision problem is essentially that of dynamically determining whether gain can be achieved by remapping after a phase change. The results also suggest that this heuristic is applicable to computations with more than two phases.

  18. Parallel Proximity Detection for Computer Simulations

    NASA Technical Reports Server (NTRS)

    Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)

    1998-01-01

    The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are included by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.

  19. CFD Research, Parallel Computation and Aerodynamic Optimization

    NASA Technical Reports Server (NTRS)

    Ryan, James S.

    1995-01-01

    During the last five years, CFD has matured substantially. Pure CFD research remains to be done, but much of the focus has shifted to integration of CFD into the design process. The work under these cooperative agreements reflects this trend. The recent work, and work which is planned, is designed to enhance the competitiveness of the US aerospace industry. CFD and optimization approaches are being developed and tested, so that the industry can better choose which methods to adopt in their design processes. The range of computer architectures has been dramatically broadened, as the assumption that only huge vector supercomputers could be useful has faded. Today, researchers and industry can trade off time, cost, and availability, choosing vector supercomputers, scalable parallel architectures, networked workstations, or heterogenous combinations of these to complete required computations efficiently.

  20. Optimized data communications in a parallel computer

    DOEpatents

    Faraj, Daniel A

    2014-10-21

    A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.

  1. Parallel Proximity Detection for Computer Simulation

    NASA Technical Reports Server (NTRS)

    Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)

    1997-01-01

    The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are includes by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.

  2. Optimized data communications in a parallel computer

    SciTech Connect

    Faraj, Daniel A.

    2014-08-19

    A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.

  3. Seismic imaging on massively parallel computers

    SciTech Connect

    Ober, C.C.; Oldfield, R.; Womble, D.E.; VanDyke, J.; Dosanjh, S.

    1996-03-01

    Fast, accurate imaging of complex, oil-bearing geologies, such as overthrusts and salt domes, is the key to reducing the costs of domestic oil and gas exploration. Geophysicists say that the known oil reserves in the Gulf of Mexico could be significantly increased if accurate seismic imaging beneath salt domes was possible. A range of techniques exist for imaging these regions, but the highly accurate techniques involve the solution of the wave equation and are characterized by large data sets and large computational demands. Massively parallel computers can provide the computational power for these highly accurate imaging techniques. A brief introduction to seismic processing will be presented, and the implementation of a seismic-imaging code for distributed memory computers will be discussed. The portable code, Salvo, performs a wave equation-based, 3-D, prestack, depth imaging and currently runs on the Intel Paragon and the Cray T3D. It used MPI for portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the Intel Paragon.

  4. Simple, parallel virtual machines for extreme computations

    NASA Astrophysics Data System (ADS)

    Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jürgen

    2015-11-01

    We introduce a virtual machine (VM) written in a numerically fast language like Fortran or C for evaluating very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte-code from the optimal matrix element generator, O'MEGA. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behavior with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.

  5. A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.; Levit, Creon

    1989-01-01

    The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.

  6. QCMPI: A parallel environment for quantum computing

    NASA Astrophysics Data System (ADS)

    Tabakin, Frank; Juliá-Díaz, Bruno

    2009-06-01

    QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with parallel processing capabilities. It is an accessible research tool that permits rapid evaluation of quantum algorithms for a large number of qubits and for various "noise" scenarios. The prime motivation for developing QCMPI is to facilitate numerical examination of not only how QC algorithms work, but also to include noise, decoherence, and attenuation effects and to evaluate the efficacy of error correction schemes. The present work builds on an earlier Mathematica code QDENSITY, which is mainly a pedagogic tool. In that earlier work, although the density matrix formulation was featured, the description using state vectors was also provided. In QCMPI, the stress is on state vectors, in order to employ a large number of qubits. The parallel processing feature is implemented by using the Message-Passing Interface (MPI) protocol. A description of how to spread the wave function components over many processors is provided, along with how to efficiently describe the action of general one- and two-qubit operators on these state vectors. These operators include the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum Fourier transformation. These operators make up the actions needed in QC. Codes for Grover's search and Shor's factoring algorithms are provided as examples. A major feature of this work is that concurrent versions of the algorithms can be evaluated with each version subject to alternate noise effects, which corresponds to the idea of solving a stochastic Schrödinger equation. The density matrix for the ensemble of such noise cases is constructed using parallel distribution methods to evaluate its eigenvalues and associated entropy. Potential applications of this powerful tool include studies of the stability and correction of QC processes using Hamiltonian based dynamics. Program summaryProgram title: QCMPI Catalogue identifier: AECS_v1_0 Program summary URL

  7. CFD research, parallel computation and aerodynamic optimization

    NASA Technical Reports Server (NTRS)

    Ryan, James S.

    1995-01-01

    Over five years of research in Computational Fluid Dynamics and its applications are covered in this report. Using CFD as an established tool, aerodynamic optimization on parallel architectures is explored. The objective of this work is to provide better tools to vehicle designers. Submarine design requires accurate force and moment calculations in flow with thick boundary layers and large separated vortices. Low noise production is critical, so flow into the propulsor region must be predicted accurately. The High Speed Civil Transport (HSCT) has been the subject of recent work. This vehicle is to be a passenger vehicle with the capability of cutting overseas flight times by more than half. A successful design must surpass the performance of comparable planes. Fuel economy, other operational costs, environmental impact, and range must all be improved substantially. For all these reasons, improved design tools are required, and these tools must eventually integrate optimization, external aerodynamics, propulsion, structures, heat transfer and other disciplines.

  8. Broadcasting a message in a parallel computer

    DOEpatents

    Archer, Charles J; Faraj, Daniel A

    2014-11-18

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.

  9. Broadcasting a message in a parallel computer

    DOEpatents

    Archer, Charles J; Faraj, Ahmad A

    2013-04-16

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.

  10. Accurate modeling of parallel scientific computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Townsend, James C.

    1988-01-01

    Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until load time. Critical mapping and remapping decisions rest on the ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a one-dimensional fluids code. The models constructed are shown to be accurate, and are used to find optimal remapping schedules.

  11. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  12. A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1994-01-01

    A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  13. A fast algorithm for parallel computation of multibody dynamics on MIMD parallel architectures

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Kwan, Gregory; Bagherzadeh, Nader

    1993-01-01

    In this paper the implementation of a parallel O(LogN) algorithm for computation of rigid multibody dynamics on a Hypercube MIMD parallel architecture is presented. To our knowledge, this is the first algorithm that achieves the time lower bound of O(LogN) by using an optimal number of O(N) processors. However, in addition to its theoretical significance, the algorithm is also highly efficient for practical implementation on commercially available MIMD parallel architectures due to its highly coarse grain size and simple communication and synchronization requirements. We present a multilevel parallel computation strategy for implementation of the algorithm on a Hypercube. This strategy allows the exploitation of parallelism at several computational levels as well as maximum overlapping of computation and communication to increase the performance of parallel computation.

  14. Parallel computation of three-dimensional nonlinear magnetostatic problems.

    SciTech Connect

    Levine, D.; Gropp, W.; Forsman, K.; Kettunen, L.; Mathematics and Computer Science; Tampere Univ. of Tech.

    1999-02-01

    We describe a general-purpose parallel electromagnetic code for computing accurate solutions to large computationally demanding, 3D, nonlinear magnetostatic problems. The code, CORAL, is based on a volume integral equation formulation. Using an IBM SP parallel computer and iterative solution methods, we successfully solved the dense linear systems inherent in such formulations. A key component of our work was the use of the PETSc library, which provides parallel portability and access to the latest linear algebra solution technology.

  15. Parallel CFD design on network-based computer

    NASA Technical Reports Server (NTRS)

    Cheung, Samson

    1995-01-01

    Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advanced computational fluid dynamics codes, which can be computationally expensive on mainframe supercomputers. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computing environment utilizing a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package is applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.

  16. CFD Optimization on Network-Based Parallel Computer System

    NASA Technical Reports Server (NTRS)

    Cheung, Samson H.; Holst, Terry L. (Technical Monitor)

    1994-01-01

    Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advance computational fluid dynamics codes, which is computationally expensive in mainframe supercomputer. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computer on a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package has been applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.

  17. Parallel computing for probabilistic fatigue analysis

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

    1993-01-01

    This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.

  18. LEWICE droplet trajectory calculations on a parallel computer

    NASA Technical Reports Server (NTRS)

    Caruso, Steven C.

    1993-01-01

    A parallel computer implementation (128 processors) of LEWICE, a NASA Lewis code used to predict the time-dependent ice accretion process for two-dimensional aerodynamic bodies of simple geometries, is described. Two-dimensional parallel droplet trajectory calculations are performed to demonstrate the potential benefits of applying parallel processing to ice accretion analysis. Parallel performance is evaluated as a function of the number of trajectories and the number of processors. For comparison, similar trajectory calculations are performed on single-processor Cray computers, and the best parallel results are found to be 33 and 23 times faster, respectively, than those of the Cray XMP and YMP.

  19. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  20. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-11-12

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.

  1. On combining computational differentiation and toolkits for parallel scientific computing.

    SciTech Connect

    Bischof, C. H.; Buecker, H. M.; Hovland, P. D.

    2000-06-08

    Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a high-level programming language such as Fortran, C, or C++. The program is treated as a potentially very long sequence of elementary statements to which the chain rule of differential calculus is applied over and over again. Combining automatic differentiation and the organizational structure of toolkits for parallel scientific computing provides a mechanism for evaluating derivatives by exploiting mathematical insight on a higher level. In these toolkits, algorithmic structures such as BLAS-like operations, linear and nonlinear solvers, or integrators for ordinary differential equations can be identified by their standardized interfaces and recognized as high-level mathematical objects rather than as a sequence of elementary statements. In this note, the differentiation of a linear solver with respect to some parameter vector is taken as an example. Mathematical insight is used to reformulate this problem into the solution of multiple linear systems that share the same coefficient matrix but differ in their right-hand sides. The experiments reported here use ADIC, a tool for the automatic differentiation of C programs, and PETSC, an object-oriented toolkit for the parallel solution of scientific problems modeled by partial differential equations.

  2. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application. PMID:27330898

  3. Distributing an executable job load file to compute nodes in a parallel computer

    DOEpatents

    Gooding, Thomas M.

    2016-09-13

    Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.

  4. Distributing an executable job load file to compute nodes in a parallel computer

    DOEpatents

    Gooding, Thomas M.

    2016-08-09

    Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.

  5. Development of Message Passing Routines for High Performance Parallel Computations

    NASA Technical Reports Server (NTRS)

    Summers, Edward K.

    2004-01-01

    Computational Fluid Dynamics (CFD) calculations require a great deal of computing power for completing the detailed computations involved. In an effort shorten the time it takes to complete such calculations they are implemented on a parallel computer. In the case of a parallel computer some sort of message passing structure must be used to communicate between the computers because, unlike a single machine, each computer in a parallel computing cluster does not have access to all the data or run all the parts of the total program. Thus, message passing is used to divide up the data and send instructions to each machine. The nature of my work this summer involves programming the "message passing" aspect of the parallel computer. I am working on modifying an existing program, which was written with OpenMP, and does not use a multi-machine parallel computing structure, to work with Message Passing Interface (MPI) routines. The actual code is being written in the FORTRAN 90 programming language. My goal is to write a parameterized message passing structure that could be used for a variety of individual applications and implement it on Silicon Graphics Incorporated s (SGI) IRIX operating system. With this new parameterized structure engineers would be able to speed up computations for a wide variety of purposes without having to use larger and more expensive computing equipment from another division or another NASA center.

  6. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.

  7. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-10-29

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.

  8. Serial and parallel computation of Kane's equations for multibody dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir

    1991-01-01

    The analysis of the efficiency of algorithms resulting from Kane's Equation for serial and parallel computation of mass matrix is examined. The algorithms resulting from Kane's equation and Modified Kane's equations are detailed. An analysis was made of two classes of algorithms for computation of mass matrix: the Newton-Euler based algorithms and the Composite rigid body algorithms. An analysis was also made of the efficiency of different algorithms for serial and parallel computations. Conclusions are drawn and presented.

  9. Dynamic traffic assignment on parallel computers

    SciTech Connect

    Nagel, K.; Frye, R.; Jakob, R.; Rickert, M.; Stretz, P.

    1998-12-01

    The authors describe part of the current framework of the TRANSIMS traffic research project at the Los Alamos National Laboratory. It includes parallel implementations of a route planner and a microscopic traffic simulation model. They present performance figures and results of an offline load-balancing scheme used in one of the iterative re-planning runs required for dynamic route assignment.

  10. Algorithms for parallel and vector computations

    NASA Technical Reports Server (NTRS)

    Ortega, James M.

    1995-01-01

    This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.

  11. A scalable parallel black oil simulator on distributed memory parallel computers

    NASA Astrophysics Data System (ADS)

    Wang, Kun; Liu, Hui; Chen, Zhangxin

    2015-11-01

    This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.

  12. Parallel algorithms and architecture for computation of manipulator forward dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1989-01-01

    Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.

  13. Performance issues for engineering analysis on MIMD parallel computers

    SciTech Connect

    Fang, H.E.; Vaughan, C.T.; Gardner, D.R.

    1994-08-01

    We discuss how engineering analysts can obtain greater computational resolution in a more timely manner from applications codes running on MIMD parallel computers. Both processor speed and memory capacity are important to achieving better performance than a serial vector supercomputer. To obtain good performance, a parallel applications code must be scalable. In addition, the aspect ratios of the subdomains in the decomposition of the simulation domain onto the parallel computer should be of order 1. We demonstrate these conclusions using simulations conducted with the PCTH shock wave physics code running on a Cray Y-MP, a 1024-node nCUBE 2, and an 1840-node Paragon.

  14. Mathematical model partitioning and packing for parallel computer calculation

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Milner, Edward J.

    1986-01-01

    This paper deals with the development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system. The identification of computational parallelism within the model equations is discussed. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. Next, an algorithm which packs the equations into a minimum number of processors is described. The results of applying the packing algorithm to a turboshaft engine model are presented.

  15. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  16. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  17. Parallel Computation of Airflow in the Human Lung Model

    NASA Astrophysics Data System (ADS)

    Lee, Taehun; Tawhai, Merryn; Hoffman, Eric. A.

    2005-11-01

    Parallel computations of airflow in the human lung based on domain decomposition are performed. The realistic lung model is segmented and reconstructed from CT images as part of an effort to build a normative atlas (NIH HL-04368) documenting airway geometry over 4 decades of age in healthy and disease-state adult humans. Because of the large number of the airway generation and the sheer complexity of the geometry, massively parallel computation of pulmonary airflow is carried out. We present the parallel algorithm implemented in the custom-developed characteristic-Galerkin finite element method, evaluate the speed-up and scalability of the scheme, and estimate the computing resources needed to simulate the airflow in the conducting airways of the human lungs. It is found that the special tree-like geometry enables the inter-processor communications to occur among only three or four processors for optimal parallelization irrespective of the number of processors involved in the computation.

  18. QCD on the Massively Parallel Computer AP1000

    NASA Astrophysics Data System (ADS)

    Akemi, K.; Fujisaki, M.; Okuda, M.; Tago, Y.; Hashimoto, T.; Hioki, S.; Miyamura, O.; Takaishi, T.; Nakamura, A.; de Forcrand, Ph.; Hege, C.; Stamatescu, I. O.

    We present the QCD-TARO program of calculations which uses the parallel computer AP1000 of Fujitsu. We discuss the results on scaling, correlation times and hadronic spectrum, some aspects of the implementation and the future prospects.

  19. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  20. Perspective volume rendering on Parallel Algebraic Logic (PAL) computer

    NASA Astrophysics Data System (ADS)

    Li, Hongzheng; Shi, Hongchi

    1998-09-01

    We propose a perspective volume graphics rendering algorithm on SIMD mesh-connected computers and implement the algorithm on the Parallel Algebraic Logic computer. The algorithm is a parallel ray casting algorithm. It decomposes the 3D perspective projection into two transformations that can be implemented in the SIMD fashion to solve the data redistribution problem caused by non-regular data access patterns in the perspective projection.

  1. Visualization on massively parallel computers using CM/AVS

    SciTech Connect

    Krogh, M.F.; Hansen, C.D.

    1993-09-01

    CM/AVS is a visualization environment for the massively parallel CM-5 from Thinking Machines. It provides a backend to the standard commercially available AVS visualization product. At the Advanced Computing Laboratory at Los Alamos National Laboratory, we have been experimenting and utilizing this software within our visualization environment. This paper describes our experiences with CM/AVS. The conclusions reached are applicable to any implimentation of visualization software within a massively parallel computing environment.

  2. History Matching in Parallel Computational Environments

    SciTech Connect

    Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav

    2004-08-31

    In the probabilistic approach for history matching, the information from the dynamic data is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, fluid properties, well configuration, flow constrains on wells etc. This implies probabilistic approach should then update different regions of the reservoir in different ways. This necessitates delineation of multiple reservoir domains in order to increase the accuracy of the approach. The research focuses on a probabilistic approach to integrate dynamic data that ensures consistency between reservoir models developed from one stage to the next. The algorithm relies on efficient parameterization of the dynamic data integration problem and permits rapid assessment of the updated reservoir model at each stage. The report also outlines various domain decomposition schemes from the perspective of increasing the accuracy of probabilistic approach of history matching. Research progress in three important areas of the project are discussed: {lg_bullet}Validation and testing the probabilistic approach to incorporating production data in reservoir models. {lg_bullet}Development of a robust scheme for identifying reservoir regions that will result in a more robust parameterization of the history matching process. {lg_bullet}Testing commercial simulators for parallel capability and development of a parallel algorithm for history matching.

  3. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  4. Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

    NASA Technical Reports Server (NTRS)

    Sues, R. H.; Lua, Y. J.; Smith, M. D.

    1994-01-01

    The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.

  5. Use of parallel computing in mass processing of laser data

    NASA Astrophysics Data System (ADS)

    Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.

    2015-12-01

    The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.

  6. Parallel aeroelastic computations for wing and wing-body configurations

    NASA Technical Reports Server (NTRS)

    Byun, Chansup

    1994-01-01

    The objective of this research is to develop computationally efficient methods for solving fluid-structural interaction problems by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures on parallel computers. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.

  7. Numerical simulation of polymer flows: A parallel computing approach

    SciTech Connect

    Aggarwal, R.; Keunings, R.; Roux, F.X.

    1993-12-31

    We present a parallel algorithm for the numerical simulation of viscoelastic fluids on distributed memory computers. The algorithm has been implemented within a general-purpose commercial finite element package used in polymer processing applications. Results obtained on the Intel iPSC/860 computer demonstrate high parallel efficiency in complex flow problems. However, since the computational load is unknown a priori, load balancing is a challenging issue. We have developed an adaptive allocation strategy which dynamically reallocates the work load to the processors based upon the history of the computational procedure. We compare the results obtained with the adaptive and static scheduling schemes.

  8. n-body simulations using message passing parallel computers.

    NASA Astrophysics Data System (ADS)

    Grama, A. Y.; Kumar, V.; Sameh, A.

    The authors present new parallel formulations of the Barnes-Hut method for n-body simulations on message passing computers. These parallel formulations partition the domain efficiently incurring minimal communication overhead. This is in contrast to existing schemes that are based on sorting a large number of keys or on the use of global data structures. The new formulations are augmented by alternate communication strategies which serve to minimize communication overhead. The impact of these communication strategies is experimentally studied. The authors report on experimental results obtained from an astrophysical simulation on an nCUBE2 parallel computer.

  9. Access and visualization using clusters and other parallel computers

    NASA Technical Reports Server (NTRS)

    Katz, Daniel S.; Bergou, Attila; Berriman, Bruce; Block, Gary; Collier, Jim; Curkendall, Dave; Good, John; Husman, Laura; Jacob, Joe; Laity, Anastasia; Li, Peggy; Miller, Craig; Plesea, Lucian; Prince, Tom; Siegel, Herb; Williams, Roy

    2003-01-01

    JPL's Parallel Applications Technologies Group has been exploring the issues of data access and visualization of very large data sets over the past 10 or so years. this work has used a number of types of parallel computers, and today includes the use of commodity clusters. This talk will highlight some of the applications and tools we have developed, including how they use parallel computing resources, and specifically how we are using modern clusters. Our applications focus on NASA's needs; thus our data sets are usually related to Earth and Space Science, including data delivered from instruments in space, and data produced by telescopes on the ground.

  10. Chare kernel; A runtime support system for parallel computations

    SciTech Connect

    Shu, W. ); Kale, L.V. )

    1991-03-01

    This paper presents the chare kernel system, which supports parallel computations with irregular structure. The chare kernel is a collection of primitive functions that manage chares, manipulative messages, invoke atomic computations, and coordinate concurrent activities. Programs written in the chare kernel language can be executed on different parallel machines without change. Users writing such programs concern themselves with the creation of parallel actions but not with assigning them to specific processors. The authors describe the design and implementation of the chare kernel. Performance of chare kernel programs on two hypercube machines, the Intel iPSC/2 and the NCUBE, is also given.

  11. A Formal Model for Real-Time Parallel Computation

    SciTech Connect

    Hui, Peter SY; Chikkagoudar, Satish

    2012-12-29

    The imposition of real-time constraints on a parallel computing environment--- specifically high-performance, cluster-computing systems--- introduces a variety of challenges with respect to the formal verification of the system's timing properties. In this paper, we briefly motivate the need for such a system, and we introduce an automaton-based method for performing such formal verification. We define the concept of a consistent parallel timing system: a hybrid system consisting of a set of timed automata (specifically, timed Buechi automata as well as a timed variant of standard finite automata), intended to model the timing properties of a well-behaved real-time parallel system. Finally, we give a brief case study to demonstrate the concepts in the paper: a parallel matrix multiplication kernel which operates within provable upper time bounds. We give the algorithm used, a corresponding consistent parallel timing system, and empirical results showing that the system operates under the specified timing constraints.

  12. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  13. Partitioning problems in parallel, pipelined, and distributed computing

    SciTech Connect

    Bokhari, S.H.

    1988-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.

  14. Swift : fast, reliable, loosely coupled parallel computation.

    SciTech Connect

    Zhao, Y.; Hategan, M.; Clifford, B.; Foster, I.; von Laszewski, G.; Nefedova, V.; Raicu, I.; Stef-Praun, T.; Wilde, M.; Mathematics and Computer Science; Univ. of Chicago

    2007-01-01

    A common pattern in scientific computing involves the execution of many tasks that are coupled only in the sense that the output of one may be passed as input to one or more others - for example, as a file, or via a Web Services invocation. While such 'loosely coupled' computations can involve large amounts of computation and communication, the concerns of the programmer tend to be different than in traditional high performance computing, being focused on management issues relating to the large numbers of datasets and tasks (and often, the complexities inherent in 'messy' data organizations) rather than the optimization of interprocessor communication. To address these concerns, we have developed Swift, a system that combines a novel scripting language called SwiftScript with a powerful runtime system based on CoG Karajan and Falkon to allow for the concise specification, and reliable and efficient execution, of large loosely coupled computations. Swift adopts and adapts ideas first explored in the GriPhyN virtual data system, improving on that system in many regards. We describe the SwiftScript language and its use of XDTM to describe the logical structure of complex file system structures. We also present the Swift system and its use of CoG Karajan, Falkon, and Globus services to dispatch and manage the execution of many tasks in different execution environments. We summarize application experiences and detail performance experiments that quantify the cost of Swift operations.

  15. Parallelization of ARC3D with Computer-Aided Tools

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1998-01-01

    A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.

  16. Parallel structures in human and computer memory

    NASA Astrophysics Data System (ADS)

    Kanerva, Pentti

    1986-08-01

    If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.

  17. Parallel structures in human and computer memory

    NASA Technical Reports Server (NTRS)

    Kanerva, P.

    1986-01-01

    If one thinks of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library. One recognizes a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. A computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do is constructed. Such a memory is remarkably similiar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. It is concluded that the frame problem of artificial intelligence could be solved by the use of such a memory if one were able to encode information about the world properly.

  18. Performing a global barrier operation in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-12-09

    Executing computing tasks on a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.

  19. Misleading Performance Claims in Parallel Computations

    SciTech Connect

    Bailey, David H.

    2009-05-29

    In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.

  20. Parallel algorithms and archtectures for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Patrick, Merrell; Ma, Shing; Mahajan, Umesh

    1989-01-01

    The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.

  1. Parallel computation using boundary elements in solid mechanics

    NASA Technical Reports Server (NTRS)

    Chien, L. S.; Sun, C. T.

    1990-01-01

    The inherent parallelism of the boundary element method is shown. The boundary element is formulated by assuming the linear variation of displacements and tractions within a line element. Moreover, MACSYMA symbolic program is employed to obtain the analytical results for influence coefficients. Three computational components are parallelized in this method to show the speedup and efficiency in computation. The global coefficient matrix is first formed concurrently. Then, the parallel Gaussian elimination solution scheme is applied to solve the resulting system of equations. Finally, and more importantly, the domain solutions of a given boundary value problem are calculated simultaneously. The linear speedups and high efficiencies are shown for solving a demonstrated problem on Sequent Symmetry S81 parallel computing system.

  2. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  3. A microeconomic scheduler for parallel computers

    NASA Technical Reports Server (NTRS)

    Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex

    1995-01-01

    We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.

  4. Nonlinear hierarchical substructural parallelism and computer architecture

    NASA Technical Reports Server (NTRS)

    Padovan, Joe

    1989-01-01

    Computer architecture is investigated in conjunction with the algorithmic structures of nonlinear finite-element analysis. To help set the stage for this goal, the development is undertaken by considering the wide-ranging needs associated with the analysis of rolling tires which possess the full range of kinematic, material and boundary condition induced nonlinearity in addition to gross and local cord-matrix material properties.

  5. A Lanczos eigenvalue method on a parallel computer

    NASA Technical Reports Server (NTRS)

    Bostic, Susan W.; Fulton, Robert E.

    1987-01-01

    Eigenvalue analyses of complex structures is a computationally intensive task which can benefit significantly from new and impending parallel computers. This study reports on a parallel computer implementation of the Lanczos method for free vibration analysis. The approach used here subdivides the major Lanczos calculation tasks into subtasks and introduces parallelism down to the subtask levels such as matrix decomposition and forward/backward substitution. The method was implemented on a commercial parallel computer and results were obtained for a long flexible space structure. While parallel computing efficiency for the Lanczos method was good for a moderate number of processors for the test problem, the greatest reduction in time was realized for the decomposition of the stiffness matrix, a calculation which took 70 percent of the time in the sequential program and which took 25 percent of the time on eight processors. For a sample calculation of the twenty lowest frequencies of a 486 degree of freedom problem, the total sequential computing time was reduced by almost a factor of ten using 16 processors.

  6. History Matching in Parallel Computational Environments

    SciTech Connect

    Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Yonghwee Kim; Sharad Yadav

    2006-08-31

    A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different r{sub D} parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current

  7. History Matching in Parallel Computational Environments

    SciTech Connect

    Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav

    2005-10-01

    A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different rD parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current approach to

  8. Identifying failure in a tree network of a parallel computer

    DOEpatents

    Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.

    2010-08-24

    Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.

  9. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, R.E.; Gustafson, J.L.; Montry, G.R.

    1999-08-10

    A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.

  10. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

    1999-01-01

    A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

  11. Parallel Computational Fluid Dynamics: Current Status and Future Requirements

    NASA Technical Reports Server (NTRS)

    Simon, Horst D.; VanDalsem, William R.; Dagum, Leonardo; Kutler, Paul (Technical Monitor)

    1994-01-01

    One or the key objectives of the Applied Research Branch in the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Allies Research Center is the accelerated introduction of highly parallel machines into a full operational environment. In this report we discuss the performance results obtained from the implementation of some computational fluid dynamics (CFD) applications on the Connection Machine CM-2 and the Intel iPSC/860. We summarize some of the experiences made so far with the parallel testbed machines at the NAS Applied Research Branch. Then we discuss the long term computational requirements for accomplishing some of the grand challenge problems in computational aerosciences. We argue that only massively parallel machines will be able to meet these grand challenge requirements, and we outline the computer science and algorithm research challenges ahead.

  12. Modeling groundwater flow on massively parallel computers

    SciTech Connect

    Ashby, S.F.; Falgout, R.D.; Fogwell, T.W.; Tompson, A.F.B.

    1994-12-31

    The authors will explore the numerical simulation of groundwater flow in three-dimensional heterogeneous porous media. An interdisciplinary team of mathematicians, computer scientists, hydrologists, and environmental engineers is developing a sophisticated simulation code for use on workstation clusters and MPPs. To date, they have concentrated on modeling flow in the saturated zone (single phase), which requires the solution of a large linear system. they will discuss their implementation of preconditioned conjugate gradient solvers. The preconditioners under consideration include simple diagonal scaling, s-step Jacobi, adaptive Chebyshev polynomial preconditioning, and multigrid. They will present some preliminary numerical results, including simulations of groundwater flow at the LLNL site. They also will demonstrate the code`s scalability.

  13. Parallel Domain Decomposition Preconditioning for Computational Fluid Dynamics

    NASA Technical Reports Server (NTRS)

    Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Kutler, Paul (Technical Monitor)

    1998-01-01

    This viewgraph presentation gives an overview of the parallel domain decomposition preconditioning for computational fluid dynamics. Details are given on some difficult fluid flow problems, stabilized spatial discretizations, and Newton's method for solving the discretized flow equations. Schur complement domain decomposition is described through basic formulation, simplifying strategies (including iterative subdomain and Schur complement solves, matrix element dropping, localized Schur complement computation, and supersparse computations), and performance evaluation.

  14. Three parallel computation methods for structural vibration analysis

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf; Bostic, Susan; Patrick, Merrell; Mahajan, Umesh; Ma, Shing

    1988-01-01

    The Lanczos (1950), multisectioning, and subspace iteration sequential methods for vibration analysis presently used as bases for three parallel algorithms are noted, in the aftermath of three example problems, to maintain reasonable accuracy in the computation of vibration frequencies. Significant computation time reductions are obtained as the number of processors increases. An analysis is made of the performance of each method, in order to characterize relative strengths and weaknesses as well as to identify those parameters that most strongly affect computation efficiency.

  15. Parallel and pipeline computation of fast unitary transforms

    NASA Technical Reports Server (NTRS)

    Fino, B. J.; Algazi, V. R.

    1975-01-01

    The letter discusses the parallel and pipeline organization of fast-unitary-transform algorithms such as the fast Fourier transform, and points out the efficiency of a combined parallel-pipeline processor of a transform such as the Haar transform, in which (2 to the n-th power) -1 hardware 'butterflies' generate a transform of order 2 to the n-th power every computation cycle.

  16. Dynamic grid refinement for partial differential equations on parallel computers

    NASA Technical Reports Server (NTRS)

    Mccormick, S.; Quinlan, D.

    1989-01-01

    The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.

  17. A Parallel Genetic Algorithm for Automated Electronic Circuit Design

    NASA Technical Reports Server (NTRS)

    Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris

    2000-01-01

    Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency

  18. A scalable parallel graph coloring algorithm for distributed memory computers.

    SciTech Connect

    Bozdag, Doruk; Manne, Fredrik; Gebremedhin, Assefaw H.; Catalyurek, Umit; Boman, Erik Gunnar

    2005-02-01

    In large-scale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributed memory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterative fashion; in each round vertices are speculatively colored based on limited information, and then a set of incorrectly colored vertices, to be recolored in the next round, is identified. Parallel speedup is achieved in part by reducing the frequency of communication among processors. Experimental results on a PC cluster using up to 16 processors show that the algorithm is scalable.

  19. RAMA: A file system for massively parallel computers

    NASA Technical Reports Server (NTRS)

    Miller, Ethan L.; Katz, Randy H.

    1993-01-01

    This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.

  20. Parallel algorithm for computing 3-D reachable workspaces

    NASA Astrophysics Data System (ADS)

    Alameldin, Tarek K.; Sobh, Tarek M.

    1992-03-01

    The problem of computing the 3-D workspace for redundant articulated chains has applications in a variety of fields such as robotics, computer aided design, and computer graphics. The computational complexity of the workspace problem is at least NP-hard. The recent advent of parallel computers has made practical solutions for the workspace problem possible. Parallel algorithms for computing the 3-D workspace for redundant articulated chains with joint limits are presented. The first phase of these algorithms computes workspace points in parallel. The second phase uses workspace points that are computed in the first phase and fits a 3-D surface around the volume that encompasses the workspace points. The second phase also maps the 3- D points into slices, uses region filling to detect the holes and voids in the workspace, extracts the workspace boundary points by testing the neighboring cells, and tiles the consecutive contours with triangles. The proposed algorithms are efficient for computing the 3-D reachable workspace for articulated linkages, not only those with redundant degrees of freedom but also those with joint limits.

  1. Numerical simulation of supersonic wake flow with parallel computers

    SciTech Connect

    Wong, C.C.; Soetrisno, M.

    1995-07-01

    Simulating a supersonic wake flow field behind a conical body is a computing intensive task. It requires a large number of computational cells to capture the dominant flow physics and a robust numerical algorithm to obtain a reliable solution. High performance parallel computers with unique distributed processing and data storage capability can provide this need. They have larger computational memory and faster computing time than conventional vector computers. We apply the PINCA Navier-Stokes code to simulate a wind-tunnel supersonic wake experiment on Intel Gamma, Intel Paragon, and IBM SP2 parallel computers. These simulations are performed to study the mean flow in the near wake region of a sharp, 7-degree half-angle, adiabatic cone at Mach number 4.3 and freestream Reynolds number of 40,600. Overall the numerical solutions capture the general features of the hypersonic laminar wake flow and compare favorably with the wind tunnel data. With a refined and clustering grid distribution in the recirculation zone, the calculated location of the rear stagnation point is consistent with the 2D axisymmetric and 3D experiments. In this study, we also demonstrate the importance of having a large local memory capacity within a computer node and the effective utilization of the number of computer nodes to achieve good parallel performance when simulating a complex, large-scale wake flow problem.

  2. CFD Analysis and Design Optimization Using Parallel Computers

    NASA Technical Reports Server (NTRS)

    Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James

    1997-01-01

    A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.

  3. Parallel grid generation algorithm for distributed memory computers

    NASA Technical Reports Server (NTRS)

    Moitra, Stuti; Moitra, Anutosh

    1994-01-01

    A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.

  4. Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations

    NASA Technical Reports Server (NTRS)

    Chrisochoides, Nikos

    1995-01-01

    We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.

  5. Solving unstructured grid problems on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Hammond, Steven W.; Schreiber, Robert

    1990-01-01

    A highly parallel graph mapping technique that enables one to efficiently solve unstructured grid problems on massively parallel computers is presented. Many implicit and explicit methods for solving discretized partial differential equations require each point in the discretization to exchange data with its neighboring points every time step or iteration. The cost of this communication can negate the high performance promised by massively parallel computing. To eliminate this bottleneck, the graph of the irregular problem is mapped into the graph representing the interconnection topology of the computer such that the sum of the distances that the messages travel is minimized. It is shown that using the heuristic mapping algorithm significantly reduces the communication time compared to a naive assignment of processes to processors.

  6. A parallel sparse algorithm targeting arterial fluid mechanics computations

    NASA Astrophysics Data System (ADS)

    Manguoglu, Murat; Takizawa, Kenji; Sameh, Ahmed H.; Tezduyar, Tayfun E.

    2011-09-01

    Iterative solution of large sparse nonsymmetric linear equation systems is one of the numerical challenges in arterial fluid-structure interaction computations. This is because the fluid mechanics parts of the fluid + structure block of the equation system that needs to be solved at every nonlinear iteration of each time step corresponds to incompressible flow, the computational domains include slender parts, and accurate wall shear stress calculations require boundary layer mesh refinement near the arterial walls. We propose a hybrid parallel sparse algorithm, domain-decomposing parallel solver (DDPS), to address this challenge. As the test case, we use a fluid mechanics equation system generated by starting with an arterial shape and flow field coming from an FSI computation and performing two time steps of fluid mechanics computation with a prescribed arterial shape change, also coming from the FSI computation. We show how the DDPS algorithm performs in solving the equation system and demonstrate the scalability of the algorithm.

  7. Panel on future directions in parallel computer architecture

    SciTech Connect

    VanTilborg, A.M. )

    1989-06-01

    One of the program highlights of the 15th Annual International Symposium on Computer Architecture, held May 30 - June 2, 1988 in Honolulu, was a panel session on future directions in parallel computer architecture. The panel was organized and chaired by the author, and was comprised of Prof. Jack Dennis (NASA Ames Research Institute for Advanced Computer Science), Prof. H.T. Kung (Carnegie Mellon), and Dr. Burton Smith (Tera Computer Company). The objective of the panel was to identify the likely trajectory of future parallel computer system progress, particularly from the sandpoint of marketplace acceptance. Approximately 250 attendees participated in the session, in which each panelist began with a ten minute viewgraph explanation of his views, followed by an open and sometimes lively exchange with the audience and fellow panelists. The session ran for ninety minutes.

  8. IPython: components for interactive and parallel computing across disciplines. (Invited)

    NASA Astrophysics Data System (ADS)

    Perez, F.; Bussonnier, M.; Frederic, J. D.; Froehle, B. M.; Granger, B. E.; Ivanov, P.; Kluyver, T.; Patterson, E.; Ragan-Kelley, B.; Sailer, Z.

    2013-12-01

    Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallel computing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallel computing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.

  9. Parallel Computing Environments and Methods for Power Distribution System Simulation

    SciTech Connect

    Lu, Ning; Taylor, Zachary T.; Chassin, David P.; Guttromson, Ross T.; Studham, Scott S.

    2005-11-10

    The development of cost-effective high-performance parallel computing on multi-processor super computers makes it attractive to port excessively time consuming simulation software from personal computers (PC) to super computes. The power distribution system simulator (PDSS) takes a bottom-up approach and simulates load at appliance level, where detailed thermal models for appliances are used. This approach works well for a small power distribution system consisting of a few thousand appliances. When the number of appliances increases, the simulation uses up the PC memory and its run time increases to a point where the approach is no longer feasible to model a practical large power distribution system. This paper presents an effort made to port a PC-based power distribution system simulator (PDSS) to a 128-processor shared-memory super computer. The paper offers an overview of the parallel computing environment and a description of the modification made to the PDSS model. The performances of the PDSS running on a standalone PC and on the super computer are compared. Future research direction of utilizing parallel computing in the power distribution system simulation is also addressed.

  10. Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

    1996-01-01

    This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.

  11. Method for implementation of recursive hierarchical segmentation on parallel computers

    NASA Technical Reports Server (NTRS)

    Tilton, James C. (Inventor)

    2005-01-01

    A method, computer readable storage, and apparatus for implementing a recursive hierarchical segmentation algorithm on a parallel computing platform. The method includes setting a bottom level of recursion that defines where a recursive division of an image into sections stops dividing, and setting an intermediate level of recursion where the recursive division changes from a parallel implementation into a serial implementation. The segmentation algorithm is implemented according to the set levels. The method can also include setting a convergence check level of recursion with which the first level of recursion communicates with when performing a convergence check.

  12. Small file aggregation in a parallel computing system

    DOEpatents

    Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang

    2014-09-02

    Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.

  13. Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges

    NASA Technical Reports Server (NTRS)

    Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.

    2000-01-01

    This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.

  14. Implicit schemes and parallel computing in unstructured grid CFD

    NASA Technical Reports Server (NTRS)

    Venkatakrishnam, V.

    1995-01-01

    The development of implicit schemes for obtaining steady state solutions to the Euler and Navier-Stokes equations on unstructured grids is outlined. Applications are presented that compare the convergence characteristics of various implicit methods. Next, the development of explicit and implicit schemes to compute unsteady flows on unstructured grids is discussed. Next, the issues involved in parallelizing finite volume schemes on unstructured meshes in an MIMD (multiple instruction/multiple data stream) fashion are outlined. Techniques for partitioning unstructured grids among processors and for extracting parallelism in explicit and implicit solvers are discussed. Finally, some dynamic load balancing ideas, which are useful in adaptive transient computations, are presented.

  15. TSE computers - A means for massively parallel computations

    NASA Technical Reports Server (NTRS)

    Strong, J. P., III

    1976-01-01

    A description is presented of hardware concepts for building a massively parallel processing system for two-dimensional data. The processing system is to use logic arrays of 128 x 128 elements which perform over 16 thousand operations simultaneously. Attention is given to image data, logic arrays, basic image logic functions, a prototype negator, an interleaver device, image logic circuits, and an image memory circuit.

  16. Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2016-03-15

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  17. Communication-efficient parallel architectures and algorithms for image computations

    SciTech Connect

    Alnuweiri, H.M.

    1989-01-01

    The main purpose of this dissertation is the design of efficient parallel techniques for image computations which require global operations on image pixels, as well as the development of parallel architectures with special communication features which can support global data movement efficiently. The class of image problems considered in this dissertation involves global operations on image pixels, and irregular (data-dependent) data movement operations. Such problems include histogramming, component labeling, proximity computations, computing the Hough Transform, computing convexity of regions and related properties such as computing the diameter and a smallest area enclosing rectangle for each region. Images with multiple figures and multiple labeled-sets of pixels are also considered. Efficient solutions to such problems involve integer sorting, graph theoretic techniques, and techniques from computational geometry. Although such solutions are not computationally intensive (they all require O(n{sup 2}) operations to be performed on an n {times} n image), they require global communications. The emphasis here is on developing parallel techniques for data movement, reduction, and distribution, which lead to processor-time optimal solutions for such problems on the proposed organizations. The proposed parallel architectures are based on a memory array which can be viewed as an arrangement of memory modules in a k-dimensional space such that the modules are connected to buses placed parallel to the orthogonal axes of the space, and each bus is connected to one processor or a group of processors. It will be shown that such organizations are communication-efficient and are thus highly suited to the image problems considered here, and also to several other classes of problems. The proposed organizations have p processors and O(n{sup 2}) words of memory to process n {times} n images.

  18. Instant well-log inversion with a parallel computer

    SciTech Connect

    Kimminau, S.J.; Trivedi, H.

    1993-08-01

    Well-log analysis requires several vectors of input data to be inverted with a physical model that produces more vectors of output data. The problem is inherently suited to either vectorization or parallelization. PLATO (parallel log analysis, timely output) is a research prototype system that uses a parallel architecture computer with memory-mapped graphics to invert vector data and display the result rapidly. By combining this high-performance computing and display system with a graphical user interface, the analyst can interact with the system in real time'' and can visualize the result of changing parameters on up to 1,000 levels of computed volumes and reconstructed logs. It is expected that such instant'' inversion will remove the main disadvantages frequently cited for simultaneous analysis methods, namely difficulty in assessing sensitivity to different parameters and slow output response. Although the prototype system uses highly specific features of a parallel processor, a subsequent version has been implemented on a conventional (Serial) workstation with less performance but adequate functionality to preserve the apparently instant response. PLATO demonstrates the feasibility of petroleum computing applications combining an intuitive graphical interface, high-performance computing of physical models, and real-time output graphics.

  19. Applications of parallel supercomputers: Scientific results and computer science lessons

    SciTech Connect

    Fox, G.C.

    1989-07-12

    Parallel Computing has come of age with several commercial and inhouse systems that deliver supercomputer performance. We illustrate this with several major computations completed or underway at Caltech on hypercubes, transputer arrays and the SIMD Connection Machine CM-2 and AMT DAP. Applications covered are lattice gauge theory, computational fluid dynamics, subatomic string dynamics, statistical and condensed matter physics,theoretical and experimental astronomy, quantum chemistry, plasma physics, grain dynamics, computer chess, graphics ray tracing, and Kalman filters. We use these applications to compare the performance of several advanced architecture computers including the conventional CRAY and ETA-10 supercomputers. We describe which problems are suitable for which computers in the terms of a matching between problem and computer architecture. This is part of a set of lessons we draw for hardware, software, and performance. We speculate on the emergence of new academic disciplines motivated by the growing importance of computers. 138 refs., 23 figs., 10 tabs.

  20. A design methodology for portable software on parallel computers

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.

    1993-01-01

    This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured

  1. Parallel Computing for the Computed-Tomography Imaging Spectrometer

    NASA Technical Reports Server (NTRS)

    Lee, Seungwon

    2008-01-01

    This software computes the tomographic reconstruction of spatial-spectral data from raw detector images of the Computed-Tomography Imaging Spectrometer (CTIS), which enables transient-level, multi-spectral imaging by capturing spatial and spectral information in a single snapshot.

  2. Hybrid parallel computing architecture for multiview phase shifting

    NASA Astrophysics Data System (ADS)

    Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun

    2014-11-01

    The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.

  3. Requirements for supercomputing in energy research: The transition to massively parallel computing

    SciTech Connect

    Not Available

    1993-02-01

    This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.

  4. Solving tridiagonal linear systems on the Butterfly parallel computer

    SciTech Connect

    Kumar, S.P.

    1989-01-01

    A parallel block partitioning method to solve a tri-diagonal system of linear equations is adapted to the BBN Butterfly multiprocessor. A performance analysis of the programming experiments on the 32-node Butterfly is presented. An upper bound on the number of processors to achieve the best performance with this method is derived. The computational results verify the theoretical speedup and efficiency results of the parallel algorithm over its serial counterpart. Also included is a study comparing performance runs of the same code on the Butterfly processor with a hardware floating point unit and on one with a software floating point facility. The total parallel time of the given code is considerably reduced by making use of the hardware floating point facility whereas the speedup and efficiency of the parallel program considerably improve on the system with software floating point capability. The achieved results are shown to be within 82% to 90% of the predicted performance.

  5. Solution of partial differential equations on vector and parallel computers

    NASA Technical Reports Server (NTRS)

    Ortega, J. M.; Voigt, R. G.

    1985-01-01

    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed.

  6. Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing

    NASA Technical Reports Server (NTRS)

    Ozguner, Fusun

    1996-01-01

    Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.

  7. Parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Amin-Javaheri, Masoud; Orin, David E.

    1989-01-01

    The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.

  8. Variable-Complexity Multidisciplinary Optimization on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Grossman, Bernard; Mason, William H.; Watson, Layne T.; Haftka, Raphael T.

    1998-01-01

    This report covers work conducted under grant NAG1-1562 for the NASA High Performance Computing and Communications Program (HPCCP) from December 7, 1993, to December 31, 1997. The objective of the research was to develop new multidisciplinary design optimization (MDO) techniques which exploit parallel computing to reduce the computational burden of aircraft MDO. The design of the High-Speed Civil Transport (HSCT) air-craft was selected as a test case to demonstrate the utility of our MDO methods. The three major tasks of this research grant included: development of parallel multipoint approximation methods for the aerodynamic design of the HSCT, use of parallel multipoint approximation methods for structural optimization of the HSCT, mathematical and algorithmic development including support in the integration of parallel computation for items (1) and (2). These tasks have been accomplished with the development of a response surface methodology that incorporates multi-fidelity models. For the aerodynamic design we were able to optimize with up to 20 design variables using hundreds of expensive Euler analyses together with thousands of inexpensive linear theory simulations. We have thereby demonstrated the application of CFD to a large aerodynamic design problem. For the predicting structural weight we were able to combine hundreds of structural optimizations of refined finite element models with thousands of optimizations based on coarse models. Computations have been carried out on the Intel Paragon with up to 128 nodes. The parallel computation allowed us to perform combined aerodynamic-structural optimization using state of the art models of a complex aircraft configurations.

  9. Measuring performance of parallel computers. Progress report, 1989

    SciTech Connect

    Sullivan, F.

    1994-07-01

    Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.

  10. Parallelization of implicit finite difference schemes in computational fluid dynamics

    NASA Technical Reports Server (NTRS)

    Decker, Naomi H.; Naik, Vijay K.; Nicoules, Michel

    1990-01-01

    Implicit finite difference schemes are often the preferred numerical schemes in computational fluid dynamics, requiring less stringent stability bounds than the explicit schemes. Each iteration in an implicit scheme involves global data dependencies in the form of second and higher order recurrences. Efficient parallel implementations of such iterative methods are considerably more difficult and non-intuitive. The parallelization of the implicit schemes that are used for solving the Euler and the thin layer Navier-Stokes equations and that require inversions of large linear systems in the form of block tri-diagonal and/or block penta-diagonal matrices is discussed. Three-dimensional cases are emphasized and schemes that minimize the total execution time are presented. Partitioning and scheduling schemes for alleviating the effects of the global data dependencies are described. An analysis of the communication and the computation aspects of these methods is presented. The effect of the boundary conditions on the parallel schemes is also discussed.

  11. Parallel matrix transpose algorithms on distributed memory concurrent computers

    SciTech Connect

    Choi, Jaeyoung; Dongarra, J. |; Walker, D.W.

    1994-12-31

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  12. Traffic simulations on parallel computers using domain decomposition techniques

    SciTech Connect

    Hanebutte, U.R.; Tentner, A.M.

    1995-12-31

    Large scale simulations of Intelligent Transportation Systems (ITS) can only be achieved by using the computing resources offered by parallel computing architectures. Domain decomposition techniques are proposed which allow the performance of traffic simulations with the standard simulation package TRAF-NETSIM on a 128 nodes IBM SPx parallel supercomputer as well as on a cluster of SUN workstations. Whilst this particular parallel implementation is based on NETSIM, a microscopic traffic simulation model, the presented strategy is applicable to a broad class of traffic simulations. An outer iteration loop must be introduced in order to converge to a global solution. A performance study that utilizes a scalable test network that consist of square-grids is presented, which addresses the performance penalty introduced by the additional iteration loop.

  13. Element-topology-independent preconditioners for parallel finite element computations

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alexander, Scott

    1992-01-01

    A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.

  14. Hardware packet pacing using a DMA in a parallel computer

    DOEpatents

    Chen, Dong; Heidelberger, Phillip; Vranas, Pavlos

    2013-08-13

    Method and system for hardware packet pacing using a direct memory access controller in a parallel computer which, in one aspect, keeps track of a total number of bytes put on the network as a result of a remote get operation, using a hardware token counter.

  15. Parallel Analysis and Visualization on Cray Compute Node Linux

    SciTech Connect

    Pugmire, Dave; Ahern, Sean

    2008-01-01

    Capability computer systems are deployed to give researchers the computational power required to investigate and solve key challenges facing the scientific community. As the power of these computer systems increases, the computational problem domain typically increases in size, complexity and scope. These increases strain the ability of commodity analysis and visualization clusters to effectively perform post-processing tasks and provide critical insight and understanding to the computed results. An alternative to purchasing increasingly larger, separate analysis and visualization commodity clusters is to use the computational system itself to perform post-processing tasks. In this paper, the recent successful port of VisIt, a parallel, open source analysis and visualization tool, to compute node linux running on the Cray is detailed. Additionally, the unprecedented ability of this resource for analysis and visualization is discussed and a report on obtained results is presented.

  16. Efficiently modeling neural networks on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Farber, Robert M.

    1993-01-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.

  17. Fencing data transfers in a parallel active messaging interface of a parallel computer

    SciTech Connect

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-30

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  18. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-08-11

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  19. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-02

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  20. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-09

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  1. Computationally efficient implementation of combustion chemistry in parallel PDF calculations

    NASA Astrophysics Data System (ADS)

    Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.

    2009-08-01

    In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel

  2. A Multilevel Parallelization Framework for High-Order Stencil Computations

    NASA Astrophysics Data System (ADS)

    Dursun, Hikmet; Nomura, Ken-Ichi; Peng, Liu; Seymour, Richard; Wang, Weiqiang; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

    Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via single-instruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.

  3. Superfast robust digital image correlation analysis with parallel computing

    NASA Astrophysics Data System (ADS)

    Pan, Bing; Tian, Long

    2015-03-01

    Existing digital image correlation (DIC) using the robust reliability-guided displacement tracking (RGDT) strategy for full-field displacement measurement is a path-dependent process that can only be executed sequentially. This path-dependent tracking strategy not only limits the potential of DIC for further improvement of its computational efficiency but also wastes the parallel computing power of modern computers with multicore processors. To maintain the robustness of the existing RGDT strategy and to overcome its deficiency, an improved RGDT strategy using a two-section tracking scheme is proposed. In the improved RGDT strategy, the calculated points with correlation coefficients higher than a preset threshold are all taken as reliably computed points and given the same priority to extend the correlation analysis to their neighbors. Thus, DIC calculation is first executed in parallel at multiple points by separate independent threads. Then for the few calculated points with correlation coefficients smaller than the threshold, DIC analysis using existing RGDT strategy is adopted. Benefiting from the improved RGDT strategy and the multithread computing, superfast DIC analysis can be accomplished without sacrificing its robustness and accuracy. Experimental results show that the presented parallel DIC method performed on a common eight-core laptop can achieve about a 7 times speedup.

  4. Parallel multithread computing for spectroscopic analysis in optical coherence tomography

    NASA Astrophysics Data System (ADS)

    Trojanowski, Michal; Kraszewski, Maciej; Strakowski, Marcin; Pluciński, Jerzy

    2014-05-01

    Spectroscopic Optical Coherence Tomography (SOCT) is an extension of Optical Coherence Tomography (OCT). It allows gathering spectroscopic information from individual scattering points inside the sample. It is based on time-frequency analysis of interferometric signals. Such analysis requires calculating hundreds of Fourier transforms while performing a single A-scan. Additionally, further processing of acquired spectroscopic information is needed. This significantly increases the time of required computations. During last years, application of graphical processing units (GPU's) was proposed to reduce computation time in OCT by using parallel computing algorithms. GPU technology can be also used to speed-up signal processing in SOCT. However, parallel algorithms used in classical OCT need to be revised because of different character of analyzed data. The classical OCT requires processing of long, independent interferometric signals for obtaining subsequent A-scans. The difference with SOCT is that it requires processing of multiple, shorter signals, which differ only in a small part of samples. We have developed new algorithms for parallel signal processing for usage in SOCT, implemented with NVIDIA CUDA (Compute Unified Device Architecture). We present details of the algorithms and performance tests for analyzing data from in-house SD-OCT system. We also give a brief discussion about usefulness of developed algorithm. Presented algorithms might be useful for researchers working on OCT, as they allow to reduce computation time and are step toward real-time signal processing of SOCT data.

  5. Aggregating job exit statuses of a plurality of compute nodes executing a parallel application

    SciTech Connect

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Mundy, Michael B.

    2015-07-21

    Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.

  6. A Simple Physical Optics Algorithm Perfect for Parallel Computing

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1993-01-01

    One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  7. Boundary element analysis on vector and parallel computers

    NASA Technical Reports Server (NTRS)

    Kane, J. H.

    1994-01-01

    Boundary element analysis (BEA) can be characterized as a numerical technique that generally shifts the computational burden in the analysis toward numerical integration and the solution of nonsymmetric and either dense or blocked sparse systems of algebraic equations. Researchers have explored the concept that the fundamental characteristics of BEA can be exploited to generate effective implementations on vector and parallel computers. In this paper, the results of some of these investigations are discussed. The performance of overall algorithms for BEA on vector supercomputers, massively data parallel single instruction multiple data (SIMD), and relatively fine grained distributed memory multiple instruction multiple data (MIMD) computer systems is described. Some general trends and conclusions are discussed, along with indications of future developments that may prove fruitful in this regard.

  8. Parallel computing in genomic research: advances and applications

    PubMed Central

    Ocaña, Kary; de Oliveira, Daniel

    2015-01-01

    Today’s genomic experiments have to process the so-called “biological big data” that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. PMID:26604801

  9. New Parallel computing framework for radiation transport codes

    SciTech Connect

    Kostin, M.A.; Mokhov, N.V.; Niita, K.; /JAERI, Tokai

    2010-09-01

    A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.

  10. Computing NLTE Opacities -- Node Level Parallel Calculation

    SciTech Connect

    Holladay, Daniel

    2015-09-11

    Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.

  11. Parallel distance matrix computation for Matlab data mining

    NASA Astrophysics Data System (ADS)

    Skurowski, Przemysław; Staniszewski, Michał

    2016-06-01

    The paper presents utility functions for computing of a distance matrix, which plays a crucial role in data mining. The goal in the design was to enable operating on relatively large datasets by overcoming basic shortcoming - computing time - with an interface easy to use. The presented solution is a set of functions, which were created with emphasis on practical applicability in real life. The proposed solution is presented along the theoretical background for the performance scaling. Furthermore, different approaches of the parallel computing are analyzed, including shared memory, which is uncommon in Matlab environment.

  12. Experiences with the Lanczos method on a parallel computer

    NASA Technical Reports Server (NTRS)

    Bostic, Susan W.; Fulton, Robert E.

    1987-01-01

    A parallel computer implementation of the Lanczos method for the free-vibration analysis of structures is considered, and results for two example problems show substantial time-reduction over the sequential solutions. The major Lanczos calculation tasks are subdivided into subtasks, and parallelism is introduced at the subtask level. A speedup of 7.8 on eight processors was obtained for the decomposition step of the problem involving a 60-m three-longeron space mast, and a speedup of 14.6 on 16 processors was obtained for the decomposition step of the problem involving a blade-stiffened graphite-epoxy panel.

  13. Radiation transport on unstructured mesh with parallel computers

    SciTech Connect

    Fan, W.C.; Drumm, C.R.

    2000-07-01

    This paper summarizes the developmental work on a deterministic transport code that provides multidimensional radiation transport capabilities on an unstructured mesh. The second-order form of the Boltzmann transport equation is solved utilizing the discrete ordinates angular differencing and the Galerkin finite element spatial differencing. The discretized system, which couples the spatial-angular dependence, is solved simultaneously using a parallel conjugate-gradient (CG) iterative solver. This approach eliminates the need for the conventional inner iterations over the discrete directions and is well-suited for massively parallel computers.

  14. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2015-02-03

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.

  15. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.

  16. Object-Based Parallel Framework for Scientific Computing

    NASA Astrophysics Data System (ADS)

    Pierce, Brian; Omelchenko, Y. A.

    1999-11-01

    We have developed a library of software in Fortran 90 and MPI for running simulations on massively parallel facilities. This is modeled after Omelchenko's FLAME code which was written in C++. With Fortran 90 we found several advantages, such as the array syntax and the intrinsic functions. The parallel portion of this library is achieved by dividing the data into subdomains, and distributing the subdomains among the processors to be computed concurrently (with periodic updates in neighboring region information as is necessary). The library is flexible enough so that one can use it to run simulations on any number of processors, and the user can divide up the data between the processors in an arbitrary fashion. We have tested this library for correctness and speed by using it to conduct simulations on a parallel cluster at General Atomics and on a serial workstation.

  17. Parallel block schemes for large scale least squares computations

    SciTech Connect

    Golub, G.H.; Plemmons, R.J.; Sameh, A.

    1986-04-01

    Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment of the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.

  18. 3D seismic imaging on massively parallel computers

    SciTech Connect

    Womble, D.E.; Ober, C.C.; Oldfield, R.

    1997-02-01

    The ability to image complex geologies such as salt domes in the Gulf of Mexico and thrusts in mountainous regions is a key to reducing the risk and cost associated with oil and gas exploration. Imaging these structures, however, is computationally expensive. Datasets can be terabytes in size, and the processing time required for the multiple iterations needed to produce a velocity model can take months, even with the massively parallel computers available today. Some algorithms, such as 3D, finite-difference, prestack, depth migration remain beyond the capacity of production seismic processing. Massively parallel processors (MPPs) and algorithms research are the tools that will enable this project to provide new seismic processing capabilities to the oil and gas industry. The goals of this work are to (1) develop finite-difference algorithms for 3D, prestack, depth migration; (2) develop efficient computational approaches for seismic imaging and for processing terabyte datasets on massively parallel computers; and (3) develop a modular, portable, seismic imaging code.

  19. PArallel Reacting Multiphase FLOw Computational Fluid Dynamic Analysis

    Energy Science and Technology Software Center (ESTSC)

    2002-06-01

    PARMFLO is a parallel multiphase reacting flow computational fluid dynamics (CFD) code. It can perform steady or unsteady simulations in three space dimensions. It is intended for use in engineering CFD analysis of industrial flow system components. Its parallel processing capabilities allow it to be applied to problems that use at least an order of magnitude more computational cells than the number that can be used on a typical single processor workstation (about 106 cellsmore » in parallel processing mode versus about io cells in serial processing mode). Alternately, by spreading the work of a CFD problem that could be run on a single workstation over a group of computers on a network, it can bring the runtime down by an order of magnitude or more (typically from many days to less than one day). The software was implemented using the industry standard Message-Passing Interface (MPI) and domain decomposition in one spatial direction. The phases of a flow problem may include an ideal gas mixture with an arbitrary number of chemical species, and dispersed droplet and particle phases. Regions of porous media may also be included within the domain. The porous media may be packed beds, foams, or monolith catalyst supports. With these features, the code is especially suited to analysis of mixing of reactants in the inlet chamber of catalytic reactors coupled to computation of product yields that result from the flow of the mixture through the catalyst coaled support structure.« less

  20. Distributed parallel computing in stochastic modeling of groundwater systems.

    PubMed

    Dong, Yanhui; Li, Guomin; Xu, Haizhen

    2013-03-01

    Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling. PMID:22823593

  1. PArallel Reacting Multiphase FLOw Computational Fluid Dynamic Analysis

    SciTech Connect

    Lottes, Steven A.

    2002-06-01

    PARMFLO is a parallel multiphase reacting flow computational fluid dynamics (CFD) code. It can perform steady or unsteady simulations in three space dimensions. It is intended for use in engineering CFD analysis of industrial flow system components. Its parallel processing capabilities allow it to be applied to problems that use at least an order of magnitude more computational cells than the number that can be used on a typical single processor workstation (about 106 cells in parallel processing mode versus about io cells in serial processing mode). Alternately, by spreading the work of a CFD problem that could be run on a single workstation over a group of computers on a network, it can bring the runtime down by an order of magnitude or more (typically from many days to less than one day). The software was implemented using the industry standard Message-Passing Interface (MPI) and domain decomposition in one spatial direction. The phases of a flow problem may include an ideal gas mixture with an arbitrary number of chemical species, and dispersed droplet and particle phases. Regions of porous media may also be included within the domain. The porous media may be packed beds, foams, or monolith catalyst supports. With these features, the code is especially suited to analysis of mixing of reactants in the inlet chamber of catalytic reactors coupled to computation of product yields that result from the flow of the mixture through the catalyst coaled support structure.

  2. Implementation of ADI: Schemes on MIMD parallel computers

    NASA Technical Reports Server (NTRS)

    Vanderwijngaart, Rob F.

    1993-01-01

    In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.

  3. The 2nd Symposium on the Frontiers of Massively Parallel Computations

    NASA Technical Reports Server (NTRS)

    Mills, Ronnie (Editor)

    1988-01-01

    Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

  4. Numerical computation on massively parallel hypercubes. [Connection machine

    SciTech Connect

    McBryan, O.A.

    1986-01-01

    We describe numerical computations on the Connection Machine, a massively parallel hypercube architecture with 65,536 single-bit processors and 32 Mbytes of memory. A parallel extension of COMMON LISP, provides access to the processors and network. The rich software environment is further enhanced by a powerful virtual processor capability, which extends the degree of fine-grained parallelism beyond 1,000,000. We briefly describe the hardware and indicate the principal features of the parallel programming environment. We then present implementations of SOR, multigrid and pre-conditioned conjugate gradient algorithms for solving partial differential equations on the Connection Machine. Despite the lack of floating point hardware, computation rates above 100 megaflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly. The software development effort is also facilitated by the fact that hypercube communications prove to be fast and essentially independent of distance. 29 refs., 4 figs.

  5. Executing a gather operation on a parallel computer

    DOEpatents

    Archer, Charles J.; Ratterman, Joseph D.

    2012-03-20

    Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallel computer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network.

  6. Performing a local reduction operation on a parallel computer

    DOEpatents

    Blocksome, Michael A; Faraj, Daniel A

    2013-06-04

    A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.

  7. Parallel computation with adaptive methods for elliptic and hyperbolic systems

    SciTech Connect

    Benantar, M.; Biswas, R.; Flaherty, J.E.; Shephard, M.S.

    1990-01-01

    We consider the solution of two dimensional vector systems of elliptic and hyperbolic partial differential equations on a shared memory parallel computer. For elliptic problems, the spatial domain is discretized using a finite quadtree mesh generation procedure and the differential system is discretized by a finite element-Galerkin technique with a piecewise linear polynomial basis. Resulting linear algebraic systems are solved using the conjugate gradient technique with element-by-element and symmetric successive over-relaxation preconditioners. Stiffness matrix assembly and linear system solutions are processed in parallel with computations scheduled on noncontiguous quadrants of the tree in order to minimize process synchronization. Determining noncontiguous regions by coloring the regular finite quadtree structure is far simpler than coloring elements of the unstructured mesh that the finite quadtree procedure generates. We describe linear-time complexity coloring procedures that use six and eight colors.

  8. Final Report: Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    Mellor-Crummey, John

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  9. Performing a local reduction operation on a parallel computer

    DOEpatents

    Blocksome, Michael A.; Faraj, Daniel A.

    2012-12-11

    A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.

  10. Establishing a group of endpoints in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.; Xue, Hanhong

    2016-02-02

    A parallel computer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallel computer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification.

  11. Probabilistic structural mechanics research for parallel processing computers

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.

    1991-01-01

    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.

  12. QCD on the highly parallel computer AP1000

    NASA Astrophysics Data System (ADS)

    Akemi, K.; de Forcrand, Ph.; Fujisaki, M.; Hashimoto, T.; Hege, H. C.; Hioki, S.; Makino, J.; Miyamura, O.; Nakamura, A.; Okuda, M.; Stamatescu, I. O.; Tago, Y.; Takaishi, T.; QCD TARO (QCD on Thousand cell ARray processorsOmnipurpose) Collaboration

    We have been running quenched QCD simulations on 32 4 and 32 3 × 48 lattices using a 512 processor AP1000, which is a highly parallel computer with up to 1024 processing elements. We have developed programs for update, blocking and hadron propagator calculations. The pseudo heatbath and the overrelaxation algorithms were used for the update with performance of 2.6 and 2.0 μsec/link, respectively.

  13. Multithreaded processor architecture for parallel symbolic computation. Technical report

    SciTech Connect

    Fujita, T.

    1987-09-01

    This paper describes the Multilisp Architecture for Symbolic Applications (MASA), which is a multithreaded processor architecture for parallel symbolic computation with various features intended for effective Multilisp program execution. The principal mechanisms exploited for this processor are multiple contexts, interleaved pipeline execution from separate instruction streams, and synchronization based on a bit in each memory cell. The tagged architecture approach is taken for Lisp program execution, and trap conditions are provided for future object manipulation and garbage collection.

  14. Modeling of supersonic combustor flows using parallel computing

    NASA Technical Reports Server (NTRS)

    Riggins, D.; Underwood, M.; Mcmillin, B.; Reeves, L.; Lu, E. J.-L.

    1992-01-01

    While current 3D CFD codes and modeling techniques have been shown capable of furnishing engineering data for complex scramjet flowfields, the usefulness of such efforts is primarily limited by solutions' CPU time requirements, and secondarily by memory requirements. Attention is presently given to the use of parallel computing capabilities for engineering CFD tools for the analysis of supersonic reacting flows, and to an illustrative incompressible CFD problem using up to 16 iPSC/2 processors with single-domain decomposition.

  15. Pipeline and parallel architectures for computer communication systems

    SciTech Connect

    Reddi, A.V.

    1983-01-01

    Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.

  16. Rapid indirect trajectory optimization on highly parallel computing architectures

    NASA Astrophysics Data System (ADS)

    Antony, Thomas

    Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical

  17. Parallel Computation of the Regional Ocean Modeling System (ROMS)

    SciTech Connect

    Wang, P; Song, Y T; Chao, Y; Zhang, H

    2005-04-05

    The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds of processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.

  18. Domain decomposition methods for the parallel computation of reacting flows

    NASA Technical Reports Server (NTRS)

    Keyes, David E.

    1988-01-01

    Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.

  19. A Parallel Iterative Method for Computing Molecular Absorption Spectra.

    PubMed

    Koval, Peter; Foerster, Dietrich; Coulaud, Olivier

    2010-09-14

    We describe a fast parallel iterative method for computing molecular absorption spectra within TDDFT linear response and using the LCAO method. We use a local basis of "dominant products" to parametrize the space of orbital products that occur in the LCAO approach. In this basis, the dynamic polarizability is computed iteratively within an appropriate Krylov subspace. The iterative procedure uses a matrix-free GMRES method to determine the (interacting) density response. The resulting code is about 1 order of magnitude faster than our previous full-matrix method. This acceleration makes the speed of our TDDFT code comparable with codes based on Casida's equation. The implementation of our method uses hybrid MPI and OpenMP parallelization in which load balancing and memory access are optimized. To validate our approach and to establish benchmarks, we compute spectra of large molecules on various types of parallel machines. The methods developed here are fairly general, and we believe they will find useful applications in molecular physics/chemistry, even for problems that are beyond TDDFT, such as organic semiconductors, particularly in photovoltaics. PMID:26616067

  20. Local rollback for fault-tolerance in parallel computing systems

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Gara, Alan; Giampapa, Mark E.; Heidelberger, Philip; Ohmacht, Martin; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan

    2012-01-24

    A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.

  1. Parallel and vector computation for stochastic optimal control applications

    NASA Technical Reports Server (NTRS)

    Hanson, F. B.

    1989-01-01

    A general method for parallel and vector numerical solutions of stochastic dynamic programming problems is described for optimal control of general nonlinear, continuous time, multibody dynamical systems, perturbed by Poisson as well as Gaussian random white noise. Possible applications include lumped flight dynamics models for uncertain environments, such as large scale and background random atmospheric fluctuations. The numerical formulation is highly suitable for a vector multiprocessor or vectorizing supercomputer, and results exhibit high processor efficiency and numerical stability. Advanced computing techniques, data structures, and hardware help alleviate Bellman's curse of dimensionality in dynamic programming computations.

  2. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2012-04-17

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallel computer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.

  3. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A.; Smith, Brian E.

    2013-01-29

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  4. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A; Smith, Brian E

    2014-11-18

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  5. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A; Smith, Brian E

    2014-11-25

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  6. Use Computer-Aided Tools to Parallelize Large CFD Applications

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Yan, J.

    2000-01-01

    Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of

  7. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel A.

    2014-07-22

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  8. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2013-09-03

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  9. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A; Mamidala, Amith R

    2014-02-11

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  10. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-09-16

    Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.

  11. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-09-02

    Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.

  12. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    SciTech Connect

    Nash, T.

    1989-05-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs.

  13. Data communications for a collective operation in a parallel active messaging interface of a parallel computer

    SciTech Connect

    Faraj, Daniel A.

    2015-11-19

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  14. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Davis, Kristan D; Faraj, Daniel A

    2013-07-09

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  15. Data communications for a collective operation in a parallel active messaging interface of a parallel computer

    DOEpatents

    Faraj, Daniel A

    2013-07-16

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  16. Semi-coarsening multigrid methods for parallel computing

    SciTech Connect

    Jones, J.E.

    1996-12-31

    Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.

  17. Parallel matrix transpose algorithms on distributed memory concurrent computers

    SciTech Connect

    Choi, J.; Walker, D.W.; Dongarra, J.J. |

    1993-10-01

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  18. DMA shared byte counters in a parallel computer

    DOEpatents

    Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

    2010-04-06

    A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.

  19. Final Report: Super Instruction Architecture for Scalable Parallel Computations

    SciTech Connect

    Sanders, Beverly Ann; Bartlett, Rodney; Deumens, Erik

    2013-12-23

    The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.

  20. Determining collective barrier operation skew in a parallel computer

    SciTech Connect

    Faraj, Daniel A.

    2015-11-24

    Determining collective barrier operation skew in a parallel computer that includes a number of compute nodes organized into an operational group includes: for each of the nodes until each node has been selected as a delayed node: selecting one of the nodes as a delayed node; entering, by each node other than the delayed node, a collective barrier operation; entering, after a delay by the delayed node, the collective barrier operation; receiving an exit signal from a root of the collective barrier operation; and measuring, for the delayed node, a barrier completion time. The barrier operation skew is calculated by: identifying, from the compute nodes' barrier completion times, a maximum barrier completion time and a minimum barrier completion time and calculating the barrier operation skew as the difference of the maximum and the minimum barrier completion time.

  1. 3D finite-difference seismic migration with parallel computers

    SciTech Connect

    Ober, C.C.; Gjertsen, R.; Minkoff, S.; Womble, D.E.

    1998-11-01

    The ability to image complex geologies such as salt domes in the Gulf of Mexico and thrusts in mountainous regions is essential for reducing the risk associated with oil exploration. Imaging these structures, however, is computationally expensive as datasets can be terabytes in size. Traditional ray-tracing migration methods cannot handle complex velocity variations commonly found near such salt structures. Instead the authors use the full 3D acoustic wave equation, discretized via a finite difference algorithm. They reduce the cost of solving the apraxial wave equation by a number of numerical techniques including the method of fractional steps and pipelining the tridiagonal solves. The imaging code, Salvo, uses both frequency parallelism (generally 90% efficient) and spatial parallelism (65% efficient). Salvo has been tested on synthetic and real data and produces clear images of the subsurface even beneath complicated salt structures.

  2. Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    John Mellor-Crummey

    2008-02-29

    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.

  3. Parallel computation of automatic differentiation applied to magnetic field calculations

    SciTech Connect

    Hinkins, R.L. |

    1994-09-01

    The author presents a parallelization of an accelerator physics application to simulate magnetic field in three dimensions. The problem involves the evaluation of high order derivatives with respect to two variables of a multivariate function. Automatic differentiation software had been used with some success, but the computation time was prohibitive. The implementation runs on several platforms, including a network of workstations using PVM, a MasPar using MPFortran, and a CM-5 using CMFortran. A careful examination of the code led to several optimizations that improved its serial performance by a factor of 8.7. The parallelization produced further improvements, especially on the MasPar with a speedup factor of 620. As a result a problem that took six days on a SPARC 10/41 now runs in minutes on the MasPar, making it feasible for physicists at Lawrence Berkeley Laboratory to simulate larger magnets.

  4. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  5. Parallel computation of multigroup reactivity coefficient using iterative method

    SciTech Connect

    Susmikanti, Mike; Dewayatna, Winter

    2013-09-09

    One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.

  6. Parallel computation of multigroup reactivity coefficient using iterative method

    NASA Astrophysics Data System (ADS)

    Susmikanti, Mike; Dewayatna, Winter

    2013-09-01

    One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.

  7. Review of An Introduction to Parallel and Vector Scientific Computing

    SciTech Connect

    Bailey, David H.; Lefton, Lew

    2006-06-30

    On one hand, the field of high-performance scientific computing is thriving beyond measure. Performance of leading-edge systems on scientific calculations, as measured say by the Top500 list, has increased by an astounding factor of 8000 during the 15-year period from 1993 to 2008, which is slightly faster even than Moore's Law. Even more importantly, remarkable advances in numerical algorithms, numerical libraries and parallel programming environments have led to improvements in the scope of what can be computed that are entirely on a par with the advances in computing hardware. And these successes have spread far beyond the confines of large government-operated laboratories, many universities, modest-sized research institutes and private firms now operate clusters that differ only in scale from the behemoth systems at the large-scale facilities. In the wake of these recent successes, researchers from fields that heretofore have not been part of the scientific computing world have been drawn into the arena. For example, at the recent SC07 conference, the exhibit hall, which long has hosted displays from leading computer systems vendors and government laboratories, featured some 70 exhibitors who had not previously participated. In spite of all these exciting developments, and in spite of the clear need to present these concepts to a much broader technical audience, there is a perplexing dearth of training material and textbooks in the field, particularly at the introductory level. Only a handful of universities offer coursework in the specific area of highly parallel scientific computing, and instructors of such courses typically rely on custom-assembled material. For example, the present reviewer and Robert F. Lucas relied on materials assembled in a somewhat ad-hoc fashion from colleagues and personal resources when presenting a course on parallel scientific computing at the University of California, Berkeley, a few years ago. Thus it is indeed refreshing to see

  8. Hyper-parallel photonic quantum computation with coupled quantum dots

    NASA Astrophysics Data System (ADS)

    Ren, Bao-Cang; Deng, Fu-Guo

    2014-04-01

    It is well known that a parallel quantum computer is more powerful than a classical one. So far, there are some important works about the construction of universal quantum logic gates, the key elements in quantum computation. However, they are focused on operating on one degree of freedom (DOF) of quantum systems. Here, we investigate the possibility of achieving scalable hyper-parallel quantum computation based on two DOFs of photon systems. We construct a deterministic hyper-controlled-not (hyper-CNOT) gate operating on both the spatial-mode and the polarization DOFs of a two-photon system simultaneously, by exploiting the giant optical circular birefringence induced by quantum-dot spins in double-sided optical microcavities as a result of cavity quantum electrodynamics (QED). This hyper-CNOT gate is implemented by manipulating the four qubits in the two DOFs of a two-photon system without auxiliary spatial modes or polarization modes. It reduces the operation time and the resources consumed in quantum information processing, and it is more robust against the photonic dissipation noise, compared with the integration of several cascaded CNOT gates in one DOF.

  9. An experiment in hurricane track prediction using parallel computing methods

    NASA Technical Reports Server (NTRS)

    Song, Chang G.; Jwo, Jung-Sing; Lakshmivarahan, S.; Dhall, S. K.; Lewis, John M.; Velden, Christopher S.

    1994-01-01

    The barotropic model is used to explore the advantages of parallel processing in deterministic forecasting. We apply this model to the track forecasting of hurricane Elena (1985). In this particular application, solutions to systems of elliptic equations are the essence of the computational mechanics. One set of equations is associated with the decomposition of the wind into irrotational and nondivergent components - this determines the initial nondivergent state. Another set is associated with recovery of the streamfunction from the forecasted vorticity. We demonstrate that direct parallel methods based on accelerated block cyclic reduction (BCR) significantly reduce the computational time required to solve the elliptic equations germane to this decomposition and forecast problem. A 72-h track prediction was made using incremental time steps of 16 min on a network of 3000 grid points nominally separated by 100 km. The prediction took 30 sec on the 8-processor Alliant FX/8 computer. This was a speed-up of 3.7 when compared to the one-processor version. The 72-h prediction of Elena's track was made as the storm moved toward Florida's west coast. Approximately 200 km west of Tampa Bay, Elena executed a dramatic recurvature that ultimately changed its course toward the northwest. Although the barotropic track forecast was unable to capture the hurricane's tight cycloidal looping maneuver, the subsequent northwesterly movement was accurately forecasted as was the location and timing of landfall near Mobile Bay.

  10. Large Scale EM Wave Propagation Analysis using FDTD Parallel Computation on Computer System for Education

    NASA Astrophysics Data System (ADS)

    Sonoda, Jun

    This paper describes the study of a fast electromagnetic (EM) wave propagation analysis that can solve electrically large domains using finite difference time domain (FDTD) method on cluster of personal computers (PC cluster). It reports an implementation of parallel FDTD using an MPI library on PC clusters of the computer system for education. Use of this method demonstrates that the speed-up ratio achieved for problem size 1200 × 1200 is about 55.0 using FDTD on 80 PCs. And also, indoor propagation of UWB pulse on the floor (1095.4λ× 98.6λ) is analyzed by the parallel FDTD using 40 PCs, computational time and memory have been reduced by 1/36.4 and 1/39.9, respectively. The results demonstrate that the parallel FDTD using PC cluster can analyze electrically large problems low computational costs than novel FDTD.

  11. Domain decomposition methods for the parallel computation of reacting flows

    NASA Astrophysics Data System (ADS)

    Keyes, David E.

    1989-05-01

    Domain decomposition is a natural route to parallel computing for partial differential equation solvers. In this procedure, subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, we make comparisons between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demostrate for it approximately 10-fold speedup on 16 processsors. The three special features of reacting flow models in relation to these linear systems are: the possibly large number of degrees of freedom per gridpoint, the dominance of dense intra-point source-term coupling over inter-point convective-diffusive coupling throughout significant portions of the flow-field and strong nonlinearities which restrict the time step to small values (independent of linear algebraic considerations) throughout significant portions of the iteration history. Though these features are exploited to advantage herein, many aspects of the paper are applicable to the modeling of general convective-diffusive systems.

  12. An efficient parallel algorithm for accelerating computational protein design

    PubMed Central

    Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang

    2014-01-01

    Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991

  13. Parallel Computation of Persistent Homology using the Blowup Complex

    SciTech Connect

    Lewis, Ryan; Morozov, Dmitriy

    2015-04-27

    We describe a parallel algorithm that computes persistent homology, an algebraic descriptor of a filtered topological space. Our algorithm is distinguished by operating on a spatial decomposition of the domain, as opposed to a decomposition with respect to the filtration. We rely on a classical construction, called the Mayer--Vietoris blowup complex, to glue global topological information about a space from its disjoint subsets. We introduce an efficient algorithm to perform this gluing operation, which may be of independent interest, and describe how to process the domain hierarchically. We report on a set of experiments that help assess the strengths and identify the limitations of our method.

  14. Optimal cooperative CubeSat maneuvers obtained through parallel computing

    NASA Astrophysics Data System (ADS)

    Ghosh, Alexander; Coverstone, Victoria

    2015-02-01

    CubeSats, the class of small standardized satellites, are quickly becoming a prevalent scientific research tool. The desire to perform ambitious missions using multiple CubeSats will lead to innovations in thruster technology and will require new tools for the development of cooperative trajectory planning. To meet this need, a new software tool was created to compute propellant-minimizing maneuvers for two or more CubeSats. By including parallelization techniques, this tool is shown to run significantly faster than its serial counterpart.

  15. Routing performance analysis and optimization within a massively parallel computer

    DOEpatents

    Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen

    2013-04-16

    An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.

  16. Parallel biomolecular computation on surfaces with advanced finite automata.

    PubMed

    Soreni, Michal; Yogev, Sivan; Kossoy, Elizaveta; Shoham, Yuval; Keinan, Ehud

    2005-03-23

    A biomolecular, programmable 3-symbol-3-state finite automaton is reported. This automaton computes autonomously with all of its components, including hardware, software, input, and output being biomolecules mixed together in solution. The hardware consisted of two enzymes: an endonuclease, BbvI, and T4 DNA ligase. The software (transition rules represented by transition molecules) and the input were double-stranded (ds) DNA oligomers. Computation was carried out by autonomous processing of the input molecules via repetitive cycles of restriction, hybridization, and ligation reactions to produce a final-state output in the form of a dsDNA molecule. The 3-symbol-3-state deterministic automaton is an extension of the 2-symbol-2-state automaton previously reported, and theoretically it can be further expanded to a 37-symbol-3-state automaton. The applicability of this design was further amplified by employing surface-anchored input molecules, using the surface plasmon resonance technology to monitor the computation steps in real time. Computation was performed by alternating the feed solutions between endonuclease and a solution containing the ligase, ATP, and appropriate transition molecules. The output detection involved final ligation with one of three soluble detection molecules. Parallel computation and stepwise detection were carried out automatically with a Biacore chip that was loaded with four different inputs. PMID:15771530

  17. Application Specific Performance Technology for Productive Parallel Computing

    SciTech Connect

    Malony, Allen D.; Shende, Sameer

    2008-09-30

    Our accomplishments over the last three years of the DOE project Application- Specific Performance Technology for Productive Parallel Computing (DOE Agreement: DE-FG02-05ER25680) are described below. The project will have met all of its objectives by the time of its completion at the end of September, 2008. Two extensive yearly progress reports were produced in in March 2006 and 2007 and were previously submitted to the DOE Office of Advanced Scientific Computing Research (OASCR). Following an overview of the objectives of the project, we summarize for each of the project areas the achievements in the first two years, and then describe in some more detail the project accomplishments this past year. At the end, we discuss the relationship of the proposed renewal application to the work done on the current project.

  18. Signal processing applications of massively parallel charge domain computing devices

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)

    1999-01-01

    The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.

  19. Applications of massively parallel computers in telemetry processing

    NASA Technical Reports Server (NTRS)

    El-Ghazawi, Tarek A.; Pritchard, Jim; Knoble, Gordon

    1994-01-01

    Telemetry processing refers to the reconstruction of full resolution raw instrumentation data with artifacts, of space and ground recording and transmission, removed. Being the first processing phase of satellite data, this process is also referred to as level-zero processing. This study is aimed at investigating the use of massively parallel computing technology in providing level-zero processing to spaceflights that adhere to the recommendations of the Consultative Committee on Space Data Systems (CCSDS). The workload characteristics, of level-zero processing, are used to identify processing requirements in high-performance computing systems. An example of level-zero functions on a SIMD MPP, such as the MasPar, is discussed. The requirements in this paper are based in part on the Earth Observing System (EOS) Data and Operation System (EDOS).

  20. Factorization of large integers on a massively parallel computer

    SciTech Connect

    Davis, J.A.; Holdridge, D.B.

    1988-01-01

    Our interest in integer factorization at Sandia National Laboratories is motivated by cryptographic applications and in particular the security of the RSA encryption-decryption algorithm. We have implemented our version of the quadratic sieve procedure on the NCUBE computer with 1024 processors (nodes). The new code is significantly different in all important aspects from the program used to factor number of order 10/sup 70/ on a single processor CRAY computer. Capabilities of parallel processing and limitation of small local memory necessitated this entirely new implementation. This effort involved several restarts as realizations of program structures that seemed appealing bogged down due to inter-processor communications. We are presently working with integers of magnitude about 10/sup 70/ in tuning this code to the novel hardware. 6 refs., 3 figs.

  1. On the relationship between parallel computation and graph embedding

    SciTech Connect

    Gupta, A.K.

    1989-01-01

    The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sized binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.

  2. Energy Proportionality and Performance in Data Parallel Computing Clusters

    SciTech Connect

    Kim, Jinoh; Chou, Jerry; Rotem, Doron

    2011-02-14

    Energy consumption in datacenters has recently become a major concern due to the rising operational costs andscalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumedby the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerancepurposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A coveringset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In thiswork, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizesenergy consumption while placing the remaining nodes in lowpower standby mode. Our algorithms can also discover coveringsets in heterogeneous computing environments. In order to allow more data parallelism, we generalize our algorithms so that itcan discover k-covering sets, i.e., a set of nodes that contain at least k replicas of the data blocks. Our experimental results showthat we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and workingenvironments.

  3. Parallel Molecular Dynamics Stencil : a new parallel computing environment for a large-scale molecular dynamics simulation of solids

    NASA Astrophysics Data System (ADS)

    Shimizu, Futoshi; Kimizuka, Hajime; Kaburaki, Hideo

    2002-08-01

    A new parallel computing environment, called as ``Parallel Molecular Dynamics Stencil'', has been developed to carry out a large-scale short-range molecular dynamics simulation of solids. The stencil is written in C language using MPI for parallelization and designed successfully to separate and conceal parts of the programs describing cutoff schemes and parallel algorithms for data communication. This has been made possible by introducing the concept of image atoms. Therefore, only a sequential programming of the force calculation routine is required for executing the stencil in parallel environment. Typical molecular dynamics routines, such as various ensembles, time integration methods, and empirical potentials, have been implemented in the stencil. In the presentation, the performance of the stencil on parallel computers of Hitachi, IBM, SGI, and PC-cluster using the models of Lennard-Jones and the EAM type potentials for fracture problem will be reported.

  4. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    NASA Astrophysics Data System (ADS)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  5. Parallel Computation of Unsteady Flows on a Network of Workstations

    NASA Technical Reports Server (NTRS)

    1997-01-01

    Parallel computation of unsteady flows requires significant computational resources. The utilization of a network of workstations seems an efficient solution to the problem where large problems can be treated at a reasonable cost. This approach requires the solution of several problems: 1) the partitioning and distribution of the problem over a network of workstation, 2) efficient communication tools, 3) managing the system efficiently for a given problem. Of course, there is the question of the efficiency of any given numerical algorithm to such a computing system. NPARC code was chosen as a sample for the application. For the explicit version of the NPARC code both two- and three-dimensional problems were studied. Again both steady and unsteady problems were investigated. The issues studied as a part of the research program were: 1) how to distribute the data between the workstations, 2) how to compute and how to communicate at each node efficiently, 3) how to balance the load distribution. In the following, a summary of these activities is presented. Details of the work have been presented and published as referenced.

  6. Low latency, high bandwidth data communications between compute nodes in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2010-11-02

    Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (`DMA`) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (`RTS`) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation.

  7. Cloud identification using genetic algorithms and massively parallel computation

    NASA Technical Reports Server (NTRS)

    Buckles, Bill P.; Petry, Frederick E.

    1996-01-01

    As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user

  8. Seismic imaging using finite-differences and parallel computers

    SciTech Connect

    Ober, C.C.

    1997-12-31

    A key to reducing the risks and costs of associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Prestack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar wave equation using finite differences. As part of an ongoing ACTI project funded by the US Department of Energy, a finite difference, 3-D prestack, depth migration code has been developed. The goal of this work is to demonstrate that massively parallel computers can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite difference, prestack, depth migration practical for oil and gas exploration. Several problems had to be addressed to get an efficient code for the Intel Paragon. These include efficient I/O, efficient parallel tridiagonal solves, and high single-node performance. Furthermore, to provide portable code the author has been restricted to the use of high-level programming languages (C and Fortran) and interprocessor communications using MPI. He has been using the SUNMOS operating system, which has affected many of his programming decisions. He will present images created from two verification datasets (the Marmousi Model and the SEG/EAEG 3D Salt Model). Also, he will show recent images from real datasets, and point out locations of improved imaging. Finally, he will discuss areas of current research which will hopefully improve the image quality and reduce computational costs.

  9. Pacing a data transfer operation between compute nodes on a parallel computer

    DOEpatents

    Blocksome, Michael A.

    2011-09-13

    Methods, systems, and products are disclosed for pacing a data transfer between compute nodes on a parallel computer that include: transferring, by an origin compute node, a chunk of an application message to a target compute node; sending, by the origin compute node, a pacing request to a target direct memory access (`DMA`) engine on the target compute node using a remote get DMA operation; determining, by the origin compute node, whether a pacing response to the pacing request has been received from the target DMA engine; and transferring, by the origin compute node, a next chunk of the application message if the pacing response to the pacing request has been received from the target DMA engine.

  10. A Testbed of Parallel Kernels for Computer Science Research

    SciTech Connect

    Bailey, David; Demmel, James; Ibrahim, Khaled; Kaiser, Alex; Koniges, Alice; Madduri, Kamesh; Shalf, John; Strohmaier, Erich; Williams, Samuel

    2010-04-30

    For several decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models for optimal performance, efficiency, and productivity. Unfortunately, this guidance is most often taken from the existing software/hardware ecosystem. Architects attempt to provide micro-architectural solutions to improve performance on fixed binaries. Researchers tweak compilers to improve code generation for existing architectures and implementations, and they may invent new programming models for fixed processor and memory architectures and computational algorithms. In today's rapidly evolving world of on-chip parallelism, these isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. In an initial study, we have developed an alternate approach that, rather than starting with an existing hardware/software solution laced with hidden assumptions, defines the computational problems of interest and invites architects, researchers and programmers to implement novel hardware/ software co-designed solutions. Our work builds on the previous ideas of computational dwarfs, motifs, and parallel patterns by selecting a representative set of essential problems for which we provide: An algorithmic description; scalable problem definition; illustrative reference implementations; verification schemes. For simplicity, we focus initially on the computational problems of interest to the scientific computing community but proclaim the methodology (and perhaps a subset of the problems) as applicable to other communities. We intend to broaden the coverage of this problem space through stronger community involvement. Previous work has established a broad categorization of numerical methods of interest to the scientific computing, in the spirit of the NAS Benchmarks, which pioneered the basic idea of a 'pencil and paper benchmark' in the 1990s. The

  11. Implementation of Parallel Computing Technology to Vortex Flow

    NASA Technical Reports Server (NTRS)

    Dacles-Mariani, Jennifer

    1999-01-01

    Mainframe supercomputers such as the Cray C90 was invaluable in obtaining large scale computations using several millions of grid points to resolve salient features of a tip vortex flow over a lifting wing. However, real flight configurations require tracking not only of the flow over several lifting wings but its growth and decay in the near- and intermediate- wake regions, not to mention the interaction of these vortices with each other. Resolving and tracking the evolution and interaction of these vortices shed from complex bodies is computationally intensive. Parallel computing technology is an attractive option in solving these flows. In planetary science vortical flows are also important in studying how planets and protoplanets form when cosmic dust and gases become gravitationally unstable and eventually form planets or protoplanets. The current paradigm for the formation of planetary systems maintains that the planets accreted from the nebula of gas and dust left over from the formation of the Sun. Traditional theory also indicate that such a preplanetary nebula took the form of flattened disk. The coagulation of dust led to the settling of aggregates toward the midplane of the disk, where they grew further into asteroid-like planetesimals. Some of the issues still remaining in this process are the onset of gravitational instability, the role of turbulence in the damping of particles and radial effects. In this study the focus will be with the role of turbulence and the radial effects.

  12. Ocean predictability studies in a parallel computing environment. Final report

    SciTech Connect

    Leary, R.H.

    1992-11-01

    The goal of the SDSC effort described here is to evaluate the performance potential of the Oberhuber isopycnal (OPYC) ocean global circulation model on the 64-node iPSC/860 parallel computer at SDSC and its near term successor, the Intel Paragon, relative to that of a single vector processor of a CRAY Y-MP and its near term successor, the CRAY C90. This effort is in support of a larger joint project with researchers at Scripps Institution of Oceanography to study the properties of long (10 to 100-year scale), computationally intensive integrations of ocean global circulation models. Generally, performance of the OPYC model on the iPSC/860 has proved quite disappointing, with a simplified version of the entire model running at approximately 1.4 Mflops on a single i860 node in its original form and after extensive optimization, including coding in assembler, at 3.2 Mflops. The author estimates overall performance to be limited to 75 Mflops or less on the 64-node machine, as compared to 180 Mflops for a single CRAY Y-MP processor and 500 Mflops for a single CRAY C90 processor. Similarly, he believes an implementation on an Intel Paragon, even with several hundred nodes, will not be competitive with a single processor C90 implementation. Additionally, the Hamburg Large Scale Geostrophic (LSG) model has been implemented on high-performance workstations and evaluated for its computational potential as an alternative to OPYC for the long term integrations.

  13. Parallel computer graphics algorithms for the Connection Machine

    SciTech Connect

    Richardson, J.F.

    1990-01-01

    Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.

  14. Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud Computing

    NASA Astrophysics Data System (ADS)

    Ford, Eric B.; Dindar, Saleh; Peters, Jorg

    2015-08-01

    The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer

  15. A high performance parallel computing architecture for robust image features

    NASA Astrophysics Data System (ADS)

    Zhou, Renyan; Liu, Leibo; Wei, Shaojun

    2014-03-01

    A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.

  16. On implicit Runge-Kutta methods for parallel computations

    NASA Technical Reports Server (NTRS)

    Keeling, Stephen L.

    1987-01-01

    Implicit Runge-Kutta methods which are well-suited for parallel computations are characterized. It is claimed that such methods are first of all, those for which the associated rational approximation to the exponential has distinct poles, and these are called multiply explicit (MIRK) methods. Also, because of the so-called order reduction phenomenon, there is reason to require that these poles be real. Then, it is proved that a necessary condition for a q-stage, real MIRK to be A sub 0-stable with maximal order q + 1 is that q = 1, 2, 3, or 5. Nevertheless, it is shown that for every positive integer q, there exists a q-stage, real MIRK which is I-stable with order q. Finally, some useful examples of algebraically stable MIRKs are given.

  17. Optimized collectives using a DMA on a parallel computer

    DOEpatents

    Chen, Dong; Gabor, Dozsa; Giampapa, Mark E.; Heidelberger; Phillip

    2011-02-08

    Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.

  18. A Generic Scheduling Simulator for High Performance Parallel Computers

    SciTech Connect

    Yoo, B S; Choi, G S; Jette, M A

    2001-08-01

    It is well known that efficient job scheduling plays a crucial role in achieving high system utilization in large-scale high performance computing environments. A good scheduling algorithm should schedule jobs to achieve high system utilization while satisfying various user demands in an equitable fashion. Designing such a scheduling algorithm is a non-trivial task even in a static environment. In practice, the computing environment and workload are constantly changing. There are several reasons for this. First, the computing platforms constantly evolve as the technology advances. For example, the availability of relatively powerful commodity off-the-shelf (COTS) components at steadily diminishing prices have made it feasible to construct ever larger massively parallel computers in recent years [1, 4]. Second, the workload imposed on the system also changes constantly. The rapidly increasing compute resources have provided many applications developers with the opportunity to radically alter program characteristics and take advantage of these additional resources. New developments in software technology may also trigger changes in user applications. Finally, political climate change may alter user priorities or the mission of the organization. System designers in such dynamic environments must be able to accurately forecast the effect of changes in the hardware, software, and/or policies under consideration. If the environmental changes are significant, one must also reassess scheduling algorithms. Simulation has frequently been relied upon for this analysis, because other methods such as analytical modeling or actual measurements are usually too difficult or costly. A drawback of the simulation approach, however, is that developing a simulator is a time-consuming process. Furthermore, an existing simulator cannot be easily adapted to a new environment. In this research, we attempt to develop a generic job-scheduling simulator, which facilitates the evaluation of

  19. Routing Application for Parallel computatIon of Discharge

    NASA Astrophysics Data System (ADS)

    David, C. H.; Maidment, D. R.; Yang, Z.

    2008-12-01

    Today, meteorological models can predict storm events and climate patterns, but there are few models that connect atmospheric models to the hydraulics of rivers. Such a model is necessary to advance the prediction of events such as floods or droughts. Furthermore, the increasingly available Geographic Information System-based hydrographic datasets offer ways to use actual mapped river for routing. Such GIS-based datasets include NHDPlus at the continental-scale [USEPA and USGS, 2007]and HydroSHEDS at the global- scale [Lehner, et al., 2006]. The objective of ongoing work is to develop RAPID (Routing Application for Parallel computatIon of Discharge), a large-scale river routing model that: - has physical representation of river flow - allows for coupling with both land surface models and groundwater models. In particular enabling bi-directional exchanges between rivers and aquifers through a computation of water volume and flow on a reach-to-reach basis - has some specific treatment for man-made infrastructures and anthropogenic actions (dams, pumping) - benefits from the latest scientific computing developments (supercomputing facilities and high performance parallel computing libraries) - will benefit from the increasingly available Geographic Information System hydrographic databases. RAPID has already been tested over France through a ten-year coupling with SIM-France [Habets, et al., 2008, using a river network made of 24,264 river reaches. RAPID has been adapted to run on the Lonestar supercomputer (http://www.tacc.utexas.edu/resources/hpcsystems/) and to allow the use of the NHDPlus dataset. RAPID is now being tested with the 74,615 river reaches of the entire Texas Gulf. This type of innovative model has strong implications for the future of hydrology. Such a tool can help improve the understanding of the effect of climate change on water resources as well as provide information on how many gages are needed and where are they needed the most. More broadly

  20. Parallel Computation of Orbit Determination for Space Debris Population

    NASA Astrophysics Data System (ADS)

    Olmedo, Estrella; Sanchez-Ortiz, Noelia; Ramos-Lerate, Mercedes

    2009-03-01

    In this work we present an algorithm for computing Orbit Determination for Space Debris population. The method presents a high degree of parallelism. That means that the number of available computers divides the computational effort. The context of this work and the later scope is to have the capability of cataloguing and correlating the Space Debris population. In this sense, as better the accuracy provided by the orbit determination is, more accurate will be the estimation of the state vectors corresponding to the debris objects and better will be the accuracy of the future catalogue of Space Debris. As more objects we can determinate the corresponding orbit, more complete will be the future catalogue. Therefore numerical tools for orbit determination are a key point in the development of a future ESSAS. The first time that a new object is observed, six measurements (these measurements may come from RADAR, Ground Based Telescope or Space Based Telescope) are required for computing an Initial Orbit Determination (IOD). After that, the Initial Estimated State Vector (IESV) is improved within the next-coming measurement. The idea of this method is the following. From six initial measurements, we compute the IOD following the same ideas of [1]. We compute also the initial knowledge covariance matrix (IKCM) corresponding to the IESV. In general, the numerical error of the IOD is too big for processing the following measurements with a conventional numerical filter (like the Square Root Information Filter (SRIF)). The problem is that the improvement of the accuracy in the IOD is not an easy task in those cases with large initial error. However the computed IKCM give a realistic approximation of the committed error in the IOD. The proposed algorithm uses the IKCM for generating a cloud of IESVs. All the IESV inside the cloud are processed with a new and much smaller IKCM by using SRIF. In such a way that the ones that are close enough to the real state vector (and thus

  1. Parallel Computation of the Topology of Level Sets

    SciTech Connect

    Pascucci, V; Cole-McLaughlin, K

    2004-12-16

    we can compute the Contour Tree in linear time in many practical cases where t = O(n{sup 1-{epsilon}}). We report the running times for a parallel implementation, showing good scalability with the number of processors.

  2. Low cost, highly effective parallel computing achieved through a Beowulf cluster.

    PubMed

    Bitner, Marc; Skelton, Gordon

    2003-01-01

    A Beowulf cluster is a means of bringing together several computers and using software and network components to make this cluster of computers appear and function as one computer with multiple parallel computing processors. A cluster of computers can provide comparable computing power usually found only in very expensive super computers or servers. PMID:12724866

  3. Scalable High Performance Computing: Direct and Large-Eddy Turbulent Flow Simulations Using Massively Parallel Computers

    NASA Technical Reports Server (NTRS)

    Morgan, Philip E.

    2004-01-01

    This final report contains reports of research related to the tasks "Scalable High Performance Computing: Direct and Lark-Eddy Turbulent FLow Simulations Using Massively Parallel Computers" and "Devleop High-Performance Time-Domain Computational Electromagnetics Capability for RCS Prediction, Wave Propagation in Dispersive Media, and Dual-Use Applications. The discussion of Scalable High Performance Computing reports on three objectives: validate, access scalability, and apply two parallel flow solvers for three-dimensional Navier-Stokes flows; develop and validate a high-order parallel solver for Direct Numerical Simulations (DNS) and Large Eddy Simulation (LES) problems; and Investigate and develop a high-order Reynolds averaged Navier-Stokes turbulence model. The discussion of High-Performance Time-Domain Computational Electromagnetics reports on five objectives: enhancement of an electromagnetics code (CHARGE) to be able to effectively model antenna problems; utilize lessons learned in high-order/spectral solution of swirling 3D jets to apply to solving electromagnetics project; transition a high-order fluids code, FDL3DI, to be able to solve Maxwell's Equations using compact-differencing; develop and demonstrate improved radiation absorbing boundary conditions for high-order CEM; and extend high-order CEM solver to address variable material properties. The report also contains a review of work done by the systems engineer.

  4. Dynamic modeling of Tampa Bay urban development using parallel computing

    USGS Publications Warehouse

    Xian, G.; Crane, M.; Steinwand, D.

    2005-01-01

    Urban land use and land cover has changed significantly in the environs of Tampa Bay, Florida, over the past 50 years. Extensive urbanization has created substantial change to the region's landscape and ecosystems. This paper uses a dynamic urban-growth model, SLEUTH, which applies six geospatial data themes (slope, land use, exclusion, urban extent, transportation, hillside), to study the process of urbanization and associated land use and land cover change in the Tampa Bay area. To reduce processing time and complete the modeling process within an acceptable period, the model is recoded and ported to a Beowulf cluster. The parallel-processing computer system accomplishes the massive amount of computation the modeling simulation requires. SLEUTH calibration process for the Tampa Bay urban growth simulation spends only 10 h CPU time. The model predicts future land use/cover change trends for Tampa Bay from 1992 to 2025. Urban extent is predicted to double in the Tampa Bay watershed between 1992 and 2025. Results show an upward trend of urbanization at the expense of a decline of 58% and 80% in agriculture and forested lands, respectively. ?? 2005 Elsevier Ltd. All rights reserved.

  5. Modeling the fracture of ice sheets on parallel computers.

    SciTech Connect

    Waisman, Haim; Bell, Robin; Keyes, David; Boman, Erik Gunnar; Tuminaro, Raymond Stephen

    2010-03-01

    The objective of this project is to investigate the complex fracture of ice and understand its role within larger ice sheet simulations and global climate change. At the present time, ice fracture is not explicitly considered within ice sheet models due in part to large computational costs associated with the accurate modeling of this complex phenomena. However, fracture not only plays an extremely important role in regional behavior but also influences ice dynamics over much larger zones in ways that are currently not well understood. Dramatic illustrations of fracture-induced phenomena most notably include the recent collapse of ice shelves in Antarctica (e.g. partial collapse of the Wilkins shelf in March of 2008 and the diminishing extent of the Larsen B shelf from 1998 to 2002). Other fracture examples include ice calving (fracture of icebergs) which is presently approximated in simplistic ways within ice sheet models, and the draining of supraglacial lakes through a complex network of cracks, a so called ice sheet plumbing system, that is believed to cause accelerated ice sheet flows due essentially to lubrication of the contact surface with the ground. These dramatic changes are emblematic of the ongoing change in the Earth's polar regions and highlight the important role of fracturing ice. To model ice fracture, a simulation capability will be designed centered around extended finite elements and solved by specialized multigrid methods on parallel computers. In addition, appropriate dynamic load balancing techniques will be employed to ensure an approximate equal amount of work for each processor.

  6. Analysis of composite ablators using massively parallel computation

    NASA Technical Reports Server (NTRS)

    Shia, David

    1995-01-01

    In this work, the feasibility of using massively parallel computation to study the response of ablative materials is investigated. Explicit and implicit finite difference methods are used on a massively parallel computer, the Thinking Machines CM-5. The governing equations are a set of nonlinear partial differential equations. The governing equations are developed for three sample problems: (1) transpiration cooling, (2) ablative composite plate, and (3) restrained thermal growth testing. The transpiration cooling problem is solved using a solution scheme based solely on the explicit finite difference method. The results are compared with available analytical steady-state through-thickness temperature and pressure distributions and good agreement between the numerical and analytical solutions is found. It is also found that a solution scheme based on the explicit finite difference method has the following advantages: incorporates complex physics easily, results in a simple algorithm, and is easily parallelizable. However, a solution scheme of this kind needs very small time steps to maintain stability. A solution scheme based on the implicit finite difference method has the advantage that it does not require very small times steps to maintain stability. However, this kind of solution scheme has the disadvantages that complex physics cannot be easily incorporated into the algorithm and that the solution scheme is difficult to parallelize. A hybrid solution scheme is then developed to combine the strengths of the explicit and implicit finite difference methods and minimize their weaknesses. This is achieved by identifying the critical time scale associated with the governing equations and applying the appropriate finite difference method according to this critical time scale. The hybrid solution scheme is then applied to the ablative composite plate and restrained thermal growth problems. The gas storage term is included in the explicit pressure calculation of both

  7. Iterative algorithms for large sparse linear systems on parallel computers

    NASA Technical Reports Server (NTRS)

    Adams, L. M.

    1982-01-01

    Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.

  8. Massively parallel computational fluid dynamics calculations for aerodynamics and aerothermodynamics applications

    SciTech Connect

    Payne, J.L.; Hassan, B.

    1998-09-01

    Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.

  9. Dynamic remapping decisions in multi-phase parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, D. M.; Reynolds, P. F., Jr.

    1986-01-01

    The effectiveness of any given mapping of workload to processors in a parallel system is dependent on the stochastic behavior of the workload. Program behavior is often characterized by a sequence of phases, with phase changes occurring unpredictably. During a phase, the behavior is fairly stable, but may become quite different during the next phase. Thus a workload assignment generated for one phase may hinder performance during the next phase. We consider the problem of deciding whether to remap a paralled computation in the face of uncertainty in remapping's utility. Fundamentally, it is necessary to balance the expected remapping performance gain against the delay cost of remapping. This paper treats this problem formally by constructing a probabilistic model of a computation with at most two phases. We use stochastic dynamic programming to show that the remapping decision policy which minimizes the expected running time of the computation has an extremely simple structure: the optimal decision at any step is followed by comparing the probability of remapping gain against a threshold. This theoretical result stresses the importance of detecting a phase change, and assessing the possibility of gain from remapping. We also empirically study the sensitivity of optimal performance to imprecise decision threshold. Under a wide range of model parameter values, we find nearly optimal performance if remapping is chosen simply when the gain probability is high. These results strongly suggest that except in extreme cases, the remapping decision problem is essentially that of dynamically determining whether gain can be achieved by remapping after a phase change; precise quantification of the decision model parameters is not necessary.

  10. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, J.

    1999-01-01

    A new atmospheric objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 1 X 1 lat-lon grid with 18 levels of heights and winds and 10 levels of moisture) using 120,000 observations in 17 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system is totally portable and can run on several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from 1 to 32 CPUs is 18%. In addition, the analysis results are identical regardless of the number of processors used. This system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. Static tests with a 2 X 2.5 resolution version of this system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from several months of cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (O-F statistics) as the current operational system.

  11. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, James G.

    1999-01-01

    A new objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 2 x 2.5 lat-lon grid with 20 levels of heights and winds and 10 levels of moisture) using 120,000 observations in less than 3 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system Ls totally portable and can run on -several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from I to 32 CPus is 18%. in addition, the analysis results are identical regardless of the number of processors used. T'his system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. It also includes a new quality control (buddy check) system. Static tests with the system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from a 2-month cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (0-F statistics) throughout the entire two months.

  12. Massively parallel computation of RCS with finite elements

    NASA Technical Reports Server (NTRS)

    Parker, Jay

    1993-01-01

    One of the promising combinations of finite element approaches for scattering problems uses Whitney edge elements, spherical vector wave-absorbing boundary conditions, and bi-conjugate gradient solution for the frequency-domain near field. Each of these approaches may be criticized. Low-order elements require high mesh density, but also result in fast, reliable iterative convergence. Spherical wave-absorbing boundary conditions require additional space to be meshed beyond the most minimal near-space region, but result in fully sparse, symmetric matrices which keep storage and solution times low. Iterative solution is somewhat unpredictable and unfriendly to multiple right-hand sides, yet we find it to be uniformly fast on large problems to date, given the other two approaches. Implementation of these approaches on a distributed memory, message passing machine yields huge dividends, as full scalability to the largest machines appears assured and iterative solution times are well-behaved for large problems. We present times and solutions for computed RCS for a conducting cube and composite permeability/conducting sphere on the Intel ipsc860 with up to 16 processors solving over 200,000 unknowns. We estimate problems of approximately 10 million unknowns, encompassing 1000 cubic wavelengths, may be attempted on a currently available 512 processor machine, but would be exceedingly tedious to prepare. The most severe bottlenecks are due to the slow rate of mesh generation on non-parallel machines and the large transfer time from such a machine to the parallel processor. One solution, in progress, is to create and then distribute a coarse mesh among the processors, followed by systematic refinement within each processor. Elimination of redundant node definitions at the mesh-partition surfaces, snap-to-surface post processing of the resulting mesh for good modelling of curved surfaces, and load-balancing redistribution of new elements after the refinement are auxiliary

  13. Nexus: An interoperability layer for parallel and distributed computer systems

    SciTech Connect

    Foster, I.; Kesselman, C.; Olson, R.; Tuecke, S.

    1994-05-01

    Nexus is a set of services that can be used to implement various task-parallel languages, data-parallel languages, and message-passing libraries. Nexus is designed to permit the efficient portable implementation of individual parallel programming systems and the interoperability of programs developed with different tools. Nexus supports lightweight threading and active message technology, allowing integration of message passing and threads.

  14. A class of parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1989-01-01

    Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

  15. Parallel In Situ Indexing for Data-intensive Computing

    SciTech Connect

    Kim, Jinoh; Abbasi, Hasan; Chacon, Luis; Docan, Ciprian; Klasky, Scott; Liu, Qing; Podhorszki, Norbert; Shoshani, Arie; Wu, Kesheng

    2011-09-09

    As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.

  16. Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel

    2016-05-03

    In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.

  17. Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel A.

    2016-03-01

    In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.

  18. BRaTS@Home and BOINC Distributed Computing for Parallel Computation

    NASA Astrophysics Data System (ADS)

    Coss, David Raymond; Flores, R.

    2008-09-01

    Utilizing Internet connectivity, the Berkeley Open Infrastructure for Network Computing (BOINC) provides parallel computing power without the expense of purchasing a computer cluster. BOINC, written in C++, is an open source system, acting as an intermediary between the project server and the BOINC client on the volunteer's computer. By using the idle time of computers of volunteer participants, BOINC allows scientists to build a computer cluster at the price of one server. As an example of such computational capabilities, I have developed BRaTS@Home, standing for BRaTS Ray Trace Simulation, using the BOINC distributed computing system to perform gravitational lensing ray-tracing simulations. Though BRaTS@Home is only one of many projects, 182 users in 26 different countries participate in the project. From June 2007 to April 2008, 795 computers have connected to the project server, providing an average computing power of 1.1 billion floating point operations per second(FLOPS), while the entire BOINC system averages over 1000 teraFLOPS, as of April 2008. Preliminary results of the project's gravitational ray-tracing simulations will be shown.

  19. SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008

    SciTech Connect

    2008-09-08

    The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

  20. Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

    NASA Technical Reports Server (NTRS)

    Hsia, T. C.; Lu, G. Z.; Han, W. H.

    1987-01-01

    In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.

  1. Parallel multiphysics algorithms and software for computational nuclear engineering

    NASA Astrophysics Data System (ADS)

    Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.

    2009-07-01

    There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.

  2. Parallelizing Sylvester-like operations on a distributed memory computer

    SciTech Connect

    Hu, D.Y.; Sorensen, D.C.

    1994-12-31

    Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.

  3. Three-dimensional radiative transfer on a massively parallel computer

    NASA Astrophysics Data System (ADS)

    Vath, H. M.

    1994-04-01

    We perform 3D radiative transfer calculations in non-local thermodynamic equilibrium (NLTE) in the simple two-level atom approximation on the Mas-Par MP-1, which contains 8192 processors and is a single instruction multiple data (SIMD) machine, an example of the new generation of massively parallel computers. On such a machine, all processors execute the same command at a given time, but on different data. To make radiative transfer calculations efficient, we must re-consider the numerical methods and storage of data. To solve the transfer equation, we adopt the short characteristic method and examine different acceleration methods to obtain the source function. We use the ALI method and test local and non-local operators. Furthermore, we compare the Ng and the orthomin methods of acceleration. We also investigate the use of multi-grid methods to get fast solutions for the NLTE case. In order to test these numerical methods, we apply them to two problems with and without periodic boundary conditions.

  4. Parallel computing of a climate model on the dawn 1000 by domain decomposition method

    NASA Astrophysics Data System (ADS)

    Bi, Xunqiang

    1997-12-01

    In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.

  5. Migration Effects of Parallel Genetic Algorithms on Line Topologies of Heterogeneous Computing Resources

    NASA Astrophysics Data System (ADS)

    Gong, Yiyuan; Guan, Senlin; Nakamura, Morikazu

    This paper investigates migration effects of parallel genetic algorithms (GAs) on the line topology of heterogeneous computing resources. Evolution process of parallel GAs is evaluated experimentally on two types of arrangements of heterogeneous computing resources: the ascending and descending order arrangements. Migration effects are evaluated from the viewpoints of scalability, chromosome diversity, migration frequency and solution quality. The results reveal that the performance of parallel GAs strongly depends on the design of the chromosome migration in which we need to consider the arrangement of heterogeneous computing resources, the migration frequency and so on. The results contribute to provide referential scheme of implementation of parallel GAs on heterogeneous computing resources.

  6. Full tensor gravity gradiometry data inversion: Performance analysis of parallel computing algorithms

    NASA Astrophysics Data System (ADS)

    Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu

    2015-09-01

    We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.

  7. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  8. Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

    NASA Technical Reports Server (NTRS)

    Wigton, Larry

    1996-01-01

    Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.

  9. Aerodynamic Influence Coefficient Computations Using Euler/Navier-Stokes Equations on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Farhangnia, Mehrdad; Bhatia, Kumar; Guruswamy, Guru; VanDalsem, William R. (Technical Monitor)

    1996-01-01

    Modem design requirements for an aircraft push current technologies used in the design process to their limit or sometimes require more advanced technologies to meet the requirement. New design requirements always demand to improve the operational performance. Accurate prediction of aerodynamic coefficients is essential to improve the performance. For example, in the design of an advanced subsonic civil transport, since the fluid flow at transonic regime shows strong nonlinearities, high fidelity equations, such as the Euler or Navier-Stokes equations predict flow characteristics more accurately than the linear aerodynamics, which are widely used in the current design process However, high fidelity flow equations are computationally expensive and require an order of magnitude longer time to obtain aerodynamic coefficients required in the design. Parallel computing is one possibility to cut down the computational turn-around time in using high fidelity equations so that high fidelity equations would be incorporated into the design process. By doing so, high fidelity equations would be used in the routine design process. This work will demonstrate the feasibility of using high fidelity flow equations in a design process by computing aerodynamic influence coefficients of a wing-body-empennage configuration on a multiple-instruction, multiple-data parallel computer.

  10. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-11

    Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  11. Parallel MIMD computation. The HEP supercomputer and its application

    SciTech Connect

    Kawalik, J.S.

    1985-01-01

    This book examines the Denelcor Heterogeneous Element Processor. Topics considered include HEP architecture, performance, implicit parallelism detection, the SISAL programming language, execution support for HEP SISAL, logic programming, data-flow techniques, solving ordinary differential equations, parallel algorithms for recurrence and tridiagonal equations, hydrocodes, research at Los Alamos, and the solution of boundary-value problems on the HEP.

  12. Efficient graph algorithms for sequential and parallel computers. Doctoral thesis

    SciTech Connect

    Goldberg, A.V.

    1987-02-01

    This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.

  13. User documentation for PVODE, an ODE solver for parallel computers

    SciTech Connect

    Hindmarsh, A.C., LLNL

    1998-05-01

    PVODE is a general purpose ordinary differential equation (ODE) solver for stiff and nonstiff ODES It is based on CVODE [5] [6], which is written in ANSI- standard C PVODE uses MPI (Message-Passing Interface) [8] and a revised version of the vector module in CVODE to achieve parallelism and portability PVODE is intended for the SPMD (Single Program Multiple Data) environment with distributed memory, in which all vectors are identically distributed across processors In particular, the vector module is designed to help the user assign a contiguous segment of a given vector to each of the processors for parallel computation The idea is for each processor to solve a certain fixed subset of the ODES To better understand PVODE, we first need to understand CVODE and its historical background The ODE solver CVODE, which was written by Cohen and Hindmarsh, combines features of two earlier Fortran codes, VODE [l] and VODPK [3] Those two codes were written by Brown, Byrne, and Hindmarsh. Both use variable-coefficient multi-step integration methods, and address both stiff and nonstiff systems (Stiffness is defined as the presence of one or more very small damping time constants ) VODE uses direct linear algebraic techniques to solve the underlying banded or dense linear systems of equations in conjunction with a modified Newton method in the stiff ODE case On the other hand, VODPK uses a preconditioned Krylov iterative method [2] to solve the underlying linear system User-supplied preconditioners directly address the dominant source of stiffness Consequently, CVODE implements both the direct and iterative methods Currently, with regard to the nonlinear and linear system solution, PVODE has three method options available. functional iteration, Newton iteration with a diagonal approximate Jacobian, and Newton iteration with the iterative method SPGMR (Scaled Preconditioned Generalized Minimal Residual method) Both CVODE and PVODE are written in such a way that other linear

  14. Evaluation of DEC`s GIGAswitch for distributed parallel computing

    SciTech Connect

    Chen, H.; Hutchins, J.; Brandt, J.

    1993-10-01

    One of Sandia`s research efforts is to reduce the end-to-end communication delay in a parallel-distributed computing environment. GIGAswitch is DEC`s implementation of a gigabit local area network based on switched FDDI technology. Using the GIGAswitch, the authors intend to minimize the medium access latency suffered by shared-medium FDDI technology. Experimental results show that the GIGAswitch adds 16.5 microseconds of switching and bridging delay to an end-to-end communication. Although the added latency causes a 1.8% throughput degradation and a 5% line efficiency degradation, the availability of dedicated bandwidth is much more than what is available to a workstation on a shared medium. For example, ten directly connected workstations each would have a dedicated bandwidth of 95 Mbps, but if they were sharing the FDDI bandwidth, each would have 10% of the total bandwidth, i.e., less than 10 Mbps. In addition, they have found that when there is no output port contention, the switch`s aggregate bandwidth will scale up to multiples of its port bandwidth. However, with output port contention, the throughput and latency performance suffered significantly. Their mathematical and simulation models indicate that the GIGAswitch line efficiency could be as low as 63% when there are nine input ports contending for the same output port. The data indicate that the delay introduced by contention at the server workstation is 50 times that introduced by the GIGAswitch. The authors conclude that the GIGAswitch meets the performance requirements of today`s high-end workstations and that the switched FDDI technology provides an alternative that utilizes existing workstation interfaces while increasing the aggregate bandwidth. However, because the speed of workstations is increasing by a factor of 2 every 1.5 years, the switched FDDI technology is only good as an interim solution.

  15. Algorithmic support for commodity-based parallel computing systems.

    SciTech Connect

    Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann

    2003-10-01

    The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.

  16. Three-Dimensional Radiative Transfer on a Massively Parallel Computer.

    NASA Astrophysics Data System (ADS)

    Vath, Horst Michael

    1994-01-01

    We perform three-dimensional radiative transfer calculations on the MasPar MP-1, which contains 8192 processors and is a single instruction multiple data (SIMD) machine, an example of the new generation of massively parallel computers. To make radiative transfer calculations efficient, we must re-consider the numerical methods and methods of storage of data that have been used with serial machines. We developed a numerical code which efficiently calculates images and spectra of astrophysical systems as seen from different viewing directions and at different wavelengths. We use this code to examine a number of different astrophysical systems. First we image the HI distribution of model galaxies. Then we investigate the galaxy NGC 5055, which displays a radial asymmetry in its optical appearance. This can be explained by the presence of dust in the outer HI disk far beyond the optical disk. As the formation of dust is connected to the presence of stars, the existence of dust in outer regions of this galaxy could have consequences for star formation at a time when this galaxy was just forming. Next we use the code for polarized radiative transfer. We first discuss the numerical computation of the required cyclotron opacities and use them to calculate spectra of AM Her systems, binaries containing accreting magnetic white dwarfs. Then we obtain spectra of an extended polar cap. Previous calculations did not consider the three -dimensional extension of the shock. We find that this results in a significant underestimate of the radiation emitted in the shock. Next we calculate the spectrum of the intermediate polar RE 0751+14. For this system we obtain a magnetic field of ~10 MG, which has consequences for the evolution of intermediate polars. Finally we perform 3D radiative transfer in NLTE in the two-level atom approximation. To solve the transfer equation in this case, we adapt the short characteristic method and examine different acceleration methods to obtain the

  17. Parallel paving: An algorithm for generating distributed, adaptive, all-quadrilateral meshes on parallel computers

    SciTech Connect

    Lober, R.R.; Tautges, T.J.; Vaughan, C.T.

    1997-03-01

    Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.

  18. Applications of the Aurora parallel Prolog system to computational molecular biology

    SciTech Connect

    Lusk, E.L.; Overbeek, R.; Mudambi, S.; Szeredi, P.

    1993-09-01

    We describe an investigation into the use of the Aurora parallel Prolog system in two applications within the area of computational molecular biology. The computational requirements were large, due to the nature of the applications, and were large, due to the nature of the applications, and were carried out on a scalable parallel computer the BBN ``Butterfly`` TC-2000. Results include both a demonstration that logic programming can be effective in the context of demanding applications on large-scale parallel machines, and some insights into parallel programming in Prolog.

  19. The finite element machine - An assessment of the impact of parallel computing on future finite element computations

    NASA Technical Reports Server (NTRS)

    Fulton, R. E.

    1986-01-01

    The requirements of complex aerospace vehicles combined with the age of structural analysis systems enhance the need to advance technology toward a new generation of structural analysis capability. Recent and impeding advances in parallel and supercomputers provide the opportunity to significantly improve these structural analysis capabilities for large order finite element problems. Long-term research in parallel computing, associated with the NASA Finite Element Machine project, is discussed. The results show the potential of parallel computers to provide substantial increases in computation speed over sequential computers. Results are given for sample problems in the areas of eigenvalue analysis and transient response.

  20. Experimental free-space optical network for massively parallel computers

    NASA Astrophysics Data System (ADS)

    Araki, S.; Kajita, M.; Kasahara, K.; Kubota, K.; Kurihara, K.; Redmond, I.; Schenfeld, E.; Suzaki, T.

    1996-03-01

    A free-space optical interconnection scheme is described for massively parallel processors based on the interconnection-cached network architecture. The optical network operates in a circuit-switching mode. Combined with a packet-switching operation among the circuit-switched optical channels, a high-bandwidth, low-latency network for massively parallel processing results. The design and assembly of a 64-channel experimental prototype is discussed, and operational results are presented.

  1. Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation

    PubMed Central

    Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho

    2014-01-01

    The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299

  2. Cooperative storage of shared files in a parallel computing system with dynamic block size

    DOEpatents

    Bent, John M.; Faibish, Sorin; Grider, Gary

    2015-11-10

    Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).

  3. Parallel computation of a maximum-likelihood estimator of a physical map.

    PubMed Central

    Bhandarkar, S M; Machaka, S A; Shete, S S; Kota, R N

    2001-01-01

    Reconstructing a physical map of a chromosome from a genomic library presents a central computational problem in genetics. Physical map reconstruction in the presence of errors is a problem of high computational complexity that provides the motivation for parallel computing. Parallelization strategies for a maximum-likelihood estimation-based approach to physical map reconstruction are presented. The estimation procedure entails a gradient descent search for determining the optimal spacings between probes for a given probe ordering. The optimal probe ordering is determined using a stochastic optimization algorithm such as simulated annealing or microcanonical annealing. A two-level parallelization strategy is proposed wherein the gradient descent search is parallelized at the lower level and the stochastic optimization algorithm is simultaneously parallelized at the higher level. Implementation and experimental results on a distributed-memory multiprocessor cluster running the parallel virtual machine (PVM) environment are presented using simulated and real hybridization data. PMID:11238392

  4. High Performance Input/Output for Parallel Computer Systems

    NASA Technical Reports Server (NTRS)

    Ligon, W. B.

    1996-01-01

    The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.

  5. Beyond input-output computings: error-driven emergence with parallel non-distributed slime mold computer.

    PubMed

    Aono, Masashi; Gunji, Yukio-Pegio

    2003-10-01

    The emergence derived from errors is the key importance for both novel computing and novel usage of the computer. In this paper, we propose an implementable experimental plan for the biological computing so as to elicit the emergent property of complex systems. An individual plasmodium of the true slime mold Physarum polycephalum acts in the slime mold computer. Modifying the Elementary Cellular Automaton as it entails the global synchronization problem upon the parallel computing provides the NP-complete problem solved by the slime mold computer. The possibility to solve the problem by giving neither all possible results nor explicit prescription of solution-seeking is discussed. In slime mold computing, the distributivity in the local computing logic can change dynamically, and its parallel non-distributed computing cannot be reduced into the spatial addition of multiple serial computings. The computing system based on exhaustive absence of the super-system may produce, something more than filling the vacancy. PMID:14563567

  6. A comparative study of serial and parallel aeroelastic computations of wings

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Guruswamy, Guru P.

    1994-01-01

    A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.

  7. Chaining direct memory access data transfer operations for compute nodes in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.

    2010-09-28

    Methods, systems, and products are disclosed for chaining DMA data transfer operations for compute nodes in a parallel computer that include: receiving, by an origin DMA engine on an origin node in an origin injection FIFO buffer for the origin DMA engine, a RGET data descriptor specifying a DMA transfer operation data descriptor on the origin node and a second RGET data descriptor on the origin node, the second RGET data descriptor specifying a target RGET data descriptor on the target node, the target RGET data descriptor specifying an additional DMA transfer operation data descriptor on the origin node; creating, by the origin DMA engine, an RGET packet in dependence upon the RGET data descriptor, the RGET packet containing the DMA transfer operation data descriptor and the second RGET data descriptor; and transferring, by the origin DMA engine to a target DMA engine on the target node, the RGET packet.

  8. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2013-07-09

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: establishing, for each node, a plurality of logical rings, each ring including a different set of at least one core on that node, each ring including the cores on at least two of the nodes; iteratively for each node: assigning each core of that node to one of the rings established for that node to which the core has not previously been assigned, and performing, for each ring for that node, a global allreduce operation using contribution data for the cores assigned to that ring or any global allreduce results from previous global allreduce operations, yielding current global allreduce results for each core; and performing, for each node, a local allreduce operation using the global allreduce results.

  9. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2013-02-12

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: performing, for each node, a local reduction operation using allreduce contribution data for the cores of that node, yielding, for each node, a local reduction result for one or more representative cores for that node; establishing one or more logical rings among the nodes, each logical ring including only one of the representative cores from each node; performing, for each logical ring, a global allreduce operation using the local reduction result for the representative cores included in that logical ring, yielding a global allreduce result for each representative core included in that logical ring; and performing, for each node, a local broadcast operation using the global allreduce results for each representative core on that node.

  10. POET on DAISy: Experiences in parallel computing on commodity workstation clusters

    SciTech Connect

    Durant, J.L.; Yam, C.; Bui-Pham, M.; Wyckoff, P.; Armstrong, R.

    1997-12-31

    Recent years have seen rapid increases in the power of parallel computers. However, easy, efficient use of these resources has been hampered by lack of appropriate software tools. To address this, we have developed POET, the Parallel Object-oriented Environment and Toolkit. POET is a frame-based approach to scientific computing on parallel platforms. It seeks to insulate the user from the details of implementing an efficient parallel solution to the user`s problem, freeing the user to concentrate on the details of the physics behind the model. We will discuss our use of POET on DAISy, our cluster of Pentium Pro workstations.

  11. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  12. POET (parallel object-oriented environment and toolkit) and frameworks for scientific distributed computing

    SciTech Connect

    Armstrong, R.; Cheung, A.

    1997-01-01

    Frameworks for parallel computing have recently become popular as a means for preserving parallel algorithms as reusable components. Frameworks for parallel computing in general, and POET in particular, focus on finding ways to orchestrate and facilitate cooperation between components that implement the parallel algorithms. Since performance is a key requirement for POET applications, CORBA or CORBA-like systems are eschewed for a SPMD message-passing architecture common to the world of distributed-parallel computing. Though the system is written in C++ for portability, the behavior of POET is more like a classical framework, such as Smalltalk. POET seeks to be a general platform for scientific parallel algorithm components which can be modified, linked, mixed and matched to a user`s specification. The purpose of this work is to identify a means for parallel code reuse and to make parallel computing more accessible to scientists whose expertise is outside the field of parallel computing. The POET framework provides two things: (1) an object model for parallel components that allows cooperation without being restrictive; (2) services that allow components to access and manage user data and message-passing facilities, etc. This work has evolved through application of a series of real distributed-parallel scientific problems. The paper focuses on what is required for parallel components to cooperate and at the same time remain ``black-boxes`` that users can drop into the frame without having to know the exquisite details of message-passing, data layout, etc. The paper walks through a specific example of a chemically reacting flow application. The example is implemented in POET and the authors identify component cooperation, usability and reusability in an anecdotal fashion.

  13. Locating hardware faults in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-01-12

    Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

  14. Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

    NASA Technical Reports Server (NTRS)

    Weeks, Cindy Lou

    1986-01-01

    Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.

  15. Wing-Body Aeroelasticity Using Finite-Difference Fluid/Finite-Element Structural Equations on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Guruswamy, Guru P.; Kutler, Paul (Technical Monitor)

    1994-01-01

    In recent years significant advances have been made for parallel computers in both hardware and software. Now parallel computers have become viable tools in computational mechanics. Many application codes developed on conventional computers have been modified to benefit from parallel computers. Significant speedups in some areas have been achieved by parallel computations. For single-discipline use of both fluid dynamics and structural dynamics, computations have been made on wing-body configurations using parallel computers. However, only a limited amount of work has been completed in combining these two disciplines for multidisciplinary applications. The prime reason is the increased level of complication associated with a multidisciplinary approach. In this work, procedures to compute aeroelasticity on parallel computers using direct coupling of fluid and structural equations will be investigated for wing-body configurations. The parallel computer selected for computations is an Intel iPSC/860 computer which is a distributed-memory, multiple-instruction, multiple data (MIMD) computer with 128 processors. In this study, the computational efficiency issues of parallel integration of both fluid and structural equations will be investigated in detail. The fluid and structural domains will be modeled using finite-difference and finite-element approaches, respectively. Results from the parallel computer will be compared with those from the conventional computers using a single processor. This study will provide an efficient computational tool for the aeroelastic analysis of wing-body structures on MIMD type parallel computers.

  16. From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators

    NASA Astrophysics Data System (ADS)

    Yim, Keun Soo

    This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of

  17. Execution models for mapping programs onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

  18. JPARSS: A Java Parallel Network Package for Grid Computing

    SciTech Connect

    Chen, Jie; Akers, Walter; Chen, Ying; Watson, William

    2002-03-01

    The emergence of high speed wide area networks makes grid computinga reality. However grid applications that need reliable data transfer still have difficulties to achieve optimal TCP performance due to network tuning of TCP window size to improve bandwidth and to reduce latency on a high speed wide area network. This paper presents a Java package called JPARSS (Java Parallel Secure Stream (Socket)) that divides data into partitions that are sent over several parallel Java streams simultaneously and allows Java or Web applications to achieve optimal TCP performance in a grid environment without the necessity of tuning TCP window size. This package enables single sign-on, certificate delegation and secure or plain-text data transfer using several security components based on X.509 certificate and SSL. Several experiments will be presented to show that using Java parallelstreams is more effective than tuning TCP window size. In addition a simple architecture using Web services

  19. Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment

    PubMed Central

    2014-01-01

    Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel

  20. MEDUSA - An overset grid flow solver for network-based parallel computer systems

    NASA Technical Reports Server (NTRS)

    Smith, Merritt H.; Pallis, Jani M.

    1993-01-01

    Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallel computer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.

  1. Parallel of low-level computer vision algorithms on a multi-DSP system

    NASA Astrophysics Data System (ADS)

    Liu, Huaida; Jia, Pingui; Li, Lijian; Yang, Yiping

    2011-06-01

    Parallel hardware becomes a commonly used approach to satisfy the intensive computation demands of computer vision systems. A multiprocessor architecture based on hypercube interconnecting digital signal processors (DSPs) is described to exploit the temporal and spatial parallelism. This paper presents a parallel implementation of low level vision algorithms designed on multi-DSP system. The convolution operation has been parallelized by using redundant boundary partitioning. Performance of the parallel convolution operation is investigated by varying the image size, mask size and the number of processors. Experimental results show that the speedup is close to the ideal value. However, it can be found that the loading imbalance of processor can significantly affect the computation time and speedup of the multi- DSP system.

  2. Methods for design and evaluation of parallel computating systems (The PISCES project)

    NASA Technical Reports Server (NTRS)

    Pratt, Terrence W.; Wise, Robert; Haught, Mary JO

    1989-01-01

    The PISCES project started in 1984 under the sponsorship of the NASA Computational Structural Mechanics (CSM) program. A PISCES 1 programming environment and parallel FORTRAN were implemented in 1984 for the DEC VAX (using UNIX processes to simulate parallel processes). This system was used for experimentation with parallel programs for scientific applications and AI (dynamic scene analysis) applications. PISCES 1 was ported to a network of Apollo workstations by N. Fitzgerald.

  3. A note on parallel and pipeline computation of fast unitary transforms

    NASA Technical Reports Server (NTRS)

    Fino, B. J.; Algazi, V. R.

    1974-01-01

    The parallel and pipeline organization of fast unitary transform algorithms such as the Fast Fourier Transform are discussed. The efficiency is pointed out of a combined parallel-pipeline processor of a transform such as the Haar transform in which 2 to the n minus 1 power hardware butterflies generate a transform of order 2 to the n power every computation cycle.

  4. Monte Carlo simulations of converging laser beam propagating in turbid media with parallel computing

    NASA Astrophysics Data System (ADS)

    Wu, Di; Lu, Jun Q.; Hu, Xin H.; Zhao, S. S.

    1999-11-01

    Due to its flexibility and simplicity, Monte Carlo method is often used to study light propagation in turbid medium where the photons are treated like classic particles being scattered and absorbed randomly based on a radiative transfer theory. However, due to the need of large number of photons to produce statistically significance results, this type of calculations requires large computing resources. To overcome such difficulty, we implemented parallel computing technique into our Monte Carlo simulations. The algorithm is based on the fact that the classic particles are uncorrelated, and the trajectories of multiple photons can be tracked simultaneously. When a beam of focused light incident to the medium, the incident photons are divided into groups according to the available processes on a parallel machine and the calculations are carried out in parallel. Utilizing PVM (Parallel Virtual Machine, a parallel computing software), the parallel programs in both C and FORTRAN are developed on the massive parallel computer Cray T3E at the North Carolina Supercomputer Center and a local PC-cluster network running UNIX/Sun Solaris. The parallel performances of our codes have been excellent on both Cray T3E and the PC clusters. In this paper, we present results on a focusing laser beam propagating through a highly scattering and diluted solution of intralipid. The dependence of the spatial distribution of light near the focal point on the concentration of intralipid solution is studied and its significance is discussed.

  5. Applications of Parallel Computational Methods to Charged-Particle Beam Dynamics

    SciTech Connect

    Kabel, A.; Cai, Y.; Dohlus, M.; Sen, T.; Uplenchwar, R.; /SLAC /DESY

    2007-10-16

    The availability of parallel computation hardware and the advent of standardized programming interfaces has made a new class of beam dynamics problems accessible to numerical simulations. We describe recent progress in code development for simulations of coherent synchrotron radiation and the weak-strong and strong-strong beam-beam interaction. Parallelization schemes will be discussed, and typical results will be presented.

  6. Special purpose parallel computer architecture for real-time control and simulation in robotic applications

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

    1993-01-01

    This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.

  7. Paging memory from random access memory to backing storage in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E

    2013-05-21

    Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.

  8. Low latency, high bandwidth data communications between compute nodes in a parallel computer

    DOEpatents

    Blocksome, Michael A

    2014-04-01

    Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.

  9. Low latency, high bandwidth data communications between compute nodes in a parallel computer

    DOEpatents

    Blocksome, Michael A

    2014-04-22

    Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.

  10. Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer

    SciTech Connect

    Blocksome, Michael A

    2015-02-17

    Methods, apparatus, and products are disclosed for self-pacing DMA data transfer operations for nodes in a parallel computer that include: transferring, by an origin DMA on an origin node, a RTS message to a target node, the RTS message specifying an message on the origin node for transfer to the target node; receiving, in an origin injection FIFO for the origin DMA from a target DMA on the target node in response to transferring the RTS message, a target RGET descriptor followed by a DMA transfer operation descriptor, the DMA descriptor for transmitting a message portion to the target node, the target RGET descriptor specifying an origin RGET descriptor on the origin node that specifies an additional DMA descriptor for transmitting an additional message portion to the target node; processing, by the origin DMA, the target RGET descriptor; and processing, by the origin DMA, the DMA transfer operation descriptor.

  11. Low latency, high bandwidth data communications between compute nodes in a parallel computer

    DOEpatents

    Blocksome, Michael A

    2013-07-02

    Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.

  12. Line-plane broadcasting in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Berg, Jeremy E.; Blocksome, Michael A.; Smith, Brian E.

    2010-06-08

    Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallel computer, the parallel computer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.

  13. Line-plane broadcasting in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Berg, Jeremy E.; Blocksome, Michael A.; Smith, Brian E.

    2010-11-23

    Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallel computer, the parallel computer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.

  14. Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Peters, Amanda E.; Ratterman, Joseph D.; Smith, Brian E.

    2012-04-17

    Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.

  15. Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Peters, Amanda A.; Ratterman, Joseph D.; Smith, Brian E.

    2012-01-10

    Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.

  16. Parallelized tree-code for clusters of personal computers

    NASA Astrophysics Data System (ADS)

    Viturro, H. R.; Carpintero, D. D.

    2000-02-01

    We present a tree-code for integrating the equations of the motion of collisionless systems, which has been fully parallelized and adapted to run in several PC-based processors simultaneously, using the well-known PVM message passing library software. SPH algorithms, not yet included, may be easily incorporated to the code. The code is written in ANSI C; it can be freely downloaded from a public ftp site. Simulations of collisions of galaxies are presented, with which the performance of the code is tested.

  17. The development and operation of Edinburgh Parallel Computing Centre`s summer scholarship programme

    SciTech Connect

    Wilson, G.V.; MacDonald, N.B.; Thornborrow, C.; Brough, C.M.

    1994-12-31

    Between 1987 and 1994, more than 100 students in a broad range of disciplines worked as summer scholars at Edinburgh Parallel Computing Centre. Many of these students have since taken their parallel computing skills into graduate work and industry, and over a quarter of EPCC`s technical staff are alumni of the Programme. This report describes the evolution and present operation of the Summer Scholarship Programme, and its costs and benefits.

  18. Final Report -- Center for Programmng Models for Scalable Parallel Computing (UIUC subgroup)

    SciTech Connect

    Marianne Winslett; Michael Folk

    2007-03-31

    The mission of the Center for Scalable Programming Models (Pmodels) was to create new ways of programming parallel computers that are much easier for humans to conceptualize, that allow programs to be written, updated and debugged quickly, and that run extremely efficiently---even on computers with thousands or even millions of processors. At UIUC, our work for Pmodels focused on support for I/O in a massively parallel environment, and included both research and technology transfer activities.

  19. Non-parallel processing: Gendered attrition in academic computer science

    NASA Astrophysics Data System (ADS)

    Cohoon, Joanne Louise Mcgrath

    2000-10-01

    This dissertation addresses the issue of disproportionate female attrition from computer science as an instance of gender segregation in higher education. By adopting a theoretical framework from organizational sociology, it demonstrates that the characteristics and processes of computer science departments strongly influence female retention. The empirical data identifies conditions under which women are retained in the computer science major at comparable rates to men. The research for this dissertation began with interviews of students, faculty, and chairpersons from five computer science departments. These exploratory interviews led to a survey of faculty and chairpersons at computer science and biology departments in Virginia. The data from these surveys are used in comparisons of the computer science and biology disciplines, and for statistical analyses that identify which departmental characteristics promote equal attrition for male and female undergraduates in computer science. This three-pronged methodological approach of interviews, discipline comparisons, and statistical analyses shows that departmental variation in gendered attrition rates can be explained largely by access to opportunity, relative numbers, and other characteristics of the learning environment. Using these concepts, this research identifies nine factors that affect the differential attrition of women from CS departments. These factors are: (1) The gender composition of enrolled students and faculty; (2) Faculty turnover; (3) Institutional support for the department; (4) Preferential attitudes toward female students; (5) Mentoring and supervising by faculty; (6) The local job market, starting salaries, and competitiveness of graduates; (7) Emphasis on teaching; and (8) Joint efforts for student success. This work contributes to our understanding of the gender segregation process in higher education. In addition, it contributes information that can lead to effective solutions for an

  20. Parallel language constructs for tensor product computations on loosely coupled architectures

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Vanrosendale, John

    1989-01-01

    Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.

  1. Multiscale Methods, Parallel Computation, and Neural Networks for Real-Time Computer Vision.

    NASA Astrophysics Data System (ADS)

    Battiti, Roberto

    1990-01-01

    This thesis presents new algorithms for low and intermediate level computer vision. The guiding ideas in the presented approach are those of hierarchical and adaptive processing, concurrent computation, and supervised learning. Processing of the visual data at different resolutions is used not only to reduce the amount of computation necessary to reach the fixed point, but also to produce a more accurate estimation of the desired parameters. The presented adaptive multiple scale technique is applied to the problem of motion field estimation. Different parts of the image are analyzed at a resolution that is chosen in order to minimize the error in the coefficients of the differential equations to be solved. Tests with video-acquired images show that velocity estimation is more accurate over a wide range of motion with respect to the homogeneous scheme. In some cases introduction of explicit discontinuities coupled to the continuous variables can be used to avoid propagation of visual information from areas corresponding to objects with different physical and/or kinematic properties. The human visual system uses concurrent computation in order to process the vast amount of visual data in "real -time." Although with different technological constraints, parallel computation can be used efficiently for computer vision. All the presented algorithms have been implemented on medium grain distributed memory multicomputers with a speed-up approximately proportional to the number of processors used. A simple two-dimensional domain decomposition assigns regions of the multiresolution pyramid to the different processors. The inter-processor communication needed during the solution process is proportional to the linear dimension of the assigned domain, so that efficiency is close to 100% if a large region is assigned to each processor. Finally, learning algorithms are shown to be a viable technique to engineer computer vision systems for different applications starting from

  2. Implementation of a 3D mixing layer code on parallel computers

    NASA Technical Reports Server (NTRS)

    Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.

    1995-01-01

    This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.

  3. Solving very large, sparse linear systems on mesh-connected parallel computers

    NASA Technical Reports Server (NTRS)

    Opsahl, Torstein; Reif, John

    1987-01-01

    The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.

  4. A new parallel-vector finite element analysis software on distributed-memory computers

    NASA Technical Reports Server (NTRS)

    Qin, Jiangning; Nguyen, Duc T.

    1993-01-01

    A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.

  5. Managing internode data communications for an uninitialized process in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

    2014-05-20

    A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.

  6. Parallel computation for blood cell classification in medical hyperspectral imagery

    NASA Astrophysics Data System (ADS)

    Li, Wei; Wu, Lucheng; Qiu, Xianbo; Ran, Qiong; Xie, Xiaoming

    2016-09-01

    With the advantage of fine spectral resolution, hyperspectral imagery provides great potential for cell classification. This paper provides a promising classification system including the following three stages: (1) band selection for a subset of spectral bands with distinctive and informative features, (2) spectral-spatial feature extraction, such as local binary patterns (LBP), and (3) followed by an effective classifier. Moreover, these three steps are further implemented on graphics processing units (GPU) respectively, which makes the system real-time and more practical. The GPU parallel implementation is compared with the serial implementation on central processing units (CPU). Experimental results based on real medical hyperspectral data demonstrate that the proposed system is able to offer high accuracy and fast speed, which are appealing for cell classification in medical hyperspectral imagery.

  7. Parallel linear equation solvers for finite element computations

    NASA Technical Reports Server (NTRS)

    Ortega, James M.; Poole, Gene; Vaughan, Courtenay; Cleary, Andrew; Averick, Brett

    1989-01-01

    The overall objective of this research is to develop efficient methods for the solution of linear and nonlinear systems of equations on parallel and supercomputers, and to apply these methods to the solution of problems in structural analysis. Attention has been given so far only to linear equations. The methods considered for the solution of the stiffness equation Kx=f have been Choleski factorization and the conjugate gradient iteration with SSOR and Incomplete Choleski preconditioning. More detail on these methods will be given on subsequent slides. These methods have been used to solve for the static displacements for the mast and panel focus problems in conjunction with the CSM testbed system based on NICE/SPAR.

  8. Protein engineering by highly parallel screening of computationally designed variants.

    PubMed

    Sun, Mark G F; Seo, Moon-Hyeong; Nim, Satra; Corbi-Verge, Carles; Kim, Philip M

    2016-07-01

    Current combinatorial selection strategies for protein engineering have been successful at generating binders against a range of targets; however, the combinatorial nature of the libraries and their vast undersampling of sequence space inherently limit these methods due to the difficulty in finely controlling protein properties of the engineered region. Meanwhile, great advances in computational protein design that can address these issues have largely been underutilized. We describe an integrated approach that computationally designs thousands of individual protein binders for high-throughput synthesis and selection to engineer high-affinity binders. We show that a computationally designed library enriches for tight-binding variants by many orders of magnitude as compared to conventional randomization strategies. We thus demonstrate the feasibility of our approach in a proof-of-concept study and successfully obtain low-nanomolar binders using in vitro and in vivo selection systems. PMID:27453948

  9. Protein engineering by highly parallel screening of computationally designed variants

    PubMed Central

    Sun, Mark G. F.; Seo, Moon-Hyeong; Nim, Satra; Corbi-Verge, Carles; Kim, Philip M.

    2016-01-01

    Current combinatorial selection strategies for protein engineering have been successful at generating binders against a range of targets; however, the combinatorial nature of the libraries and their vast undersampling of sequence space inherently limit these methods due to the difficulty in finely controlling protein properties of the engineered region. Meanwhile, great advances in computational protein design that can address these issues have largely been underutilized. We describe an integrated approach that computationally designs thousands of individual protein binders for high-throughput synthesis and selection to engineer high-affinity binders. We show that a computationally designed library enriches for tight-binding variants by many orders of magnitude as compared to conventional randomization strategies. We thus demonstrate the feasibility of our approach in a proof-of-concept study and successfully obtain low-nanomolar binders using in vitro and in vivo selection systems. PMID:27453948

  10. Dynamic remapping of parallel computations with varying resource demands

    NASA Technical Reports Server (NTRS)

    Nicol, D. M.; Saltz, J. H.

    1986-01-01

    A large class of computational problems is characterized by frequent synchronization, and computational requirements which change as a function of time. When such a problem must be solved on a message passing multiprocessor machine, the combination of these characteristics lead to system performance which decreases in time. Performance can be improved with periodic redistribution of computational load; however, redistribution can exact a sometimes large delay cost. We study the issue of deciding when to invoke a global load remapping mechanism. Such a decision policy must effectively weigh the costs of remapping against the performance benefits. We treat this problem by constructing two analytic models which exhibit stochastically decreasing performance. One model is quite tractable; we are able to describe the optimal remapping algorithm, and the optimal decision policy governing when to invoke that algorithm. However, computational complexity prohibits the use of the optimal remapping decision policy. We then study the performance of a general remapping policy on both analytic models. This policy attempts to minimize a statistic W(n) which measures the system degradation (including the cost of remapping) per computation step over a period of n steps. We show that as a function of time, the expected value of W(n) has at most one minimum, and that when this minimum exists it defines the optimal fixed-interval remapping policy. Our decision policy appeals to this result by remapping when it estimates that W(n) is minimized. Our performance data suggests that this policy effectively finds the natural frequency of remapping. We also use the analytic models to express the relationship between performance and remapping cost, number of processors, and the computation's stochastic activity.

  11. Fencing network direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-07-07

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  12. Fencing network direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-07-14

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  13. Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles

    NASA Technical Reports Server (NTRS)

    Ryan, James S.; Weeratunga, Sisira

    1993-01-01

    Multidisciplinary design optimization of aircraft will require unprecedented capabilities of both analysis software and computer hardware. The speed and accuracy of the analysis will depend heavily on the computational fluid dynamics (CFD) module which is used. A new CFD module has been developed to combine the robust accuracy of conventional codes with the ability to run on parallel architectures. This is achieved by parallelizing the ARC3D algorithm, a central-differenced Navier-Stokes method, on the Intel iPSC/860. The computed solutions are identical to those from conventional machines. Computational speed on 64 processors is comparable to the rate on one Cray Y-MP processor and will increase as new generations of parallel computers become available.

  14. Medical image processing utilizing neural networks trained on a massively parallel computer.

    PubMed

    Kerr, J P; Bartlett, E B

    1995-07-01

    While finding many applications in science, engineering, and medicine, artificial neural networks (ANNs) have typically been limited to small architectures. In this paper, we demonstrate how very large architecture neural networks can be trained for medical image processing utilizing a massively parallel, single-instruction multiple data (SIMD) computer. The two- to three-orders of magnitude improvement in processing time attainable using a parallel computer makes it practical to train very large architecture ANNs. As an example we have trained several ANNs to demonstrate the tomographic reconstruction of 64 x 64 single photon emission computed tomography (SPECT) images from 64 planar views of the images. The potential for these large architecture ANNs lies in the fact that once the neural network is properly trained on the parallel computer the corresponding interconnection weight file can be loaded on a serial computer. Subsequently, relatively fast processing of all novel images can be performed on a PC or workstation. PMID:7497701

  15. A Lanczos eigenvalue method on a parallel computer. [for large complex space structure free vibration analysis

    NASA Technical Reports Server (NTRS)

    Bostic, Susan W.; Fulton, Robert E.

    1987-01-01

    Eigenvalue analyses of complex structures is a computationally intensive task which can benefit significantly from new and impending parallel computers. This study reports on a parallel computer implementation of the Lanczos method for free vibration analysis. The approach used here subdivides the major Lanczos calculation tasks into subtasks and introduces parallelism down to the subtask levels such as matrix decomposition and forward/backward substitution. The method was implemented on a commercial parallel computer and results were obtained for a long flexible space structure. While parallel computing efficiency is problem and computer dependent, the efficiency for the Lanczos method was good for a moderate number of processors for the test problem. The greatest reduction in time was realized for the decomposition of the stiffness matrix, a calculation which took 70 percent of the time in the sequential program and which took 25 percent of the time on eight processors. For a sample calculation of the twenty lowest frequencies of a 486 degree of freedom problem, the total sequential computing time was reduced by almost a factor of ten using 16 processors.

  16. Large-eddy simulation of the Rayleigh-Taylor instability on a massively parallel computer

    SciTech Connect

    Amala, P.A.K.

    1995-03-01

    A computational model for the solution of the three-dimensional Navier-Stokes equations is developed. This model includes a turbulence model: a modified Smagorinsky eddy-viscosity with a stochastic backscatter extension. The resultant equations are solved using finite difference techniques: the second-order explicit Lax-Wendroff schemes. This computational model is implemented on a massively parallel computer. Programming models on massively parallel computers are next studied. It is desired to determine the best programming model for the developed computational model. To this end, three different codes are tested on a current massively parallel computer: the CM-5 at Los Alamos. Each code uses a different programming model: one is a data parallel code; the other two are message passing codes. Timing studies are done to determine which method is the fastest. The data parallel approach turns out to be the fastest method on the CM-5 by at least an order of magnitude. The resultant code is then used to study a current problem of interest to the computational fluid dynamics community. This is the Rayleigh-Taylor instability. The Lax-Wendroff methods handle shocks and sharp interfaces poorly. To this end, the Rayleigh-Taylor linear analysis is modified to include a smoothed interface. The linear growth rate problem is then investigated. Finally, the problem of the randomly perturbed interface is examined. Stochastic backscatter breaks the symmetry of the stationary unstable interface and generates a mixing layer growing at the experimentally observed rate. 115 refs., 51 figs., 19 tabs.

  17. APES-based procedure for super-resolution SAR imagery with GPU parallel computing

    NASA Astrophysics Data System (ADS)

    Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao

    2015-10-01

    The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.

  18. MAX - An advanced parallel computer for space applications

    NASA Technical Reports Server (NTRS)

    Lewis, Blair F.; Bunker, Robert L.

    1991-01-01

    MAX is a fault-tolerant multicomputer hardware and software architecture designed to meet the needs of NASA spacecraft systems. It consists of conventional computing modules (computers) connected via a dual network topology. One network is used to transfer data among the computers and between computers and I/O devices. This network's topology is arbitrary. The second network operates as a broadcast medium for operating system synchronization messages and supports the operating system's Byzantine resilience. A fully distributed operating system supports multitasking in an asynchronous event and data driven environment. A large grain dataflow paradigm is used to coordinate the multitasking and provide easy control of concurrency. It is the basis of the system's fault tolerance and allows both static and dynamical location of tasks. Redundant execution of tasks with software voting of results may be specified for critical tasks. The dataflow paradigm also supports simplified software design, test and maintenance. A unique feature is a method for reliably patching code in an executing dataflow application.

  19. Parallel processing algorithms for hydrocodes on a computer with MIMD architecture (DENELCOR's HEP)

    SciTech Connect

    Hicks, D.L.

    1983-11-01

    In real time simulation/prediction of complex systems such as water-cooled nuclear reactors, if reactor operators had fast simulator/predictors to check the consequences of their operations before implementing them, events such as the incident at Three Mile Island might be avoided. However, existing simulator/predictors such as RELAP run slower than real time on serial computers. It appears that the only way to overcome the barrier to higher computing rates is to use computers with architectures that allow concurrent computations or parallel processing. The computer architecture with the greatest degree of parallelism is labeled Multiple Instruction Stream, Multiple Data Stream (MIMD). An example of a machine of this type is the HEP computer by DENELCOR. It appears that hydrocodes are very well suited for parallelization on the HEP. It is a straightforward exercise to parallelize explicit, one-dimensional Lagrangean hydrocodes in a zone-by-zone parallelization. Similarly, implicit schemes can be parallelized in a zone-by-zone fashion via an a priori, symbolic inversion of the tridiagonal matrix that arises in an implicit scheme. These techniques are extended to Eulerian hydrocodes by using Harlow's rezone technique. The extension from single-phase Eulerian to two-phase Eulerian is straightforward. This step-by-step extension leads to hydrocodes with zone-by-zone parallelization that are capable of two-phase flow simulation. Extensions to two and three spatial dimensions can be achieved by operator splitting. It appears that a zone-by-zone parallelization is the best way to utilize the capabilities of an MIMD machine. 40 references.

  20. Hardware Efficient and High-Performance Networks for Parallel Computers.

    NASA Astrophysics Data System (ADS)

    Chien, Minze Vincent

    High performance interconnection networks are the key to high utilization and throughput in large-scale parallel processing systems. Since many interconnection problems in parallel processing such as concentration, permutation and broadcast problems can be cast as sorting problems, this dissertation considers the problem of sorting on a new model, called an adaptive sorting network. It presents four adaptive binary sorters the first two of which are ordinary combinational circuits while the last two exploit time-multiplexing and pipelining techniques. These sorter constructions demonstrate that any sequence of n bits can be sorted in O(log^2n) bit-level delay, using O(n) constant fanin gates. This improves the cost complexity of Batcher's binary sorters by a factor of O(log^2n) while matching their sorting time. It is further shown that any sequence of n numbers can be sorted on the same model in O(log^2n) comparator-level delay using O(nlog nloglog n) comparators. The adaptive binary sorter constructions lead to new O(n) bit-level cost concentrators and superconcentrators with O(log^2n) bit-level delay. Their employment in recently constructed permutation and generalized connectors lead to permutation and generalized connection networks with O(nlog n) bit-level cost and O(log^3n) bit-level delay. These results provide the least bit-level cost for such networks with competitive delays. Finally, the dissertation considers a key issue in the implementation of interconnection networks, namely, the pin constraint. Current VLSI technologies can house a large number of switches in a single chip, but the mere fact that one chip cannot have too many pins precludes the possibility of implementing a large connection network on a single chip. The dissertation presents techniques for partitioning connection networks into identical modules of switches in such a way that each module is contained in a single chip with an arbitrarily specified number of pins, and that the cost of

  1. Lilith: A scalable secure tool for massively parallel distributed computing

    SciTech Connect

    Armstrong, R.C.; Camp, L.J.; Evensky, D.A.; Gentile, A.C.

    1997-06-01

    Changes in high performance computing have necessitated the ability to utilize and interrogate potentially many thousands of processors. The ASCI (Advanced Strategic Computing Initiative) program conducted by the United States Department of Energy, for example, envisions thousands of distinct operating systems connected by low-latency gigabit-per-second networks. In addition multiple systems of this kind will be linked via high-capacity networks with latencies as low as the speed of light will allow. Code which spans systems of this sort must be scalable; yet constructing such code whether for applications, debugging, or maintenance is an unsolved problem. Lilith is a research software platform that attempts to answer these questions with an end toward meeting these needs. Presently, Lilith exists as a test-bed, written in Java, for various spanning algorithms and security schemes. The test-bed software has, and enforces, hooks allowing implementation and testing of various security schemes.

  2. A domain decomposition study of massively parallel computing in compressible gas dynamics

    NASA Astrophysics Data System (ADS)

    Wong, C. C.; Blottner, F. G.; Payne, J. L.; Soetrisno, M.

    1995-03-01

    The appropriate utilization of massively parallel computers for solving the Navier-Stokes equations is investigated and determined from an engineering perspective. The issues investigated are: (1) Should strip or patch domain decomposition of the spatial mesh be used to reduce computer time? (2) How many computer nodes should be used for a problem with a given sized mesh to reduce computer time? (3) Is the convergence of the Navier-Stokes solution procedure (LU-SGS) adversely influenced by the domain decomposition approach? The results of the paper show that the present Navier-Stokes solution technique has good performance on a massively parallel computer for transient flow problems. For steady-state problems with a large number of mesh cells, the solution procedure will require significant computer time due to an increased number of iterations to achieve a converged solution. There is an optimum number of computer nodes to use for a problem with a given global mesh size.

  3. CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design

    PubMed Central

    Chen, Yong

    2016-01-01

    A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html. PMID:27487245

  4. Applicability of Parallel Computing to Partial Wave Analysis

    NASA Astrophysics Data System (ADS)

    Ruger, Justin; Gilfoyle, Gerard; Weygand, Dennis; CLAS Collaboration

    2013-10-01

    Bound states of Quantum Chromodynamics (QCD) give insights into the nature of confinement, a key element of the strong interaction. States may be identified from weak signals extracted from the analysis of high statistics data from reactions with many final state particles. One of the best tools for the analysis of these reactions is Partial Wave Analysis (PWA). PWA transforms an ensemble of experimental data from a large acceptance detector from free particle eigenstates to angular momentum eigenstates. The PWA program must be fast enough to deal with the large amounts of data available currently, as processing time scales with the number of events. The scope of this research is to study the applicability and scalability of Intel's Xeon Phi using the Many Integrated Core (MIC) architecture when applied to the existing PWA code at Jefferson Laboratory. An algorithm was developed for the Xeon Phi and scaled across 240 available threads, giving parallel functionality to the PWA which was originally written serially. This scaling can make the fitting process fifteen times faster. Supported by the US Department of Energy.

  5. CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design.

    PubMed

    Zhang, Shaoqiang; Chen, Yong

    2016-01-01

    A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html. PMID:27487245

  6. Particle/Continuum Hybrid Simulation in a Parallel Computing Environment

    NASA Technical Reports Server (NTRS)

    Baganoff, Donald

    1996-01-01

    The objective of this study was to modify an existing parallel particle code based on the direct simulation Monte Carlo (DSMC) method to include a Navier-Stokes (NS) calculation so that a hybrid solution could be developed. In carrying out this work, it was determined that the following five issues had to be addressed before extensive program development of a three dimensional capability was pursued: (1) find a set of one-sided kinetic fluxes that are fully compatible with the DSMC method, (2) develop a finite volume scheme to make use of these one-sided kinetic fluxes, (3) make use of the one-sided kinetic fluxes together with DSMC type boundary conditions at a material surface so that velocity slip and temperature slip arise naturally for near-continuum conditions, (4) find a suitable sampling scheme so that the values of the one-sided fluxes predicted by the NS solution at an interface between the two domains can be converted into the correct distribution of particles to be introduced into the DSMC domain, (5) carry out a suitable number of tests to confirm that the developed concepts are valid, individually and in concert for a hybrid scheme.

  7. A survey of synchronization methods for parallel computers

    SciTech Connect

    Dinning, A. )

    1989-07-01

    This article examines how traditional synchronization methods influence the design of MIMD multiprocessors. This particular class of architectures is one in which high-level synchronization plays an important role. Although vector processors, dataflow machines, and single instruction, multiple-data (SIMD) computers are highly synchronized, their synchronization is generally an explicit part of the control flow and is executed as part of every instruction. In MIMD multiprocessors, synchronization must occur on demand, so more sophisticated schemes are needed.

  8. Parallel computer processing and modeling: applications for the ICU

    NASA Astrophysics Data System (ADS)

    Baxter, Grant; Pranger, L. Alex; Draghic, Nicole; Sims, Nathaniel M.; Wiesmann, William P.

    2003-07-01

    Current patient monitoring procedures in hospital intensive care units (ICUs) generate vast quantities of medical data, much of which is considered extemporaneous and not evaluated. Although sophisticated monitors to analyze individual types of patient data are routinely used in the hospital setting, this equipment lacks high order signal analysis tools for detecting long-term trends and correlations between different signals within a patient data set. Without the ability to continuously analyze disjoint sets of patient data, it is difficult to detect slow-forming complications. As a result, the early onset of conditions such as pneumonia or sepsis may not be apparent until the advanced stages. We report here on the development of a distributed software architecture test bed and software medical models to analyze both asynchronous and continuous patient data in real time. Hardware and software has been developed to support a multi-node distributed computer cluster capable of amassing data from multiple patient monitors and projecting near and long-term outcomes based upon the application of physiologic models to the incoming patient data stream. One computer acts as a central coordinating node; additional computers accommodate processing needs. A simple, non-clinical model for sepsis detection was implemented on the system for demonstration purposes. This work shows exceptional promise as a highly effective means to rapidly predict and thereby mitigate the effect of nosocomial infections.

  9. Domain decomposition: A bridge between nature and parallel computers

    NASA Technical Reports Server (NTRS)

    Keyes, David E.

    1992-01-01

    Domain decomposition is an intuitive organizing principle for a partial differential equation (PDE) computation, both physically and architecturally. However, its significance extends beyond the readily apparent issues of geometry and discretization, on one hand, and of modular software and distributed hardware, on the other. Engineering and computer science aspects are bridged by an old but recently enriched mathematical theory that offers the subject not only unity, but also tools for analysis and generalization. Domain decomposition induces function-space and operator decompositions with valuable properties. Function-space bases and operator splittings that are not derived from domain decompositions generally lack one or more of these properties. The evolution of domain decomposition methods for elliptically dominated problems has linked two major algorithmic developments of the last 15 years: multilevel and Krylov methods. Domain decomposition methods may be considered descendants of both classes with an inheritance from each: they are nearly optimal and at the same time efficiently parallelizable. Many computationally driven application areas are ripe for these developments. A progression is made from a mathematically informal motivation for domain decomposition methods to a specific focus on fluid dynamics applications. To be introductory rather than comprehensive, simple examples are provided while convergence proofs and algorithmic details are left to the original references; however, an attempt is made to convey their most salient features, especially where this leads to algorithmic insight.

  10. Parallel Computing of Multi-scale Finite Element Sheet Forming Analyses Based on Crystallographic Homogenization Method

    SciTech Connect

    Kuramae, Hiroyuki; Okada, Kenji; Uetsuji, Yasutomo; Nakamachi, Eiji; Tam, Nguyen Ngoc; Nakamura, Yasunori

    2005-08-05

    Since the multi-scale finite element analysis (FEA) requires large computation time, development of the parallel computing technique for the multi-scale analysis is inevitable. A parallel elastic/crystalline viscoplastic FEA code based on a crystallographic homogenization method has been developed using PC cluster. The homogenization scheme is introduced to compute macro-continuum plastic deformations and material properties by considering a polycrystal texture. Since the dynamic explicit method is applied to this method, the analysis using micro crystal structures computes the homogenized stresses in parallel based on domain partitioning of macro-continuum without solving simultaneous linear equations. The micro-structure is defined by the Scanning Electron Microscope (SEM) and the Electron Back Scan Diffraction (EBSD) measurement based crystal orientations. In order to improve parallel performance of elastoplasticity analysis, which dynamically and partially increases computational costs during the analysis, a dynamic workload balancing technique is introduced to the parallel analysis. The technique, which is an automatic task distribution method, is realized by adaptation of subdomain size for macro-continuum to maintain the computational load balancing among cluster nodes. The analysis code is applied to estimate the polycrystalline sheet metal formability.

  11. Implementation of the Distributed Parallel Program for Geoid Heights Computation Using MPI and Openmp

    NASA Astrophysics Data System (ADS)

    Lee, S.; Kim, J.; Jung, Y.; Choi, J.; Choi, C.

    2012-07-01

    Much research have been carried out using optimization algorithms for developing high-performance program, under the parallel computing environment with the evolution of the computer hardware technology such as dual-core processor and so on. Then, the studies by the parallel computing in geodesy and surveying fields are not so many. The present study aims to reduce running time for the geoid heights computation and carrying out least-squares collocation to improve its accuracy using distributed parallel technology. A distributed parallel program was developed in which a multi-core CPU-based PC cluster was adopted using MPI and OpenMP library. Geoid heights were calculated by the spherical harmonic analysis using the earth geopotential model of the National Geospatial-Intelligence Agency(2008). The geoid heights around the Korean Peninsula were calculated and tested in diskless-based PC cluster environment. As results, for the computing geoid heights by a earth geopotential model, the distributed parallel program was confirmed more effective to reduce the computational time compared to the sequential program.

  12. A transient FETI methodology for large-scale parallel implicit computations in structural mechanics, part 2

    NASA Technical Reports Server (NTRS)

    Farhat, Charbel; Crivelli, Luis

    1993-01-01

    Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallellize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet and perhaps will never be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient than explicit codes when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.

  13. A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

    NASA Technical Reports Server (NTRS)

    Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

    1992-01-01

    Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.

  14. Numerical characterization of nonlinear dynamical systems using parallel computing: The role of GPUS approach

    NASA Astrophysics Data System (ADS)

    Fazanaro, Filipe I.; Soriano, Diogo C.; Suyama, Ricardo; Madrid, Marconi K.; Oliveira, José Raimundo de; Muñoz, Ignacio Bravo; Attux, Romis

    2016-08-01

    The characterization of nonlinear dynamical systems and their attractors in terms of invariant measures, basins of attractions and the structure of their vector fields usually outlines a task strongly related to the underlying computational cost. In this work, the practical aspects related to the use of parallel computing - specially the use of Graphics Processing Units (GPUS) and of the Compute Unified Device Architecture (CUDA) - are reviewed and discussed in the context of nonlinear dynamical systems characterization. In this work such characterization is performed by obtaining both local and global Lyapunov exponents for the classical forced Duffing oscillator. The local divergence measure was employed by the computation of the Lagrangian Coherent Structures (LCSS), revealing the general organization of the flow according to the obtained separatrices, while the global Lyapunov exponents were used to characterize the attractors obtained under one or more bifurcation parameters. These simulation sets also illustrate the required computation time and speedup gains provided by different parallel computing strategies, justifying the employment and the relevance of GPUS and CUDA in such extensive numerical approach. Finally, more than simply providing an overview supported by a representative set of simulations, this work also aims to be a unified introduction to the use of the mentioned parallel computing tools in the context of nonlinear dynamical systems, providing codes and examples to be executed in MATLAB and using the CUDA environment, something that is usually fragmented in different scientific communities and restricted to specialists on parallel computing strategies.

  15. Time efficient 3-D electromagnetic modeling on massively parallel computers

    SciTech Connect

    Alumbaugh, D.L.; Newman, G.A.

    1995-08-01

    A numerical modeling algorithm has been developed to simulate the electromagnetic response of a three dimensional earth to a dipole source for frequencies ranging from 100 to 100MHz. The numerical problem is formulated in terms of a frequency domain--modified vector Helmholtz equation for the scattered electric fields. The resulting differential equation is approximated using a staggered finite difference grid which results in a linear system of equations for which the matrix is sparse and complex symmetric. The system of equations is solved using a preconditioned quasi-minimum-residual method. Dirichlet boundary conditions are employed at the edges of the mesh by setting the tangential electric fields equal to zero. At frequencies less than 1MHz, normal grid stretching is employed to mitigate unwanted reflections off the grid boundaries. For frequencies greater than this, absorbing boundary conditions must be employed by making the stretching parameters of the modified vector Helmholtz equation complex which introduces loss at the boundaries. To allow for faster calculation of realistic models, the original serial version of the code has been modified to run on a massively parallel architecture. This modification involves three distinct tasks; (1) mapping the finite difference stencil to a processor stencil which allows for the necessary information to be exchanged between processors that contain adjacent nodes in the model, (2) determining the most efficient method to input the model which is accomplished by dividing the input into ``global`` and ``local`` data and then reading the two sets in differently, and (3) deciding how to output the data which is an inherently nonparallel process.

  16. Parallel optimization of pixel purity index algorithm for massive hyperspectral images in cloud computing environment

    NASA Astrophysics Data System (ADS)

    Chen, Yufeng; Wu, Zebin; Sun, Le; Wei, Zhihui; Li, Yonglong

    2016-04-01

    With the gradual increase in the spatial and spectral resolution of hyperspectral images, the size of image data becomes larger and larger, and the complexity of processing algorithms is growing, which poses a big challenge to efficient massive hyperspectral image processing. Cloud computing technologies distribute computing tasks to a large number of computing resources for handling large data sets without the limitation of memory and computing resource of a single machine. This paper proposes a parallel pixel purity index (PPI) algorithm for unmixing massive hyperspectral images based on a MapReduce programming model for the first time in the literature. According to the characteristics of hyperspectral images, we describe the design principle of the algorithm, illustrate the main cloud unmixing processes of PPI, and analyze the time complexity of serial and parallel algorithms. Experimental results demonstrate that the parallel implementation of the PPI algorithm on the cloud can effectively process big hyperspectral data and accelerate the algorithm.

  17. Benchmarking ocean circulation models on massively parallel computers

    SciTech Connect

    Poling, D.A.

    1997-08-01

    General circulation models are becoming the premier theoretical tools for studying the complex structure of the global climate. GEONET was envisioned as exercising the resources developed for the nuclear weapons program to address environmental problems. The similarity of circulation models to weapons codes made them an attractive field for them to develop expertise. The author hoped to become an active player in mainline climate research through computer simulation. This is the final report of a three-year, Laboratory-Directed Research and Development (LDRD) project at the Los Alamos National Laboratory (LANL). The intention of this research was to establish the Laboratory in mainstream climate research in conjunction with the GEONET project.

  18. Accelerating Neuroimage Registration through Parallel Computation of Similarity Metric

    PubMed Central

    Luo, Yun-gang; Liu, Ping; Shi, Lin; Luo, Yishan; Yi, Lei; Li, Ang; Qin, Jing; Heng, Pheng-Ann; Wang, Defeng

    2015-01-01

    Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR) as the similarity metric and a GPU accelerated correlation coefficient (CC) calculation for the symmetric diffeomorphic registration of ANTs have been developed. The comparison with their corresponding original tools shows that our accelerated algorithms can greatly outperform the original algorithm in terms of computational efficiency. This paper demonstrates the great potential of applying these registration tools in clinical applications. PMID:26352412

  19. Programming a massively parallel, computation universal system: static behavior

    SciTech Connect

    Lapedes, A.; Farber, R.

    1986-01-01

    In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

  20. Parallel processing experiences on the Denelcor HEP computer

    SciTech Connect

    Hayes, A.H.

    1984-01-01

    Recent experiments conducted on a Denelcor HEP (Heterogeneous Element Processor) computer are discussed in this paper. Algorithm research was done on four types of problems of interest to the Los Alamos National Laboratory: (1) Monte Carlo, using GAMTAB, in which the interaction of photons with matter is analyzed; (2) Hydrodynamics; (3) Reactor Safety, in which the operation of a nuclear reactor is simulated; and (4) Particle-in-Cell, in which electrostatic interaction of plasma beams are studied. Means of maximizing programming efficiency are analyzed with ways of speeding up processing determined.

  1. Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

    NASA Technical Reports Server (NTRS)

    Logan, Terry G.

    1994-01-01

    The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.

  2. ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers

    NASA Astrophysics Data System (ADS)

    Peng, Jun; Lu, Jinchi; Law, Kincho H.; Elgamal, Ahmed

    2004-10-01

    This paper presents the computational procedures and solution strategy employed in ParCYCLIC, a parallel non-linear finite element program developed based on an existing serial code CYCLIC for the analysis of cyclic seismically-induced liquefaction problems. In ParCYCLIC, finite elements are employed within an incremental plasticity, coupled solid-fluid formulation. A constitutive model developed for simulating liquefaction-induced deformations is a main component of this analysis framework. The elements of the computational strategy, designed for distributed-memory message-passing parallel computer systems, include: (a) an automatic domain decomposer to partition the finite element mesh; (b) nodal ordering strategies to minimize storage space for the matrix coefficients; (c) an efficient scheme for the allocation of sparse matrix coefficients among the processors; and (d) a parallel sparse direct solver. Application of ParCYCLIC to simulate 3-D geotechnical experimental models is demonstrated. The computational results show excellent parallel performance and scalability of ParCYCLIC on parallel computers with a large number of processors. Copyright

  3. Parallel computation with molecular-motor-propelled agents in nanofabricated networks

    PubMed Central

    Nicolau, Dan V.; Lard, Mercy; Korten, Till; van Delft, Falco C. M. J. M.; Persson, Malin; Bengtsson, Elina; Månsson, Alf; Diez, Stefan; Linke, Heiner; Nicolau, Dan V.

    2016-01-01

    The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology. PMID:26903637

  4. Parallel computation with molecular-motor-propelled agents in nanofabricated networks.

    PubMed

    Nicolau, Dan V; Lard, Mercy; Korten, Till; van Delft, Falco C M J M; Persson, Malin; Bengtsson, Elina; Månsson, Alf; Diez, Stefan; Linke, Heiner; Nicolau, Dan V

    2016-03-01

    The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology. PMID:26903637

  5. A massively parallel computational approach to coupled thermoelastic/porous gas flow problems

    NASA Technical Reports Server (NTRS)

    Shia, David; Mcmanus, Hugh L.

    1995-01-01

    A new computational scheme for coupled thermoelastic/porous gas flow problems is presented. Heat transfer, gas flow, and dynamic thermoelastic governing equations are expressed in fully explicit form, and solved on a massively parallel computer. The transpiration cooling problem is used as an example problem. The numerical solutions have been verified by comparison to available analytical solutions. Transient temperature, pressure, and stress distributions have been obtained. Small spatial oscillations in pressure and stress have been observed, which would be impractical to predict with previously available schemes. Comparisons between serial and massively parallel versions of the scheme have also been made. The results indicate that for small scale problems the serial and parallel versions use practically the same amount of CPU time. However, as the problem size increases the parallel version becomes more efficient than the serial version.

  6. Partitioning and packing mathematical simulation models for calculation on parallel computers

    NASA Technical Reports Server (NTRS)

    Arpasi, D. J.; Milner, E. J.

    1986-01-01

    The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.

  7. Efficient solid state NMR powder simulations using SMP and MPP parallel computation.

    PubMed

    Kristensen, Jørgen Holm; Farnan, Ian

    2003-04-01

    Methods for parallel simulation of solid state NMR powder spectra are presented for both shared and distributed memory parallel supercomputers. For shared memory architectures the performance of simulation programs implementing the OpenMP application programming interface is evaluated. It is demonstrated that the design of correct and efficient shared memory parallel programs is difficult as the performance depends on data locality and cache memory effects. The distributed memory parallel programming model is examined for simulation programs using the MPI message passing interface. The results reveal that both shared and distributed memory parallel computation are very efficient with an almost perfect application speedup and may be applied to the most advanced powder simulations. PMID:12713968

  8. Geometrically nonlinear design sensitivity analysis on parallel-vector high-performance computers

    NASA Technical Reports Server (NTRS)

    Baddourah, Majdi A.; Nguyen, Duc T.

    1993-01-01

    Parallel-vector solution strategies for generation and assembly of element matrices, solution of the resulted system of linear equations, calculations of the unbalanced loads, displacements, stresses, and design sensitivity analysis (DSA) are all incorporated into the Newton Raphson (NR) procedure for nonlinear finite element analysis and DSA. Numerical results are included to show the performance of the proposed method for structural analysis and DSA in a parallel-vector computer environment.

  9. Parallel computation for reservoir thermal simulation: An overlapping domain decomposition approach

    NASA Astrophysics Data System (ADS)

    Wang, Zhongxiao

    2005-11-01

    In this dissertation, we are involved in parallel computing for the thermal simulation of multicomponent, multiphase fluid flow in petroleum reservoirs. We report the development and applications of such a simulator. Unlike many efforts made to parallelize locally the solver of a linear equations system which affects the performance the most, this research takes a global parallelization strategy by decomposing the computational domain into smaller subdomains. This dissertation addresses the domain decomposition techniques and, based on the comparison, adopts an overlapping domain decomposition method. This global parallelization method hands over each subdomain to a single processor of the parallel computer to process. Communication is required when handling overlapping regions between subdomains. For this purpose, MPI (message passing interface) is used for data communication and communication control. A physical and mathematical model is introduced for the reservoir thermal simulation. Numerical tests on two sets of industrial data of practical oilfields indicate that this model and the parallel implementation match the history data accurately. Therefore, we expect to use both the model and the parallel code to predict oil production and guide the design, implementation and real-time fine tuning of new well operating schemes. A new adaptive mechanism to synchronize processes on different processors has been introduced, which not only ensures the computational accuracy but also improves the time performance. To accelerate the convergence rate of iterative solution of the large linear equations systems derived from the discretization of governing equations of our physical and mathematical model in space and time, we adopt the ORTHOMIN method in conjunction with an incomplete LU factorization preconditioning technique. Important improvements have been made in both ORTHOMIN method and incomplete LU factorization in order to enhance time performance without affecting

  10. Seventh SIAM Conference on Parallel Processing for Scientific Computing. Final technical report

    SciTech Connect

    1996-10-01

    The Seventh SIAM Conference on Parallel Processing for Scientific Computing was held in downtown San Francisco on the dates above. More than 400 people attended the meeting. This SIAM conference is, in this organizer`s opinion, the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Other, related areas, most notably parallel software and applications, are also well represented. The strong contributed sessions and minisymposia at the meeting attest to these claims.

  11. The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete; Saini, Subhash (Technical Monitor)

    1998-01-01

    Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.

  12. Sound Radiation from Ducted Fans Using Computational Aeroacoustics on Parallel Computers.

    NASA Astrophysics Data System (ADS)

    Ozyoruk, Yusuf

    1995-01-01

    As a component of a more advanced, new generation fan noise prediction technology, a computational aeroacoustics algorithm has been developed using an entirely new approach. Unlike previous approaches, the current method accounts for the nonuniform background flow and aerodynamic-acoustic coupling issues by solving the 3-D, time-dependent, full nonlinear Euler equations (although the developed computer program is a Navier-Stokes solver). The equations are solved on a 3-D body fitted curvilinear coordinate system using temporally and spatially 4th-order accurate finite difference, Runge -Kutta time integration. The time-accurate flow field is determined only in a relatively small physical domain using nonreflecting boundary conditions on its outer boundaries. A moving surface Kirchhoff method using the formulation of Farassat and Myers has been developed and coupled to the flow solver for far-field noise predictions. The acoustic field is obtained by subtracting the mean field from the total field. To establish the mean flow field, steady state solutions are required and Jameson's full approximation storage multigrid method has been extended to make use of the current high resolution algorithm for obtaining such solutions fast. Formulations in cylindrical coordinates together with cell-centered finite differencing are used to effectively treat the grid singularity along the centerline. Well designed grids aid this treatment. A 3-D grid generator has been developed using the conformal mappings of Ives and Menor to provide the hybrid radiation code with capabilities for very rapid and good quality mesh generation. The hybrid radiation code has been written in CM-Fortran, which is essentially High Performance Fortran. Some novel optimization procedures have been developed and implemented in the code, which runs efficiently on the CM-200 and CM-5 parallel computers. The code has been tested solving a large variety of problems, ranging from an oscillating piston problem

  13. Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems

    SciTech Connect

    Perumalla, Kalyan S; Park, Alfred J

    2013-01-01

    Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.

  14. Automated CFD Parameter Studies on Distributed Parallel Computers

    NASA Technical Reports Server (NTRS)

    Rogers, Stuart E.; Aftosmis, Michael; Pandya, Shishir; Tejnil, Edward; Ahmad, Jasim; Kwak, Dochan (Technical Monitor)

    2002-01-01

    The objective of the current work is to build a prototype software system which will automated the process of running CFD jobs on Information Power Grid (IPG) resources. This system should remove the need for user monitoring and intervention of every single CFD job. It should enable the use of many different computers to populate a massive run matrix in the shortest time possible. Such a software system has been developed, and is known as the AeroDB script system. The approach taken for the development of AeroDB was to build several discrete modules. These include a database, a job-launcher module, a run-manager module to monitor each individual job, and a web-based user portal for monitoring of the progress of the parameter study. The details of the design of AeroDB are presented in the following section. The following section provides the results of a parameter study which was performed using AeroDB for the analysis of a reusable launch vehicle (RLV). The paper concludes with a section on the lessons learned in this effort, and ideas for future work in this area.

  15. Modeling the Fracture of Ice Sheets on Parallel Computers

    SciTech Connect

    Waisman, Haim; Tuminaro, Ray

    2013-10-10

    The objective of this project was to investigate the complex fracture of ice and understand its role within larger ice sheet simulations and global climate change. This objective was achieved by developing novel physics based models for ice, novel numerical tools to enable the modeling of the physics and by collaboration with the ice community experts. At the present time, ice fracture is not explicitly considered within ice sheet models due in part to large computational costs associated with the accurate modeling of this complex phenomena. However, fracture not only plays an extremely important role in regional behavior but also influences ice dynamics over much larger zones in ways that are currently not well understood. To this end, our research findings through this project offers significant advancement to the field and closes a large gap of knowledge in understanding and modeling the fracture of ice sheets in the polar regions. Thus, we believe that our objective has been achieved and our research accomplishments are significant. This is corroborated through a set of published papers, posters and presentations at technical conferences in the field. In particular significant progress has been made in the mechanics of ice, fracture of ice sheets and ice shelves in polar regions and sophisticated numerical methods that enable the solution of the physics in an efficient way.

  16. Parallel beam dynamics calculations on high performance computers

    NASA Astrophysics Data System (ADS)

    Ryne, Robert; Habib, Salman

    1997-02-01

    Faced with a backlog of nuclear waste and weapons plutonium, as well as an ever-increasing public concern about safety and environmental issues associated with conventional nuclear reactors, many countries are studying new, accelerator-driven technologies that hold the promise of providing safe and effective solutions to these problems. Proposed projects include accelerator transmutation of waste (ATW), accelerator-based conversion of plutonium (ABC), accelerator-driven energy production (ADEP), and accelerator production of tritium (APT). Also, next-generation spallation neutron sources based on similar technology will play a major role in materials science and biological science research. The design of accelerators for these projects will require a major advance in numerical modeling capability. For example, beam dynamics simulations with approximately 100 million particles will be needed to ensure that extremely stringent beam loss requirements (less than a nanoampere per meter) can be met. Compared with typical present-day modeling using 10,000-100,000 particles, this represents an increase of 3-4 orders of magnitude. High performance computing (HPC) platforms make it possible to perform such large scale simulations, which require 10's of GBytes of memory. They also make it possible to perform smaller simulations in a matter of hours that would require months to run on a single processor workstation. This paper will describe how HPC platforms can be used to perform the numerically intensive beam dynamics simulations required for development of these new accelerator-driven technologies.

  17. Lower bounds on parallel, distributed, and automata computations

    SciTech Connect

    Gereb-Graus, M.

    1989-01-01

    In this thesis the author presents a collection of lower bound results from several areas of computer science. Conventional wisdom states that lower bounds are much more difficult to prove than upper bounds. To get an upper bound one has to demonstrate just one scheme with the appropriate complexity. On the other hand, to prove lower bounds one has to deal with all possible schemes. The difficulty of lower bounds can be further demonstrated by the fact that wherever for some problem he has a very large gap between the lower and the upper bound, the conjecture for the truth usually is the known upper bound. His first two results are impossibility results for finite state automata. A hierarchy of complexity classes on tree languages (analogous to the polynomial hierarchy) accepted by alternating finite state machines is introduced. It turns out that the alternating class is equal to the well known tree language class accepted by the treeautomata. By separating the deterministic and the nondeterministic classes of his hierarchy he gives a negative answer to the folklore question whether the expressive power of the treeautomata is the same as that of the finite state automaton that can walk on the edges of the tree (bugautomaton). He proves that three-head one-way DFA cannot perform string-matching, that is, no three-head one-way DFA accepts the language L = (x{number sign}y {vert bar} x is a substring of y, where x,y {element of} (0,1){sup *}). He proves that in a one round fair coin flipping (or voting) scheme with n participants, there is at least one participant who has a chance to decide the outcome with probability at least 3/n {minus} o(1/n).

  18. A parallel real-time computing-cluster implementation of spotlight SAR processing

    NASA Astrophysics Data System (ADS)

    Mathew, Bipin; Rabinkin, Daniel

    2005-05-01

    The high resolution imaging capability of Synthetic Aperture Radar (SAR) is largely unaffected by atmospheric conditions and has proven to be an indispensable asset in a variety of military and civilian applications. Application of SAR methodology for real-time imaging however carries with it the large computational complexity and storage requirements of the image-forming algorithms. Recently however, the rapidly diminishing cost of computing hardware and the related ascent of cluster-based computing, has made parallelization of these algorithms an appealing area of investigation. This paper describes a parallel SAR processor developed at MIT Lincoln Laboratory. Several novel technologies were employed in it's implementation, including pMatlab which is a parallel extension of standard Matlab that is also being developed at MIT Lincoln Laboratory. These technologies will be described later in the document. We begin with a brief description of the basic SAR algorithm.

  19. Scalable simulations for directed self-assembly patterning with the use of GPU parallel computing

    NASA Astrophysics Data System (ADS)

    Yoshimoto, Kenji; Peters, Brandon L.; Khaira, Gurdaman S.; de Pablo, Juan J.

    2012-03-01

    Directed self-assembly (DSA) patterning has been increasingly investigated as an alternative lithographic process for future technology nodes. One of the critical specs for DSA patterning is defects generated through annealing process or by roughness of pre-patterned structure. Due to their high sensitivity to the process and wafer conditions, however, characterization of those defects still remain challenging. DSA simulations can be a powerful tool to predict the formation of the DSA defects. In this work, we propose a new method to perform parallel computing of DSA Monte Carlo (MC) simulations. A consumer graphics card was used to access its hundreds of processing units for parallel computing. By partitioning the simulation system into non-interacting domains, we were able to run MC trial moves in parallel on multiple graphics-processing units (GPUs). Our results show a significant improvement in computational performance.

  20. Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2002-01-01

    Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.

  1. Adaptive methods and parallel computation for partial differential equations. Final report

    SciTech Connect

    Biswas, R.; Benantar, M.; Flaherty, J.E.

    1992-05-01

    Consider the adaptive solution of two-dimensional vector systems of hyperbolic and elliptic partial differential equations on shared-memory parallel computers. Hyperbolic systems are approximated by an explicit finite volume technique and solved by a recursive local mesh refinement procedure on a tree-structured grid. Local refinement of the time steps and spatial cells of a coarse base mesh is performed in regions where a refinement indicator exceeds a prescribed tolerance. Computational procedures that sequentially traverse the tree while processing solutions on each grid in parallel, that process solutions at the same tree level in parallel, and that dynamically assign processors to nodes of the tree have been developed and applied to an example. Computational results comparing a variety of heuristic processor load balancing techniques and refinement strategies are presented.

  2. Parallel computing simulation of fluid flow in the unsaturated zone of Yucca Mountain, Nevada.

    PubMed

    Zhang, Keni; Wu, Yu-Shu; Bodvarsson, G S

    2003-01-01

    This paper presents the application of parallel computing techniques to large-scale modeling of fluid flow in the unsaturated zone (UZ) at Yucca Mountain, Nevada. In this study, parallel computing techniques, as implemented into the TOUGH2 code, are applied in large-scale numerical simulations on a distributed-memory parallel computer. The modeling study has been conducted using an over-1-million-cell three-dimensional numerical model, which incorporates a wide variety of field data for the highly heterogeneous fractured formation at Yucca Mountain. The objective of this study is to analyze the impact of various surface infiltration scenarios (under current and possible future climates) on flow through the UZ system, using various hydrogeological conceptual models with refined grids. The results indicate that the 1-million-cell models produce better resolution results and reveal some flow patterns that cannot be obtained using coarse-grid modeling models. PMID:12714301

  3. Massively parallel computing simulation of fluid flow in the unsaturated zone of Yucca Mountain, Nevada

    SciTech Connect

    Zhang, Keni; Wu, Yu-Shu; Bodvarsson, G.S.

    2001-08-31

    This paper presents the application of parallel computing techniques to large-scale modeling of fluid flow in the unsaturated zone (UZ) at Yucca Mountain, Nevada. In this study, parallel computing techniques, as implemented into the TOUGH2 code, are applied in large-scale numerical simulations on a distributed-memory parallel computer. The modeling study has been conducted using an over-one-million-cell three-dimensional numerical model, which incorporates a wide variety of field data for the highly heterogeneous fractured formation at Yucca Mountain. The objective of this study is to analyze the impact of various surface infiltration scenarios (under current and possible future climates) on flow through the UZ system, using various hydrogeological conceptual models with refined grids. The results indicate that the one-million-cell models produce better resolution results and reveal some flow patterns that cannot be obtained using coarse-grid modeling models.

  4. Storing files in a parallel computing system based on user-specified parser function

    DOEpatents

    Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Manzanares, Adam; Torres, Aaron

    2014-10-21

    Techniques are provided for storing files in a parallel computing system based on a user-specified parser function. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a parser from the distributed application for processing the plurality of files prior to storage; and storing one or more of the plurality of files in one or more storage nodes of the parallel computing system based on the processing by the parser. The plurality of files comprise one or more of a plurality of complete files and a plurality of sub-files. The parser can optionally store only those files that satisfy one or more semantic requirements of the parser. The parser can also extract metadata from one or more of the files and the extracted metadata can be stored with one or more of the plurality of files and used for searching for files.

  5. Experiments with data flow on a general-purpose parallel computer. Memorandum report

    SciTech Connect

    Spertus, E.; Dally, W.J.

    1991-01-01

    The MIT J-Machine, a massively-parallel computer, is an experiment in providing general-purpose mechanisms for communication, synchronization, and naming that will support a wide variety of parallel models of computation. Having universal mechanisms allows the separation of issues of language design and machine organization. The authors have developed two experimental dataflow programming systems for the J-Machine. For the first system, they adapted Papadopoulos' explicit token store to implement static and then dynamic dataflow. Each node in a dataflow graph is expanded into a sequence of code, each of which is scheduled individually at runtime. For a later system, they made use of Iannucci's hybrid execution model to combine several dataflow graph nodes into a single sequence, decreasing scheduling overhead. By combining the strengths of the two systems, it is possible to produce a system with competitive performance. They have demonstrated the feasibility of efficiently executing dataflow programs on a general purpose parallel computer.

  6. Spatiotemporal Domain Decomposition for Massive Parallel Computation of Space-Time Kernel Density

    NASA Astrophysics Data System (ADS)

    Hohl, A.; Delmelle, E. M.; Tang, W.

    2015-07-01

    Accelerated processing capabilities are deemed critical when conducting analysis on spatiotemporal datasets of increasing size, diversity and availability. High-performance parallel computing offers the capacity to solve computationally demanding problems in a limited timeframe, but likewise poses the challenge of preventing processing inefficiency due to workload imbalance between computing resources. Therefore, when designing new algorithms capable of implementing parallel strategies, careful spatiotemporal domain decomposition is necessary to account for heterogeneity in the data. In this study, we perform octtree-based adaptive decomposition of the spatiotemporal domain for parallel computation of space-time kernel density. In order to avoid edge effects near subdomain boundaries, we establish spatiotemporal buffers to include adjacent data-points that are within the spatial and temporal kernel bandwidths. Then, we quantify computational intensity of each subdomain to balance workloads among processors. We illustrate the benefits of our methodology using a space-time epidemiological dataset of Dengue fever, an infectious vector-borne disease that poses a severe threat to communities in tropical climates. Our parallel implementation of kernel density reaches substantial speedup compared to sequential processing, and achieves high levels of workload balance among processors due to great accuracy in quantifying computational intensity. Our approach is portable of other space-time analytical tests.

  7. Fluid/Structure Interaction Studies of Aircraft Using High Fidelity Equations on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru; VanDalsem, William (Technical Monitor)

    1994-01-01

    Abstract Aeroelasticity which involves strong coupling of fluids, structures and controls is an important element in designing an aircraft. Computational aeroelasticity using low fidelity methods such as the linear aerodynamic flow equations coupled with the modal structural equations are well advanced. Though these low fidelity approaches are computationally less intensive, they are not adequate for the analysis of modern aircraft such as High Speed Civil Transport (HSCT) and Advanced Subsonic Transport (AST) which can experience complex flow/structure interactions. HSCT can experience vortex induced aeroelastic oscillations whereas AST can experience transonic buffet associated structural oscillations. Both aircraft may experience a dip in the flutter speed at the transonic regime. For accurate aeroelastic computations at these complex fluid/structure interaction situations, high fidelity equations such as the Navier-Stokes for fluids and the finite-elements for structures are needed. Computations using these high fidelity equations require large computational resources both in memory and speed. Current conventional super computers have reached their limitations both in memory and speed. As a result, parallel computers have evolved to overcome the limitations of conventional computers. This paper will address the transition that is taking place in computational aeroelasticity from conventional computers to parallel computers. The paper will address special techniques needed to take advantage of the architecture of new parallel computers. Results will be illustrated from computations made on iPSC/860 and IBM SP2 computer by using ENSAERO code that directly couples the Euler/Navier-Stokes flow equations with high resolution finite-element structural equations.

  8. Applications of Parallel Computation in Micro-Mechanics and Finite Element Method

    NASA Technical Reports Server (NTRS)

    Tan, Hui-Qian

    1996-01-01

    This project discusses the application of parallel computations related with respect to material analyses. Briefly speaking, we analyze some kind of material by elements computations. We call an element a cell here. A cell is divided into a number of subelements called subcells and all subcells in a cell have the identical structure. The detailed structure will be given later in this paper. It is obvious that the problem is "well-structured". SIMD machine would be a better choice. In this paper we try to look into the potentials of SIMD machine in dealing with finite element computation by developing appropriate algorithms on MasPar, a SIMD parallel machine. In section 2, the architecture of MasPar will be discussed. A brief review of the parallel programming language MPL also is given in that section. In section 3, some general parallel algorithms which might be useful to the project will be proposed. And, combining with the algorithms, some features of MPL will be discussed in more detail. In section 4, the computational structure of cell/subcell model will be given. The idea of designing the parallel algorithm for the model will be demonstrated. Finally in section 5, a summary will be given.

  9. Parallel, distributed and GPU computing technologies in single-particle electron microscopy

    PubMed Central

    Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

    2009-01-01

    Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686

  10. Exploiting parallel computing with limited program changes using a network of microcomputers

    NASA Technical Reports Server (NTRS)

    Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.

    1985-01-01

    Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.

  11. Implementation of the morphological shared-weight neural network (MSNN) for target recognition on the Parallel Algebraic Logic (PAL) computer

    NASA Astrophysics Data System (ADS)

    Li, Hongzheng; Shi, Hongchi; Gader, Paul D.; Keller, James M.

    1998-09-01

    The morphological shared-weight neural network (MSNN) is an effective approach to automatic target recognition. Implementation of the network in parallel is critical for real-time target recognition systems. Although there is significant parallelism inherent in the MSNN, it is a challenge to implement it on an SIMD parallel computer consisting of a large array of simple processing elements. This paper discusses issues related to detection accuracy and throughput in implementing the MSNN on the Parallel Algebraic Logic computer.

  12. User-microprogrammable, local host computer with low-level parallelism

    SciTech Connect

    Tomita, S.; Shibayama, K.; Kitamura, T.; Nakata, T.; Hagiwara, H.

    1983-01-01

    This paper describes the architecture of a dynamically microprogrammable computer with low-level parallelism, called QA-2, which is designed as a high-performance, local host computer for laboratory use. The architectural principle of the QA-2 is the marriage of high-speed, parallel processing capability offered by four powerful arithmetic and logic units (ALUS) with architectural flexibility provided by large scale, dynamic user-microprogramming. By changing its writable control storage dynamically, the QA-2 can be tailored to a wide spectrum of research-oriented applications covering high-level language processing and real-time processing. 11 references.

  13. Acceleration of the matrix multiplication of Radiance three phase daylighting simulations with parallel computing on heterogeneous hardware of personal computer

    SciTech Connect

    Zuo, Wangda; McNeil, Andrew; Wetter, Michael; Lee, Eleanor S.

    2013-05-23

    Building designers are increasingly relying on complex fenestration systems to reduce energy consumed for lighting and HVAC in low energy buildings. Radiance, a lighting simulation program, has been used to conduct daylighting simulations for complex fenestration systems. Depending on the configurations, the simulation can take hours or even days using a personal computer. This paper describes how to accelerate the matrix multiplication portion of a Radiance three-phase daylight simulation by conducting parallel computing on heterogeneous hardware of a personal computer. The algorithm was optimized and the computational part was implemented in parallel using OpenCL. The speed of new approach was evaluated using various daylighting simulation cases on a multicore central processing unit and a graphics processing unit. Based on the measurements and analysis of the time usage for the Radiance daylighting simulation, further speedups can be achieved by using fast I/O devices and storing the data in a binary format.

  14. DVS-SOFTWARE: An Effective Tool for Applying Highly Parallelized Hardware To Computational Geophysics

    NASA Astrophysics Data System (ADS)

    Herrera, I.; Herrera, G. S.

    2015-12-01

    Most geophysical systems are macroscopic physical systems. The behavior prediction of such systems is carried out by means of computational models whose basic models are partial differential equations (PDEs) [1]. Due to the enormous size of the discretized version of such PDEs it is necessary to apply highly parallelized super-computers. For them, at present, the most efficient software is based on non-overlapping domain decomposition methods (DDM). However, a limiting feature of the present state-of-the-art techniques is due to the kind of discretizations used in them. Recently, I. Herrera and co-workers using 'non-overlapping discretizations' have produced the DVS-Software which overcomes this limitation [2]. The DVS-software can be applied to a great variety of geophysical problems and achieves very high parallel efficiencies (90%, or so [3]). It is therefore very suitable for effectively applying the most advanced parallel supercomputers available at present. In a parallel talk, in this AGU Fall Meeting, Graciela Herrera Z. will present how this software is being applied to advance MOD-FLOW. Key Words: Parallel Software for Geophysics, High Performance Computing, HPC, Parallel Computing, Domain Decomposition Methods (DDM)REFERENCES [1]. Herrera Ismael and George F. Pinder, Mathematical Modelling in Science and Engineering: An axiomatic approach", John Wiley, 243p., 2012. [2]. Herrera, I., de la Cruz L.M. and Rosas-Medina A. "Non Overlapping Discretization Methods for Partial, Differential Equations". NUMER METH PART D E, 30: 1427-1454, 2014, DOI 10.1002/num 21852. (Open source) [3]. Herrera, I., & Contreras Iván "An Innovative Tool for Effectively Applying Highly Parallelized Software To Problems of Elasticity". Geofísica Internacional, 2015 (In press)

  15. Application of parallel computing to seismic damage process simulation of an arch dam

    NASA Astrophysics Data System (ADS)

    Zhong, Hong; Lin, Gao; Li, Jianbo

    2010-06-01

    The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.

  16. Efficient multi-objective calibration of a computationally intensive hydrologic model with parallel computing software in Python

    Technology Transfer Automated Retrieval System (TEKTRAN)

    With enhanced data availability, distributed watershed models for large areas with high spatial and temporal resolution are increasingly used to understand water budgets and examine effects of human activities and climate change/variability on water resources. Developing parallel computing software...

  17. OpenMP-style parallelism in data-centered multicore computing with R

    SciTech Connect

    Jiang, Lei; Patel, Pragneshkumar B; Ostrouchov, George; Jamitzky, Ferdinand

    2012-01-01

    R is a domain specific language widely used for data analysis by the statistics community as well as by researchers in finance, biology, social sciences, and many other disciplines. As R programs are linked to input data, the exponential growth of available data makes high-performance computing with R imperative. To ease the process of writing parallel programs in R, code transformation from a sequential program to a parallel version would bring much convenience to R users. In this paper, we present our work in semiautomatic parallelization of R codes with user-added OpenMPstyle pragmas. While such pragmas are used at the frontend, we take advantage of multiple parallel backends with different R packages. We provide flexibility for importing parallelism with plug-in components, impose built-in MapReduce for data processing, and also maintain code reusability. We illustrate the advantage of the on-the-fly mechanisms which can lead to significant applications in data-centered parallel computing.

  18. Parallel variable-band Choleski solvers for computational structural analysis applications on vector multiprocessor supercomputers

    NASA Technical Reports Server (NTRS)

    Poole, E. L.; Overman, A. L.

    1991-01-01

    A Choleski method used to solve linear systems of equations that arise in large scale structural analyses is described. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is also used for two different parallel implementations, demonstrating the use of CRAY macrotasking. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both the CRAY-2 and CRAY Y-MP computers. CPU and wall clock timings are given for the various parallel methods and are compared to single processor timings of the same algorithm. Computation rates over 1 GIGAFLOP (1 billion floating point operations per second) on a four processor CRAY-2 and over 2 GIGAFLOPS on an eight processor CRAY Y-MP are demonstrated as measured by wall clock time in a dedicated environment. Reduced wall clock times for the parallel methods relative to the single processor implementation of the same Choleski algorithm are also demonstrated for runs made in multi-user mode.

  19. Fast parallel algorithms that compute transitive closure of a fuzzy relation

    NASA Technical Reports Server (NTRS)

    Kreinovich, Vladik YA.

    1993-01-01

    The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).

  20. Optimization on a Network-based Parallel Computer System for Supersonic Laminar Wing Design

    NASA Technical Reports Server (NTRS)

    Garcia, Joseph A.; Cheung, Samson; Holst, Terry L. (Technical Monitor)

    1995-01-01

    A set of Computational Fluid Dynamics (CFD) routines and flow transition prediction tools are integrated into a network based parallel numerical optimization routine. Through this optimization routine, the design of a 2-D airfoil and an infinitely swept wing will be studied in order to advance the design cycle capability of supersonic laminar flow wings. The goal of advancing supersonic laminar flow wing design is achieved by wisely choosing the design variables used in the optimization routine. The design variables are represented by the theory of Fourier series and potential theory. These theories, combined with the parallel CFD flow routines and flow transition prediction tools, provide a design space for a global optimal point to be searched. Finally, the parallel optimization routine enables gradient evaluations to be performed in a fast and parallel fashion.

  1. Aeroelasticity of wing and wing-body configurations on parallel computers

    NASA Technical Reports Server (NTRS)

    Byun, Chansup

    1995-01-01

    The objective of this research is to develop computationally efficient methods for solving aeroelasticity problems on parallel computers. Both uncoupled and coupled methods are studied in this research. For the uncoupled approach, the conventional U-g method is used to determine the flutter boundary. The generalized aerodynamic forces required are obtained by the pulse transfer-function analysis method. For the coupled approach, the fluid-structure interaction is obtained by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.

  2. Method and apparatus of parallel computing with simultaneously operating stream prefetching and list prefetching engines

    SciTech Connect

    Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan

    2012-12-11

    A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.

  3. Effecting a broadcast with an allreduce operation on a parallel computer

    DOEpatents

    Almasi, Gheorghe; Archer, Charles J.; Ratterman, Joseph D.; Smith, Brian E.

    2010-11-02

    A parallel computer comprises a plurality of compute nodes organized into at least one operational group for collective parallel operations. Each compute node is assigned a unique rank and is coupled for data communications through a global combining network. One compute node is assigned to be a logical root. A send buffer and a receive buffer is configured. Each element of a contribution of the logical root in the send buffer is contributed. One or more zeros corresponding to a size of the element are injected. An allreduce operation with a bitwise OR using the element and the injected zeros is performed. And the result for the allreduce operation is determined and stored in each receive buffer.

  4. A parallel computational framework for integrated surface-subsurface flow and transport simulations

    NASA Astrophysics Data System (ADS)

    Park, Y.; Hwang, H.; Sudicky, E. A.

    2010-12-01

    HydroGeoSphere is a 3D control-volume finite element hydrologic model describing fully-integrated surface and subsurface water flow and solute and thermal energy transport. Because the model solves tighly-coupled highly-nonlinear partial differential equations, often applied at regional and continental scales (for example, to analyze the impact of climate change on water resources), high performance computing (HPC) is essential. The target parallelization includes the composition of the Jacobian matrix for the iterative linearization method and the sparse-matrix solver, a preconditioned Bi-CGSTAB. The matrix assembly is parallelized by using a coarse-grained scheme in that the local matrix compositions can be performed independently. The preconditioned Bi-CGSTAB algorithm performs a number of LU substitutions, matrix-vector multiplications, and inner products, where the parallelization of the LU substitution is not trivial. The parallelization of the solver is achieved by partitioning the domain into equal-size subdomains, with an efficient reordering scheme. The computational flow of the Bi-CGSTAB solver is also modified to reduce the parallelization overhead and to be suitable for parallel architectures. The parallelized model is tested on several benchmark simulations which include linear and nonlinear flow problems involving various domain sizes and degrees of hydrologic complexities. The performance is evaluated in terms of computational robustness and efficiency, using standard scaling performance measures. The results of simulation profiling indicate that the efficiency becomes higher with an increasing number of nodes/elements in the mesh, for increasingly nonlinear transient simulations, and with domains of irregular geometry. These characteristics are promising for the large-scale analysis water resources problems involved integrated surface/subsurface flow regimes.

  5. The use of inexact ODE solver in waveform relaxation methods on a massively parallel computer

    SciTech Connect

    Luk, W.S.; Wing, O.

    1995-12-01

    This paper presents the use of inexact ordinary differential equation (ODE) solver in waveform relaxation methods for solving initial value problems: Since the conventional ODE solvers are inherently sequential, the inexact ODE solver is used by taking time points from only previous waveform iteration for time integration. As a result, this method is truly massively parallel, as the equation is completely unfolded both in system and in time. Convergence analysis shows that the spectral radius of the iteration equation resulting from the {open_quotes}inexact{close_quotes} solver is the same as that from the standard method, and hence the new method is robust. The parallel implementation issues on the DECmpp 12000/Sx computer will also be discussed. Numerical results illustrate that though the number of iterations in the inexact method is increased over the exact method, as expected, the computation time is much reduced because of the large-scale parallelism.

  6. Fast parallel molecular algorithms for DNA-based computation: factoring integers.

    PubMed

    Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui

    2005-06-01

    The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations. PMID:16117023

  7. Three-dimensional aerodynamic design optimization using discrete sensitivity analysis and parallel computing

    NASA Astrophysics Data System (ADS)

    Oloso, Amidu Olawale

    A hybrid automatic differentiation/incremental iterative method was implemented in the general purpose advanced computational fluid dynamics code (CFL3D Version 4.1) to yield a new code (CFL3D.ADII) that is capable of computing consistently discrete first order sensitivity derivatives for complex geometries. With the exception of unsteady problems, the new code retains all the useful features and capabilities of the original CFL3D flow analysis code. The superiority of the new code over a carefully applied method of finite-differences is demonstrated. A coarse grain, scalable, distributed-memory, parallel version of CFL3D.ADII was developed based on "derivative stripmining". In this data-parallel approach, an identical copy of CFL3D.ADII is executed on each processor with different derivative input files. The effect of communication overhead on the overall parallel computational efficiency is negligible. However, the fraction of CFL3D.ADII duplicated on all processors has significant impact on the computational efficiency. To reduce the large execution time associated with the sequential 1-D line search in gradient-based aerodynamic optimization, an alternative parallel approach was developed. The execution time of the new approach was reduced effectively to that of one flow analysis, regardless of the number of function evaluations in the 1-D search. The new approach was found to yield design results that are essentially identical to those obtained from the traditional sequential approach but at much smaller execution time. The parallel CFL3D.ADII and the parallel 1-D line search are demonstrated in shape improvement studies of a realistic High Speed Civil Transport (HSCT) wing/body configuration represented by over 100 design variables and 200,000 grid points in inviscid supersonic flow on the 16 node IBM SP2 parallel computer at the Numerical Aerospace Simulation (NAS) facility, NASA Ames Research Center. In addition to making the handling of such a large

  8. An improved coarse-grained parallel algorithm for computational acceleration of ordinary Kriging interpolation

    NASA Astrophysics Data System (ADS)

    Hu, Hongda; Shu, Hong

    2015-05-01

    Heavy computation limits the use of Kriging interpolation methods in many real-time applications, especially with the ever-increasing problem size. Many researchers have realized that parallel processing techniques are critical to fully exploit computational resources and feasibly solve computation-intensive problems like Kriging. Much research has addressed the parallelization of traditional approach to Kriging, but this computation-intensive procedure may not be suitable for high-resolution interpolation of spatial data. On the basis of a more effective serial approach, we propose an improved coarse-grained parallel algorithm to accelerate ordinary Kriging interpolation. In particular, the interpolation task of each unobserved point is considered as a basic parallel unit. To reduce time complexity and memory consumption, the large right hand side matrix in the Kriging linear system is transformed and fixed at only two columns and therefore no longer directly relevant to the number of unobserved points. The MPI (Message Passing Interface) model is employed to implement our parallel programs in a homogeneous distributed memory system. Experimentally, the improved parallel algorithm performs better than the traditional one in spatial interpolation of annual average precipitation in Victoria, Australia. For example, when the number of processors is 24, the improved algorithm keeps speed-up at 20.8 while the speed-up of the traditional algorithm only reaches 9.3. Likewise, the weak scaling efficiency of the improved algorithm is nearly 90% while that of the traditional algorithm almost drops to 40% with 16 processors. Experimental results also demonstrate that the performance of the improved algorithm is enhanced by increasing the problem size.

  9. Alleviating Search Uncertainty through Concept Associations: Automatic Indexing, Co-Occurrence Analysis, and Parallel Computing.

    ERIC Educational Resources Information Center

    Chen, Hsinchun; Martinez, Joanne; Kirchhoff, Amy; Ng, Tobun D.; Schatz, Bruce R.

    1998-01-01

    Grounded on object filtering, automatic indexing, and co-occurrence analysis, an experiment was performed using a parallel supercomputer to analyze over 400,000 abstracts in an INSPEC computer engineering collection. A user evaluation revealed that system-generated thesauri were better than the human-generated INSPEC subject thesaurus in concept…

  10. Parallel Structures of Computer-Assisted Signature Pedagogy: The Case of Integrated Spreadsheets

    ERIC Educational Resources Information Center

    Abramovich, Sergei; Easton, Jonathan; Hayes, Victoria O.

    2012-01-01

    This article was motivated by the authors' work on a project with a group of 2nd-grade students in a computer lab of a rural school in upstate New York. From this project, one goal of which was to provide a capstone experience for a teacher candidate in teaching application-oriented mathematics with technology, the ideas about parallel structures…

  11. Tracking the Continuity of Language Comprehension: Computer Mouse Trajectories Suggest Parallel Syntactic Processing

    ERIC Educational Resources Information Center

    Farmer, Thomas A.; Cargill, Sarah A.; Hindy, Nicholas C.; Dale, Rick; Spivey, Michael J.

    2007-01-01

    Although several theories of online syntactic processing assume the parallel activation of multiple syntactic representations, evidence supporting simultaneous activation has been inconclusive. Here, the continuous and non-ballistic properties of computer mouse movements are exploited, by recording their streaming x, y coordinates to procure…

  12. High Performance Parallel Processing Project: Industrial computing initiative. Progress reports for fiscal year 1995

    SciTech Connect

    Koniges, A.

    1996-02-09

    This project is a package of 11 individual CRADA`s plus hardware. This innovative project established a three-year multi-party collaboration that is significantly accelerating the availability of commercial massively parallel processing computing software technology to U.S. government, academic, and industrial end-users. This report contains individual presentations from nine principal investigators along with overall program information.

  13. A distributed, dynamic, parallel computational model: the role of noise in velocity storage

    PubMed Central

    Merfeld, Daniel M.

    2012-01-01

    Networks of neurons perform complex calculations using distributed, parallel computation, including dynamic “real-time” calculations required for motion control. The brain must combine sensory signals to estimate the motion of body parts using imperfect information from noisy neurons. Models and experiments suggest that the brain sometimes optimally minimizes the influence of noise, although it remains unclear when and precisely how neurons perform such optimal computations. To investigate, we created a model of velocity storage based on a relatively new technique–“particle filtering”–that is both distributed and parallel. It extends existing observer and Kalman filter models of vestibular processing by simulating the observer model many times in parallel with noise added. During simulation, the variance of the particles defining the estimator state is used to compute the particle filter gain. We applied our model to estimate one-dimensional angular velocity during yaw rotation, which yielded estimates for the velocity storage time constant, afferent noise, and perceptual noise that matched experimental data. We also found that the velocity storage time constant was Bayesian optimal by comparing the estimate of our particle filter with the estimate of the Kalman filter, which is optimal. The particle filter demonstrated a reduced velocity storage time constant when afferent noise increased, which mimics what is known about aminoglycoside ablation of semicircular canal hair cells. This model helps bridge the gap between parallel distributed neural computation and systems-level behavioral responses like the vestibuloocular response and perception. PMID:22514288

  14. Solutions of TEAM Problem No. 13 using integral equations in a sequential and parallel computing environment

    SciTech Connect

    Kettunen, L.; Forsman, K.; Levine, D.; Gropp, W.

    1993-12-31

    In this paper a brief discussion of h-type volume integral formulations implemented in GFUNET/CORAL code is given and solutions of TEAM benchmark No. 13 are shown. GFUNET/CORAL is a general purpose code for 2D and 3D magnetostatics. Solutions of TEAM problem No. 13 are computed using both a sequential and parallel version of GFUNET/CORAL.

  15. A single user efficiency measure for evaluation of parallel or pipeline computer architectures

    NASA Technical Reports Server (NTRS)

    Jones, W. P.

    1978-01-01

    A precise statement of the relationship between sequential computation at one rate, parallel or pipeline computation at a much higher rate, the data movement rate between levels of memory, the fraction of inherently sequential operations or data that must be processed sequentially, the fraction of data to be moved that cannot be overlapped with computation, and the relative computational complexity of the algorithms for the two processes, scalar and vector, was developed. The relationship should be applied to the multirate processes that obtain in the employment of various new or proposed computer architectures for computational aerodynamics. The relationship, an efficiency measure that the single user of the computer system perceives, argues strongly in favor of separating scalar and vector processes, sometimes referred to as loosely coupled processes, to achieve optimum use of hardware.

  16. Parallel algorithms for computational geometry utilizing a fixed number of processors

    SciTech Connect

    Strader, R.G.

    1988-01-01

    The design of algorithms for systems where both communication and computation are important is presented. Approaches to parallel computation and the underlying theoretical models are surveyed. two models of computation are developed, both based on a divide-and-conquer strategy. The first utilizes a tree-like merge resulting in several levels of communication and computation, the total number determined by the number of processors. The second model contains a fixed number of levels independent of the number of processors. Using the notation from the survey and the models of computation, algorithms are designed for the computational geometry problems of finding the convex hull and Delaunay triangulation for a set of uniform random points in the Euclidean plane. Communication and computation timing measurements based on these algorithms are presented and analyzed. The results are then generalized to predict the behavior of expanded problems. Architectural support, partitioning issues, and limitations of this approach are discussed.

  17. Some computational challenges of developing efficient parallel algorithms for data-dependent computations in thermal-hydraulics supercomputer applications

    SciTech Connect

    Woodruff, S.B.

    1992-05-01

    The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems.

  18. MiniGhost : a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing.

    SciTech Connect

    Barrett, Richard Frederick; Heroux, Michael Allen; Vaughan, Courtenay Thomas

    2012-04-01

    A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available architectures provide strong inter-node bandwidth relative to latency costs, many codes 'bulk up' these messages by aggregating data into a message as a means of reducing the number of messages. A renewed focus on non-traditional architectures and architecture features provides new opportunities for exploring alternatives to this programming approach. In this report we describe miniGhost, a 'miniapp' designed for exploration of the capabilities of current as well as emerging and future architectures within the context of these sorts of applications. MiniGhost joins the suite of miniapps developed as part of the Mantevo project.

  19. Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems

    NASA Technical Reports Server (NTRS)

    Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.

    2000-01-01

    The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.

  20. A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

    NASA Technical Reports Server (NTRS)

    Jones, Mark Howard

    1987-01-01

    A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.