Science.gov

Sample records for parallel scientific computing

  1. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1995-01-01

    The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.

  2. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  3. Parallel hypergraph partitioning for scientific computing.

    SciTech Connect

    Heaphy, Robert; Devine, Karen Dragon; Catalyurek, Umit; Bisseling, Robert; Hendrickson, Bruce Alan; Boman, Erik Gunnar

    2005-07-01

    Graph partitioning is often used for load balancing in parallel computing, but it is known that hypergraph partitioning has several advantages. First, hypergraphs more accurately model communication volume, and second, they are more expressive and can better represent nonsymmetric problems. Hypergraph partitioning is particularly suited to parallel sparse matrix-vector multiplication, a common kernel in scientific computing. We present a parallel software package for hypergraph (and sparse matrix) partitioning developed at Sandia National Labs. The algorithm is a variation on multilevel partitioning. Our parallel implementation is novel in that it uses a two-dimensional data distribution among processors. We present empirical results that show our parallel implementation achieves good speedup on several large problems (up to 33 million nonzeros) with up to 64 processors on a Linux cluster.

  4. PISCES: An environment for parallel scientific computation

    NASA Technical Reports Server (NTRS)

    Pratt, T. W.

    1985-01-01

    The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.

  5. Accurate modeling of parallel scientific computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Townsend, James C.

    1988-01-01

    Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until load time. Critical mapping and remapping decisions rest on the ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a one-dimensional fluids code. The models constructed are shown to be accurate, and are used to find optimal remapping schedules.

  6. Lattice gauge theory on the Intel parallel scientific computer

    NASA Astrophysics Data System (ADS)

    Gottlieb, Steven

    1990-08-01

    Intel Scientific Computers (ISC) has just started producing its third general of parallel computer, the iPSC/860. Based on the i860 chip that has a peak performance of 80 Mflops and with a current maximum of 128 nodes, this computer should achieve speeds in excess of those obtainable on conventional vector supercomputers. The hardware, software and computing techniques appropriate for lattice gauge theory calculations are described. The differences between a staggered fermion conjugate gradient program written under CANOPY and for the iPSC are detailed.

  7. Review of An Introduction to Parallel and Vector Scientific Computing

    SciTech Connect

    Bailey, David H.; Lefton, Lew

    2006-06-30

    On one hand, the field of high-performance scientific computing is thriving beyond measure. Performance of leading-edge systems on scientific calculations, as measured say by the Top500 list, has increased by an astounding factor of 8000 during the 15-year period from 1993 to 2008, which is slightly faster even than Moore's Law. Even more importantly, remarkable advances in numerical algorithms, numerical libraries and parallel programming environments have led to improvements in the scope of what can be computed that are entirely on a par with the advances in computing hardware. And these successes have spread far beyond the confines of large government-operated laboratories, many universities, modest-sized research institutes and private firms now operate clusters that differ only in scale from the behemoth systems at the large-scale facilities. In the wake of these recent successes, researchers from fields that heretofore have not been part of the scientific computing world have been drawn into the arena. For example, at the recent SC07 conference, the exhibit hall, which long has hosted displays from leading computer systems vendors and government laboratories, featured some 70 exhibitors who had not previously participated. In spite of all these exciting developments, and in spite of the clear need to present these concepts to a much broader technical audience, there is a perplexing dearth of training material and textbooks in the field, particularly at the introductory level. Only a handful of universities offer coursework in the specific area of highly parallel scientific computing, and instructors of such courses typically rely on custom-assembled material. For example, the present reviewer and Robert F. Lucas relied on materials assembled in a somewhat ad-hoc fashion from colleagues and personal resources when presenting a course on parallel scientific computing at the University of California, Berkeley, a few years ago. Thus it is indeed refreshing to see

  8. Applications of parallel supercomputers: Scientific results and computer science lessons

    SciTech Connect

    Fox, G.C.

    1989-07-12

    Parallel Computing has come of age with several commercial and inhouse systems that deliver supercomputer performance. We illustrate this with several major computations completed or underway at Caltech on hypercubes, transputer arrays and the SIMD Connection Machine CM-2 and AMT DAP. Applications covered are lattice gauge theory, computational fluid dynamics, subatomic string dynamics, statistical and condensed matter physics,theoretical and experimental astronomy, quantum chemistry, plasma physics, grain dynamics, computer chess, graphics ray tracing, and Kalman filters. We use these applications to compare the performance of several advanced architecture computers including the conventional CRAY and ETA-10 supercomputers. We describe which problems are suitable for which computers in the terms of a matching between problem and computer architecture. This is part of a set of lessons we draw for hardware, software, and performance. We speculate on the emergence of new academic disciplines motivated by the growing importance of computers. 138 refs., 23 figs., 10 tabs.

  9. SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008

    SciTech Connect

    2008-09-08

    The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

  10. Parallel computers

    SciTech Connect

    Treveaven, P.

    1989-01-01

    This book presents an introduction to object-oriented, functional, and logic parallel computing on which the fifth generation of computer systems will be based. Coverage includes concepts for parallel computing languages, a parallel object-oriented system (DOOM) and its language (POOL), an object-oriented multilevel VLSI simulator using POOL, and implementation of lazy functional languages on parallel architectures.

  11. Parallel computing works

    SciTech Connect

    Not Available

    1991-10-23

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  12. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  13. MiniGhost : a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing.

    SciTech Connect

    Barrett, Richard Frederick; Heroux, Michael Allen; Vaughan, Courtenay Thomas

    2012-04-01

    A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available architectures provide strong inter-node bandwidth relative to latency costs, many codes 'bulk up' these messages by aggregating data into a message as a means of reducing the number of messages. A renewed focus on non-traditional architectures and architecture features provides new opportunities for exploring alternatives to this programming approach. In this report we describe miniGhost, a 'miniapp' designed for exploration of the capabilities of current as well as emerging and future architectures within the context of these sorts of applications. MiniGhost joins the suite of miniapps developed as part of the Mantevo project.

  14. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed.

  15. Improving scalability with loop transformations and message aggregation in parallel object-oriented frameworks for scientific computing

    SciTech Connect

    Bassetti, F.; Davis, K.; Quinlan, D.

    1998-09-01

    Application codes reliably achieve performance far less than the advertised capabilities of existing architectures, and this problem is worsening with increasingly-parallel machines. For large-scale numerical applications, stencil operations often impose the great part of the computational cost, and the primary sources of inefficiency are the costs of message passing and poor cache utilization. This paper proposes and demonstrates optimizations for stencil and stencil-like computations for both serial and parallel environments that ameliorate these sources of inefficiency. Achieving scalability, they believe, requires both algorithm design and compile-time support. The optimizations they present are automatable because the stencil-like computations are implemented at a high level of abstraction using object-oriented parallel array class libraries. These optimizations, which are beyond the capabilities of today compilers, may be performed automatically by a preprocessor such as the one they are currently developing.

  16. Merlin - Massively parallel heterogeneous computing

    NASA Technical Reports Server (NTRS)

    Wittie, Larry; Maples, Creve

    1989-01-01

    Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.

  17. Visualizing Parallel Computer System Performance

    NASA Technical Reports Server (NTRS)

    Malony, Allen D.; Reed, Daniel A.

    1988-01-01

    Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.

  18. Parallel Computing in SCALE

    SciTech Connect

    DeHart, Mark D; Williams, Mark L; Bowman, Stephen M

    2010-01-01

    The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

  19. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  20. Parallel distributed computing using Python

    NASA Astrophysics Data System (ADS)

    Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro

    2011-09-01

    This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.

  1. Parallel computation and computers for artificial intelligence

    SciTech Connect

    Kowalik, J.S. )

    1988-01-01

    This book discusses Parallel Processing in Artificial Intelligence; Parallel Computing using Multilisp; Execution of Common Lisp in a Parallel Environment; Qlisp; Restricted AND-Parallel Execution of Logic Programs; PARLOG: Parallel Programming in Logic; and Data-driven Processing of Semantic Nets. Attention is also given to: Application of the Butterfly Parallel Processor in Artificial Intelligence; On the Range of Applicability of an Artificial Intelligence Machine; Low-level Vision on Warp and the Apply Programming Mode; AHR: A Parallel Computer for Pure Lisp; FAIM-1: An Architecture for Symbolic Multi-processing; and Overview of Al Application Oriented Parallel Processing Research in Japan.

  2. Computers in Scientific Instrumentation.

    ERIC Educational Resources Information Center

    Enke, C. G.

    1982-01-01

    Computer applications in scientific instrumentation are traced from early data processing to modern computer-based instruments. Probable pathways toward instruments with increased "intelligence" include, among others, implementation of hierarchical computer networks and microprocessor controllers and the simplification of programing. The…

  3. Scientific Grid computing.

    PubMed

    Coveney, Peter V

    2005-08-15

    We introduce a definition of Grid computing which is adhered to throughout this Theme Issue. We compare the evolution of the World Wide Web with current aspirations for Grid computing and indicate areas that need further research and development before a generally usable Grid infrastructure becomes available. We discuss work that has been done in order to make scientific Grid computing a viable proposition, including the building of Grids, middleware developments, computational steering and visualization. We review science that has been enabled by contemporary computational Grids, and associated progress made through the widening availability of high performance computing.

  4. Parallel algorithms for matrix computations

    SciTech Connect

    Plemmons, R.J.

    1990-01-01

    The present conference on parallel algorithms for matrix computations encompasses both shared-memory systems and distributed-memory systems, as well as combinations of the two, to provide an overall perspective on parallel algorithms for both dense and sparse matrix computations in solving systems of linear equations, dense or structured problems related to least-squares computations, eigenvalue computations, singular-value computations, and rapid elliptic solvers. Specific issues addressed include the influence of parallel and vector architectures on algorithm design, computations for distributed-memory architectures such as hypercubes, solutions for sparse symmetric positive definite linear systems, symbolic and numeric factorizations, and triangular solutions. Also addressed are reference sources for parallel and vector numerical algorithms, sources for machine architectures, and sources for programming languages.

  5. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  6. Parallel computation using limited resources

    SciTech Connect

    Sugla, B.

    1985-01-01

    This thesis addresses itself to the task of designing and analyzing parallel algorithms when the resources of processors, communication, and time are limited. The two parts of this thesis deal with multiprocessor systems and VLSI - the two important parallel processing environments that are prevalent today. In the first part a time-processor-communication tradeoff analysis is conducted for two kinds of problems - N input, 1 output, and N input, N output computations. In the class of problems of the second kind, the problem of prefix computation, an important problem due to the number of naturally occurring computations it can model, is studied. Finally, a general methodology is given for design of parallel algorithms that can be used to optimize a given design to a wide set of architectural variations. The second part of the thesis considers the design of parallel algorithms for the VLSI model of computation when the resource of time is severely restricted.

  7. Structural mechanics computations on parallel computing platforms

    SciTech Connect

    Kulak, R.F.; Plaskacz, E.J.; Pfeiffer, P.A.

    1995-06-01

    With recent advances in parallel supercomputers and network-connected workstations, the solution to large scale structural engineering problems has now become tractable. High-performance computer architectures, which are usually available at large universities and national laboratories, now can solve large nonlinear problems. At the other end of the spectrum, network connected workstations can be configured to become a distributed-parallel computer. This approach is attractive to small, medium and large engineering firms. This paper describes the development of a parallelized finite element computer program for the solution of static, nonlinear structural mechanics problems.

  8. Experimental Parallel-Processing Computer

    NASA Technical Reports Server (NTRS)

    Mcgregor, J. W.; Salama, M. A.

    1986-01-01

    Master processor supervises slave processors, each with its own memory. Computer with parallel processing serves as inexpensive tool for experimentation with parallel mathematical algorithms. Speed enhancement obtained depends on both nature of problem and structure of algorithm used. In parallel-processing architecture, "bank select" and control signals determine which one, if any, of N slave processor memories accessible to master processor at any given moment. When so selected, slave memory operates as part of master computer memory. When not selected, slave memory operates independently of main memory. Slave processors communicate with each other via input/output bus.

  9. Accelerating Scientific Computations using FPGAs

    NASA Astrophysics Data System (ADS)

    Pell, O.; Atasu, K.; Mencer, O.

    Field Programmable Gate Arrays (FPGAs) are semiconductor devices that contain a grid of programmable cells, which the user configures to implement any digital circuit of up to a few million gates. Modern FPGAs allow the user to reconfigure these circuits many times each second, making FPGAs fully programmable and general purpose. Recent FPGA technology provides sufficient resources to tackle scientific applications on large-scale parallel systems. As a case study, we implement the Fast Fourier Transform [1] in a flexible floating point implementation. We utilize A Stream Compiler [2] (ASC) which combines C++ syntax with flexible floating point support by providing a 'HWfloat' data-type. The resulting FFT can be targeted to a variety of FPGA platforms in FFTW-style, though not yet completely automatically. The resulting FFT circuit can be adapted to the particular resources available on the system. The optimal implementation of an FFT accelerator depends on the length and dimensionality of the FFT, the available FPGA area, the available hard DSP blocks, the FPGA board architecture, and the precision and range of the application [3]. Software-style object-orientated abstractions allow us to pursue an accelerated pace of development by maximizing re-use of design patterns. ASC allows a few core hardware descriptions to generate hundreds of different circuit variants to meet particular speed, area and precision goals. The key to achieving maximum acceleration of FFT computation is to match memory and compute bandwidths so that maximum use is made of computational resources. Modern FPGAs contain up to hundreds of independent SRAM banks to store intermediate results, providing ample scope for optimizing memory parallelism. At 175Mhz, one of Maxeler's Radix-4 FFT cores computes 4x as many 1024pt FFTs per second as a dual Pentium-IV Xeon machine running FFTW. Eight such parallel cores fit onto the largest FPGA in the Xilinx Virtex-4 family, providing a 32x speed-up over

  10. Scientific Computing with Python

    NASA Astrophysics Data System (ADS)

    Beazley, D. M.

    Scripting languages have become a powerful tool for the construction of flexible scientific software because they provide scientists with an interpreted programming environment, can be easily interfaced with existing software written in C, C++, and Fortran, and can serve as a framework for modular software construction. In this paper, I describe the process of adding a scripting language to a scientific computing project by focusing on the use of Python with a large-scale molecular dynamics code developed for materials science research at Los Alamos National Laboratory. Although this application is not related to astronomical data analysis, the problems, solutions, and lessons learned may be of interest to researchers who are considering the use of scripting languages with their own projects.

  11. Turbomachinery CFD on parallel computers

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.

    1992-01-01

    The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.

  12. Computer simulation and scientific visualization

    SciTech Connect

    Weber, D.P.; Moszur, F.M.

    1990-01-01

    The simulation of processes in engineering and the physical sciences has progressed rapidly over the last several years. With rapid developments in supercomputers, parallel processing, numerical algorithms and software, scientists and engineers are now positioned to quantitatively simulate systems requiring many billions of arithmetic operations. The need to understand and assimilate such massive amounts of data has been a driving force in the development of both hardware and software to create visual representations of the underling physical systems. In this paper, and the accompanying videotape, the evolution and development of the visualization process in scientific computing will be reviewed. Specific applications and associated imaging hardware and software technology illustrate both the computational needs and the evolving trends. 6 refs.

  13. Parallel computing in atmospheric chemistry models

    SciTech Connect

    Rotman, D.

    1996-02-01

    Studies of atmospheric chemistry are of high scientific interest, involve computations that are complex and intense, and require enormous amounts of I/O. Current supercomputer computational capabilities are limiting the studies of stratospheric and tropospheric chemistry and will certainly not be able to handle the upcoming coupled chemistry/climate models. To enable such calculations, the authors have developed a computing framework that allows computations on a wide range of computational platforms, including massively parallel machines. Because of the fast paced changes in this field, the modeling framework and scientific modules have been developed to be highly portable and efficient. Here, the authors present the important features of the framework and focus on the atmospheric chemistry module, named IMPACT, and its capabilities. Applications of IMPACT to aircraft studies will be presented.

  14. Computational electromagnetics and parallel dense matrix computations

    SciTech Connect

    Forsman, K.; Kettunen, L.; Gropp, W.; Levine, D.

    1995-06-01

    We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.

  15. The science of computing - Parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, P. J.

    1985-01-01

    Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.

  16. Parallel computers and parallel algorithms for CFD: An introduction

    NASA Astrophysics Data System (ADS)

    Roose, Dirk; Vandriessche, Rafael

    1995-10-01

    This text presents a tutorial on those aspects of parallel computing that are important for the development of efficient parallel algorithms and software for computational fluid dynamics. We first review the main architectural features of parallel computers and we briefly describe some parallel systems on the market today. We introduce some important concepts concerning the development and the performance evaluation of parallel algorithms. We discuss how work load imbalance and communication costs on distributed memory parallel computers can be minimized. We present performance results for some CFD test cases. We focus on applications using structured and block structured grids, but the concepts and techniques are also valid for unstructured grids.

  17. Computing contingency statistics in parallel.

    SciTech Connect

    Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre

    2010-09-01

    Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

  18. High Performance Parallel Computational Nanotechnology

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to

  19. Parallel Tensor Compression for Large-Scale Scientific Data.

    SciTech Connect

    Kolda, Tamara G.; Ballard, Grey; Austin, Woody Nathan

    2015-10-01

    As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data. By viewing the data as a dense five way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 10000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.

  20. Trajectory optimization using parallel shooting method on parallel computer

    SciTech Connect

    Wirthman, D.J.; Park, S.Y.; Vadali, S.R.

    1995-03-01

    The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.

  1. Parallel computing in enterprise modeling.

    SciTech Connect

    Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.

    2008-08-01

    This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

  2. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  3. Advanced Scientific Computing Environment Team new scientific database management task

    SciTech Connect

    Church, J.P.; Roberts, J.C.; Sims, R.N.; Smetana, A.O.; Westmoreland, B.W.

    1991-06-01

    The mission of the ASCENT Team is to continually keep pace with, evaluate, and select emerging computing technologies to define and implement prototypic scientific environments that maximize the ability of scientists and engineers to manage scientific data. These environments are to be implemented in a manner consistent with the site computing architecture and standards and NRTSC/SCS strategic plans for scientific computing. The major trends in computing hardware and software technology clearly indicate that the future computer'' will be a network environment that comprises supercomputers, graphics boxes, mainframes, clusters, workstations, terminals, and microcomputers. This network computer'' will have an architecturally transparent operating system allowing the applications code to run on any box supplying the required computing resources. The environment will include a distributed database and database managing system(s) that permits use of relational, hierarchical, object oriented, GIS, et al, databases. To reach this goal requires a stepwise progression from the present assemblage of monolithic applications codes running on disparate hardware platforms and operating systems. The first steps include converting from the existing JOSHUA system to a new J80 system that complies with modern language standards, development of a new J90 prototype to provide JOSHUA capabilities on Unix platforms, development of portable graphics tools to greatly facilitate preparation of input and interpretation of output; and extension of Jvv'' concepts and capabilities to distributed and/or parallel computing environments.

  4. An Expert Assistant for Computer Aided Parallelization

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.

  5. Parallel Computing Using Web Servers and "Servlets".

    ERIC Educational Resources Information Center

    Lo, Alfred; Bloor, Chris; Choi, Y. K.

    2000-01-01

    Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…

  6. Broadcasting a message in a parallel computer

    DOEpatents

    Berg, Jeremy E.; Faraj, Ahmad A.

    2011-08-02

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.

  7. Software Engineering for Scientific Computer Simulations

    NASA Astrophysics Data System (ADS)

    Post, Douglass E.; Henderson, Dale B.; Kendall, Richard P.; Whitney, Earl M.

    2004-11-01

    Computer simulation is becoming a very powerful tool for analyzing and predicting the performance of fusion experiments. Simulation efforts are evolving from including only a few effects to many effects, from small teams with a few people to large teams, and from workstations and small processor count parallel computers to massively parallel platforms. Successfully making this transition requires attention to software engineering issues. We report on the conclusions drawn from a number of case studies of large scale scientific computing projects within DOE, academia and the DoD. The major lessons learned include attention to sound project management including setting reasonable and achievable requirements, building a good code team, enforcing customer focus, carrying out verification and validation and selecting the optimum computational mathematics approaches.

  8. Computers and Computation. Readings from Scientific American.

    ERIC Educational Resources Information Center

    Fenichel, Robert R.; Weizenbaum, Joseph

    A collection of articles from "Scientific American" magazine has been put together at this time because the current period in computer science is one of consolidation rather than innovation. A few years ago, computer science was moving so swiftly that even the professional journals were more archival than informative; but today it is much easier…

  9. Accelerating Scientific Discovery Through Computation and Visualization

    PubMed Central

    Sims, James S.; Hagedorn, John G.; Ketcham, Peter M.; Satterfield, Steven G.; Griffin, Terence J.; George, William L.; Fowler, Howland A.; am Ende, Barbara A.; Hung, Howard K.; Bohn, Robert B.; Koontz, John E.; Martys, Nicos S.; Bouldin, Charles E.; Warren, James A.; Feder, David L.; Clark, Charles W.; Filla, B. James; Devaney, Judith E.

    2000-01-01

    The rate of scientific discovery can be accelerated through computation and visualization. This acceleration results from the synergy of expertise, computing tools, and hardware for enabling high-performance computation, information science, and visualization that is provided by a team of computation and visualization scientists collaborating in a peer-to-peer effort with the research scientists. In the context of this discussion, high performance refers to capabilities beyond the current state of the art in desktop computing. To be effective in this arena, a team comprising a critical mass of talent, parallel computing techniques, visualization algorithms, advanced visualization hardware, and a recurring investment is required to stay beyond the desktop capabilities. This article describes, through examples, how the Scientific Applications and Visualization Group (SAVG) at NIST has utilized high performance parallel computing and visualization to accelerate condensate modeling, (2) fluid flow in porous materials and in other complex geometries, (3) flows in suspensions, (4) x-ray absorption, (5) dielectric breakdown modeling, and (6) dendritic growth in alloys. PMID:27551642

  10. Implementing clips on a parallel computer

    NASA Technical Reports Server (NTRS)

    Riley, Gary

    1987-01-01

    The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

  11. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  12. Enabling Computational Technologies for Terascale Scientific Simulations

    SciTech Connect

    Ashby, S.F.

    2000-08-24

    We develop scalable algorithms and object-oriented code frameworks for terascale scientific simulations on massively parallel processors (MPPs). Our research in multigrid-based linear solvers and adaptive mesh refinement enables Laboratory programs to use MPPs to explore important physical phenomena. For example, our research aids stockpile stewardship by making practical detailed 3D simulations of radiation transport. The need to solve large linear systems arises in many applications, including radiation transport, structural dynamics, combustion, and flow in porous media. These systems result from discretizations of partial differential equations on computational meshes. Our first research objective is to develop multigrid preconditioned iterative methods for such problems and to demonstrate their scalability on MPPs. Scalability describes how total computational work grows with problem size; it measures how effectively additional resources can help solve increasingly larger problems. Many factors contribute to scalability: computer architecture, parallel implementation, and choice of algorithm. Scalable algorithms have been shown to decrease simulation times by several orders of magnitude.

  13. File-access characteristics of parallel scientific workloads

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David; Purakayastha, Apratim; Best, Michael; Ellis, Carla Schlatter

    1995-01-01

    Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of parallel file systems. The design of a high-performance parallel file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of parallel file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide parallel file-system design.

  14. Template based parallel checkpointing in a massively parallel computer system

    DOEpatents

    Archer, Charles Jens; Inglett, Todd Alan

    2009-01-13

    A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

  15. Parallel computation with the force

    NASA Technical Reports Server (NTRS)

    Jordan, H. F.

    1985-01-01

    A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.

  16. Reservoir Thermal Recover Simulation on Parallel Computers

    NASA Astrophysics Data System (ADS)

    Li, Baoyan; Ma, Yuanle

    The rapid development of parallel computers has provided a hardware background for massive refine reservoir simulation. However, the lack of parallel reservoir simulation software has blocked the application of parallel computers on reservoir simulation. Although a variety of parallel methods have been studied and applied to black oil, compositional, and chemical model numerical simulations, there has been limited parallel software available for reservoir simulation. Especially, the parallelization study of reservoir thermal recovery simulation has not been fully carried out, because of the complexity of its models and algorithms. The authors make use of the message passing interface (MPI) standard communication library, the domain decomposition method, the block Jacobi iteration algorithm, and the dynamic memory allocation technique to parallelize their serial thermal recovery simulation software NUMSIP, which is being used in petroleum industry in China. The parallel software PNUMSIP was tested on both IBM SP2 and Dawn 1000A distributed-memory parallel computers. The experiment results show that the parallelization of I/O has great effects on the efficiency of parallel software PNUMSIP; the data communication bandwidth is also an important factor, which has an influence on software efficiency. Keywords: domain decomposition method, block Jacobi iteration algorithm, reservoir thermal recovery simulation, distributed-memory parallel computer

  17. Scientific Computing on the Grid

    SciTech Connect

    Allen, Gabrielle; Seidel, Edward; Shalf, John

    2001-12-12

    Computer simulations are becoming increasingly important as the only means for studying and interpreting the complex processes of nature. Yet the scope and accuracy of these simulations are severely limited by available computational power, even using today's most powerful supercomputers. As we endeavor to simulate the true complexity of nature, we will require much larger scale calculations than are possible at present. Such dynamic and large scale applications will require computational grids and grids require development of new latency tolerant algorithms, and sophisticated code frameworks like Cactus to carry out more complex and high fidelity simulations with a massive degree of parallelism.

  18. Comparisons of some large scientific computers

    NASA Technical Reports Server (NTRS)

    Credeur, K. R.

    1981-01-01

    In 1975, the National Aeronautics and Space Administration (NASA) began studies to assess the technical and economic feasibility of developing a computer having sustained computational speed of one billion floating point operations per second and a working memory of at least 240 million words. Such a powerful computer would allow computational aerodynamics to play a major role in aeronautical design and advanced fluid dynamics research. Based on favorable results from these studies, NASA proceeded with developmental plans. The computer was named the Numerical Aerodynamic Simulator (NAS). To help insure that the estimated cost, schedule, and technical scope were realistic, a brief study was made of past large scientific computers. Large discrepancies between inception and operation in scope, cost, or schedule were studied so that they could be minimized with NASA's proposed new compter. The main computers studied were the ILLIAC IV, STAR 100, Parallel Element Processor Ensemble (PEPE), and Shuttle Mission Simulator (SMS) computer. Comparison data on memory and speed were also obtained on the IBM 650, 704, 7090, 360-50, 360-67, 360-91, and 370-195; the CDC 6400, 6600, 7600, CYBER 203, and CYBER 205; CRAY 1; and the Advanced Scientific Computer (ASC). A few lessons learned conclude the report.

  19. West Virginia US Department of Energy experimental program to stimulate competitive research. Section 2: Human resource development; Section 3: Carbon-based structural materials research cluster; Section 3: Data parallel algorithms for scientific computing

    SciTech Connect

    Not Available

    1994-02-02

    This report consists of three separate but related reports. They are (1) Human Resource Development, (2) Carbon-based Structural Materials Research Cluster, and (3) Data Parallel Algorithms for Scientific Computing. To meet the objectives of the Human Resource Development plan, the plan includes K--12 enrichment activities, undergraduate research opportunities for students at the state`s two Historically Black Colleges and Universities, graduate research through cluster assistantships and through a traineeship program targeted specifically to minorities, women and the disabled, and faculty development through participation in research clusters. One research cluster is the chemistry and physics of carbon-based materials. The objective of this cluster is to develop a self-sustaining group of researchers in carbon-based materials research within the institutions of higher education in the state of West Virginia. The projects will involve analysis of cokes, graphites and other carbons in order to understand the properties that provide desirable structural characteristics including resistance to oxidation, levels of anisotropy and structural characteristics of the carbons themselves. In the proposed cluster on parallel algorithms, research by four WVU faculty and three state liberal arts college faculty are: (1) modeling of self-organized critical systems by cellular automata; (2) multiprefix algorithms and fat-free embeddings; (3) offline and online partitioning of data computation; and (4) manipulating and rendering three dimensional objects. This cluster furthers the state Experimental Program to Stimulate Competitive Research plan by building on existing strengths at WVU in parallel algorithms.

  20. Parallel computations and control of adaptive structures

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alvin, Kenneth F.; Belvin, W. Keith; Chong, K. P. (Editor); Liu, S. C. (Editor); Li, J. C. (Editor)

    1991-01-01

    The equations of motion for structures with adaptive elements for vibration control are presented for parallel computations to be used as a software package for real-time control of flexible space structures. A brief introduction of the state-of-the-art parallel computational capability is also presented. Time marching strategies are developed for an effective use of massive parallel mapping, partitioning, and the necessary arithmetic operations. An example is offered for the simulation of control-structure interaction on a parallel computer and the impact of the approach presented for applications in other disciplines than aerospace industry is assessed.

  1. Data-parallel algorithms for image computing

    NASA Astrophysics Data System (ADS)

    Carlotto, Mark J.

    1990-11-01

    Data-parallel algorithms for image computing on the Connection Machine are described. After a brief review of some basic programming concepts in *Lip, a parallel extension of Common Lisp, data-parallel programming paradigms based on a local (diffusion-like) model of computation, the scan model of computation, a general interprocessor communications model, and a region-based model are introduced. Algorithms for connected component labeling, distance transformation, Voronoi diagrams, finding minimum cost paths, local means, shape-from-shading, hidden surface calculations, affine transformation, oblique parallel projection, and spatial operations over regions are presented. An new algorithm for interpolating irregularly spaced data via Voronoi diagrams is also described.

  2. Running Geant on T. Node parallel computer

    SciTech Connect

    Jejcic, A.; Maillard, J.; Silva, J. ); Mignot, B. )

    1990-08-01

    AnInmos transputer-based computer has been utilized to overcome the difficulties due to the limitations on the processing abilities of event parallelism and multiprocessor farms (i.e., the so called bus-crisis) and the concern regarding the growing sizes of databases typical in High Energy Physics. This study was done on the T.Node parallel computer manufactured by TELMAT. Detailed figures are reported concerning the event parallelization. (AIP)

  3. Modified mesh-connected parallel computers

    SciTech Connect

    Carlson, D.A. )

    1988-10-01

    The mesh-connected parallel computer is an important parallel processing organization that has been used in the past for the design of supercomputing systems. In this paper, the authors explore modifications of a mesh-connected parallel computer for the purpose of increasing the efficiency of executing important application programs. These modifications are made by adding one or more global mesh structures to the processing array. They show how our modifications allow asymptotic improvements in the efficiency of executing computations having low to medium interprocessor communication requirements (e.g., tree computations, prefix computations, finding the connected components of a graph). For computations with high interprocessor communication requirements such as sorting, they show that they offer no speedup. They also compare the modified mesh-connected parallel computer to other similar organizations including the pyramid, the X-tree, and the mesh-of-trees.

  4. HPC Infrastructure for Solid Earth Simulation on Parallel Computers

    NASA Astrophysics Data System (ADS)

    Nakajima, K.; Chen, L.; Okuda, H.

    2004-12-01

    Recently, various types of parallel computers with various types of architectures and processing elements (PE) have emerged, which include PC clusters and the Earth Simulator. Moreover, users can easily access to these computer resources through network on Grid environment. It is well-known that thorough tuning is required for programmers to achieve excellent performance on each computer. The method for tuning strongly depends on the type of PE and architecture. Optimization by tuning is a very tough work, especially for developers of applications. Moreover, parallel programming using message passing library such as MPI is another big task for application programmers. In GeoFEM project (http://gefeom.tokyo.rist.or.jp), authors have developed a parallel FEM platform for solid earth simulation on the Earth Simulator, which supports parallel I/O, parallel linear solvers and parallel visualization. This platform can efficiently hide complicated procedures for parallel programming and optimization on vector processors from application programmers. This type of infrastructure is very useful. Source codes developed on PC with single processor is easily optimized on massively parallel computer by linking the source code to the parallel platform installed on the target computer. This parallel platform, called HPC Infrastructure will provide dramatic efficiency, portability and reliability in development of scientific simulation codes. For example, line number of the source codes is expected to be less than 10,000 and porting legacy codes to parallel computer takes 2 or 3 weeks. Original GeoFEM platform supports only I/O, linear solvers and visualization. In the present work, further development for adaptive mesh refinement (AMR) and dynamic load-balancing (DLB) have been carried out. In this presentation, examples of large-scale solid earth simulation using the Earth Simulator will be demonstrated. Moreover, recent results of a parallel computational steering tool using an

  5. The BLAZE language: A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  6. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  7. Heisenberg: Paralleling Scientific and Historical Methods

    NASA Astrophysics Data System (ADS)

    Cofield, Calla

    2007-04-01

    Werner Heisenberg is an important historical subject within the physics community partly because his actions as a human being are discussed nearly as often as his work as a physicist. But does the scientific community establish it's historical ideas with the same methods and standards as it's scientific conclusions? I interviewed Heisenberg's son, Jochen Heisenberg, a professor of physics at UNH. Despite a great amount of literature on Werner Heisenberg, only one historian has interviewed Jochen about his father and few have interviewed Werner's wife. Nature is mysterious and unpredictable, but it doesn't lie or distort like humans, and we believe it can give ``honest'' results. But are we keeping the same standards with history that we do with science? Are we holding historians to these standards and if not, is it up to scientists to not only be keepers of scientific understanding, but historical understanding as well? Shouldn't we record history by using the scientific method, by weighing the best sources of data differently than the less reliable, and are we right to be as stubborn about changing our views on history as we are about changing our views on nature?

  8. Mapping Radiosity Computations to Parallel Processors.

    NASA Astrophysics Data System (ADS)

    Singh, Gautam Bir

    The radiosity method for rendering scenes is gaining popularity because of its ability to accurately model the energy distribution in an environment. As this photonic energy distribution is independent of the viewer's position, generating scenes for different viewpoints only requires hidden surface removal and can be performed in real-time. This makes it more attractive than ray tracing as a technique for modeling illumination. It is quite conceivable that radiosity method will be used for applications in scientific visualization, lighting simulations, CAD/CAM, virtual reality, and medical imaging. Computing radiosity of a scene with moderate to high complexity is tantamount to solving a system of tens of thousands of linear equations. Iterative linear system solvers, such as Gauss-Seidel, Jacobi, or conjugate descent, are quite demanding for a system of equations this large. An alternate approach, known as progressive refinement, offers some computational tractability and delivers an approximate solution relatively quickly. This dissertation presents the results of partitioning the radiosity computation to suitably map on a variety of multiprocessor classes. The effect of problem decomposition on computation and communication components is studied for the shared memory, the message passing and the loosely coupled distributed memory multiprocessors. Kendall Square Research's KSR1 and Intel hypercube iPSC/860 were used for experimenting with the shared memory and message-passing algorithms respectively. A network of IBM RS/6000 was used for understanding coarse grain parallelization techniques. These experiments demonstrated that optimality of parallel algorithms must be considered as a < machine, algorithm > pair. Thus the notion of program portability must also take machine architecture in consideration beside allowing for software compatibility. As the number of polygons for processing complex scenes continues to grow, the subdivision in the object space become

  9. Collectively loading an application in a parallel computer

    DOEpatents

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.

    2016-01-05

    Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

  10. Interconnections For Stacked Parallel Computer Modules

    NASA Technical Reports Server (NTRS)

    Johannesson, Richard T.

    1996-01-01

    Concept for interconnecting modules in parallel computers leads to cheaper, smaller, lighter, lower-power computing systems for aerospace, industrial, business, and consumer applications. Computer modules stacked and interconnected in various configurations. Connections among stacks controlled by switching within gateways and/or by addresses on buses.

  11. Parallel unstructured grid generation for computational aerosciences

    NASA Technical Reports Server (NTRS)

    Shephard, Mark S.

    1993-01-01

    The objective of this research project is to develop efficient parallel automatic grid generation procedures for use in computational aerosciences. This effort is focused on a parallel version of the Finite Octree grid generator. Progress made during the first six months is reported.

  12. Parallel Computing Strategies for Irregular Algorithms

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

    2002-01-01

    Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

  13. Massively Parallel Computing: A Sandia Perspective

    SciTech Connect

    Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.

    1999-05-06

    The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.

  14. Charon Message-Passing Toolkit for Scientific Computations

    NASA Technical Reports Server (NTRS)

    VanderWijngarrt, Rob F.; Saini, Subhash (Technical Monitor)

    1998-01-01

    The Charon toolkit for piecemeal development of high-efficiency parallel programs for scientific computing is described. The portable toolkit, callable from C and Fortran, provides flexible domain decompositions and high-level distributed constructs for easy translation of serial legacy code or design to distributed environments. Gradual tuning can subsequently be applied to obtain high performance, possibly by using explicit message passing. Charon also features general structured communications that support stencil-based computations with complex recurrences. Through the separation of partitioning and distribution, the toolkit can also be used for blocking of uni-processor code, and for debugging of parallel algorithms on serial machines. An elaborate review of recent parallelization aids is presented to highlight the need for a toolkit like Charon. Some performance results of parallelizing the NAS Parallel Benchmark SP program using Charon are given, showing good scalability. Some performance results of parallelizing the NAS Parallel Benchmark SP program using Charon are given, showing good scalability.

  15. Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges

    NASA Technical Reports Server (NTRS)

    Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.

    2000-01-01

    This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.

  16. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  17. IPython: components for interactive and parallel computing across disciplines. (Invited)

    NASA Astrophysics Data System (ADS)

    Perez, F.; Bussonnier, M.; Frederic, J. D.; Froehle, B. M.; Granger, B. E.; Ivanov, P.; Kluyver, T.; Patterson, E.; Ragan-Kelley, B.; Sailer, Z.

    2013-12-01

    Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallel computing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallel computing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.

  18. On evaluating parallel computer systems

    NASA Technical Reports Server (NTRS)

    Adams, George B., III; Brown, Robert L.; Denning, Peter J.

    1985-01-01

    A workshop was held in an attempt to program real problems on the MIT Static Data Flow Machine. Most of the architecture of the machine was specified but some parts were incomplete. The main purpose for the workshop was to explore principles for the evaluation of computer systems employing new architectures. Principles explored were: (1) evaluation must be an integral, ongoing part of a project to develop a computer of radically new architecture; (2) the evaluation should seek to measure the usability of the system as well as its performance; (3) users from the application domains must be an integral part of the evaluation process; and (4) evaluation results should be fed back into the design process. It is concluded that the general organizational principles are achievable in practice from this workshop.

  19. Link failure detection in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Megerian, Mark G.; Smith, Brian E.

    2010-11-09

    Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.

  20. Internode data communications in a parallel computer

    SciTech Connect

    Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.

    2013-09-03

    Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.

  1. Internode data communications in a parallel computer

    SciTech Connect

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.

  2. Locating hardware faults in a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-04-13

    Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

  3. Finite element computation with parallel VLSI

    NASA Technical Reports Server (NTRS)

    Mcgregor, J.; Salama, M.

    1983-01-01

    This paper describes a parallel processing computer consisting of a 16-bit microcomputer as a master processor which controls and coordinates the activities of 8086/8087 VLSI chip set slave processors working in parallel. The hardware is inexpensive and can be flexibly configured and programmed to perform various functions. This makes it a useful research tool for the development of, and experimentation with parallel mathematical algorithms. Application of the hardware to computational tasks involved in the finite element analysis method is demonstrated by the generation and assembly of beam finite element stiffness matrices. A number of possible schemes for the implementation of N-elements on N- or n-processors (N is greater than n) are described, and the speedup factors of their time consumption are determined as a function of the number of available parallel processors.

  4. Efficient communication in massively parallel computers

    SciTech Connect

    Cypher, R.E.

    1989-01-01

    A fundamental operation in parallel computation is sorting. Sorting is important not only because it is required by many algorithms, but also because it can be used to implement irregular, pointer-based communication. The author studies two algorithms for sorting in massively parallel computers. First, he examines Shellsort. Shellsort is a sorting algorithm that is based on a sequence of parameters called increments. Shellsort can be used to create a parallel sorting device known as a sorting network. Researchers have suggested that if the correct increment sequence is used, an optimal size sorting network can be obtained. All published increment sequences have been monotonically decreasing. He shows that no monotonically decreasing increment sequence will yield an optimal size sorting network. Second, he presents a sorting algorithm called Cubesort. Cubesort is the fastest known sorting algorithm for a variety of parallel computers aver a wide range of parameters. He also presents a paradigm for developing parallel algorithms that have efficient communication. The paradigm, called the data reduction paradigm, consists of using a divide-and-conquer strategy. Both the division and combination phases of the divide-and-conquer algorithm may require irregular, pointer-based communication between processors. However, the problem is divided so as to limit the amount of data that must be communicated. As a result the communication can be performed efficiently. He presents data reduction algorithms for the image component labeling problem, the closest pair problem and four versions of the parallel prefix problem.

  5. Toward a science of parallel computation

    SciTech Connect

    Worlton, W.J.

    1986-01-01

    The evolution of parallel processing over the past several decades can be viewed as the development of a new scientific discipline. Parallel processing has been, and is, undergoing the same evolutionary stages that are common to the development of scientific disciplines in general: exploration, focusing, and maturity. That parallel processing is not yet a science can readily be appreciated by its lack of some of the characteristics typical of mature sciences, such as prescriptive terminology, comprehensive taxonomies, and authoritative fundamental principles. A great deal of outstanding work has been done and the field is experiencing the beginnings of its ''focusing'' phase, i.e., support is being concentrated in a set of the more promising approaches selected from among the larger set of exploratory projects. However, the possible set of parallel-processing concepts is so extensive that exploratory work will probably continue for one or two more decades. In the meantime, the growing maturity of the field will be reflected in the increasing clarity and precision of the terminology, the development of systematic classification of the domain of discourse, the development of basic principles, and the growing number of commercial products that are the outcome of the research and development projects on which support is being focused. In this paper we develop some generalizations of taxonomies and use basic principles to draw conclusions about the extensibility of parallel processor architectures. 7 refs., 5 figs., 2 tabs.

  6. Parallel algorithms for optical digital computers

    SciTech Connect

    Huang, A.

    1983-01-01

    Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.

  7. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  8. The new landscape of parallel computer architecture

    NASA Astrophysics Data System (ADS)

    Shalf, John

    2007-07-01

    The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.

  9. Wing-Body Aeroelasticity on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Byun, Chansup

    1996-01-01

    This article presents a procedure for computing the aeroelasticity of wing-body configurations on multiple-instruction, multiple-data parallel computers. In this procedure, fluids are modeled using Euler equations discretized by a finite difference method, and structures are modeled using finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. A parallel integration scheme is used to compute aeroelastic responses by solving the coupled fluid and structural equations concurrently while keeping modularity of each discipline. The present procedure is validated by computing the aeroelastic response of a wing and comparing with experiment. Aeroelastic computations are illustrated for a high speed civil transport type wing-body configuration.

  10. Interfacing Computer Aided Parallelization and Performance Analysis

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

    2003-01-01

    When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.

  11. Parallel computing using a Lagrangian formulation

    NASA Technical Reports Server (NTRS)

    Liou, May-Fun; Loh, Ching Yuen

    1991-01-01

    A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.

  12. Temporal fringe pattern analysis with parallel computing

    SciTech Connect

    Tuck Wah Ng; Kar Tien Ang; Argentini, Gianluca

    2005-11-20

    Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallel computing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallel computing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis.

  13. Highly parallel computer architecture for robotic computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)

    1991-01-01

    In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  14. Overview of ICLASS research: Reliable and parallel computing

    NASA Technical Reports Server (NTRS)

    Iyer, Ravi K.

    1987-01-01

    An overview of Illinois Computer Laboratory for Aerospace Systems and Software (ICLASS) Research: Reliable and Parallel Computing is presented. Topics covered include: reliable and fault tolerant computing; fault tolerant multiprocessor architectures; fault tolerant matrix computation; and parallel processing.

  15. A high performance scientific cloud computing environment for materials simulations

    NASA Astrophysics Data System (ADS)

    Jorissen, K.; Vila, F. D.; Rehr, J. J.

    2012-09-01

    We describe the development of a scientific cloud computing (SCC) platform that offers high performance computation capability. The platform consists of a scientific virtual machine prototype containing a UNIX operating system and several materials science codes, together with essential interface tools (an SCC toolset) that offers functionality comparable to local compute clusters. In particular, our SCC toolset provides automatic creation of virtual clusters for parallel computing, including tools for execution and monitoring performance, as well as efficient I/O utilities that enable seamless connections to and from the cloud. Our SCC platform is optimized for the Amazon Elastic Compute Cloud (EC2). We present benchmarks for prototypical scientific applications and demonstrate performance comparable to local compute clusters. To facilitate code execution and provide user-friendly access, we have also integrated cloud computing capability in a JAVA-based GUI. Our SCC platform may be an alternative to traditional HPC resources for materials science or quantum chemistry applications.

  16. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  17. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  18. Parallel Analysis and Visualization on Cray Compute Node Linux

    SciTech Connect

    Pugmire, Dave; Ahern, Sean

    2008-01-01

    Capability computer systems are deployed to give researchers the computational power required to investigate and solve key challenges facing the scientific community. As the power of these computer systems increases, the computational problem domain typically increases in size, complexity and scope. These increases strain the ability of commodity analysis and visualization clusters to effectively perform post-processing tasks and provide critical insight and understanding to the computed results. An alternative to purchasing increasingly larger, separate analysis and visualization commodity clusters is to use the computational system itself to perform post-processing tasks. In this paper, the recent successful port of VisIt, a parallel, open source analysis and visualization tool, to compute node linux running on the Cray is detailed. Additionally, the unprecedented ability of this resource for analysis and visualization is discussed and a report on obtained results is presented.

  19. Parallel Object-Oriented Computation Applied to a Finite Element Problem

    NASA Technical Reports Server (NTRS)

    Weissman, Jon B.; Grimshaw, Andrew S.; Ferraro, Robert

    1993-01-01

    The conventional wisdom in the scientific computing community is that the best way to solve large-scale numerically intensive scientific problems on today's parallel MIMD computers is to use Fortran or C programmed in a data-parallel style using low-level message-passing primitives. This approach inevitably leads to nonportable codes, extensive development time, and restricts parallel programming to the domain of the expert programmer. We believe that these problems are not inherent to parallel computing but are the result of the tools used. We will show that comparable performance can be achieved with little effort if better tools that present higher level abstractions are used.

  20. Charon Message-Passing Toolkit for Scientific Computations

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Saini, Subhash (Technical Monitor)

    1998-01-01

    The Charon toolkit for piecemeal development of high-efficiency parallel programs for scientific computing is described. The portable toolkit, callable from C and Fortran, provides flexible domain decompositions and high-level distributed constructs for easy translation of serial legacy code or design to distributed environments. Gradual tuning can subsequently be applied to obtain high performance, possibly by using explicit message passing. Charon also features general structured communications that support stencil-based computations with complex recurrences. Through the separation of partitioning and distribution, the toolkit can also be used for blocking of uni-processor code, and for debugging of parallel algorithms on serial machines. An elaborate review of recent parallelization aids is presented to highlight the need for a toolkit like Charon. Some performance results of parallelizing the NAS Parallel Benchmark SP program using Charon are given, showing good scalability.

  1. Parallel computation of Feynman loop integrals

    NASA Astrophysics Data System (ADS)

    de Doncker, E.; Yuasa, F.

    2012-12-01

    The need for large numbers of compute-intensive integrals, arising in quantum field theory perturbation calculations, justifies the parallelization of loop integrals. In earlier work, we devised effective multivariate methods by iterated (repeated) adaptive numerical integration and extrapolation, applicable for some problem classes where standard multivariate integration techniques fail through integrand singularities. In repeated integration, the function evaluations for the outer integral are independent integral computations for the next lower level and can be distributed to threads in a multi-core parallelization. The distribution level determines the number of dimensions in the inner integral below it and thus the granularity of the computation. We discuss a multi-threaded implementation over the OpenMP Application Program Interface.

  2. Tpetra, and the use of generic programming in scientific computing

    SciTech Connect

    Baker, Christopher G; Heroux, Dr. Michael A

    2012-01-01

    We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra s design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and notable empirical results.

  3. Rectilinear partitioning of irregular data parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1991-01-01

    New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed.

  4. Efficient parallel global garbage collection on massively parallel computers

    SciTech Connect

    Kamada, Tomio; Matsuoka, Satoshi; Yonezawa, Akinori

    1994-12-31

    On distributed-memory high-performance MPPs where processors are interconnected by an asynchronous network, efficient Garbage Collection (GC) becomes difficult due to inter-node references and references within pending, unprocessed messages. The parallel global GC algorithm (1) takes advantage of reference locality, (2) efficiently traverses references over nodes, (3) admits minimum pause time of ongoing computations, and (4) has been shown to scale up to 1024 node MPPs. The algorithm employs a global weight counting scheme to substantially reduce message traffic. The two methods for confirming the arrival of pending messages are used: one counts numbers of messages and the other uses network `bulldozing.` Performance evaluation in actual implementations on a multicomputer with 32-1024 nodes, Fujitsu AP1000, reveals various favorable properties of the algorithm.

  5. Java Parallel Secure Stream for Grid Computing

    SciTech Connect

    Chen, Jie; Akers, Walter; Chen, Ying; Watson, William

    2001-09-01

    The emergence of high speed wide area networks makes grid computing a reality. However grid applications that need reliable data transfer still have difficulties to achieve optimal TCP performance due to network tuning of TCP window size to improve the bandwidth and to reduce latency on a high speed wide area network. This paper presents a pure Java package called JPARSS (Java Par-allel Secure Stream) that divides data into partitions that are sent over several parallel Java streams simultaneously and allows Java or Web applications to achieve optimal TCP performance in a gird environment without the necessity of tuning the TCP window size. Several experimental results are provided to show that using parallel stream is more effective than tuning TCP window size. In addi-tion X.509 certificate based single sign-on mechanism and SSL based connection establishment are integrated into this package. Finally a few applications using this package will be discussed.

  6. Intranode data communications in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

    2014-01-07

    Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

  7. Intranode data communications in a parallel computer

    SciTech Connect

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

    2013-07-23

    Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

  8. Advances in parallel computer technology for desktop atmospheric dispersion models

    SciTech Connect

    Bian, X.; Ionescu-Niscov, S.; Fast, J.D.; Allwine, K.J.

    1996-12-31

    Desktop models are those models used by analysts with varied backgrounds, for performing, for example, air quality assessment and emergency response activities. These models must be robust, well documented, have minimal and well controlled user inputs, and have clear outputs. Existing coarse-grained parallel computers can provide significant increases in computation speed in desktop atmospheric dispersion modeling without considerable increases in hardware cost. This increased speed will allow for significant improvements to be made in the scientific foundations of these applied models, in the form of more advanced diffusion schemes and better representation of the wind and turbulence fields. This is especially attractive for emergency response applications where speed and accuracy are of utmost importance. This paper describes one particular application of coarse-grained parallel computer technology to a desktop complex terrain atmospheric dispersion modeling system. By comparing performance characteristics of the coarse-grained parallel version of the model with the single-processor version, we will demonstrate that applying coarse-grained parallel computer technology to desktop atmospheric dispersion modeling systems will allow us to address critical issues facing future requirements of this class of dispersion models.

  9. Synchronizing compute node time bases in a parallel computer

    DOEpatents

    Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

    2015-01-27

    Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.

  10. Synchronizing compute node time bases in a parallel computer

    DOEpatents

    Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

    2014-12-30

    Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.

  11. Opportunities in computational mechanics: Advances in parallel computing

    SciTech Connect

    Lesar, R.A.

    1999-02-01

    In this paper, the authors will discuss recent advances in computing power and the prospects for using these new capabilities for studying plasticity and failure. They will first review the new capabilities made available with parallel computing. They will discuss how these machines perform and how well their architecture might work on materials issues. Finally, they will give some estimates on the size of problems possible using these computers.

  12. Problem solving on computers with parallel organization of computations

    SciTech Connect

    Glushkov, V.M.; Molchanov, I.N.

    1981-07-01

    In principle, parallelism can be implemented on different levels: the level of construction of physical models, objects or processes representing the phenomenon being studied, the level of the solution method, the level of the algorithm, the level of the program, the level of data exchange in the computer, and the level of data input and output. The authors consider some of the available parallelism techniques. 9 references.

  13. Parallel Computation of the Regional Ocean Modeling System (ROMS)

    SciTech Connect

    Wang, P; Song, Y T; Chao, Y; Zhang, H

    2005-04-05

    The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds of processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.

  14. FastQuery: A Parallel Indexing System for Scientific Data

    SciTech Connect

    Chou, Jerry; Wu, Kesheng; Prabhat,

    2011-07-29

    Modern scientific datasets present numerous data management and analysis challenges. State-of-the- art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also develop a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.

  15. Efficient Parallel Engineering Computing on Linux Workstations

    NASA Technical Reports Server (NTRS)

    Lou, John Z.

    2010-01-01

    A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).

  16. A design methodology for portable software on parallel computers

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.

    1993-01-01

    This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured

  17. Seismic imaging on massively parallel computers

    SciTech Connect

    Ober, C.C.; Oldfield, R.A.; Womble, D.E.; Mosher, C.C.

    1997-07-01

    A key to reducing the risks and costs associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Pre-stack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar-wave equation using finite differences. Current industry computational capabilities are insufficient for the application of finite-difference, 3-D, prestack, depth-migration algorithms. High performance computers and state-of-the-art algorithms and software are required to meet this need. As part of an ongoing ACTI project funded by the US Department of Energy, the authors have developed a finite-difference, 3-D prestack, depth-migration code for massively parallel computer systems. The goal of this work is to demonstrate that massively parallel computers (thousands of processors) can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite-difference, prestack, depth migration practical for oil and gas exploration.

  18. Parallelized reliability estimation of reconfigurable computer networks

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Das, Subhendu; Palumbo, Dan

    1990-01-01

    A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.

  19. Parallelism for quantum computation with qudits

    SciTech Connect

    O'Leary, Dianne P.; Brennen, Gavin K.; Bullock, Stephen S.

    2006-09-15

    Robust quantum computation with d-level quantum systems (qudits) poses two requirements: fast, parallel quantum gates and high-fidelity two-qudit gates. We first describe how to implement parallel single-qudit operations. It is by now well known that any single-qudit unitary can be decomposed into a sequence of Givens rotations on two-dimensional subspaces of the qudit state space. Using a coupling graph to represent physically allowed couplings between pairs of qudit states, we then show that the logical depth (time) of the parallel gate sequence is equal to the height of an associated tree. The implementation of a given unitary can then optimize the tradeoff between gate time and resources used. These ideas are illustrated for qudits encoded in the ground hyperfine states of the alkali-metal atoms {sup 87}Rb and {sup 133}Cs. Second, we provide a protocol for implementing parallelized nonlocal two-qudit gates using the assistance of entangled qubit pairs. Using known protocols for qubit entanglement purification, this offers the possibility of high-fidelity two-qudit gates.

  20. Parallel algorithms for computing linked list prefix

    SciTech Connect

    Han, Y. )

    1989-06-01

    Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.

  1. Parallel computing: One opportunity, four challenges

    SciTech Connect

    Gaudiot, J.-L.

    1989-12-31

    The author reviews briefly the area of parallel computer processing. This area has been expanding at a great rate in the past decade. Great strides have been made in the hardware area, and in the speed of performance of chips. However to some degree the hardware area is beginning to run into basic physical speed limits, which will slow the rate of advance of this area simply because of physical limitations. The author looks at ways that computer architecture, and software applications, can work to continue the rate of increase in computing power which has occurred over the past decade. Four particular areas are mentioned: programmability; communication network design; reliable operation; performance evaluation and benchmarking.

  2. Computation and parallel implementation for early vision

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    1990-01-01

    The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

  3. Computational fluid dynamics on a massively parallel computer

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.; Levit, Creon

    1989-01-01

    A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallel computer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.

  4. Parallel Computing in Information Retrieval--An Updated Review.

    ERIC Educational Resources Information Center

    Macfarlane, A.; And Others

    1997-01-01

    Reviews the progress of parallel computing in information retrieval (IR) and stresses the importance of the motivation in using parallel computing for text retrieval. Analyzes parallel IR systems using a classification defined by Rasmussen; describes retrieval models used in parallel information processing; and suggests areas of needed research.…

  5. Parallel computing techniques for rotorcraft aerodynamics

    NASA Astrophysics Data System (ADS)

    Ekici, Kivanc

    The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).

  6. Parallel block schemes for large scale least squares computations

    SciTech Connect

    Golub, G.H.; Plemmons, R.J.; Sameh, A.

    1986-04-01

    Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment of the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.

  7. Parallel algorithm for computing points on a computation front hyperplane

    NASA Astrophysics Data System (ADS)

    Krasnov, M. M.

    2015-01-01

    A parallel algorithm for computing points on a computation front hyperplane is described. This task arises in the computation of a quantity defined on a multidimensional rectangular domain. Three-dimensional domains are usually discussed, but the material is given in the general form when the number of measurements is at least two. When the values of a quantity at different points are internally independent (which is frequently the case), the corresponding computations are independent as well and can be performed in parallel. However, if there are internal dependences (as, for example, in the Gauss-Seidel method for systems of linear equations), then the order of scanning points of the domain is an important issue. A conventional approach in this case is to form a computation front hyperplane (a usual plane in the three-dimensional case and a line in the two-dimensional case) that moves linearly across the domain at a certain angle. At every step in the course of motion of this hyperplane, its intersection points with the domain can be treated independently and, hence, in parallel, but the steps themselves are executed sequentially. At different steps, the intersection of the hyperplane with the entire domain can have a rather complex geometry and the search for all points of the domain lying on the hyperplane at a given step is a nontrivial problem. This problem (i.e., the computation of the coordinates of points lying in the intersection of the domain with the hyperplane at a given step in the course of hyperplane motion) is addressed below. The computations over the points of the hyperplane can be executed in parallel.

  8. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  9. Nonisothermal multiphase subsurface transport on parallel computers

    SciTech Connect

    Martinez, M.J.; Hopkins, P.L.; Shadid, J.N.

    1997-10-01

    We present a numerical method for nonisothermal, multiphase subsurface transport in heterogeneous porous media. The mathematical model considers nonisothermal two-phase (liquid/gas) flow, including capillary pressure effects, binary diffusion in the gas phase, conductive, latent, and sensible heat transport. The Galerkin finite element method is used for spatial discretization, and temporal integration is accomplished via a predictor/corrector scheme. Message-passing and domain decomposition techniques are used for implementing a scalable algorithm for distributed memory parallel computers. An illustrative application is shown to demonstrate capabilities and performance.

  10. Scientific computing environment for the 1980s

    NASA Technical Reports Server (NTRS)

    Bailey, F. R.

    1986-01-01

    An emerging scientific computing environment in which computers are used not only to solve large-scale models, but are also integrated into the daily activities of scientists and engineers, is discussed. The requirements of the scientific user in this environment are reviewed, and the hardware environment is described, including supercomputers, work stations, mass storage, and communications. Significant increases in memory capacity to keep pace with performance increases, the introduction of powerful graphics displays into the work station, and networking to integrate many computers are stressed. The emerging system software environment is considered, including the operating systems, communications software, and languages. New scientific user tools and utilities that will become available are described.

  11. Optimal dynamic remapping of parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Reynolds, Paul F., Jr.

    1987-01-01

    A large class of computations are characterized by a sequence of phases, with phase changes occurring unpredictably. The decision problem was considered regarding the remapping of workload to processors in a parallel computation when the utility of remapping and the future behavior of the workload is uncertain, and phases exhibit stable execution requirements during a given phase, but requirements may change radically between phases. For these problems a workload assignment generated for one phase may hinder performance during the next phase. This problem is treated formally for a probabilistic model of computation with at most two phases. The fundamental problem of balancing the expected remapping performance gain against the delay cost was addressed. Stochastic dynamic programming is used to show that the remapping decision policy minimizing the expected running time of the computation has an extremely simple structure. Because the gain may not be predictable, the performance of a heuristic policy that does not require estimnation of the gain is examined. The heuristic method's feasibility is demonstrated by its use on an adaptive fluid dynamics code on a multiprocessor. The results suggest that except in extreme cases, the remapping decision problem is essentially that of dynamically determining whether gain can be achieved by remapping after a phase change. The results also suggest that this heuristic is applicable to computations with more than two phases.

  12. Dynamic file-access characteristics of a production parallel scientific workload

    NASA Technical Reports Server (NTRS)

    Kotz, David; Nieuwejaar, Nils

    1994-01-01

    Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors. Most successful systems are based on a solid understanding of the expected workload, but thus far there have been no comprehensive workload characterizations of multiprocessor file systems. This paper presents the results of a three week tracing study in which all file-related activity on a massively parallel computer was recorded. Our instrumentation differs from previous efforts in that it collects information about every I/O request and about the mix of jobs running in a production environment. We also present the results of a trace-driven caching simulation and recommendations for designers of multiprocessor file systems.

  13. CFD Research, Parallel Computation and Aerodynamic Optimization

    NASA Technical Reports Server (NTRS)

    Ryan, James S.

    1995-01-01

    During the last five years, CFD has matured substantially. Pure CFD research remains to be done, but much of the focus has shifted to integration of CFD into the design process. The work under these cooperative agreements reflects this trend. The recent work, and work which is planned, is designed to enhance the competitiveness of the US aerospace industry. CFD and optimization approaches are being developed and tested, so that the industry can better choose which methods to adopt in their design processes. The range of computer architectures has been dramatically broadened, as the assumption that only huge vector supercomputers could be useful has faded. Today, researchers and industry can trade off time, cost, and availability, choosing vector supercomputers, scalable parallel architectures, networked workstations, or heterogenous combinations of these to complete required computations efficiently.

  14. Parallel Proximity Detection for Computer Simulations

    NASA Technical Reports Server (NTRS)

    Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)

    1998-01-01

    The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are included by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.

  15. Parallel Proximity Detection for Computer Simulation

    NASA Technical Reports Server (NTRS)

    Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)

    1997-01-01

    The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are includes by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.

  16. Optimized data communications in a parallel computer

    DOEpatents

    Faraj, Daniel A.

    2014-08-19

    A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.

  17. Optimized data communications in a parallel computer

    SciTech Connect

    Faraj, Daniel A

    2014-10-21

    A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.

  18. Final Report: Super Instruction Architecture for Scalable Parallel Computations

    SciTech Connect

    Sanders, Beverly Ann; Bartlett, Rodney; Deumens, Erik

    2013-12-23

    The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.

  19. A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.; Levit, Creon

    1989-01-01

    The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.

  20. Simple, parallel virtual machines for extreme computations

    NASA Astrophysics Data System (ADS)

    Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jürgen

    2015-11-01

    We introduce a virtual machine (VM) written in a numerically fast language like Fortran or C for evaluating very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte-code from the optimal matrix element generator, O'MEGA. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behavior with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.

  1. A Component Architecture for High-Performance Scientific Computing

    SciTech Connect

    Bernholdt, D E; Allan, B A; Armstrong, R; Bertrand, F; Chiu, K; Dahlgren, T L; Damevski, K; Elwasif, W R; Epperly, T W; Govindaraju, M; Katz, D S; Kohl, J A; Krishnan, M; Kumfert, G; Larson, J W; Lefantzi, S; Lewis, M J; Malony, A D; McInnes, L C; Nieplocha, J; Norris, B; Parker, S G; Ray, J; Shende, S; Windus, T L; Zhou, S

    2004-12-14

    The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.

  2. QCMPI: A parallel environment for quantum computing

    NASA Astrophysics Data System (ADS)

    Tabakin, Frank; Juliá-Díaz, Bruno

    2009-06-01

    QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with parallel processing capabilities. It is an accessible research tool that permits rapid evaluation of quantum algorithms for a large number of qubits and for various "noise" scenarios. The prime motivation for developing QCMPI is to facilitate numerical examination of not only how QC algorithms work, but also to include noise, decoherence, and attenuation effects and to evaluate the efficacy of error correction schemes. The present work builds on an earlier Mathematica code QDENSITY, which is mainly a pedagogic tool. In that earlier work, although the density matrix formulation was featured, the description using state vectors was also provided. In QCMPI, the stress is on state vectors, in order to employ a large number of qubits. The parallel processing feature is implemented by using the Message-Passing Interface (MPI) protocol. A description of how to spread the wave function components over many processors is provided, along with how to efficiently describe the action of general one- and two-qubit operators on these state vectors. These operators include the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum Fourier transformation. These operators make up the actions needed in QC. Codes for Grover's search and Shor's factoring algorithms are provided as examples. A major feature of this work is that concurrent versions of the algorithms can be evaluated with each version subject to alternate noise effects, which corresponds to the idea of solving a stochastic Schrödinger equation. The density matrix for the ensemble of such noise cases is constructed using parallel distribution methods to evaluate its eigenvalues and associated entropy. Potential applications of this powerful tool include studies of the stability and correction of QC processes using Hamiltonian based dynamics. Program summaryProgram title: QCMPI Catalogue identifier: AECS_v1_0 Program summary URL

  3. Accelerating Scientific Discovery Through Computation and Visualization II

    PubMed Central

    Sims, James S.; George, William L.; Satterfield, Steven G.; Hung, Howard K.; Hagedorn, John G.; Ketcham, Peter M.; Griffin, Terence J.; Hagstrom, Stanley A.; Franiatte, Julien C.; Bryant, Garnett W.; Jaskólski, W.; Martys, Nicos S.; Bouldin, Charles E.; Simmons, Vernon; Nicolas, Oliver P.; Warren, James A.; am Ende, Barbara A.; Koontz, John E.; Filla, B. James; Pourprix, Vital G.; Copley, Stefanie R.; Bohn, Robert B.; Peskin, Adele P.; Parker, Yolanda M.; Devaney, Judith E.

    2002-01-01

    This is the second in a series of articles describing a wide variety of projects at NIST that synergistically combine physical science and information science. It describes, through examples, how the Scientific Applications and Visualization Group (SAVG) at NIST has utilized high performance parallel computing, visualization, and machine learning to accelerate research. The examples include scientific collaborations in the following areas: (1) High Precision Energies for few electron atomic systems, (2) Flows of suspensions, (3) X-ray absorption, (4) Molecular dynamics of fluids, (5) Nanostructures, (6) Dendritic growth in alloys, (7) Screen saver science, (8) genetic programming. PMID:27446728

  4. CFD research, parallel computation and aerodynamic optimization

    NASA Technical Reports Server (NTRS)

    Ryan, James S.

    1995-01-01

    Over five years of research in Computational Fluid Dynamics and its applications are covered in this report. Using CFD as an established tool, aerodynamic optimization on parallel architectures is explored. The objective of this work is to provide better tools to vehicle designers. Submarine design requires accurate force and moment calculations in flow with thick boundary layers and large separated vortices. Low noise production is critical, so flow into the propulsor region must be predicted accurately. The High Speed Civil Transport (HSCT) has been the subject of recent work. This vehicle is to be a passenger vehicle with the capability of cutting overseas flight times by more than half. A successful design must surpass the performance of comparable planes. Fuel economy, other operational costs, environmental impact, and range must all be improved substantially. For all these reasons, improved design tools are required, and these tools must eventually integrate optimization, external aerodynamics, propulsion, structures, heat transfer and other disciplines.

  5. Broadcasting a message in a parallel computer

    DOEpatents

    Archer, Charles J; Faraj, Ahmad A

    2013-04-16

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.

  6. Broadcasting a message in a parallel computer

    DOEpatents

    Archer, Charles J; Faraj, Daniel A

    2014-11-18

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.

  7. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  8. 77 FR 62231 - DOE/Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-12

    .../Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ACTION: Notice of Open Meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing...: Melea Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building;...

  9. 76 FR 31945 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-06-02

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION... the Advanced Scientific Computing Advisory Committee (ASCAC). The Federal Advisory Committee Act (Pub... INFORMATION CONTACT: Melea Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown...

  10. Application Specific Performance Technology for Productive Parallel Computing

    SciTech Connect

    Malony, Allen D.; Shende, Sameer

    2008-09-30

    Our accomplishments over the last three years of the DOE project Application- Specific Performance Technology for Productive Parallel Computing (DOE Agreement: DE-FG02-05ER25680) are described below. The project will have met all of its objectives by the time of its completion at the end of September, 2008. Two extensive yearly progress reports were produced in in March 2006 and 2007 and were previously submitted to the DOE Office of Advanced Scientific Computing Research (OASCR). Following an overview of the objectives of the project, we summarize for each of the project areas the achievements in the first two years, and then describe in some more detail the project accomplishments this past year. At the end, we discuss the relationship of the proposed renewal application to the work done on the current project.

  11. Advanced Scientific Computing Environment Team new scientific database management task. Progress report

    SciTech Connect

    Church, J.P.; Roberts, J.C.; Sims, R.N.; Smetana, A.O.; Westmoreland, B.W.

    1991-06-01

    The mission of the ASCENT Team is to continually keep pace with, evaluate, and select emerging computing technologies to define and implement prototypic scientific environments that maximize the ability of scientists and engineers to manage scientific data. These environments are to be implemented in a manner consistent with the site computing architecture and standards and NRTSC/SCS strategic plans for scientific computing. The major trends in computing hardware and software technology clearly indicate that the future ``computer`` will be a network environment that comprises supercomputers, graphics boxes, mainframes, clusters, workstations, terminals, and microcomputers. This ``network computer`` will have an architecturally transparent operating system allowing the applications code to run on any box supplying the required computing resources. The environment will include a distributed database and database managing system(s) that permits use of relational, hierarchical, object oriented, GIS, et al, databases. To reach this goal requires a stepwise progression from the present assemblage of monolithic applications codes running on disparate hardware platforms and operating systems. The first steps include converting from the existing JOSHUA system to a new J80 system that complies with modern language standards, development of a new J90 prototype to provide JOSHUA capabilities on Unix platforms, development of portable graphics tools to greatly facilitate preparation of input and interpretation of output; and extension of ``Jvv`` concepts and capabilities to distributed and/or parallel computing environments.

  12. Evaluation of leading scalar and vector architectures for scientific computations

    SciTech Connect

    Simon, Horst D.; Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Ethier, Stephane; Shalf, John

    2004-04-20

    The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This project examines the performance of the cacheless vector Earth Simulator (ES) and compares it to superscalar cache-based IBM Power3 system. Results demonstrate that the ES is significantly faster than the Power3 architecture, highlighting the tremendous potential advantage of the ES for numerical simulation. However, vectorization of a particle-in-cell application (GTC) greatly increased the memory footprint preventing loop-level parallelism and limiting scalability potential.

  13. Using MPI File Caching to Improve Parallel Write Performance for Large-Scale Scientific Applications

    SciTech Connect

    Liao, Wei-keng; Ching, Avery; Coloma, Kenin; Nisar, Arifa; Choudhary, Alok; Chen, Jackie; Sankaran, Ramanan; Klasky, Scott A

    2007-01-01

    Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of clientside file caching and write-behind strategies. In distributed environments where files are rarely accessed by more than one client concurrently, file caching has achieved significant success; however, in parallel applications where multiple clients manipulate a shared file, cache coherence control can serialize I/O. We have designed a thread based caching layer for the MPI I/O library, which adds a portable caching system closer to user applications so more information about the application's I/O patterns is available for better coherence control. We demonstrate the impact of our caching solution on parallel write performance with a comprehensive evaluation that includes a set of widely used I/O benchmarks and production application I/O kernels.

  14. Using MPI file caching to improve parallel write performance for large-scale scientific applications

    SciTech Connect

    Sankaran, Ramanan; Liao, Wei-Keng; Chen, Jacqueline H; Klasky, Scott A; Choudhary, Alok

    2007-01-01

    Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of client-side file caching and write-behind strategies. In distributed environments where files are rarely accessed by more than one client concurrently, file caching has achieved significant success; however, in parallel applications where multiple clients manipulate a shared file, cache coherence control can serialize I/O. We have designed a thread based caching layer for the MPI I/O library, which adds a portable caching system closer to user applications so more information about the application's I/O patterns is available for better coherence control. We demonstrate the impact of our caching solution on parallel write performance with a comprehensive evaluation that includes a set of widely used I/O benchmarks and production application I/O kernels.

  15. Parallel solvers for reservoir simulation on MIMD computers

    SciTech Connect

    Piault, E.; Willien, F.; Roux, F.X.

    1995-12-01

    We have investigated parallel solvers for reservoir simulation. We compare different solvers and preconditioners using T3D and SP1 parallel computers. We use block diagonal domain decomposition preconditioner with non-overlapping sub-domains.

  16. Parallel reservoir computing using optical amplifiers.

    PubMed

    Vandoorne, Kristof; Dambre, Joni; Verstraeten, David; Schrauwen, Benjamin; Bienstman, Peter

    2011-09-01

    Reservoir computing (RC), a computational paradigm inspired on neural systems, has become increasingly popular in recent years for solving a variety of complex recognition and classification problems. Thus far, most implementations have been software-based, limiting their speed and power efficiency. Integrated photonics offers the potential for a fast, power efficient and massively parallel hardware implementation. We have previously proposed a network of coupled semiconductor optical amplifiers as an interesting test case for such a hardware implementation. In this paper, we investigate the important design parameters and the consequences of process variations through simulations. We use an isolated word recognition task with babble noise to evaluate the performance of the photonic reservoirs with respect to traditional software reservoir implementations, which are based on leaky hyperbolic tangent functions. Our results show that the use of coherent light in a well-tuned reservoir architecture offers significant performance benefits. The most important design parameters are the delay and the phase shift in the system's physical connections. With optimized values for these parameters, coherent semiconductor optical amplifier (SOA) reservoirs can achieve better results than traditional simulated reservoirs. We also show that process variations hardly degrade the performance, but amplifier noise can be detrimental. This effect must therefore be taken into account when designing SOA-based RC implementations.

  17. Parallel CE/SE Computations via Domain Decomposition

    NASA Technical Reports Server (NTRS)

    Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

    2000-01-01

    This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.

  18. A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1994-01-01

    A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  19. Scientific data analysis on data-parallel platforms.

    SciTech Connect

    Ulmer, Craig D.; Bayer, Gregory W.; Choe, Yung Ryn; Roe, Diana C.

    2010-09-01

    As scientific computing users migrate to petaflop platforms that promise to generate multi-terabyte datasets, there is a growing need in the community to be able to embed sophisticated analysis algorithms in the computing platforms' storage systems. Data Warehouse Appliances (DWAs) are attractive for this work, due to their ability to store and process massive datasets efficiently. While DWAs have been utilized effectively in data-mining and informatics applications, they remain largely unproven in scientific workloads. In this paper we present our experiences in adapting two mesh analysis algorithms to function on five different DWA architectures: two Netezza database appliances, an XtremeData dbX database, a LexisNexis DAS, and multiple Hadoop MapReduce clusters. The main contribution of this work is insight into the differences between these DWAs from a user's perspective. In addition, we present performance measurements for ten DWA systems to help understand the impact of different architectural trade-offs in these systems.

  20. Parallel computation of three-dimensional nonlinear magnetostatic problems.

    SciTech Connect

    Levine, D.; Gropp, W.; Forsman, K.; Kettunen, L.; Mathematics and Computer Science; Tampere Univ. of Tech.

    1999-02-01

    We describe a general-purpose parallel electromagnetic code for computing accurate solutions to large computationally demanding, 3D, nonlinear magnetostatic problems. The code, CORAL, is based on a volume integral equation formulation. Using an IBM SP parallel computer and iterative solution methods, we successfully solved the dense linear systems inherent in such formulations. A key component of our work was the use of the PETSc library, which provides parallel portability and access to the latest linear algebra solution technology.

  1. I/O-Efficient Scientific Computation Using TPIE

    NASA Technical Reports Server (NTRS)

    Vengroff, Darren Erik; Vitter, Jeffrey Scott

    1996-01-01

    In recent years, input/output (I/O)-efficient algorithms for a wide variety of problems have appeared in the literature. However, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to support I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only with explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation. In this paper we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks while others are of our own creation. We demonstrate that the central processing unit (CPU) overhead required to manage I/O is small and that even with just a single disk, the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.

  2. CFD Optimization on Network-Based Parallel Computer System

    NASA Technical Reports Server (NTRS)

    Cheung, Samson H.; VanDalsem, William (Technical Monitor)

    1994-01-01

    Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advance computational fluid dynamics codes, which is computationally expensive in mainframe supercomputer. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computer on a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package has been applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.

  3. Parallel CFD design on network-based computer

    NASA Technical Reports Server (NTRS)

    Cheung, Samson

    1995-01-01

    Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advanced computational fluid dynamics codes, which can be computationally expensive on mainframe supercomputers. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computing environment utilizing a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package is applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.

  4. Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud Computing

    NASA Astrophysics Data System (ADS)

    Ford, Eric B.; Dindar, Saleh; Peters, Jorg

    2015-08-01

    The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer

  5. The 9th INS scientific computational programs

    NASA Astrophysics Data System (ADS)

    1993-06-01

    Twelve groups were accepted as the 9-th INS scientific Computational Programs (ISCP). This ISCP can use full resources of the INS central computer without an operational limitation. The 9-th ISCP started on June/1/92 and ended on March/30/93. Some results were already published in journals or appeared in physical meetings. This is a sum of the used statistics of the 9-th ISCP and reports presented by groups at the 9-th ISCP.

  6. A Scientific Cloud Computing Platform for Condensed Matter Physics

    NASA Astrophysics Data System (ADS)

    Jorissen, K.; Johnson, W.; Vila, F. D.; Rehr, J. J.

    2013-03-01

    Scientific Cloud Computing (SCC) makes possible calculations with high performance computational tools, without the need to purchase or maintain sophisticated hardware and software. We have recently developed an interface dubbed SC2IT that controls on-demand virtual Linux clusters within the Amazon EC2 cloud platform. Using this interface we have developed a more advanced, user-friendly SCC Platform configured especially for condensed matter calculations. This platform contains a GUI, based on a new Java version of SC2IT, that permits calculations of various materials properties. The cloud platform includes Virtual Machines preconfigured for parallel calculations and several precompiled and optimized materials science codes for electronic structure and x-ray and electron spectroscopy. Consequently this SCC makes state-of-the-art condensed matter calculations easy to access for general users. Proof-of-principle performance benchmarks show excellent parallelization and communication performance. Supported by NSF grant OCI-1048052

  7. LEWICE droplet trajectory calculations on a parallel computer

    NASA Technical Reports Server (NTRS)

    Caruso, Steven C.

    1993-01-01

    A parallel computer implementation (128 processors) of LEWICE, a NASA Lewis code used to predict the time-dependent ice accretion process for two-dimensional aerodynamic bodies of simple geometries, is described. Two-dimensional parallel droplet trajectory calculations are performed to demonstrate the potential benefits of applying parallel processing to ice accretion analysis. Parallel performance is evaluated as a function of the number of trajectories and the number of processors. For comparison, similar trajectory calculations are performed on single-processor Cray computers, and the best parallel results are found to be 33 and 23 times faster, respectively, than those of the Cray XMP and YMP.

  8. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  9. Review of parallel computing methods and tools for FPGA technology

    NASA Astrophysics Data System (ADS)

    Cieszewski, Radosław; Linczuk, Maciej; Pozniak, Krzysztof; Romaniuk, Ryszard

    2013-10-01

    Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating speci c tasks. High-Energy Physics Experiments measuring systems often use FPGAs for ne-grained computation. FPGA combines many bene ts of both software and ASIC implementations. Like software, the mapped circuit is exible, and can be recon gured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for ne-grained computation implemented in FPGA using Behavioral Description and High Level Programming Languages.

  10. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-11-12

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.

  11. Access and visualization using clusters and other parallel computers

    NASA Technical Reports Server (NTRS)

    Katz, D. S.; Bergou, A.; Berriman, B.; Block, G.; Collier, J.; Curkendall, D.; Good, J.; Husman, L.; Jacob, J.; Laity, A.; Li, P.; Miller, C.; Plesea, L.; Prince, T.; Siegel, H.; Williams, R.

    2003-01-01

    JPL's Parallel Applications Technologies Group has been exploring the issues of data access and visualization of very large data sets over the past 10 years. this work has used a number of types of parallel computers, and today includes the use of commodity clusters. This talk will highlight some of the applications and tools we have developed, including how they use parallel computing resources, and specifically how we are using modern clusters.

  12. Parallel In Situ Indexing for Data-intensive Computing

    SciTech Connect

    Kim, Jinoh; Abbasi, Hasan; Chacon, Luis; Docan, Ciprian; Klasky, Scott; Liu, Qing; Podhorszki, Norbert; Shoshani, Arie; Wu, Kesheng

    2011-09-09

    As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.

  13. Scientific Computing Kernels on the Cell Processor

    SciTech Connect

    Williams, Samuel W.; Shalf, John; Oliker, Leonid; Kamil, Shoaib; Husbands, Parry; Yelick, Katherine

    2007-04-04

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on a 3.2GHz Cell blade. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

  14. Knowledge-Based Parallel Performance Technology for Scientific Application Competitiveness Final Report

    SciTech Connect

    Malony, Allen D; Shende, Sameer

    2011-08-15

    The primary goal of the University of Oregon's DOE "œcompetitiveness" project was to create performance technology that embodies and supports knowledge of performance data, analysis, and diagnosis in parallel performance problem solving. The target of our development activities was the TAU Performance System and the technology accomplishments reported in this and prior reports have all been incorporated in the TAU open software distribution. In addition, the project has been committed to maintaining strong interactions with the DOE SciDAC Performance Engineering Research Institute (PERI) and Center for Technology for Advanced Scientific Component Software (TASCS). This collaboration has proved valuable for translation of our knowledge-based performance techniques to parallel application development and performance engineering practice. Our outreach has also extended to the DOE Advanced CompuTational Software (ACTS) collection and project. Throughout the project we have participated in the PERI and TASCS meetings, as well as the ACTS annual workshops.

  15. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.

  16. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application. PMID:27330898

  17. Distributing an executable job load file to compute nodes in a parallel computer

    DOEpatents

    Gooding, Thomas M.

    2016-09-13

    Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.

  18. Distributing an executable job load file to compute nodes in a parallel computer

    DOEpatents

    Gooding, Thomas M.

    2016-08-09

    Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.

  19. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.

  20. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-10-29

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.

  1. C++ expression templates performance issues in scientific computing

    SciTech Connect

    Bassetti, F.; Davis, K.; Quinlan, D.

    1997-10-01

    Ever-increasing size and complexity of software applications and libraries in parallel scientific computing is making implementation in the programming languages traditional for this field--FORTRAN 77 and C--impractical. The major impediment to the progression to a higher-level language such as C++ is attaining FORTRAN 77 or C performance, which is considered absolutely necessary by many practitioners. The use of template metaprogramming in C++, in the form of so-called expression templates to generate custom C++ code, holds great promise for getting C performance from C++ in the context of operations on array-like objects. Several sophisticated expression template implementations of parallel array-class libraries exist, and in certain circumstances their promise of performance is realized. Unfortunately this is not uniformly the case; this paper explores the major reasons that this is so.

  2. Parallel, distributed and GPU computing technologies in single-particle electron microscopy

    PubMed Central

    Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

    2009-01-01

    Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686

  3. A scalable parallel black oil simulator on distributed memory parallel computers

    NASA Astrophysics Data System (ADS)

    Wang, Kun; Liu, Hui; Chen, Zhangxin

    2015-11-01

    This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.

  4. Scientific Visualization and Computational Science: Natural Partners

    NASA Technical Reports Server (NTRS)

    Uselton, Samuel P.; Lasinski, T. A. (Technical Monitor)

    1995-01-01

    Scientific visualization is developing rapidly, stimulated by computational science, which is gaining acceptance as a third alternative to theory and experiment. Computational science is based on numerical simulations of mathematical models derived from theory. But each individual simulation is like a hypothetical experiment; initial conditions are specified, and the result is a record of the observed conditions. Experiments can be simulated for situations that can not really be created or controlled. Results impossible to measure can be computed.. Even for observable values, computed samples are typically much denser. Numerical simulations also extend scientific exploration where the mathematics is analytically intractable. Numerical simulations are used to study phenomena from subatomic to intergalactic scales and from abstract mathematical structures to pragmatic engineering of everyday objects. But computational science methods would be almost useless without visualization. The obvious reason is that the huge amounts of data produced require the high bandwidth of the human visual system, and interactivity adds to the power. Visualization systems also provide a single context for all the activities involved from debugging the simulations, to exploring the data, to communicating the results. Most of the presentations today have their roots in image processing, where the fundamental task is: Given an image, extract information about the scene. Visualization has developed from computer graphics, and the inverse task: Given a scene description, make an image. Visualization extends the graphics paradigm by expanding the possible input. The goal is still to produce images; the difficulty is that the input is not a scene description displayable by standard graphics methods. Visualization techniques must either transform the data into a scene description or extend graphics techniques to display this odd input. Computational science is a fertile field for visualization

  5. Performance issues for engineering analysis on MIMD parallel computers

    SciTech Connect

    Fang, H.E.; Vaughan, C.T.; Gardner, D.R.

    1994-08-01

    We discuss how engineering analysts can obtain greater computational resolution in a more timely manner from applications codes running on MIMD parallel computers. Both processor speed and memory capacity are important to achieving better performance than a serial vector supercomputer. To obtain good performance, a parallel applications code must be scalable. In addition, the aspect ratios of the subdomains in the decomposition of the simulation domain onto the parallel computer should be of order 1. We demonstrate these conclusions using simulations conducted with the PCTH shock wave physics code running on a Cray Y-MP, a 1024-node nCUBE 2, and an 1840-node Paragon.

  6. Advanced Scientific Computing Research Network Requirements

    SciTech Connect

    Bacon, Charles; Bell, Greg; Canon, Shane; Dart, Eli; Dattoria, Vince; Goodwin, Dave; Lee, Jason; Hicks, Susan; Holohan, Ed; Klasky, Scott; Lauzon, Carolyn; Rogers, Jim; Shipman, Galen; Skinner, David; Tierney, Brian

    2013-03-08

    The Energy Sciences Network (ESnet) is the primary provider of network connectivity for the U.S. Department of Energy (DOE) Office of Science (SC), the single largest supporter of basic research in the physical sciences in the United States. In support of SC programs, ESnet regularly updates and refreshes its understanding of the networking requirements of the instruments, facilities, scientists, and science programs that it serves. This focus has helped ESnet to be a highly successful enabler of scientific discovery for over 25 years. In October 2012, ESnet and the Office of Advanced Scientific Computing Research (ASCR) of the DOE SC organized a review to characterize the networking requirements of the programs funded by the ASCR program office. The requirements identified at the review are summarized in the Findings section, and are described in more detail in the body of the report.

  7. Trilinos for emerging parallel computing systems.

    SciTech Connect

    Heroux, Michael Allen

    2010-08-01

    Trilinos is an object-oriented software framework to enabled the solution of large-scale, complex multiphysics engineering and scientific problems. Different Trilinos packages build on each other to create a stack providing the necessary capability: (1) Non-linear solver; (2) Linear solver/preconditioner; (3) Distributed linear algebra; and (4) Local linear algebra.

  8. AVES: A Computer Cluster System approach for INTEGRAL Scientific Analysis

    NASA Astrophysics Data System (ADS)

    Federici, M.; Martino, B. L.; Natalucci, L.; Umbertini, P.

    The AVES computing system, based on an "Cluster" architecture is a fully integrated, low cost computing facility dedicated to the archiving and analysis of the INTEGRAL data. AVES is a modular system that uses the software resource manager (SLURM) and allows almost unlimited expandibility (65,536 nodes and hundreds of thousands of processors); actually is composed by 30 Personal Computers with Quad-Cores CPU able to reach the computing power of 300 Giga Flops (300x10{9} Floating point Operations Per Second), with 120 GB of RAM and 7.5 Tera Bytes (TB) of storage memory in UFS configuration plus 6 TB for users area. AVES was designed and built to solve growing problems raised from the analysis of the large data amount accumulated by the INTEGRAL mission (actually about 9 TB) and due to increase every year. The used analysis software is the OSA package, distributed by the ISDC in Geneva. This is a very complex package consisting of dozens of programs that can not be converted to parallel computing. To overcome this limitation we developed a series of programs to distribute the workload analysis on the various nodes making AVES automatically divide the analysis in N jobs sent to N cores. This solution thus produces a result similar to that obtained by the parallel computing configuration. In support of this we have developed tools that allow a flexible use of the scientific software and quality control of on-line data storing. The AVES software package is constituted by about 50 specific programs. Thus the whole computing time, compared to that provided by a Personal Computer with single processor, has been enhanced up to a factor 70.

  9. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  10. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  11. QCD on the Massively Parallel Computer AP1000

    NASA Astrophysics Data System (ADS)

    Akemi, K.; Fujisaki, M.; Okuda, M.; Tago, Y.; Hashimoto, T.; Hioki, S.; Miyamura, O.; Takaishi, T.; Nakamura, A.; de Forcrand, Ph.; Hege, C.; Stamatescu, I. O.

    We present the QCD-TARO program of calculations which uses the parallel computer AP1000 of Fujitsu. We discuss the results on scaling, correlation times and hadronic spectrum, some aspects of the implementation and the future prospects.

  12. Computational Biology, Advanced Scientific Computing, and Emerging Computational Architectures

    SciTech Connect

    2007-06-27

    This CRADA was established at the start of FY02 with $200 K from IBM and matching funds from DOE to support post-doctoral fellows in collaborative research between International Business Machines and Oak Ridge National Laboratory to explore effective use of emerging petascale computational architectures for the solution of computational biology problems. 'No cost' extensions of the CRADA were negotiated with IBM for FY03 and FY04.

  13. A Testbed of Parallel Kernels for Computer Science Research

    SciTech Connect

    Bailey, David; Demmel, James; Ibrahim, Khaled; Kaiser, Alex; Koniges, Alice; Madduri, Kamesh; Shalf, John; Strohmaier, Erich; Williams, Samuel

    2010-04-30

    For several decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models for optimal performance, efficiency, and productivity. Unfortunately, this guidance is most often taken from the existing software/hardware ecosystem. Architects attempt to provide micro-architectural solutions to improve performance on fixed binaries. Researchers tweak compilers to improve code generation for existing architectures and implementations, and they may invent new programming models for fixed processor and memory architectures and computational algorithms. In today's rapidly evolving world of on-chip parallelism, these isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. In an initial study, we have developed an alternate approach that, rather than starting with an existing hardware/software solution laced with hidden assumptions, defines the computational problems of interest and invites architects, researchers and programmers to implement novel hardware/ software co-designed solutions. Our work builds on the previous ideas of computational dwarfs, motifs, and parallel patterns by selecting a representative set of essential problems for which we provide: An algorithmic description; scalable problem definition; illustrative reference implementations; verification schemes. For simplicity, we focus initially on the computational problems of interest to the scientific computing community but proclaim the methodology (and perhaps a subset of the problems) as applicable to other communities. We intend to broaden the coverage of this problem space through stronger community involvement. Previous work has established a broad categorization of numerical methods of interest to the scientific computing, in the spirit of the NAS Benchmarks, which pioneered the basic idea of a 'pencil and paper benchmark' in the 1990s. The

  14. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  15. Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

    NASA Technical Reports Server (NTRS)

    Sues, R. H.; Lua, Y. J.; Smith, M. D.

    1994-01-01

    The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.

  16. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  17. Parallel aeroelastic computations for wing and wing-body configurations

    NASA Technical Reports Server (NTRS)

    Byun, Chansup

    1994-01-01

    The objective of this research is to develop computationally efficient methods for solving fluid-structural interaction problems by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures on parallel computers. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.

  18. History Matching in Parallel Computational Environments

    SciTech Connect

    Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav

    2004-08-31

    In the probabilistic approach for history matching, the information from the dynamic data is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, fluid properties, well configuration, flow constrains on wells etc. This implies probabilistic approach should then update different regions of the reservoir in different ways. This necessitates delineation of multiple reservoir domains in order to increase the accuracy of the approach. The research focuses on a probabilistic approach to integrate dynamic data that ensures consistency between reservoir models developed from one stage to the next. The algorithm relies on efficient parameterization of the dynamic data integration problem and permits rapid assessment of the updated reservoir model at each stage. The report also outlines various domain decomposition schemes from the perspective of increasing the accuracy of probabilistic approach of history matching. Research progress in three important areas of the project are discussed: {lg_bullet}Validation and testing the probabilistic approach to incorporating production data in reservoir models. {lg_bullet}Development of a robust scheme for identifying reservoir regions that will result in a more robust parameterization of the history matching process. {lg_bullet}Testing commercial simulators for parallel capability and development of a parallel algorithm for history matching.

  19. Routing Application for Parallel computatIon of Discharge

    NASA Astrophysics Data System (ADS)

    David, C. H.; Maidment, D. R.; Yang, Z.

    2008-12-01

    Today, meteorological models can predict storm events and climate patterns, but there are few models that connect atmospheric models to the hydraulics of rivers. Such a model is necessary to advance the prediction of events such as floods or droughts. Furthermore, the increasingly available Geographic Information System-based hydrographic datasets offer ways to use actual mapped river for routing. Such GIS-based datasets include NHDPlus at the continental-scale [USEPA and USGS, 2007]and HydroSHEDS at the global- scale [Lehner, et al., 2006]. The objective of ongoing work is to develop RAPID (Routing Application for Parallel computatIon of Discharge), a large-scale river routing model that: - has physical representation of river flow - allows for coupling with both land surface models and groundwater models. In particular enabling bi-directional exchanges between rivers and aquifers through a computation of water volume and flow on a reach-to-reach basis - has some specific treatment for man-made infrastructures and anthropogenic actions (dams, pumping) - benefits from the latest scientific computing developments (supercomputing facilities and high performance parallel computing libraries) - will benefit from the increasingly available Geographic Information System hydrographic databases. RAPID has already been tested over France through a ten-year coupling with SIM-France [Habets, et al., 2008, using a river network made of 24,264 river reaches. RAPID has been adapted to run on the Lonestar supercomputer (http://www.tacc.utexas.edu/resources/hpcsystems/) and to allow the use of the NHDPlus dataset. RAPID is now being tested with the 74,615 river reaches of the entire Texas Gulf. This type of innovative model has strong implications for the future of hydrology. Such a tool can help improve the understanding of the effect of climate change on water resources as well as provide information on how many gages are needed and where are they needed the most. More broadly

  20. The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete; Saini, Subhash (Technical Monitor)

    1998-01-01

    Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.

  1. 78 FR 6087 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-01-29

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing..., Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U. S. Department of...

  2. 75 FR 9887 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-03-04

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing... Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U.S. Department...

  3. 76 FR 9765 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-02-22

    ... Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing..., Office of Advanced Scientific Computing Research, SC-21/Germantown Building, U.S. Department of...

  4. 78 FR 41046 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-07-09

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION... hereby given that the Advanced Scientific Computing Advisory Committee will be renewed for a two-year... (DOE), on the Advanced Scientific Computing Research Program managed by the Office of...

  5. 75 FR 64720 - DOE/Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-10-20

    .../Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing... Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U. S. Department...

  6. 76 FR 41234 - Advanced Scientific Computing Advisory Committee Charter Renewal

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-13

    ... Advanced Scientific Computing Advisory Committee Charter Renewal AGENCY: Department of Energy, Office of... Administration, notice is hereby given that the Advanced Scientific Computing Advisory Committee will be renewed... concerning the Advanced Scientific Computing program in response only to charges from the Director of...

  7. 75 FR 57742 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-09-22

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION... Scientific Computing Advisory Committee (ASCAC). Federal Advisory Committee Act (Pub. L. 92-463, 86 Stat. 770...: Melea Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building;...

  8. 78 FR 56871 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-09-16

    ... Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing... Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U.S. Department...

  9. 77 FR 45345 - DOE/Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-07-31

    .../Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing... Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U.S. Department...

  10. 78 FR 50404 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-08-19

    ... Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ] ACTION... Scientific Computing Advisory Committee (ASCAC). The Federal Advisory Committee Act (Pub. L. 92-463, 86 Stat... INFORMATION CONTACT: Melea Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown...

  11. 75 FR 43518 - Advanced Scientific Computing Advisory Committee; Meeting

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-26

    ... Advanced Scientific Computing Advisory Committee; Meeting AGENCY: Office of Science, DOE. ACTION: Notice of open meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing Advisory..., Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U. S. Department of...

  12. 77 FR 12823 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-03-02

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION: Notice of Open Meeting. SUMMARY: This notice announces a meeting of the Advanced Scientific Computing..., Office of Advanced Scientific Computing Research, SC-21/Germantown Building, U.S. Department of...

  13. Chare kernel; A runtime support system for parallel computations

    SciTech Connect

    Shu, W. ); Kale, L.V. )

    1991-03-01

    This paper presents the chare kernel system, which supports parallel computations with irregular structure. The chare kernel is a collection of primitive functions that manage chares, manipulative messages, invoke atomic computations, and coordinate concurrent activities. Programs written in the chare kernel language can be executed on different parallel machines without change. Users writing such programs concern themselves with the creation of parallel actions but not with assigning them to specific processors. The authors describe the design and implementation of the chare kernel. Performance of chare kernel programs on two hypercube machines, the Intel iPSC/2 and the NCUBE, is also given.

  14. Access and visualization using clusters and other parallel computers

    NASA Technical Reports Server (NTRS)

    Katz, Daniel S.; Bergou, Attila; Berriman, Bruce; Block, Gary; Collier, Jim; Curkendall, Dave; Good, John; Husman, Laura; Jacob, Joe; Laity, Anastasia; Li, Peggy; Miller, Craig; Plesea, Lucian; Prince, Tom; Siegel, Herb; Williams, Roy

    2003-01-01

    JPL's Parallel Applications Technologies Group has been exploring the issues of data access and visualization of very large data sets over the past 10 or so years. this work has used a number of types of parallel computers, and today includes the use of commodity clusters. This talk will highlight some of the applications and tools we have developed, including how they use parallel computing resources, and specifically how we are using modern clusters. Our applications focus on NASA's needs; thus our data sets are usually related to Earth and Space Science, including data delivered from instruments in space, and data produced by telescopes on the ground.

  15. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  16. Parallel Computation of the Topology of Level Sets

    SciTech Connect

    Pascucci, V; Cole-McLaughlin, K

    2004-12-16

    This paper introduces two efficient algorithms that compute the Contour Tree of a 3D scalar field F and its augmented version with the Betti numbers of each isosurface. The Contour Tree is a fundamental data structure in scientific visualization that is used to preprocess the domain mesh to allow optimal computation of isosurfaces with minimal overhead storage. The Contour Tree can also be used to build user interfaces reporting the complete topological characterization of a scalar field, as shown in Figure 1. Data exploration time is reduced since the user understands the evolution of level set components with changing isovalue. The Augmented Contour Tree provides even more accurate information segmenting the range space of the scalar field in portion of invariant topology. The exploration time for a single isosurface is also improved since its genus is known in advance. Our first new algorithm augments any given Contour Tree with the Betti numbers of all possible corresponding isocontours in linear time with the size of the tree. Moreover we show how to extend the scheme introduced in [3] with the Betti number computation without increasing its complexity. Thus, we improve on the time complexity from our previous approach [10] from O(m log m) to O(n log n + m), where m is the number of cells and n is the number of vertices in the domain of F. Our second contribution is a new divide-and-conquer algorithm that computes the Augmented Contour Tree with improved efficiency. The approach computes the output Contour Tree by merging two intermediate Contour Trees and is independent of the interpolant. In this way we confine any knowledge regarding a specific interpolant to an independent function that computes the tree for a single cell. We have implemented this function for the trilinear interpolant and plan to replace it with higher order interpolants when needed. The time complexity is O(n + t log n), where t is the number of critical points of F. For the first time

  17. Performing a global barrier operation in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-12-09

    Executing computing tasks on a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.

  18. Parallelization of ARC3D with Computer-Aided Tools

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1998-01-01

    A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.

  19. Parallel algorithms and archtectures for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Patrick, Merrell; Ma, Shing; Mahajan, Umesh

    1989-01-01

    The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.

  20. Parallel structures in human and computer memory

    NASA Technical Reports Server (NTRS)

    Kanerva, P.

    1986-01-01

    If one thinks of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library. One recognizes a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. A computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do is constructed. Such a memory is remarkably similiar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. It is concluded that the frame problem of artificial intelligence could be solved by the use of such a memory if one were able to encode information about the world properly.

  1. Parallel structures in human and computer memory

    NASA Astrophysics Data System (ADS)

    Kanerva, Pentti

    1986-08-01

    If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.

  2. Java Performance for Scientific Applications on LLNL Computer Systems

    SciTech Connect

    Kapfer, C; Wissink, A

    2002-05-10

    Languages in use for high performance computing at the laboratory--Fortran (f77 and f90), C, and C++--have many years of development behind them and are generally considered the fastest available. However, Fortran and C do not readily extend to object-oriented programming models, limiting their capability for very complex simulation software. C++ facilitates object-oriented programming but is a very complex and error-prone language. Java offers a number of capabilities that these other languages do not. For instance it implements cleaner (i.e., easier to use and less prone to errors) object-oriented models than C++. It also offers networking and security as part of the language standard, and cross-platform executables that make it architecture neutral, to name a few. These features have made Java very popular for industrial computing applications. The aim of this paper is to explain the trade-offs in using Java for large-scale scientific applications at LLNL. Despite its advantages, the computational science community has been reluctant to write large-scale computationally intensive applications in Java due to concerns over its poor performance. However, considerable progress has been made over the last several years. The Java Grande Forum [1] has been promoting the use of Java for large-scale computing. Members have introduced efficient array libraries, developed fast just-in-time (JIT) compilers, and built links to existing packages used in high performance parallel computing.

  3. Misleading Performance Claims in Parallel Computations

    SciTech Connect

    Bailey, David H.

    2009-05-29

    In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.

  4. Integrating multiple scientific computing needs via a Private Cloud infrastructure

    NASA Astrophysics Data System (ADS)

    Bagnasco, S.; Berzano, D.; Brunetti, R.; Lusso, S.; Vallero, S.

    2014-06-01

    In a typical scientific computing centre, diverse applications coexist and share a single physical infrastructure. An underlying Private Cloud facility eases the management and maintenance of heterogeneous use cases such as multipurpose or application-specific batch farms, Grid sites catering to different communities, parallel interactive data analysis facilities and others. It allows to dynamically and efficiently allocate resources to any application and to tailor the virtual machines according to the applications' requirements. Furthermore, the maintenance of large deployments of complex and rapidly evolving middleware and application software is eased by the use of virtual images and contextualization techniques; for example, rolling updates can be performed easily and minimizing the downtime. In this contribution we describe the Private Cloud infrastructure at the INFN-Torino Computer Centre, that hosts a full-fledged WLCG Tier-2 site and a dynamically expandable PROOF-based Interactive Analysis Facility for the ALICE experiment at the CERN LHC and several smaller scientific computing applications. The Private Cloud building blocks include the OpenNebula software stack, the GlusterFS filesystem (used in two different configurations for worker- and service-class hypervisors) and the OpenWRT Linux distribution (used for network virtualization). A future integration into a federated higher-level infrastructure is made possible by exposing commonly used APIs like EC2 and by using mainstream contextualization tools like CloudInit.

  5. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  6. Identifying failure in a tree network of a parallel computer

    DOEpatents

    Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.

    2010-08-24

    Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.

  7. A Lanczos eigenvalue method on a parallel computer

    NASA Technical Reports Server (NTRS)

    Bostic, Susan W.; Fulton, Robert E.

    1987-01-01

    Eigenvalue analyses of complex structures is a computationally intensive task which can benefit significantly from new and impending parallel computers. This study reports on a parallel computer implementation of the Lanczos method for free vibration analysis. The approach used here subdivides the major Lanczos calculation tasks into subtasks and introduces parallelism down to the subtask levels such as matrix decomposition and forward/backward substitution. The method was implemented on a commercial parallel computer and results were obtained for a long flexible space structure. While parallel computing efficiency for the Lanczos method was good for a moderate number of processors for the test problem, the greatest reduction in time was realized for the decomposition of the stiffness matrix, a calculation which took 70 percent of the time in the sequential program and which took 25 percent of the time on eight processors. For a sample calculation of the twenty lowest frequencies of a 486 degree of freedom problem, the total sequential computing time was reduced by almost a factor of ten using 16 processors.

  8. A microeconomic scheduler for parallel computers

    NASA Technical Reports Server (NTRS)

    Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex

    1995-01-01

    We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.

  9. A compositional reservoir simulator on distributed memory parallel computers

    SciTech Connect

    Rame, M.; Delshad, M.

    1995-12-31

    This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. A portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.

  10. Animated computer graphics models of space and earth sciences data generated via the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

    1987-01-01

    The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

  11. History Matching in Parallel Computational Environments

    SciTech Connect

    Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav

    2005-10-01

    A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different rD parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current approach to

  12. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, R.E.; Gustafson, J.L.; Montry, G.R.

    1999-08-10

    A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.

  13. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

    1999-01-01

    A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

  14. Parallel Computational Fluid Dynamics: Current Status and Future Requirements

    NASA Technical Reports Server (NTRS)

    Simon, Horst D.; VanDalsem, William R.; Dagum, Leonardo; Kutler, Paul (Technical Monitor)

    1994-01-01

    One or the key objectives of the Applied Research Branch in the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Allies Research Center is the accelerated introduction of highly parallel machines into a full operational environment. In this report we discuss the performance results obtained from the implementation of some computational fluid dynamics (CFD) applications on the Connection Machine CM-2 and the Intel iPSC/860. We summarize some of the experiences made so far with the parallel testbed machines at the NAS Applied Research Branch. Then we discuss the long term computational requirements for accomplishing some of the grand challenge problems in computational aerosciences. We argue that only massively parallel machines will be able to meet these grand challenge requirements, and we outline the computer science and algorithm research challenges ahead.

  15. Parallel computation of geometry control in adaptive truss structures

    NASA Technical Reports Server (NTRS)

    Ramesh, A. V.; Utku, S.; Wada, B. K.

    1992-01-01

    The fast computation of geometry control in adaptive truss structures involves two distinct parts: the efficient integration of the inverse kinematic differential equations that govern the geometry control and the fast computation of the Jacobian, which appears on the right-hand-side of the inverse kinematic equations. This paper present an efficient parallel implementation of the Jacobian computation on an MIMD machine. Large speedup from the parallel implementation is obtained, which reduces the Jacobian computation to an O(M-squared/n) procedure on an n-processor machine, where M is the number of members in the adaptive truss. The parallel algorithm given here is a good candidate for on-line geometry control of adaptive structures using attached processors.

  16. Six Years of Parallel Computing at NAS (1987 - 1993): What Have we Learned?

    NASA Technical Reports Server (NTRS)

    Simon, Horst D.; Cooper, D. M. (Technical Monitor)

    1994-01-01

    In the fall of 1987 the age of parallelism at NAS began with the installation of a 32K processor CM-2 from Thinking Machines. In 1987 this was described as an "experiment" in parallel processing. In the six years since, NAS acquired a series of parallel machines, and conducted an active research and development effort focused on the use of highly parallel machines for applications in the computational aerosciences. In this time period parallel processing for scientific applications evolved from a fringe research topic into the one of main activities at NAS. In this presentation I will review the history of parallel computing at NAS in the context of the major progress, which has been made in the field in general. I will attempt to summarize the lessons we have learned so far, and the contributions NAS has made to the state of the art. Based on these insights I will comment on the current state of parallel computing (including the HPCC effort) and try to predict some trends for the next six years.

  17. Numerical characterization of nonlinear dynamical systems using parallel computing: The role of GPUS approach

    NASA Astrophysics Data System (ADS)

    Fazanaro, Filipe I.; Soriano, Diogo C.; Suyama, Ricardo; Madrid, Marconi K.; Oliveira, José Raimundo de; Muñoz, Ignacio Bravo; Attux, Romis

    2016-08-01

    The characterization of nonlinear dynamical systems and their attractors in terms of invariant measures, basins of attractions and the structure of their vector fields usually outlines a task strongly related to the underlying computational cost. In this work, the practical aspects related to the use of parallel computing - specially the use of Graphics Processing Units (GPUS) and of the Compute Unified Device Architecture (CUDA) - are reviewed and discussed in the context of nonlinear dynamical systems characterization. In this work such characterization is performed by obtaining both local and global Lyapunov exponents for the classical forced Duffing oscillator. The local divergence measure was employed by the computation of the Lagrangian Coherent Structures (LCSS), revealing the general organization of the flow according to the obtained separatrices, while the global Lyapunov exponents were used to characterize the attractors obtained under one or more bifurcation parameters. These simulation sets also illustrate the required computation time and speedup gains provided by different parallel computing strategies, justifying the employment and the relevance of GPUS and CUDA in such extensive numerical approach. Finally, more than simply providing an overview supported by a representative set of simulations, this work also aims to be a unified introduction to the use of the mentioned parallel computing tools in the context of nonlinear dynamical systems, providing codes and examples to be executed in MATLAB and using the CUDA environment, something that is usually fragmented in different scientific communities and restricted to specialists on parallel computing strategies.

  18. Scientific development of a massively parallel ocean climate model. Progress report for 1992--1993 and Continuing request for 1993--1994 to CHAMMP (Computer Hardware, Advanced Mathematics, and Model Physics)

    SciTech Connect

    Semtner, A.J. Jr.; Chervin, R.M.

    1993-05-01

    During the second year of CHAMMP funding to the principal investigators, progress has been made in the proposed areas of research, as follows: investigation of the physics of the thermohaline circulation; examination of resolution effects on ocean general circulation; and development of a massively parallel ocean climate model.

  19. InSAR Scientific Computing Environment

    NASA Astrophysics Data System (ADS)

    Gurrola, E. M.; Rosen, P. A.; Sacco, G.; Zebker, H. A.; Simons, M.; Sandwell, D. T.

    2010-12-01

    The InSAR Scientific Computing Environment (ISCE) is a software development effort in its second year within the NASA Advanced Information Systems and Technology program. The ISCE will provide a new computing environment for geodetic image processing for InSAR sensors that will enable scientists to reduce measurements directly from radar satellites and aircraft to new geophysical products without first requiring them to develop detailed expertise in radar processing methods. The environment can serve as the core of a centralized processing center to bring Level-0 raw radar data up to Level-3 data products, but is adaptable to alternative processing approaches for science users interested in new and different ways to exploit mission data. The NRC Decadal Survey-recommended DESDynI mission will deliver data of unprecedented quantity and quality, making possible global-scale studies in climate research, natural hazards, and Earth's ecosystem. The InSAR Scientific Computing Environment is planned to become a key element in processing DESDynI data into higher level data products and it is expected to enable a new class of analyses that take greater advantage of the long time and large spatial scales of these new data, than current approaches. At the core of ISCE is both legacy processing software from the JPL/Caltech ROI_PAC repeat-pass interferometry package as well as a new InSAR processing package containing more efficient and more accurate processing algorithms being developed at Stanford for this project that is based on experience gained in developing processors for missions such as SRTM and UAVSAR. Around the core InSAR processing programs we are building object-oriented wrappers to enable their incorporation into a more modern, flexible, extensible software package that is informed by modern programming methods, including rigorous componentization of processing codes, abstraction and generalization of data models, and a robust, intuitive user interface with

  20. OPENING REMARKS: Scientific Discovery through Advanced Computing

    NASA Astrophysics Data System (ADS)

    Strayer, Michael

    2006-01-01

    Good morning. Welcome to SciDAC 2006 and Denver. I share greetings from the new Undersecretary for Energy, Ray Orbach. Five years ago SciDAC was launched as an experiment in computational science. The goal was to form partnerships among science applications, computer scientists, and applied mathematicians to take advantage of the potential of emerging terascale computers. This experiment has been a resounding success. SciDAC has emerged as a powerful concept for addressing some of the biggest challenges facing our world. As significant as these successes were, I believe there is also significance in the teams that achieved them. In addition to their scientific aims these teams have advanced the overall field of computational science and set the stage for even larger accomplishments as we look ahead to SciDAC-2. I am sure that many of you are expecting to hear about the results of our current solicitation for SciDAC-2. I’m afraid we are not quite ready to make that announcement. Decisions are still being made and we will announce the results later this summer. Nearly 250 unique proposals were received and evaluated, involving literally thousands of researchers, postdocs, and students. These collectively requested more than five times our expected budget. This response is a testament to the success of SciDAC in the community. In SciDAC-2 our budget has been increased to about 70 million for FY 2007 and our partnerships have expanded to include the Environment and National Security missions of the Department. The National Science Foundation has also joined as a partner. These new partnerships are expected to expand the application space of SciDAC, and broaden the impact and visibility of the program. We have, with our recent solicitation, expanded to turbulence, computational biology, and groundwater reactive modeling and simulation. We are currently talking with the Department’s applied energy programs about risk assessment, optimization of complex systems - such

  1. Modeling groundwater flow on massively parallel computers

    SciTech Connect

    Ashby, S.F.; Falgout, R.D.; Fogwell, T.W.; Tompson, A.F.B.

    1994-12-31

    The authors will explore the numerical simulation of groundwater flow in three-dimensional heterogeneous porous media. An interdisciplinary team of mathematicians, computer scientists, hydrologists, and environmental engineers is developing a sophisticated simulation code for use on workstation clusters and MPPs. To date, they have concentrated on modeling flow in the saturated zone (single phase), which requires the solution of a large linear system. they will discuss their implementation of preconditioned conjugate gradient solvers. The preconditioners under consideration include simple diagonal scaling, s-step Jacobi, adaptive Chebyshev polynomial preconditioning, and multigrid. They will present some preliminary numerical results, including simulations of groundwater flow at the LLNL site. They also will demonstrate the code`s scalability.

  2. Parallel Domain Decomposition Preconditioning for Computational Fluid Dynamics

    NASA Technical Reports Server (NTRS)

    Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Kutler, Paul (Technical Monitor)

    1998-01-01

    This viewgraph presentation gives an overview of the parallel domain decomposition preconditioning for computational fluid dynamics. Details are given on some difficult fluid flow problems, stabilized spatial discretizations, and Newton's method for solving the discretized flow equations. Schur complement domain decomposition is described through basic formulation, simplifying strategies (including iterative subdomain and Schur complement solves, matrix element dropping, localized Schur complement computation, and supersparse computations), and performance evaluation.

  3. Computational mechanics analysis tools for parallel-vector supercomputers

    NASA Technical Reports Server (NTRS)

    Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

    1993-01-01

    Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.

  4. Computational mechanics analysis tools for parallel-vector supercomputers

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.; Nguyen, Duc T.; Baddourah, Majdi; Qin, Jiangning

    1993-01-01

    Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigensolution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many of these algorithms is available.

  5. Parallel and pipeline computation of fast unitary transforms

    NASA Technical Reports Server (NTRS)

    Fino, B. J.; Algazi, V. R.

    1975-01-01

    The letter discusses the parallel and pipeline organization of fast-unitary-transform algorithms such as the fast Fourier transform, and points out the efficiency of a combined parallel-pipeline processor of a transform such as the Haar transform, in which (2 to the n-th power) -1 hardware 'butterflies' generate a transform of order 2 to the n-th power every computation cycle.

  6. Dynamic grid refinement for partial differential equations on parallel computers

    NASA Technical Reports Server (NTRS)

    Mccormick, S.; Quinlan, D.

    1989-01-01

    The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.

  7. Numerical simulation of supersonic wake flow with parallel computers

    SciTech Connect

    Wong, C.C.; Soetrisno, M.

    1995-07-01

    Simulating a supersonic wake flow field behind a conical body is a computing intensive task. It requires a large number of computational cells to capture the dominant flow physics and a robust numerical algorithm to obtain a reliable solution. High performance parallel computers with unique distributed processing and data storage capability can provide this need. They have larger computational memory and faster computing time than conventional vector computers. We apply the PINCA Navier-Stokes code to simulate a wind-tunnel supersonic wake experiment on Intel Gamma, Intel Paragon, and IBM SP2 parallel computers. These simulations are performed to study the mean flow in the near wake region of a sharp, 7-degree half-angle, adiabatic cone at Mach number 4.3 and freestream Reynolds number of 40,600. Overall the numerical solutions capture the general features of the hypersonic laminar wake flow and compare favorably with the wind tunnel data. With a refined and clustering grid distribution in the recirculation zone, the calculated location of the rear stagnation point is consistent with the 2D axisymmetric and 3D experiments. In this study, we also demonstrate the importance of having a large local memory capacity within a computer node and the effective utilization of the number of computer nodes to achieve good parallel performance when simulating a complex, large-scale wake flow problem.

  8. Parallel algorithm for computing 3-D reachable workspaces

    NASA Astrophysics Data System (ADS)

    Alameldin, Tarek K.; Sobh, Tarek M.

    1992-03-01

    The problem of computing the 3-D workspace for redundant articulated chains has applications in a variety of fields such as robotics, computer aided design, and computer graphics. The computational complexity of the workspace problem is at least NP-hard. The recent advent of parallel computers has made practical solutions for the workspace problem possible. Parallel algorithms for computing the 3-D workspace for redundant articulated chains with joint limits are presented. The first phase of these algorithms computes workspace points in parallel. The second phase uses workspace points that are computed in the first phase and fits a 3-D surface around the volume that encompasses the workspace points. The second phase also maps the 3- D points into slices, uses region filling to detect the holes and voids in the workspace, extracts the workspace boundary points by testing the neighboring cells, and tiles the consecutive contours with triangles. The proposed algorithms are efficient for computing the 3-D reachable workspace for articulated linkages, not only those with redundant degrees of freedom but also those with joint limits.

  9. RAMA: A file system for massively parallel computers

    NASA Technical Reports Server (NTRS)

    Miller, Ethan L.; Katz, Randy H.

    1993-01-01

    This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.

  10. Toward an automated parallel computing environment for geosciences

    NASA Astrophysics Data System (ADS)

    Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping

    2007-08-01

    Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.

  11. Solving unstructured grid problems on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Hammond, Steven W.; Schreiber, Robert

    1990-01-01

    A highly parallel graph mapping technique that enables one to efficiently solve unstructured grid problems on massively parallel computers is presented. Many implicit and explicit methods for solving discretized partial differential equations require each point in the discretization to exchange data with its neighboring points every time step or iteration. The cost of this communication can negate the high performance promised by massively parallel computing. To eliminate this bottleneck, the graph of the irregular problem is mapped into the graph representing the interconnection topology of the computer such that the sum of the distances that the messages travel is minimized. It is shown that using the heuristic mapping algorithm significantly reduces the communication time compared to a naive assignment of processes to processors.

  12. CFD Analysis and Design Optimization Using Parallel Computers

    NASA Technical Reports Server (NTRS)

    Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James

    1997-01-01

    A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.

  13. Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations

    NASA Technical Reports Server (NTRS)

    Chrisochoides, Nikos

    1995-01-01

    We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.

  14. Parallel grid generation algorithm for distributed memory computers

    NASA Technical Reports Server (NTRS)

    Moitra, Stuti; Moitra, Anutosh

    1994-01-01

    A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.

  15. Scientific Writing & the Scientific Method: Parallel "Hourglass" Structure in Form & Content

    ERIC Educational Resources Information Center

    Schulte, Bruce A.

    2003-01-01

    The writing of a scientific article offers a sharp divergence from the standard essay paper familiar to most students from English class. Supporting documentation is embedded in the scientific prose, not footnoted along the bottom of each page. Instructing students in the art of scientific writing can be truly challenging. After years of…

  16. Institute for Scientific Computing Research Fiscal Year 2002 Annual Report

    SciTech Connect

    Keyes, D E; McGraw, J R; Bodtker, L K

    2003-03-11

    The Institute for Scientific Computing Research (ISCR) at Lawrence Livermore National Laboratory is jointly administered by the Computing Applications and Research Department (CAR) and the University Relations Program (URP), and this joint relationship expresses its mission. An extensively externally networked ISCR cost-effectively expands the level and scope of national computational science expertise available to the Laboratory through CAR. The URP, with its infrastructure for managing six institutes and numerous educational programs at LLNL, assumes much of the logistical burden that is unavoidable in bridging the Laboratory's internal computational research environment with that of the academic community. As large-scale simulations on the parallel platforms of DOE's Advanced Simulation and Computing (ASCI) become increasingly important to the overall mission of LLNL, the role of the ISCR expands in importance, accordingly. Relying primarily on non-permanent staffing, the ISCR complements Laboratory research in areas of the computer and information sciences that are needed at the frontier of Laboratory missions. The ISCR strives to be the ''eyes and ears'' of the Laboratory in the computer and information sciences, in keeping the Laboratory aware of and connected to important external advances. It also attempts to be ''feet and hands, in carrying those advances into the Laboratory and incorporating them into practice. In addition to conducting research, the ISCR provides continuing education opportunities to Laboratory personnel, in the form of on-site workshops taught by experts on novel software or hardware technologies. The ISCR also seeks to influence the research community external to the Laboratory to pursue Laboratory-related interests and to train the workforce that will be required by the Laboratory. Part of the performance of this function is interpreting to the external community appropriate (unclassified) aspects of the Laboratory's own contributions

  17. Parallel Computing Environments and Methods for Power Distribution System Simulation

    SciTech Connect

    Lu, Ning; Taylor, Zachary T.; Chassin, David P.; Guttromson, Ross T.; Studham, Scott S.

    2005-11-10

    The development of cost-effective high-performance parallel computing on multi-processor super computers makes it attractive to port excessively time consuming simulation software from personal computers (PC) to super computes. The power distribution system simulator (PDSS) takes a bottom-up approach and simulates load at appliance level, where detailed thermal models for appliances are used. This approach works well for a small power distribution system consisting of a few thousand appliances. When the number of appliances increases, the simulation uses up the PC memory and its run time increases to a point where the approach is no longer feasible to model a practical large power distribution system. This paper presents an effort made to port a PC-based power distribution system simulator (PDSS) to a 128-processor shared-memory super computer. The paper offers an overview of the parallel computing environment and a description of the modification made to the PDSS model. The performances of the PDSS running on a standalone PC and on the super computer are compared. Future research direction of utilizing parallel computing in the power distribution system simulation is also addressed.

  18. Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

    1996-01-01

    This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.

  19. Implicit schemes and parallel computing in unstructured grid CFD

    NASA Technical Reports Server (NTRS)

    Venkatakrishnam, V.

    1995-01-01

    The development of implicit schemes for obtaining steady state solutions to the Euler and Navier-Stokes equations on unstructured grids is outlined. Applications are presented that compare the convergence characteristics of various implicit methods. Next, the development of explicit and implicit schemes to compute unsteady flows on unstructured grids is discussed. Next, the issues involved in parallelizing finite volume schemes on unstructured meshes in an MIMD (multiple instruction/multiple data stream) fashion are outlined. Techniques for partitioning unstructured grids among processors and for extracting parallelism in explicit and implicit solvers are discussed. Finally, some dynamic load balancing ideas, which are useful in adaptive transient computations, are presented.

  20. Method for implementation of recursive hierarchical segmentation on parallel computers

    NASA Technical Reports Server (NTRS)

    Tilton, James C. (Inventor)

    2005-01-01

    A method, computer readable storage, and apparatus for implementing a recursive hierarchical segmentation algorithm on a parallel computing platform. The method includes setting a bottom level of recursion that defines where a recursive division of an image into sections stops dividing, and setting an intermediate level of recursion where the recursive division changes from a parallel implementation into a serial implementation. The segmentation algorithm is implemented according to the set levels. The method can also include setting a convergence check level of recursion with which the first level of recursion communicates with when performing a convergence check.

  1. Small file aggregation in a parallel computing system

    DOEpatents

    Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang

    2014-09-02

    Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.

  2. Computational acceleration for MR image reconstruction in partially parallel imaging.

    PubMed

    Ye, Xiaojing; Chen, Yunmei; Huang, Feng

    2011-05-01

    In this paper, we present a fast numerical algorithm for solving total variation and l(1) (TVL1) based image reconstruction with application in partially parallel magnetic resonance imaging. Our algorithm uses variable splitting method to reduce computational cost. Moreover, the Barzilai-Borwein step size selection method is adopted in our algorithm for much faster convergence. Experimental results on clinical partially parallel imaging data demonstrate that the proposed algorithm requires much fewer iterations and/or less computational cost than recently developed operator splitting and Bregman operator splitting methods, which can deal with a general sensing matrix in reconstruction framework, to get similar or even better quality of reconstructed images. PMID:20833599

  3. TSE computers - A means for massively parallel computations

    NASA Technical Reports Server (NTRS)

    Strong, J. P., III

    1976-01-01

    A description is presented of hardware concepts for building a massively parallel processing system for two-dimensional data. The processing system is to use logic arrays of 128 x 128 elements which perform over 16 thousand operations simultaneously. Attention is given to image data, logic arrays, basic image logic functions, a prototype negator, an interleaver device, image logic circuits, and an image memory circuit.

  4. Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2016-03-15

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  5. 76 FR 45786 - Advanced Scientific Computing Advisory Committee; Meeting

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-01

    ... Advanced Scientific Computing Advisory Committee; Meeting AGENCY: Office of Science, Department of Energy... Computing Advisory Committee (ASCAC). Federal Advisory Committee Act (Pub. L. 92-463, 86 Stat. 770) requires... INFORMATION CONTACT: Melea Baker, Office of Advanced Scientific Computing Research; SC-21/Germantown...

  6. 78 FR 64931 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-10-30

    ... Advanced Scientific Computing Advisory Committee AGENCY: Office of Science, Department of Energy. ACTION... Computing Advisory Committee (ASCAC). This meeting replaces the cancelled ASCAC meeting that was to be held... Advanced Scientific Computing Research; SC-21/Germantown Building; U. S. Department of Energy;...

  7. 76 FR 64330 - Advanced Scientific Computing Advisory Committee

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-18

    ... Advanced Scientific Computing Advisory Committee AGENCY: Department of Energy, Office of Science. ACTION... Computing Advisory Committee (ASCAC). The Federal Advisory Committee Act (Pub. L. 92-463, 86 Stat. 770..., Office of Advanced Scientific Computing Research; SC-21/Germantown Building; U. S. Department of...

  8. Requirements for supercomputing in energy research: The transition to massively parallel computing

    SciTech Connect

    Not Available

    1993-02-01

    This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.

  9. Parallel Computing for the Computed-Tomography Imaging Spectrometer

    NASA Technical Reports Server (NTRS)

    Lee, Seungwon

    2008-01-01

    This software computes the tomographic reconstruction of spatial-spectral data from raw detector images of the Computed-Tomography Imaging Spectrometer (CTIS), which enables transient-level, multi-spectral imaging by capturing spatial and spectral information in a single snapshot.

  10. Hybrid parallel computing architecture for multiview phase shifting

    NASA Astrophysics Data System (ADS)

    Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun

    2014-11-01

    The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.

  11. InSAR Scientific Computing Environment (Invited)

    NASA Astrophysics Data System (ADS)

    Rosen, P. A.; Gurrola, E. M.; Sacco, G.; Zebker, H. A.; Simons, M.; Sandwell, D. T.

    2009-12-01

    The InSAR Scientific Computing Environment (ISCE) is a new development effort within the NASA Advanced Information Systems and Technology program, with the intent of recasting the JPL/Caltech ROI_PAC repeat-pass interferometry package into a modern, reconfigurable, open-source computing environment. The new capability initiates the next generation of geodetic imaging processing technology for InSAR sensors, providing flexibility and extensibility in reducing measurements from radar satellites and aircraft to new geophysical products. The NRC Decadal Survey recommended DESDynI mission will deliver to the science community data of unprecedented quantity and quality, making possible global-scale studies in climate research, natural hazards, and Earth’s ecosystem. DESDynI will provide time series and multi-image measurements that permit four-dimensional models of Earth surface processes so that, for example, climate-induced changes over time become apparent and quantifiable. In this paper, we describe the Environment, and illustrate how it can facility space-based geodesy from InSAR. The ISCE invokes object oriented scripts to control legacy and new codes, and abstracts and generalizes the data model for efficient manipulation of objects among modules. The module interfaces are suitable for command-line execution or GUI-programming. It exposes users gradually to its levels of capability, allowing novices to apply it readily for simple tasks and for experienced users to mine the data with great facility. The intent of the effort is to encourage user contributions to the code, creating an open source community that will extend its life and utility.

  12. Solution of partial differential equations on vector and parallel computers

    NASA Technical Reports Server (NTRS)

    Ortega, J. M.; Voigt, R. G.

    1985-01-01

    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed.

  13. Analysis of multigrid methods on massively parallel computers: Architectural implications

    NASA Technical Reports Server (NTRS)

    Matheson, Lesley R.; Tarjan, Robert E.

    1993-01-01

    We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.

  14. The InSAR Scientific Computing Environment

    NASA Technical Reports Server (NTRS)

    Rosen, Paul A.; Gurrola, Eric; Sacco, Gian Franco; Zebker, Howard

    2012-01-01

    We have developed a flexible and extensible Interferometric SAR (InSAR) Scientific Computing Environment (ISCE) for geodetic image processing. ISCE was designed from the ground up as a geophysics community tool for generating stacks of interferograms that lend themselves to various forms of time-series analysis, with attention paid to accuracy, extensibility, and modularity. The framework is python-based, with code elements rigorously componentized by separating input/output operations from the processing engines. This allows greater flexibility and extensibility in the data models, and creates algorithmic code that is less susceptible to unnecessary modification when new data types and sensors are available. In addition, the components support provenance and checkpointing to facilitate reprocessing and algorithm exploration. The algorithms, based on legacy processing codes, have been adapted to assume a common reference track approach for all images acquired from nearby orbits, simplifying and systematizing the geometry for time-series analysis. The framework is designed to easily allow user contributions, and is distributed for free use by researchers. ISCE can process data from the ALOS, ERS, EnviSAT, Cosmo-SkyMed, RadarSAT-1, RadarSAT-2, and TerraSAR-X platforms, starting from Level-0 or Level 1 as provided from the data source, and going as far as Level 3 geocoded deformation products. With its flexible design, it can be extended with raw/meta data parsers to enable it to work with radar data from other platforms

  15. Solving tridiagonal linear systems on the Butterfly parallel computer

    SciTech Connect

    Kumar, S.P.

    1989-01-01

    A parallel block partitioning method to solve a tri-diagonal system of linear equations is adapted to the BBN Butterfly multiprocessor. A performance analysis of the programming experiments on the 32-node Butterfly is presented. An upper bound on the number of processors to achieve the best performance with this method is derived. The computational results verify the theoretical speedup and efficiency results of the parallel algorithm over its serial counterpart. Also included is a study comparing performance runs of the same code on the Butterfly processor with a hardware floating point unit and on one with a software floating point facility. The total parallel time of the given code is considerably reduced by making use of the hardware floating point facility whereas the speedup and efficiency of the parallel program considerably improve on the system with software floating point capability. The achieved results are shown to be within 82% to 90% of the predicted performance.

  16. Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing

    NASA Technical Reports Server (NTRS)

    Ozguner, Fusun

    1996-01-01

    Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.

  17. Parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Amin-Javaheri, Masoud; Orin, David E.

    1989-01-01

    The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.

  18. Variable-Complexity Multidisciplinary Optimization on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Grossman, Bernard; Mason, William H.; Watson, Layne T.; Haftka, Raphael T.

    1998-01-01

    This report covers work conducted under grant NAG1-1562 for the NASA High Performance Computing and Communications Program (HPCCP) from December 7, 1993, to December 31, 1997. The objective of the research was to develop new multidisciplinary design optimization (MDO) techniques which exploit parallel computing to reduce the computational burden of aircraft MDO. The design of the High-Speed Civil Transport (HSCT) air-craft was selected as a test case to demonstrate the utility of our MDO methods. The three major tasks of this research grant included: development of parallel multipoint approximation methods for the aerodynamic design of the HSCT, use of parallel multipoint approximation methods for structural optimization of the HSCT, mathematical and algorithmic development including support in the integration of parallel computation for items (1) and (2). These tasks have been accomplished with the development of a response surface methodology that incorporates multi-fidelity models. For the aerodynamic design we were able to optimize with up to 20 design variables using hundreds of expensive Euler analyses together with thousands of inexpensive linear theory simulations. We have thereby demonstrated the application of CFD to a large aerodynamic design problem. For the predicting structural weight we were able to combine hundreds of structural optimizations of refined finite element models with thousands of optimizations based on coarse models. Computations have been carried out on the Intel Paragon with up to 128 nodes. The parallel computation allowed us to perform combined aerodynamic-structural optimization using state of the art models of a complex aircraft configurations.

  19. CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly Parallel Image-Analysis Algorithms

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth John

    2010-10-01

    The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the

  20. Measuring performance of parallel computers. Progress report, 1989

    SciTech Connect

    Sullivan, F.

    1994-07-01

    Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.

  1. Element-topology-independent preconditioners for parallel finite element computations

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alexander, Scott

    1992-01-01

    A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.

  2. Hardware packet pacing using a DMA in a parallel computer

    DOEpatents

    Chen, Dong; Heidelberger, Phillip; Vranas, Pavlos

    2013-08-13

    Method and system for hardware packet pacing using a direct memory access controller in a parallel computer which, in one aspect, keeps track of a total number of bytes put on the network as a result of a remote get operation, using a hardware token counter.

  3. An introduction to performance debugging for parallel computers

    SciTech Connect

    Gropp, W.

    1995-02-07

    Programming parallel computers for performance is a difficult task that requires careful attention to both single-node performance and data exchange between processors. This paper discusses some of the sources of poor performance, ways to identify them in an application, and a few ways to address these issues.

  4. PCCM2: A GCM adapted for scalable parallel computers

    SciTech Connect

    Drake, J.; Semeraro, B.D.; Worley, P.; Foster, I.; Michalakes, J.; Toonen, B.; Hack, J.J.; Williamson, D.L.

    1994-01-01

    The Computer Hardware, Advanced Mathematics and Model Physics (CHAMMP) program seeks to provide climate researchers with an advanced modeling capability for the study of global change issues. One of the more ambitious projects being undertaken in the CHAMMP program is the development of PCCM2, an adaptation of the Community Climate Model (CCM2) for scalable parallel computers. PCCM2 uses a message-passing, domain-decomposition approach, in which each processor is allocated responsibility for computation on one part of the computational grid, and messages are generated to communicate data between processors. Much of the research effort associated with development of a parallel code of this sort is concerned with identifying efficient decomposition and communication strategies. In PCCM2, this task is complicated by the need to support both semi-Lagrangian transport and spectral transport. Load balancing and parallel I/O techniques are also required. In this paper, the authors review the various parallel algorithms used in PCCM2 and the work done to arrive at a validated model.

  5. Efficiently modeling neural networks on massively parallel computers

    SciTech Connect

    Farber, R.M.

    1992-01-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper will describe the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SMM computers and can be implemented on computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors. We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can be extend to arbitrarily large networks by merging the memory space of separate processors with fast adjacent processor inter-processor communications. This paper will consider the simulation of only feed forward neural network although this method is extendible to recurrent networks.

  6. Efficiently modeling neural networks on massively parallel computers

    SciTech Connect

    Farber, R.M.

    1992-12-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper will describe the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SMM computers and can be implemented on computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors. We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can be extend to arbitrarily large networks by merging the memory space of separate processors with fast adjacent processor inter-processor communications. This paper will consider the simulation of only feed forward neural network although this method is extendible to recurrent networks.

  7. Efficiently modeling neural networks on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Farber, Robert M.

    1993-01-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.

  8. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-09

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  9. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-30

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  10. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-08-11

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  11. Fencing data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-06-02

    Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

  12. Computationally efficient implementation of combustion chemistry in parallel PDF calculations

    NASA Astrophysics Data System (ADS)

    Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.

    2009-08-01

    In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel

  13. High Performance Input/Output for Parallel Computer Systems

    NASA Technical Reports Server (NTRS)

    Ligon, W. B.

    1996-01-01

    The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.

  14. Aggregating job exit statuses of a plurality of compute nodes executing a parallel application

    SciTech Connect

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Mundy, Michael B.

    2015-07-21

    Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.

  15. Parallel multithread computing for spectroscopic analysis in optical coherence tomography

    NASA Astrophysics Data System (ADS)

    Trojanowski, Michal; Kraszewski, Maciej; Strakowski, Marcin; Pluciński, Jerzy

    2014-05-01

    Spectroscopic Optical Coherence Tomography (SOCT) is an extension of Optical Coherence Tomography (OCT). It allows gathering spectroscopic information from individual scattering points inside the sample. It is based on time-frequency analysis of interferometric signals. Such analysis requires calculating hundreds of Fourier transforms while performing a single A-scan. Additionally, further processing of acquired spectroscopic information is needed. This significantly increases the time of required computations. During last years, application of graphical processing units (GPU's) was proposed to reduce computation time in OCT by using parallel computing algorithms. GPU technology can be also used to speed-up signal processing in SOCT. However, parallel algorithms used in classical OCT need to be revised because of different character of analyzed data. The classical OCT requires processing of long, independent interferometric signals for obtaining subsequent A-scans. The difference with SOCT is that it requires processing of multiple, shorter signals, which differ only in a small part of samples. We have developed new algorithms for parallel signal processing for usage in SOCT, implemented with NVIDIA CUDA (Compute Unified Device Architecture). We present details of the algorithms and performance tests for analyzing data from in-house SD-OCT system. We also give a brief discussion about usefulness of developed algorithm. Presented algorithms might be useful for researchers working on OCT, as they allow to reduce computation time and are step toward real-time signal processing of SOCT data.

  16. A Simple Physical Optics Algorithm Perfect for Parallel Computing

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1993-01-01

    One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  17. New parallel vision environment in heterogeneous networked computing

    NASA Astrophysics Data System (ADS)

    Kim, Kyung N.; Choi, Tae-Sun; Ramakrishna, R. S.

    1998-09-01

    Many vision tasks are very complex and computationally intensive. Real time requirements further aggravate the situation. They usually involve both structured (low-level vision) and unstructured (high-level vision) computations. Parallel approaches offer hope in this context. Parallel approaches to vision tasks and scheduling schemes for their implementation receive special emphasis in this paper. Architectural issues are also addressed. The aim is to design algorithms which can be implemented on low cost heterogeneous networks running PVM. Issues connected with general purpose architectures also receive attention. The proposed ideas have been illustrated through a practical example (of eye location from an image sequence). Next generation multimedia environments are expected to routinely employ such high performance computing platforms.

  18. Parallel distance matrix computation for Matlab data mining

    NASA Astrophysics Data System (ADS)

    Skurowski, Przemysław; Staniszewski, Michał

    2016-06-01

    The paper presents utility functions for computing of a distance matrix, which plays a crucial role in data mining. The goal in the design was to enable operating on relatively large datasets by overcoming basic shortcoming - computing time - with an interface easy to use. The presented solution is a set of functions, which were created with emphasis on practical applicability in real life. The proposed solution is presented along the theoretical background for the performance scaling. Furthermore, different approaches of the parallel computing are analyzed, including shared memory, which is uncommon in Matlab environment.

  19. Computing NLTE Opacities -- Node Level Parallel Calculation

    SciTech Connect

    Holladay, Daniel

    2015-09-11

    Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.

  20. Parallel computing in genomic research: advances and applications.

    PubMed

    Ocaña, Kary; de Oliveira, Daniel

    2015-01-01

    Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.

  1. Parallel computing in genomic research: advances and applications.

    PubMed

    Ocaña, Kary; de Oliveira, Daniel

    2015-01-01

    Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. PMID:26604801

  2. Parallel computing in genomic research: advances and applications

    PubMed Central

    Ocaña, Kary; de Oliveira, Daniel

    2015-01-01

    Today’s genomic experiments have to process the so-called “biological big data” that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. PMID:26604801

  3. New Parallel computing framework for radiation transport codes

    SciTech Connect

    Kostin, M.A.; Mokhov, N.V.; Niita, K.; /JAERI, Tokai

    2010-09-01

    A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.

  4. Scientific Word Processing for Personal Computers.

    ERIC Educational Resources Information Center

    Canham, Geoffrey W. Rayner

    1987-01-01

    Discusses some of the variables teachers should consider when selecting scientific word processing software. Briefly describes some of the attributes of 17 software packages that were developed to accommodate scientific formulae and chemical structures. Provides the names and addresses of the respective software producers. (TW)

  5. Experiences with the Lanczos method on a parallel computer

    NASA Technical Reports Server (NTRS)

    Bostic, Susan W.; Fulton, Robert E.

    1987-01-01

    A parallel computer implementation of the Lanczos method for the free-vibration analysis of structures is considered, and results for two example problems show substantial time-reduction over the sequential solutions. The major Lanczos calculation tasks are subdivided into subtasks, and parallelism is introduced at the subtask level. A speedup of 7.8 on eight processors was obtained for the decomposition step of the problem involving a 60-m three-longeron space mast, and a speedup of 14.6 on 16 processors was obtained for the decomposition step of the problem involving a blade-stiffened graphite-epoxy panel.

  6. Computational Simulations and the Scientific Method

    NASA Technical Reports Server (NTRS)

    Kleb, Bil; Wood, Bill

    2005-01-01

    As scientific simulation software becomes more complicated, the scientific-software implementor's need for component tests from new model developers becomes more crucial. The community's ability to follow the basic premise of the Scientific Method requires independently repeatable experiments, and model innovators are in the best position to create these test fixtures. Scientific software developers also need to quickly judge the value of the new model, i.e., its cost-to-benefit ratio in terms of gains provided by the new model and implementation risks such as cost, time, and quality. This paper asks two questions. The first is whether other scientific software developers would find published component tests useful, and the second is whether model innovators think publishing test fixtures is a feasible approach.

  7. Constructing Scientific Arguments Using Evidence from Dynamic Computational Climate Models

    ERIC Educational Resources Information Center

    Pallant, Amy; Lee, Hee-Sun

    2015-01-01

    Modeling and argumentation are two important scientific practices students need to develop throughout school years. In this paper, we investigated how middle and high school students (N = 512) construct a scientific argument based on evidence from computational models with which they simulated climate change. We designed scientific argumentation…

  8. Diderot: a Domain-Specific Language for Portable Parallel Scientific Visualization and Image Analysis.

    PubMed

    Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John

    2016-01-01

    Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets. PMID:26529733

  9. Diderot: a Domain-Specific Language for Portable Parallel Scientific Visualization and Image Analysis.

    PubMed

    Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John

    2016-01-01

    Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets.

  10. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.

  11. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2015-02-03

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.

  12. The 2nd Symposium on the Frontiers of Massively Parallel Computations

    NASA Technical Reports Server (NTRS)

    Mills, Ronnie (Editor)

    1988-01-01

    Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

  13. User documentation for PVODE, an ODE solver for parallel computers

    SciTech Connect

    Hindmarsh, A.C., LLNL

    1998-05-01

    PVODE is a general purpose ordinary differential equation (ODE) solver for stiff and nonstiff ODES It is based on CVODE [5] [6], which is written in ANSI- standard C PVODE uses MPI (Message-Passing Interface) [8] and a revised version of the vector module in CVODE to achieve parallelism and portability PVODE is intended for the SPMD (Single Program Multiple Data) environment with distributed memory, in which all vectors are identically distributed across processors In particular, the vector module is designed to help the user assign a contiguous segment of a given vector to each of the processors for parallel computation The idea is for each processor to solve a certain fixed subset of the ODES To better understand PVODE, we first need to understand CVODE and its historical background The ODE solver CVODE, which was written by Cohen and Hindmarsh, combines features of two earlier Fortran codes, VODE [l] and VODPK [3] Those two codes were written by Brown, Byrne, and Hindmarsh. Both use variable-coefficient multi-step integration methods, and address both stiff and nonstiff systems (Stiffness is defined as the presence of one or more very small damping time constants ) VODE uses direct linear algebraic techniques to solve the underlying banded or dense linear systems of equations in conjunction with a modified Newton method in the stiff ODE case On the other hand, VODPK uses a preconditioned Krylov iterative method [2] to solve the underlying linear system User-supplied preconditioners directly address the dominant source of stiffness Consequently, CVODE implements both the direct and iterative methods Currently, with regard to the nonlinear and linear system solution, PVODE has three method options available. functional iteration, Newton iteration with a diagonal approximate Jacobian, and Newton iteration with the iterative method SPGMR (Scaled Preconditioned Generalized Minimal Residual method) Both CVODE and PVODE are written in such a way that other linear

  14. PArallel Reacting Multiphase FLOw Computational Fluid Dynamic Analysis

    2002-06-01

    PARMFLO is a parallel multiphase reacting flow computational fluid dynamics (CFD) code. It can perform steady or unsteady simulations in three space dimensions. It is intended for use in engineering CFD analysis of industrial flow system components. Its parallel processing capabilities allow it to be applied to problems that use at least an order of magnitude more computational cells than the number that can be used on a typical single processor workstation (about 106 cellsmore » in parallel processing mode versus about io cells in serial processing mode). Alternately, by spreading the work of a CFD problem that could be run on a single workstation over a group of computers on a network, it can bring the runtime down by an order of magnitude or more (typically from many days to less than one day). The software was implemented using the industry standard Message-Passing Interface (MPI) and domain decomposition in one spatial direction. The phases of a flow problem may include an ideal gas mixture with an arbitrary number of chemical species, and dispersed droplet and particle phases. Regions of porous media may also be included within the domain. The porous media may be packed beds, foams, or monolith catalyst supports. With these features, the code is especially suited to analysis of mixing of reactants in the inlet chamber of catalytic reactors coupled to computation of product yields that result from the flow of the mixture through the catalyst coaled support structure.« less

  15. Distributed parallel computing in stochastic modeling of groundwater systems.

    PubMed

    Dong, Yanhui; Li, Guomin; Xu, Haizhen

    2013-03-01

    Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling.

  16. 3D seismic imaging on massively parallel computers

    SciTech Connect

    Womble, D.E.; Ober, C.C.; Oldfield, R.

    1997-02-01

    The ability to image complex geologies such as salt domes in the Gulf of Mexico and thrusts in mountainous regions is a key to reducing the risk and cost associated with oil and gas exploration. Imaging these structures, however, is computationally expensive. Datasets can be terabytes in size, and the processing time required for the multiple iterations needed to produce a velocity model can take months, even with the massively parallel computers available today. Some algorithms, such as 3D, finite-difference, prestack, depth migration remain beyond the capacity of production seismic processing. Massively parallel processors (MPPs) and algorithms research are the tools that will enable this project to provide new seismic processing capabilities to the oil and gas industry. The goals of this work are to (1) develop finite-difference algorithms for 3D, prestack, depth migration; (2) develop efficient computational approaches for seismic imaging and for processing terabyte datasets on massively parallel computers; and (3) develop a modular, portable, seismic imaging code.

  17. Parallel computation of meshless methods for explicit dynamic analysis.

    SciTech Connect

    Danielson, K. T.; Hao, S.; Liu, W. K.; Uras, R. A.; Li, S.; Reactor Engineering; Northwestern Univ.; Waterways Experiment Station

    2000-03-10

    A parallel computational implementation of modern meshless methods is presented for explicit dynamic analysis. The procedures are demonstrated by application of the Reproducing Kernel Particle Method (RKPM). Aspects of a coarse grain parallel paradigm are detailed for a Lagrangian formulation using model partitioning. Integration points are uniquely defined on separate processors and particle definitions are duplicated, as necessary, so that all support particles for each point are defined locally on the corresponding processor. Several partitioning schemes are considered and a reduced graph-based procedure is presented. Partitioning issues are discussed and procedures to accommodate essential boundary conditions in parallel are presented. Explicit MPI message passing statements are used for all communications among partitions on different processors. The effectiveness of the procedure is demonstrated by highly deformable inelastic example problems.

  18. Implementation of ADI: Schemes on MIMD parallel computers

    NASA Technical Reports Server (NTRS)

    Vanderwijngaart, Rob F.

    1993-01-01

    In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.

  19. Computing through Scientific Abstractions in SysBioPS

    SciTech Connect

    Chin, George; Stephan, Eric G.; Gracio, Deborah K.

    2004-10-13

    Today, biologists and bioinformaticists have a tremendous amount of computational power at their disposal. With the availability of supercomputers, burgeoning scientific databases and digital libraries such as GenBank and PubMed, and pervasive computational environments such as the Grid, biologists have access to a wealth of computational capabilities and scientific data at hand. Yet, the rapid development of computational technologies has far exceeded the typical biologist’s ability to effectively apply the technology in their research. Computational sciences research and development efforts such as the Biology Workbench, BioSPICE (Biological Simulation Program for Intra-Cellular Evaluation), and BioCoRE (Biological Collaborative Research Environment) are important in connecting biologists and their scientific problems to computational infrastructures. On the Computational Cell Environment and Heuristic Entity-Relationship Building Environment projects at the Pacific Northwest National Laboratory, we are jointly developing a new breed of scientific problem solving environment called SysBioPSE that will allow biologists to access and apply computational resources in the scientific research context. In contrast to other computational science environments, SysBioPSE operates as an abstraction layer above a computational infrastructure. The goal of SysBioPSE is to allow biologists to apply computational resources in the context of the scientific problems they are addressing and the scientific perspectives from which they conduct their research. More specifically, SysBioPSE allows biologists to capture and represent scientific concepts and theories and experimental processes, and to link these views to scientific applications, data repositories, and computer systems.

  20. Scientific computing infrastructure and services in Moldova

    NASA Astrophysics Data System (ADS)

    Bogatencov, P. P.; Secrieru, G. V.; Degteariov, N. V.; Iliuha, N. P.

    2016-09-01

    In recent years distributed information processing and high-performance computing technologies (HPC, distributed Cloud and Grid computing infrastructures) for solving complex tasks with high demands of computing resources are actively developing. In Moldova the works on creation of high-performance and distributed computing infrastructures were started relatively recently due to participation in implementation of a number of international projects. Research teams from Moldova participated in a series of regional and pan-European projects that allowed them to begin forming the national heterogeneous computing infrastructure, get access to regional and European computing resources, and expand the range and areas of solving tasks.

  1. Executing a gather operation on a parallel computer

    DOEpatents

    Archer, Charles J.; Ratterman, Joseph D.

    2012-03-20

    Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallel computer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network.

  2. Numerical computation on massively parallel hypercubes. [Connection machine

    SciTech Connect

    McBryan, O.A.

    1986-01-01

    We describe numerical computations on the Connection Machine, a massively parallel hypercube architecture with 65,536 single-bit processors and 32 Mbytes of memory. A parallel extension of COMMON LISP, provides access to the processors and network. The rich software environment is further enhanced by a powerful virtual processor capability, which extends the degree of fine-grained parallelism beyond 1,000,000. We briefly describe the hardware and indicate the principal features of the parallel programming environment. We then present implementations of SOR, multigrid and pre-conditioned conjugate gradient algorithms for solving partial differential equations on the Connection Machine. Despite the lack of floating point hardware, computation rates above 100 megaflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly. The software development effort is also facilitated by the fact that hypercube communications prove to be fast and essentially independent of distance. 29 refs., 4 figs.

  3. Berkeley Lab Computing Sciences: Accelerating Scientific Discovery

    SciTech Connect

    Hules, John A

    2008-12-12

    Scientists today rely on advances in computer science, mathematics, and computational science, as well as large-scale computing and networking facilities, to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab's Computing Sciences organization researches, develops, and deploys new tools and technologies to meet these needs and to advance research in such areas as global climate change, combustion, fusion energy, nanotechnology, biology, and astrophysics.

  4. Probabilistic structural mechanics research for parallel processing computers

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.

    1991-01-01

    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.

  5. Establishing a group of endpoints in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.; Xue, Hanhong

    2016-02-02

    A parallel computer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallel computer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification.

  6. Performing a local reduction operation on a parallel computer

    DOEpatents

    Blocksome, Michael A.; Faraj, Daniel A.

    2012-12-11

    A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.

  7. Performing a local reduction operation on a parallel computer

    DOEpatents

    Blocksome, Michael A; Faraj, Daniel A

    2013-06-04

    A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.

  8. Final Report: Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    Mellor-Crummey, John

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  9. Portable Extensible Toolkit for Scientific Computation

    1995-06-30

    PETSC2.0 is a software toolkit for portable, parallel (and serial) numerical solution of partial differential equations and minimization problems. It includes software for the solution of linear and nonlinear systems of equations. These codes are written in a data-structure-neutral manner to enable easy reuse and flexibility.

  10. About parallel computing on spatial rotations in spin mesomorphic structures

    NASA Astrophysics Data System (ADS)

    Nesterov, Michal M.; Tarkhanov, Victor I.

    2004-05-01

    An approach to coherent parallel computing on finite spatial rotations is considered. It is shown that such rotations are elements of 4-dimensional (4D) space and to design any algorithms with them we need to change Boole logic to more complicated rotational one. The latter is provided by formalism of Cayley-Klein parameters, which is compatible with geometric Clifford algebra and all its elements of pure and mixed grades: vectors, scalars, paravectors, quaternions, spinors, etc.

  11. Pipeline and parallel architectures for computer communication systems

    SciTech Connect

    Reddi, A.V.

    1983-01-01

    Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.

  12. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2012-04-17

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallel computer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.

  13. Rapid indirect trajectory optimization on highly parallel computing architectures

    NASA Astrophysics Data System (ADS)

    Antony, Thomas

    Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical

  14. A Parallel Iterative Method for Computing Molecular Absorption Spectra.

    PubMed

    Koval, Peter; Foerster, Dietrich; Coulaud, Olivier

    2010-09-14

    We describe a fast parallel iterative method for computing molecular absorption spectra within TDDFT linear response and using the LCAO method. We use a local basis of "dominant products" to parametrize the space of orbital products that occur in the LCAO approach. In this basis, the dynamic polarizability is computed iteratively within an appropriate Krylov subspace. The iterative procedure uses a matrix-free GMRES method to determine the (interacting) density response. The resulting code is about 1 order of magnitude faster than our previous full-matrix method. This acceleration makes the speed of our TDDFT code comparable with codes based on Casida's equation. The implementation of our method uses hybrid MPI and OpenMP parallelization in which load balancing and memory access are optimized. To validate our approach and to establish benchmarks, we compute spectra of large molecules on various types of parallel machines. The methods developed here are fairly general, and we believe they will find useful applications in molecular physics/chemistry, even for problems that are beyond TDDFT, such as organic semiconductors, particularly in photovoltaics.

  15. Domain decomposition methods for the parallel computation of reacting flows

    NASA Technical Reports Server (NTRS)

    Keyes, David E.

    1988-01-01

    Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.

  16. Parallel and vector computation for stochastic optimal control applications

    NASA Technical Reports Server (NTRS)

    Hanson, F. B.

    1989-01-01

    A general method for parallel and vector numerical solutions of stochastic dynamic programming problems is described for optimal control of general nonlinear, continuous time, multibody dynamical systems, perturbed by Poisson as well as Gaussian random white noise. Possible applications include lumped flight dynamics models for uncertain environments, such as large scale and background random atmospheric fluctuations. The numerical formulation is highly suitable for a vector multiprocessor or vectorizing supercomputer, and results exhibit high processor efficiency and numerical stability. Advanced computing techniques, data structures, and hardware help alleviate Bellman's curse of dimensionality in dynamic programming computations.

  17. Local rollback for fault-tolerance in parallel computing systems

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Gara, Alan; Giampapa, Mark E.; Heidelberger, Philip; Ohmacht, Martin; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan

    2012-01-24

    A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.

  18. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A; Smith, Brian E

    2014-11-25

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  19. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A.; Smith, Brian E.

    2013-01-29

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  20. Runtime optimization of an application executing on a parallel computer

    DOEpatents

    Faraj, Daniel A; Smith, Brian E

    2014-11-18

    Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.

  1. A language comparison for scientific computing on MIMD architectures

    NASA Technical Reports Server (NTRS)

    Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.

    1989-01-01

    Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.

  2. Piecewise - Parabolic Methods for Parallel Computation with Applications to Unstable Fluid Flow in 2 and 3 Dimensions

    SciTech Connect

    Woodward, P. R.

    2003-03-26

    This report summarizes the results of the project entitled, ''Piecewise-Parabolic Methods for Parallel Computation with Applications to Unstable Fluid Flow in 2 and 3 Dimensions'' This project covers a span of many years, beginning in early 1987. It has provided over that considerable period the core funding to my research activities in scientific computation at the University of Minnesota. It has supported numerical algorithm development, application of those algorithms to fundamental fluid dynamics problems in order to demonstrate their effectiveness, and the development of scientific visualization software and systems to extract scientific understanding from those applications.

  3. Use Computer-Aided Tools to Parallelize Large CFD Applications

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Yan, J.

    2000-01-01

    Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of

  4. InSAR Scientific Computing Environment - The Home Stretch

    NASA Astrophysics Data System (ADS)

    Rosen, P. A.; Gurrola, E. M.; Sacco, G.; Zebker, H. A.

    2011-12-01

    The Interferometric Synthetic Aperture Radar (InSAR) Scientific Computing Environment (ISCE) is a software development effort in its third and final year within the NASA Advanced Information Systems and Technology program. The ISCE is a new computing environment for geodetic image processing for InSAR sensors enabling scientists to reduce measurements directly from radar satellites to new geophysical products with relative ease. The environment can serve as the core of a centralized processing center to bring Level-0 raw radar data up to Level-3 data products, but is adaptable to alternative processing approaches for science users interested in new and different ways to exploit mission data. Upcoming international SAR missions will deliver data of unprecedented quantity and quality, making possible global-scale studies in climate research, natural hazards, and Earth's ecosystem. The InSAR Scientific Computing Environment has the functionality to become a key element in processing data from NASA's proposed DESDynI mission into higher level data products, supporting a new class of analyses that take advantage of the long time and large spatial scales of these new data. At the core of ISCE is a new set of efficient and accurate InSAR algorithms. These algorithms are placed into an object-oriented, flexible, extensible software package that is informed by modern programming methods, including rigorous componentization of processing codes, abstraction and generalization of data models. The environment is designed to easily allow user contributions, enabling an open source community to extend the framework into the indefinite future. ISCE supports data from nearly all of the available satellite platforms, including ERS, EnviSAT, Radarsat-1, Radarsat-2, ALOS, TerraSAR-X, and Cosmo-SkyMed. The code applies a number of parallelization techniques and sensible approximations for speed. It is configured to work on modern linux-based computers with gcc compilers and python

  5. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    SciTech Connect

    Nash, T.

    1989-05-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs.

  6. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Davis, Kristan D; Faraj, Daniel A

    2013-07-09

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  7. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A; Mamidala, Amith R

    2014-02-11

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  8. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    DOEpatents

    Blocksome, Michael A.; Mamidala, Amith R.

    2013-09-03

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  9. Data communications in a parallel active messaging interface of a parallel computer

    SciTech Connect

    Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-09-02

    Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.

  10. Data communications in a parallel active messaging interface of a parallel computer

    SciTech Connect

    Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-09-16

    Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.

  11. Data communications for a collective operation in a parallel active messaging interface of a parallel computer

    SciTech Connect

    Faraj, Daniel A.

    2015-11-19

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  12. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel A.

    2014-07-22

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  13. Data communications for a collective operation in a parallel active messaging interface of a parallel computer

    DOEpatents

    Faraj, Daniel A

    2013-07-16

    Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

  14. Semi-coarsening multigrid methods for parallel computing

    SciTech Connect

    Jones, J.E.

    1996-12-31

    Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.

  15. Computer Science Techniques Applied to Parallel Atomistic Simulation

    NASA Astrophysics Data System (ADS)

    Nakano, Aiichiro

    1998-03-01

    Recent developments in parallel processing technology and multiresolution numerical algorithms have established large-scale molecular dynamics (MD) simulations as a new research mode for studying materials phenomena such as fracture. However, this requires large system sizes and long simulated times. We have developed: i) Space-time multiresolution schemes; ii) fuzzy-clustering approach to hierarchical dynamics; iii) wavelet-based adaptive curvilinear-coordinate load balancing; iv) multilevel preconditioned conjugate gradient method; and v) spacefilling-curve-based data compression for parallel I/O. Using these techniques, million-atom parallel MD simulations are performed for the oxidation dynamics of nanocrystalline Al. The simulations take into account the effect of dynamic charge transfer between Al and O using the electronegativity equalization scheme. The resulting long-range Coulomb interaction is calculated efficiently with the fast multipole method. Results for temperature and charge distributions, residual stresses, bond lengths and bond angles, and diffusivities of Al and O will be presented. The oxidation of nanocrystalline Al is elucidated through immersive visualization in virtual environments. A unique dual-degree education program at Louisiana State University will also be discussed in which students can obtain a Ph.D. in Physics & Astronomy and a M.S. from the Department of Computer Science in five years. This program fosters interdisciplinary research activities for interfacing High Performance Computing and Communications with large-scale atomistic simulations of advanced materials. This work was supported by NSF (CAREER Program), ARO, PRF, and Louisiana LEQSF.

  16. Parallel matrix transpose algorithms on distributed memory concurrent computers

    SciTech Connect

    Choi, J.; Walker, D.W.; Dongarra, J.J. |

    1993-10-01

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  17. Simulation methods for advanced scientific computing

    SciTech Connect

    Booth, T.E.; Carlson, J.A.; Forster, R.A.

    1998-11-01

    This is the final report of a three-year, Laboratory Directed Research and Development (LDRD) project at the Los Alamos National Laboratory (LANL). The objective of the project was to create effective new algorithms for solving N-body problems by computer simulation. The authors concentrated on developing advanced classical and quantum Monte Carlo techniques. For simulations of phase transitions in classical systems, they produced a framework generalizing the famous Swendsen-Wang cluster algorithms for Ising and Potts models. For spin-glass-like problems, they demonstrated the effectiveness of an extension of the multicanonical method for the two-dimensional, random bond Ising model. For quantum mechanical systems, they generated a new method to compute the ground-state energy of systems of interacting electrons. They also improved methods to compute excited states when the diffusion quantum Monte Carlo method is used and to compute longer time dynamics when the stationary phase quantum Monte Carlo method is used.

  18. Determining collective barrier operation skew in a parallel computer

    SciTech Connect

    Faraj, Daniel A.

    2015-11-24

    Determining collective barrier operation skew in a parallel computer that includes a number of compute nodes organized into an operational group includes: for each of the nodes until each node has been selected as a delayed node: selecting one of the nodes as a delayed node; entering, by each node other than the delayed node, a collective barrier operation; entering, after a delay by the delayed node, the collective barrier operation; receiving an exit signal from a root of the collective barrier operation; and measuring, for the delayed node, a barrier completion time. The barrier operation skew is calculated by: identifying, from the compute nodes' barrier completion times, a maximum barrier completion time and a minimum barrier completion time and calculating the barrier operation skew as the difference of the maximum and the minimum barrier completion time.

  19. DMA shared byte counters in a parallel computer

    SciTech Connect

    Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

    2010-04-06

    A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.

  20. Massively parallel computing on an organic molecular layer

    NASA Astrophysics Data System (ADS)

    Bandyopadhyay, Anirban; Pati, Ranjit; Sahu, Satyajit; Peper, Ferdinand; Fujita, Daisuke

    2010-05-01

    Modern computers operate at enormous speeds-capable of executing in excess of 1013 instructions per second-but their sequential approach to processing, by which logical operations are performed one after another, has remained unchanged since the 1950s. In contrast, although individual neurons of the human brain fire at around just 103 times per second, the simultaneous collective action of millions of neurons enables them to complete certain tasks more efficiently than even the fastest supercomputer. Here we demonstrate an assembly of molecular switches that simultaneously interact to perform a variety of computational tasks including conventional digital logic, calculating Voronoi diagrams, and simulating natural phenomena such as heat diffusion and cancer growth. As well as representing a conceptual shift from serial-processing with static architectures, our parallel, dynamically reconfigurable approach could provide a means to solve otherwise intractable computational problems.

  1. Solving sparse triangular linear systems on parallel computers

    NASA Technical Reports Server (NTRS)

    Anderson, Edward; Saad, Youcef

    1989-01-01

    This paper describes and compares three parallel algorithms for solving sparse triangular systems of equations. These methods involve some preprocessing overhead and are primarily of interest in solving many systems with the same coefficient matrix. The first approach is to use a fixed blocksize and form the inverse of the diagonal blocks. The second approach is to use a variable blocksize and reorder the unknowns so that the diagonal blocks are diagonal matrices. The latter technique is called level scheduling because of how it is represented in the adjacency graph, and both row-wise and jagged diagonal storage for the off-diagonal blocks are considered. These techniques are analyzed for general parallel computers and experiments are presented for the eight-processor Alliant FX/8.

  2. Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    John Mellor-Crummey

    2008-02-29

    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.

  3. Multicore Challenges and Benefits for High Performance Scientific Computing

    DOE PAGESBeta

    Nielsen, Ida M. B.; Janssen, Curtis L.

    2008-01-01

    Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less

  4. Parallel computation of multigroup reactivity coefficient using iterative method

    NASA Astrophysics Data System (ADS)

    Susmikanti, Mike; Dewayatna, Winter

    2013-09-01

    One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.

  5. National Energy Research Scientific Computing Center 2007 Annual Report

    SciTech Connect

    Hules, John A.; Bashor, Jon; Wang, Ucilia; Yarris, Lynn; Preuss, Paul

    2008-10-23

    This report presents highlights of the research conducted on NERSC computers in a variety of scientific disciplines during the year 2007. It also reports on changes and upgrades to NERSC's systems and services aswell as activities of NERSC staff.

  6. Remarks on Some Scientific Applications of Computers for Development.

    ERIC Educational Resources Information Center

    Lions, J. L.

    Some suggestions are made on how to solve the main difficulties in problems of education and related questions of computer applications. The general outline is as follows: scientific applications for development, scientific background needed, the main difficulties, the present situation, how to start, how to keep highly trained workers in their…

  7. Basic mathematical function libraries for scientific computation

    NASA Technical Reports Server (NTRS)

    Galant, David C.

    1989-01-01

    Ada packages implementing selected mathematical functions for the support of scientific and engineering applications were written. The packages provide the Ada programmer with the mathematical function support found in the languages Pascal and FORTRAN as well as an extended precision arithmetic and a complete complex arithmetic. The algorithms used are fully described and analyzed. Implementation assumes that the Ada type FLOAT objects fully conform to the IEEE 754-1985 standard for single binary floating-point arithmetic, and that INTEGER objects are 32-bit entities. Codes for the Ada packages are included as appendixes.

  8. A hybrid method for the parallel computation of Green's functions

    SciTech Connect

    Petersen, Dan Erik; Li Song; Stokbro, Kurt; Sorensen, Hans Henrik B.; Hansen, Per Christian; Skelboe, Stig; Darve, Eric

    2009-08-01

    Quantum transport models for nanodevices using the non-equilibrium Green's function method require the repeated calculation of the block tridiagonal part of the Green's and lesser Green's function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green's and lesser Green's function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only require computing a small number of entries of the inverse matrix. Then, we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size.

  9. An experiment in hurricane track prediction using parallel computing methods

    NASA Technical Reports Server (NTRS)

    Song, Chang G.; Jwo, Jung-Sing; Lakshmivarahan, S.; Dhall, S. K.; Lewis, John M.; Velden, Christopher S.

    1994-01-01

    The barotropic model is used to explore the advantages of parallel processing in deterministic forecasting. We apply this model to the track forecasting of hurricane Elena (1985). In this particular application, solutions to systems of elliptic equations are the essence of the computational mechanics. One set of equations is associated with the decomposition of the wind into irrotational and nondivergent components - this determines the initial nondivergent state. Another set is associated with recovery of the streamfunction from the forecasted vorticity. We demonstrate that direct parallel methods based on accelerated block cyclic reduction (BCR) significantly reduce the computational time required to solve the elliptic equations germane to this decomposition and forecast problem. A 72-h track prediction was made using incremental time steps of 16 min on a network of 3000 grid points nominally separated by 100 km. The prediction took 30 sec on the 8-processor Alliant FX/8 computer. This was a speed-up of 3.7 when compared to the one-processor version. The 72-h prediction of Elena's track was made as the storm moved toward Florida's west coast. Approximately 200 km west of Tampa Bay, Elena executed a dramatic recurvature that ultimately changed its course toward the northwest. Although the barotropic track forecast was unable to capture the hurricane's tight cycloidal looping maneuver, the subsequent northwesterly movement was accurately forecasted as was the location and timing of landfall near Mobile Bay.

  10. Hyper-parallel photonic quantum computation with coupled quantum dots.

    PubMed

    Ren, Bao-Cang; Deng, Fu-Guo

    2014-04-11

    It is well known that a parallel quantum computer is more powerful than a classical one. So far, there are some important works about the construction of universal quantum logic gates, the key elements in quantum computation. However, they are focused on operating on one degree of freedom (DOF) of quantum systems. Here, we investigate the possibility of achieving scalable hyper-parallel quantum computation based on two DOFs of photon systems. We construct a deterministic hyper-controlled-not (hyper-CNOT) gate operating on both the spatial-mode and the polarization DOFs of a two-photon system simultaneously, by exploiting the giant optical circular birefringence induced by quantum-dot spins in double-sided optical microcavities as a result of cavity quantum electrodynamics (QED). This hyper-CNOT gate is implemented by manipulating the four qubits in the two DOFs of a two-photon system without auxiliary spatial modes or polarization modes. It reduces the operation time and the resources consumed in quantum information processing, and it is more robust against the photonic dissipation noise, compared with the integration of several cascaded CNOT gates in one DOF.

  11. Scientific development of a massively parallel ocean climate model. Final report

    SciTech Connect

    Semtner, A.J.; Chervin, R.M.

    1996-09-01

    Over the last three years, very significant advances have been made in refining the grid resolution of ocean models and in improving the physical and numerical treatments of ocean hydrodynamics. Some of these advances have occurred as a result of the successful transition of ocean models onto massively parallel computers, which has been led by Los Alamos investigators. Major progress has been made in simulating global ocean circulation and in understanding various ocean climatic aspects such as the effect of wind driving on heat and freshwater transports. These steps have demonstrated the capability to conduct realistic decadal to century ocean integrations at high resolution on massively parallel computers.

  12. An efficient parallel algorithm for accelerating computational protein design

    PubMed Central

    Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang

    2014-01-01

    Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991

  13. PS3 CELL Development for Scientific Computation and Research

    NASA Astrophysics Data System (ADS)

    Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.

    2007-12-01

    The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.

  14. Scientific computations section monthly report, November 1993

    SciTech Connect

    Buckner, M.R.

    1993-12-30

    This progress report from the Savannah River Technology Center contains abstracts from papers from the computational modeling, applied statistics, applied physics, experimental thermal hydraulics, and packaging and transportation groups. Specific topics covered include: engineering modeling and process simulation, criticality methods and analysis, plutonium disposition.

  15. Parallel Computation of Persistent Homology using the Blowup Complex

    SciTech Connect

    Lewis, Ryan; Morozov, Dmitriy

    2015-04-27

    We describe a parallel algorithm that computes persistent homology, an algebraic descriptor of a filtered topological space. Our algorithm is distinguished by operating on a spatial decomposition of the domain, as opposed to a decomposition with respect to the filtration. We rely on a classical construction, called the Mayer--Vietoris blowup complex, to glue global topological information about a space from its disjoint subsets. We introduce an efficient algorithm to perform this gluing operation, which may be of independent interest, and describe how to process the domain hierarchically. We report on a set of experiments that help assess the strengths and identify the limitations of our method.

  16. Semi-automatic process partitioning for parallel computation

    NASA Technical Reports Server (NTRS)

    Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

    1988-01-01

    On current multiprocessor architectures one must carefully distribute data in memory in order to achieve high performance. Process partitioning is the operation of rewriting an algorithm as a collection of tasks, each operating primarily on its own portion of the data, to carry out the computation in parallel. A semi-automatic approach to process partitioning is considered in which the compiler, guided by advice from the user, automatically transforms programs into such an interacting task system. This approach is illustrated with a picture processing example written in BLAZE, which is transformed into a task system maximizing locality of memory reference.

  17. Routing performance analysis and optimization within a massively parallel computer

    DOEpatents

    Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen

    2013-04-16

    An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.

  18. Charon Toolkit for Parallel, Implicit Structured-Grid Computations: Functional Design

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Kutler, Paul (Technical Monitor)

    1997-01-01

    In a previous report the design concepts of Charon were presented. Charon is a toolkit that aids engineers in developing scientific programs for structured-grid applications to be run on MIMD parallel computers. It constitutes an augmentation of the general-purpose MPI-based message-passing layer, and provides the user with a hierarchy of tools for rapid prototyping and validation of parallel programs, and subsequent piecemeal performance tuning. Here we describe the implementation of the domain decomposition tools used for creating data distributions across sets of processors. We also present the hierarchy of parallelization tools that allows smooth translation of legacy code (or a serial design) into a parallel program. Along with the actual tool descriptions, we will present the considerations that led to the particular design choices. Many of these are motivated by the requirement that Charon must be useful within the traditional computational environments of Fortran 77 and C. Only the Fortran 77 syntax will be presented in this report.

  19. The FORCE: A portable parallel programming language supporting computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna

    1989-01-01

    This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.

  20. Performance modeling of parallel computations in resource-constrained systems

    SciTech Connect

    Ghodsi, M.

    1989-01-01

    In this research, the author studies several problems related to the modeling and performance analysis of parallel computations. First, the author investigates the properties of a new task graph model, called the generalized task graphs, which can represent the nondeterminism involved in the parallel search algorithms. In parallel search, several possible solutions to a problem are carried out concurrently, with the intention of having only one successful result. The general task graphs are prone to problems such as deadlock, unboundedness, and unsafeness. The author defines the notion of well-formedness of task graphs by viewing them as high level Petri nets and provide necessary and sufficient conditions under which a generalized task graph is well formed. Next, the author proposes a new approximate iterative algorithm to predict the performance of a resource constrained queueing network, running a number of statistically identical jobs with internal concurrency. The jobs are assumed to be instances of an arbitrary task graph. The queueing network includes a limited number of identical passive resources. A task must acquire one unit of the passive resources before receiving service. Detailed experimental results are presented which show that the algorithm converges quite fast and is reasonably accurate.

  1. Factorization of large integers on a massively parallel computer

    SciTech Connect

    Davis, J.A.; Holdridge, D.B.

    1988-01-01

    Our interest in integer factorization at Sandia National Laboratories is motivated by cryptographic applications and in particular the security of the RSA encryption-decryption algorithm. We have implemented our version of the quadratic sieve procedure on the NCUBE computer with 1024 processors (nodes). The new code is significantly different in all important aspects from the program used to factor number of order 10/sup 70/ on a single processor CRAY computer. Capabilities of parallel processing and limitation of small local memory necessitated this entirely new implementation. This effort involved several restarts as realizations of program structures that seemed appealing bogged down due to inter-processor communications. We are presently working with integers of magnitude about 10/sup 70/ in tuning this code to the novel hardware. 6 refs., 3 figs.

  2. Parallel computing-based sclera recognition for human identification

    NASA Astrophysics Data System (ADS)

    Lin, Yong; Du, Eliza Y.; Zhou, Zhi

    2012-06-01

    Compared to iris recognition, sclera recognition which uses line descriptor can achieve comparable recognition accuracy in visible wavelengths. However, this method is too time-consuming to be implemented in a real-time system. In this paper, we propose a GPU-based parallel computing approach to reduce the sclera recognition time. We define a new descriptor in which the information of KD tree structure and sclera edge are added. Registration and matching task is divided into subtasks in various sizes according to their computation complexities. Every affine transform parameters are generated by searching on KD tree. Texture memory, constant memory, and shared memory are used to store templates and transform matrixes. The experiment results show that the proposed method executed on GPU can dramatically improve the sclera matching speed in hundreds of times without accuracy decreasing.

  3. Signal processing applications of massively parallel charge domain computing devices

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)

    1999-01-01

    The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.

  4. Progress in Parallel Schur Complement Preconditioning for Computational Fluid Dynamics

    NASA Technical Reports Server (NTRS)

    Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Chancellor, Marisa K. (Technical Monitor)

    1997-01-01

    We consider preconditioning methods for nonself-adjoint advective-diffusive systems based on a non-overlapping Schur complement procedure for arbitrary triangulated domains. The ultimate goal of this research is to develop scalable preconditioning algorithms for fluid flow discretizations on parallel computing architectures. In our implementation of the Schur complement preconditioning technique, the triangulation is first partitioned into a number of subdomains using the METIS multi-level k-way partitioning code. This partitioning induces a natural 2X2 partitioning of the p.d.e. discretization matrix. By considering various inverse approximations of the 2X2 system, we have developed a family of robust preconditioning techniques. A computer code based on these ideas has been developed and tested on the IBM SP2 and the SGI Power Challenge array using MPI message passing protocol. A number of example CFD calculations will be presented to illustrate and assess various Schur complement approximations.

  5. Solving the Cauchy-Riemann equations on parallel computers

    NASA Technical Reports Server (NTRS)

    Fatoohi, Raad A.; Grosch, Chester E.

    1987-01-01

    Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.

  6. Analytic reconstruction approach for parallel translational computed tomography.

    PubMed

    Kong, Huihua; Yu, Hengyong

    2015-01-01

    To develop low-cost and low-dose computed tomography (CT) scanners for developing countries, recently a parallel translational computed tomography (PTCT) is proposed, and the source and detector are translated oppositely with respect to the imaging object without a slip-ring. In this paper, we develop an analytic filtered-backprojection (FBP)-type reconstruction algorithm for two dimensional (2D) fan-beam PTCT and extend it to three dimensional (3D) cone-beam geometry in a Feldkamp-type framework. Particularly, a weighting function is constructed to deal with data redundancy for multiple translations PTCT to eliminate image artifacts. Extensive numerical simulations are performed to validate and evaluate the proposed analytic reconstruction algorithms, and the results confirm their correctness and merits. PMID:25882732

  7. On the relationship between parallel computation and graph embedding

    SciTech Connect

    Gupta, A.K.

    1989-01-01

    The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sized binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.

  8. Energy Proportionality and Performance in Data Parallel Computing Clusters

    SciTech Connect

    Kim, Jinoh; Chou, Jerry; Rotem, Doron

    2011-02-14

    Energy consumption in datacenters has recently become a major concern due to the rising operational costs andscalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumedby the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerancepurposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A coveringset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In thiswork, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizesenergy consumption while placing the remaining nodes in lowpower standby mode. Our algorithms can also discover coveringsets in heterogeneous computing environments. In order to allow more data parallelism, we generalize our algorithms so that itcan discover k-covering sets, i.e., a set of nodes that contain at least k replicas of the data blocks. Our experimental results showthat we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and workingenvironments.

  9. Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations

    SciTech Connect

    Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; Van der Wijngaart, Rob

    2003-05-01

    The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly out performs the cache-based architectures. However, certain applications are not easily amenable to vectorization and would re quire extensive algorithm and implementation reengineering to utilize the SX-6 effectively.

  10. Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob

    2003-01-01

    The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.

  11. Building Cognition: The Construction of Computational Representations for Scientific Discovery

    ERIC Educational Resources Information Center

    Chandrasekharan, Sanjay; Nersessian, Nancy J.

    2015-01-01

    Novel computational representations, such as simulation models of complex systems and video games for scientific discovery (Foldit, EteRNA etc.), are dramatically changing the way discoveries emerge in science and engineering. The cognitive roles played by such computational representations in discovery are not well understood. We present a…

  12. Parallel Molecular Dynamics Stencil : a new parallel computing environment for a large-scale molecular dynamics simulation of solids

    NASA Astrophysics Data System (ADS)

    Shimizu, Futoshi; Kimizuka, Hajime; Kaburaki, Hideo

    2002-08-01

    A new parallel computing environment, called as ``Parallel Molecular Dynamics Stencil'', has been developed to carry out a large-scale short-range molecular dynamics simulation of solids. The stencil is written in C language using MPI for parallelization and designed successfully to separate and conceal parts of the programs describing cutoff schemes and parallel algorithms for data communication. This has been made possible by introducing the concept of image atoms. Therefore, only a sequential programming of the force calculation routine is required for executing the stencil in parallel environment. Typical molecular dynamics routines, such as various ensembles, time integration methods, and empirical potentials, have been implemented in the stencil. In the presentation, the performance of the stencil on parallel computers of Hitachi, IBM, SGI, and PC-cluster using the models of Lennard-Jones and the EAM type potentials for fracture problem will be reported.

  13. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    NASA Astrophysics Data System (ADS)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  14. Introduction to the LaRC central scientific computing complex

    NASA Technical Reports Server (NTRS)

    Shoosmith, John N.

    1993-01-01

    The computers and associated equipment that make up the Central Scientific Computing Complex of the Langley Research Center are briefly described. The electronic networks that provide access to the various components of the complex and a number of areas that can be used by Langley and contractors staff for special applications (scientific visualization, image processing, software engineering, and grid generation) are also described. Flight simulation facilities that use the central computers are described. Management of the complex, procedures for its use, and available services and resources are discussed. This document is intended for new users of the complex, for current users who wish to keep appraised of changes, and for visitors who need to understand the role of central scientific computers at Langley.

  15. Parallel Computation of Unsteady Flows on a Network of Workstations

    NASA Technical Reports Server (NTRS)

    1997-01-01

    Parallel computation of unsteady flows requires significant computational resources. The utilization of a network of workstations seems an efficient solution to the problem where large problems can be treated at a reasonable cost. This approach requires the solution of several problems: 1) the partitioning and distribution of the problem over a network of workstation, 2) efficient communication tools, 3) managing the system efficiently for a given problem. Of course, there is the question of the efficiency of any given numerical algorithm to such a computing system. NPARC code was chosen as a sample for the application. For the explicit version of the NPARC code both two- and three-dimensional problems were studied. Again both steady and unsteady problems were investigated. The issues studied as a part of the research program were: 1) how to distribute the data between the workstations, 2) how to compute and how to communicate at each node efficiently, 3) how to balance the load distribution. In the following, a summary of these activities is presented. Details of the work have been presented and published as referenced.

  16. ASCR Cybersecurity for Scientific Computing Integrity

    SciTech Connect

    Piesert, Sean

    2015-02-27

    The Department of Energy (DOE) has the responsibility to address the energy, environmental, and nuclear security challenges that face our nation. Much of DOE’s enterprise involves distributed, collaborative teams; a signi¬cant fraction involves “open science,” which depends on multi-institutional, often international collaborations that must access or share signi¬cant amounts of information between institutions and over networks around the world. The mission of the Office of Science is the delivery of scienti¬c discoveries and major scienti¬c tools to transform our understanding of nature and to advance the energy, economic, and national security of the United States. The ability of DOE to execute its responsibilities depends critically on its ability to assure the integrity and availability of scienti¬c facilities and computer systems, and of the scienti¬c, engineering, and operational software and data that support its mission.

  17. Low latency, high bandwidth data communications between compute nodes in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2010-11-02

    Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (`DMA`) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (`RTS`) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation.

  18. Parallel processing using an optical delay-based reservoir computer

    NASA Astrophysics Data System (ADS)

    Van der Sande, Guy; Nguimdo, Romain Modeste; Verschaffelt, Guy

    2016-04-01

    Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015

  19. Development of magnetron sputtering simulator with GPU parallel computing

    NASA Astrophysics Data System (ADS)

    Sohn, Ilyoup; Kim, Jihun; Bae, Junkyeong; Lee, Jinpil

    2014-12-01

    Sputtering devices are widely used in the semiconductor and display panel manufacturing process. Currently, a number of surface treatment applications using magnetron sputtering techniques are being used to improve the efficiency of the sputtering process, through the installation of magnets outside the vacuum chamber. Within the internal space of the low pressure chamber, plasma generated from the combination of a rarefied gas and an electric field is influenced interactively. Since the quality of the sputtering and deposition rate on the substrate is strongly dependent on the multi-physical phenomena of the plasma regime, numerical simulations using PIC-MCC (Particle In Cell, Monte Carlo Collision) should be employed to develop an efficient sputtering device. In this paper, the development of a magnetron sputtering simulator based on the PIC-MCC method and the associated numerical techniques are discussed. To solve the electric field equations in the 2-D Cartesian domain, a Poisson equation solver based on the FDM (Finite Differencing Method) is developed and coupled with the Monte Carlo Collision method to simulate the motion of gas particles influenced by an electric field. The magnetic field created from the permanent magnet installed outside the vacuum chamber is also numerically calculated using Biot-Savart's Law. All numerical methods employed in the present PIC code are validated by comparison with analytical and well-known commercial engineering software results, with all of the results showing good agreement. Finally, the developed PIC-MCC code is parallelized to be suitable for general purpose computing on graphics processing unit (GPGPU) acceleration, so as to reduce the large computation time which is generally required for particle simulations. The efficiency and accuracy of the GPGPU parallelized magnetron sputtering simulator are examined by comparison with the calculated results and computation times from the original serial code. It is found that

  20. Seismic imaging using finite-differences and parallel computers

    SciTech Connect

    Ober, C.C.

    1997-12-31

    A key to reducing the risks and costs of associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Prestack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar wave equation using finite differences. As part of an ongoing ACTI project funded by the US Department of Energy, a finite difference, 3-D prestack, depth migration code has been developed. The goal of this work is to demonstrate that massively parallel computers can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite difference, prestack, depth migration practical for oil and gas exploration. Several problems had to be addressed to get an efficient code for the Intel Paragon. These include efficient I/O, efficient parallel tridiagonal solves, and high single-node performance. Furthermore, to provide portable code the author has been restricted to the use of high-level programming languages (C and Fortran) and interprocessor communications using MPI. He has been using the SUNMOS operating system, which has affected many of his programming decisions. He will present images created from two verification datasets (the Marmousi Model and the SEG/EAEG 3D Salt Model). Also, he will show recent images from real datasets, and point out locations of improved imaging. Finally, he will discuss areas of current research which will hopefully improve the image quality and reduce computational costs.

  1. Cloud identification using genetic algorithms and massively parallel computation

    NASA Technical Reports Server (NTRS)

    Buckles, Bill P.; Petry, Frederick E.

    1996-01-01

    As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user

  2. Pacing a data transfer operation between compute nodes on a parallel computer

    DOEpatents

    Blocksome, Michael A.

    2011-09-13

    Methods, systems, and products are disclosed for pacing a data transfer between compute nodes on a parallel computer that include: transferring, by an origin compute node, a chunk of an application message to a target compute node; sending, by the origin compute node, a pacing request to a target direct memory access (`DMA`) engine on the target compute node using a remote get DMA operation; determining, by the origin compute node, whether a pacing response to the pacing request has been received from the target DMA engine; and transferring, by the origin compute node, a next chunk of the application message if the pacing response to the pacing request has been received from the target DMA engine.

  3. Ferrofluids: Modeling, numerical analysis, and scientific computation

    NASA Astrophysics Data System (ADS)

    Tomas, Ignacio

    This dissertation presents some developments in the Numerical Analysis of Partial Differential Equations (PDEs) describing the behavior of ferrofluids. The most widely accepted PDE model for ferrofluids is the Micropolar model proposed by R.E. Rosensweig. The Micropolar Navier-Stokes Equations (MNSE) is a subsystem of PDEs within the Rosensweig model. Being a simplified version of the much bigger system of PDEs proposed by Rosensweig, the MNSE are a natural starting point of this thesis. The MNSE couple linear velocity u, angular velocity w, and pressure p. We propose and analyze a first-order semi-implicit fully-discrete scheme for the MNSE, which decouples the computation of the linear and angular velocities, is unconditionally stable and delivers optimal convergence rates under assumptions analogous to those used for the Navier-Stokes equations. Moving onto the much more complex Rosensweig's model, we provide a definition (approximation) for the effective magnetizing field h, and explain the assumptions behind this definition. Unlike previous definitions available in the literature, this new definition is able to accommodate the effect of external magnetic fields. Using this definition we setup the system of PDEs coupling linear velocity u, pressure p, angular velocity w, magnetization m, and magnetic potential ϕ We show that this system is energy-stable and devise a numerical scheme that mimics the same stability property. We prove that solutions of the numerical scheme always exist and, under certain simplifying assumptions, that the discrete solutions converge. A notable outcome of the analysis of the numerical scheme for the Rosensweig's model is the choice of finite element spaces that allow the construction of an energy-stable scheme. Finally, with the lessons learned from Rosensweig's model, we develop a diffuse-interface model describing the behavior of two-phase ferrofluid flows and present an energy-stable numerical scheme for this model. For a

  4. Building a High Performance Computing Infrastructure for Novosibirsk Scientific Center

    NASA Astrophysics Data System (ADS)

    Adakin, A.; Belov, S.; Chubarov, D.; Kalyuzhny, V.; Kaplin, V.; Kuchin, N.; Lomakin, S.; Nikultsev, V.; Sukharev, A.; Zaytsev, A.

    2011-12-01

    Novosibirsk Scientific Center (NSC), also known worldwide as Akademgorodok, is one of the largest Russian scientific centers hosting Novosibirsk State University (NSU) and more than 35 research organizations of the Siberian Branch of Russian Academy of Sciences including Budker Institute of Nuclear Physics (BINP), Institute of Computational Technologies (ICT), and Institute of Computational Mathematics and Mathematical Geophysics (ICM&MG). Since each institute has specific requirements on the architecture of the computing farms involved in its research field, currentiy we've got several computing facilities hosted by NSC institutes, each optimized for the particular set of tasks, of which the largest are the NSU Supercomputer Center, Siberian Supercomputer Center (ICM&MG), and a Grid Computing Facility of BINP. Recendy a dedicated optical network with the initial bandwidth of 10 Gbps connecting these three facilities was built in order to make it possible to share the computing resources among the research communities of participating institutes, thus providing a common platform for building the computing infrastructure for various scientific projects. Unification of the computing infrastructure is achieved by extensive use of virtualization technologies based on XEN and KVM platforms. The solution implemented was tested thoroughly within the computing environment of KEDR detector experiment which is being carried out at BINP, and foreseen to be applied to the use cases of other HEP experiments in the upcoming future.

  5. InSAR Scientific Computing Environment

    NASA Technical Reports Server (NTRS)

    Rosen, Paul A.; Sacco, Gian Franco; Gurrola, Eric M.; Zabker, Howard A.

    2011-01-01

    This computing environment is the next generation of geodetic image processing technology for repeat-pass Interferometric Synthetic Aperture (InSAR) sensors, identified by the community as a needed capability to provide flexibility and extensibility in reducing measurements from radar satellites and aircraft to new geophysical products. This software allows users of interferometric radar data the flexibility to process from Level 0 to Level 4 products using a variety of algorithms and for a range of available sensors. There are many radar satellites in orbit today delivering to the science community data of unprecedented quantity and quality, making possible large-scale studies in climate research, natural hazards, and the Earth's ecosystem. The proposed DESDynI mission, now under consideration by NASA for launch later in this decade, would provide time series and multiimage measurements that permit 4D models of Earth surface processes so that, for example, climate-induced changes over time would become apparent and quantifiable. This advanced data processing technology, applied to a global data set such as from the proposed DESDynI mission, enables a new class of analyses at time and spatial scales unavailable using current approaches. This software implements an accurate, extensible, and modular processing system designed to realize the full potential of InSAR data from future missions such as the proposed DESDynI, existing radar satellite data, as well as data from the NASA UAVSAR (Uninhabited Aerial Vehicle Synthetic Aperture Radar), and other airborne platforms. The processing approach has been re-thought in order to enable multi-scene analysis by adding new algorithms and data interfaces, to permit user-reconfigurable operation and extensibility, and to capitalize on codes already developed by NASA and the science community. The framework incorporates modern programming methods based on recent research, including object-oriented scripts controlling legacy and

  6. Implementation of Parallel Computing Technology to Vortex Flow

    NASA Technical Reports Server (NTRS)

    Dacles-Mariani, Jennifer

    1999-01-01

    Mainframe supercomputers such as the Cray C90 was invaluable in obtaining large scale computations using several millions of grid points to resolve salient features of a tip vortex flow over a lifting wing. However, real flight configurations require tracking not only of the flow over several lifting wings but its growth and decay in the near- and intermediate- wake regions, not to mention the interaction of these vortices with each other. Resolving and tracking the evolution and interaction of these vortices shed from complex bodies is computationally intensive. Parallel computing technology is an attractive option in solving these flows. In planetary science vortical flows are also important in studying how planets and protoplanets form when cosmic dust and gases become gravitationally unstable and eventually form planets or protoplanets. The current paradigm for the formation of planetary systems maintains that the planets accreted from the nebula of gas and dust left over from the formation of the Sun. Traditional theory also indicate that such a preplanetary nebula took the form of flattened disk. The coagulation of dust led to the settling of aggregates toward the midplane of the disk, where they grew further into asteroid-like planetesimals. Some of the issues still remaining in this process are the onset of gravitational instability, the role of turbulence in the damping of particles and radial effects. In this study the focus will be with the role of turbulence and the radial effects.

  7. Parallel computer graphics algorithms for the Connection Machine

    SciTech Connect

    Richardson, J.F.

    1990-01-01

    Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.

  8. A high performance parallel computing architecture for robust image features

    NASA Astrophysics Data System (ADS)

    Zhou, Renyan; Liu, Leibo; Wei, Shaojun

    2014-03-01

    A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.

  9. Optimized collectives using a DMA on a parallel computer

    DOEpatents

    Chen, Dong; Gabor, Dozsa; Giampapa, Mark E.; Heidelberger; Phillip

    2011-02-08

    Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.

  10. On implicit Runge-Kutta methods for parallel computations

    NASA Technical Reports Server (NTRS)

    Keeling, Stephen L.

    1987-01-01

    Implicit Runge-Kutta methods which are well-suited for parallel computations are characterized. It is claimed that such methods are first of all, those for which the associated rational approximation to the exponential has distinct poles, and these are called multiply explicit (MIRK) methods. Also, because of the so-called order reduction phenomenon, there is reason to require that these poles be real. Then, it is proved that a necessary condition for a q-stage, real MIRK to be A sub 0-stable with maximal order q + 1 is that q = 1, 2, 3, or 5. Nevertheless, it is shown that for every positive integer q, there exists a q-stage, real MIRK which is I-stable with order q. Finally, some useful examples of algebraically stable MIRKs are given.

  11. A Generic Scheduling Simulator for High Performance Parallel Computers

    SciTech Connect

    Yoo, B S; Choi, G S; Jette, M A

    2001-08-01

    It is well known that efficient job scheduling plays a crucial role in achieving high system utilization in large-scale high performance computing environments. A good scheduling algorithm should schedule jobs to achieve high system utilization while satisfying various user demands in an equitable fashion. Designing such a scheduling algorithm is a non-trivial task even in a static environment. In practice, the computing environment and workload are constantly changing. There are several reasons for this. First, the computing platforms constantly evolve as the technology advances. For example, the availability of relatively powerful commodity off-the-shelf (COTS) components at steadily diminishing prices have made it feasible to construct ever larger massively parallel computers in recent years [1, 4]. Second, the workload imposed on the system also changes constantly. The rapidly increasing compute resources have provided many applications developers with the opportunity to radically alter program characteristics and take advantage of these additional resources. New developments in software technology may also trigger changes in user applications. Finally, political climate change may alter user priorities or the mission of the organization. System designers in such dynamic environments must be able to accurately forecast the effect of changes in the hardware, software, and/or policies under consideration. If the environmental changes are significant, one must also reassess scheduling algorithms. Simulation has frequently been relied upon for this analysis, because other methods such as analytical modeling or actual measurements are usually too difficult or costly. A drawback of the simulation approach, however, is that developing a simulator is a time-consuming process. Furthermore, an existing simulator cannot be easily adapted to a new environment. In this research, we attempt to develop a generic job-scheduling simulator, which facilitates the evaluation of

  12. High-Performance Parallel Computing. Final report, 1 February 1984-31 January 1985

    SciTech Connect

    Browne, J.C.; Lipovski, G.J.

    1986-01-22

    The 1984/85 accomplishments of the research project High-Performance Parallel Computing included bringing the prototype of the Texas Reconfigurable Array Computer (TRAC) to a configuration and to a state of stability where it could support execution of simple assembly language programs; initial development of a unified model of parallel computation which is a basis for a programming environment uniting process and data flow models of parallel computation; bringing to operational status on an alternative host one of the two parallel programming languages (the Computation Structures Language, CSL) originally intended for use on TRAC; exploration of the expressive capabilities of this programming language; initiation of development of a graphical programming language based on the unified model of parallel computation mentioned previously; major progress on a graphically interfaced Petri Net-based performance modeling system for parallel computations and development of algorithms for scheduling of circuits to realize configurations in configurable Banyan network-based computer architectures.

  13. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

    PubMed

    Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

    2010-10-01

    Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies.

  14. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

    PubMed

    Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

    2010-10-01

    Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. PMID:20674066

  15. Low cost, highly effective parallel computing achieved through a Beowulf cluster.

    PubMed

    Bitner, Marc; Skelton, Gordon

    2003-01-01

    A Beowulf cluster is a means of bringing together several computers and using software and network components to make this cluster of computers appear and function as one computer with multiple parallel computing processors. A cluster of computers can provide comparable computing power usually found only in very expensive super computers or servers.

  16. DOE Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee Report on Scientific and Technical Information

    SciTech Connect

    Hey, Tony; Agarwal, Deborah; Borgman, Christine; Cartaro, Concetta; Crivelli, Silvia; Van Dam, Kerstin Kleese; Luce, Richard; Arjun, Shankar; Trefethen, Anne; Wade, Alex; Williams, Dean

    2015-09-04

    The Advanced Scientific Computing Advisory Committee (ASCAC) was charged to form a standing subcommittee to review the Department of Energy’s Office of Scientific and Technical Information (OSTI) and to begin by assessing the quality and effectiveness of OSTI’s recent and current products and services and to comment on its mission and future directions in the rapidly changing environment for scientific publication and data. The Committee met with OSTI staff and reviewed available products, services and other materials. This report summaries their initial findings and recommendations.

  17. Component-based software for high-performance scientific computing

    NASA Astrophysics Data System (ADS)

    Alexeev, Yuri; Allan, Benjamin A.; Armstrong, Robert C.; Bernholdt, David E.; Dahlgren, Tamara L.; Gannon, Dennis; Janssen, Curtis L.; Kenny, Joseph P.; Krishnan, Manojkumar; Kohl, James A.; Kumfert, Gary; Curfman McInnes, Lois; Nieplocha, Jarek; Parker, Steven G.; Rasmussen, Craig; Windus, Theresa L.

    2005-01-01

    Recent advances in both computational hardware and multidisciplinary science have given rise to an unprecedented level of complexity in scientific simulation software. This paper describes an ongoing grass roots effort aimed at addressing complexity in high-performance computing through the use of Component-Based Software Engineering (CBSE). Highlights of the benefits and accomplishments of the Common Component Architecture (CCA) Forum and SciDAC ISIC are given, followed by an illustrative example of how the CCA has been applied to drive scientific discovery in quantum chemistry. Thrusts for future research are also described briefly.

  18. Parallel Computation of Orbit Determination for Space Debris Population

    NASA Astrophysics Data System (ADS)

    Olmedo, Estrella; Sanchez-Ortiz, Noelia; Ramos-Lerate, Mercedes

    2009-03-01

    In this work we present an algorithm for computing Orbit Determination for Space Debris population. The method presents a high degree of parallelism. That means that the number of available computers divides the computational effort. The context of this work and the later scope is to have the capability of cataloguing and correlating the Space Debris population. In this sense, as better the accuracy provided by the orbit determination is, more accurate will be the estimation of the state vectors corresponding to the debris objects and better will be the accuracy of the future catalogue of Space Debris. As more objects we can determinate the corresponding orbit, more complete will be the future catalogue. Therefore numerical tools for orbit determination are a key point in the development of a future ESSAS. The first time that a new object is observed, six measurements (these measurements may come from RADAR, Ground Based Telescope or Space Based Telescope) are required for computing an Initial Orbit Determination (IOD). After that, the Initial Estimated State Vector (IESV) is improved within the next-coming measurement. The idea of this method is the following. From six initial measurements, we compute the IOD following the same ideas of [1]. We compute also the initial knowledge covariance matrix (IKCM) corresponding to the IESV. In general, the numerical error of the IOD is too big for processing the following measurements with a conventional numerical filter (like the Square Root Information Filter (SRIF)). The problem is that the improvement of the accuracy in the IOD is not an easy task in those cases with large initial error. However the computed IKCM give a realistic approximation of the committed error in the IOD. The proposed algorithm uses the IKCM for generating a cloud of IESVs. All the IESV inside the cloud are processed with a new and much smaller IKCM by using SRIF. In such a way that the ones that are close enough to the real state vector (and thus

  19. Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format.

    PubMed

    Bustamam, Alhadi; Burrage, Kevin; Hamilton, Nicholas A

    2012-01-01

    Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.

  20. Scalable High Performance Computing: Direct and Large-Eddy Turbulent Flow Simulations Using Massively Parallel Computers

    NASA Technical Reports Server (NTRS)

    Morgan, Philip E.

    2004-01-01

    This final report contains reports of research related to the tasks "Scalable High Performance Computing: Direct and Lark-Eddy Turbulent FLow Simulations Using Massively Parallel Computers" and "Devleop High-Performance Time-Domain Computational Electromagnetics Capability for RCS Prediction, Wave Propagation in Dispersive Media, and Dual-Use Applications. The discussion of Scalable High Performance Computing reports on three objectives: validate, access scalability, and apply two parallel flow solvers for three-dimensional Navier-Stokes flows; develop and validate a high-order parallel solver for Direct Numerical Simulations (DNS) and Large Eddy Simulation (LES) problems; and Investigate and develop a high-order Reynolds averaged Navier-Stokes turbulence model. The discussion of High-Performance Time-Domain Computational Electromagnetics reports on five objectives: enhancement of an electromagnetics code (CHARGE) to be able to effectively model antenna problems; utilize lessons learned in high-order/spectral solution of swirling 3D jets to apply to solving electromagnetics project; transition a high-order fluids code, FDL3DI, to be able to solve Maxwell's Equations using compact-differencing; develop and demonstrate improved radiation absorbing boundary conditions for high-order CEM; and extend high-order CEM solver to address variable material properties. The report also contains a review of work done by the systems engineer.

  1. Understanding Performance of Parallel Scientific Simulation Codes using Open|SpeedShop

    SciTech Connect

    Ghosh, K K

    2011-11-07

    Conclusions of this presentation are: (1) Open SpeedShop's (OSS) is convenient to use for large, parallel, scientific simulation codes; (2) Large codes benefit from uninstrumented execution; (3) Many experiments can be run in a short time - might need multiple shots e.g. usertime for caller-callee, hwcsamp for HW counters; (4) Decent idea of code's performance is easily obtained; (5) Statistical sampling calls for decent number of samples; and (6) HWC data is very useful for micro-analysis but can be tricky to analyze.

  2. Accelerating Dust Storm Simulation by Balancing Task Allocation in Parallel Computing Environment

    NASA Astrophysics Data System (ADS)

    Gui, Z.; Yang, C.; XIA, J.; Huang, Q.; YU, M.

    2013-12-01

    Dust storm has serious negative impacts on environment, human health, and assets. The continuing global climate change has increased the frequency and intensity of dust storm in the past decades. To better understand and predict the distribution, intensity and structure of dust storm, a series of dust storm models have been developed, such as Dust Regional Atmospheric Model (DREAM), the NMM meteorological module (NMM-dust) and Chinese Unified Atmospheric Chemistry Environment for Dust (CUACE/Dust). The developments and applications of these models have contributed significantly to both scientific research and our daily life. However, dust storm simulation is a data and computing intensive process. Normally, a simulation for a single dust storm event may take several days or hours to run. It seriously impacts the timeliness of prediction and potential applications. To speed up the process, high performance computing is widely adopted. By partitioning a large study area into small subdomains according to their geographic location and executing them on different computing nodes in a parallel fashion, the computing performance can be significantly improved. Since spatiotemporal correlations exist in the geophysical process of dust storm simulation, each subdomain allocated to a node need to communicate with other geographically adjacent subdomains to exchange data. Inappropriate allocations may introduce imbalance task loads and unnecessary communications among computing nodes. Therefore, task allocation method is the key factor, which may impact the feasibility of the paralleling. The allocation algorithm needs to carefully leverage the computing cost and communication cost for each computing node to minimize total execution time and reduce overall communication cost for the entire system. This presentation introduces two algorithms for such allocation and compares them with evenly distributed allocation method. Specifically, 1) In order to get optimized solutions, a

  3. Dynamic modeling of Tampa Bay urban development using parallel computing

    NASA Astrophysics Data System (ADS)

    Xian, George; Crane, Mike; Steinwand, Dan

    2005-08-01

    Urban land use and land cover has changed significantly in the environs of Tampa Bay, Florida, over the past 50 years. Extensive urbanization has created substantial change to the region's landscape and ecosystems. This paper uses a dynamic urban-growth model, SLEUTH, which applies six geospatial data themes (slope, land use, exclusion, urban extent, transportation, hillside), to study the process of urbanization and associated land use and land cover change in the Tampa Bay area. To reduce processing time and complete the modeling process within an acceptable period, the model is recoded and ported to a Beowulf cluster. The parallel-processing computer system accomplishes the massive amount of computation the modeling simulation requires. SLEUTH calibration process for the Tampa Bay urban growth simulation spends only 10 h CPU time. The model predicts future land use/cover change trends for Tampa Bay from 1992 to 2025. Urban extent is predicted to double in the Tampa Bay watershed between 1992 and 2025. Results show an upward trend of urbanization at the expense of a decline of 58% and 80% in agriculture and forested lands, respectively.

  4. Dynamic modeling of Tampa Bay urban development using parallel computing

    USGS Publications Warehouse

    Xian, G.; Crane, M.; Steinwand, D.

    2005-01-01

    Urban land use and land cover has changed significantly in the environs of Tampa Bay, Florida, over the past 50 years. Extensive urbanization has created substantial change to the region's landscape and ecosystems. This paper uses a dynamic urban-growth model, SLEUTH, which applies six geospatial data themes (slope, land use, exclusion, urban extent, transportation, hillside), to study the process of urbanization and associated land use and land cover change in the Tampa Bay area. To reduce processing time and complete the modeling process within an acceptable period, the model is recoded and ported to a Beowulf cluster. The parallel-processing computer system accomplishes the massive amount of computation the modeling simulation requires. SLEUTH calibration process for the Tampa Bay urban growth simulation spends only 10 h CPU time. The model predicts future land use/cover change trends for Tampa Bay from 1992 to 2025. Urban extent is predicted to double in the Tampa Bay watershed between 1992 and 2025. Results show an upward trend of urbanization at the expense of a decline of 58% and 80% in agriculture and forested lands, respectively. ?? 2005 Elsevier Ltd. All rights reserved.

  5. Modeling the fracture of ice sheets on parallel computers.

    SciTech Connect

    Waisman, Haim; Bell, Robin; Keyes, David; Boman, Erik Gunnar; Tuminaro, Raymond Stephen

    2010-03-01

    The objective of this project is to investigate the complex fracture of ice and understand its role within larger ice sheet simulations and global climate change. At the present time, ice fracture is not explicitly considered within ice sheet models due in part to large computational costs associated with the accurate modeling of this complex phenomena. However, fracture not only plays an extremely important role in regional behavior but also influences ice dynamics over much larger zones in ways that are currently not well understood. Dramatic illustrations of fracture-induced phenomena most notably include the recent collapse of ice shelves in Antarctica (e.g. partial collapse of the Wilkins shelf in March of 2008 and the diminishing extent of the Larsen B shelf from 1998 to 2002). Other fracture examples include ice calving (fracture of icebergs) which is presently approximated in simplistic ways within ice sheet models, and the draining of supraglacial lakes through a complex network of cracks, a so called ice sheet plumbing system, that is believed to cause accelerated ice sheet flows due essentially to lubrication of the contact surface with the ground. These dramatic changes are emblematic of the ongoing change in the Earth's polar regions and highlight the important role of fracturing ice. To model ice fracture, a simulation capability will be designed centered around extended finite elements and solved by specialized multigrid methods on parallel computers. In addition, appropriate dynamic load balancing techniques will be employed to ensure an approximate equal amount of work for each processor.

  6. PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Pattern

    SciTech Connect

    Gong, Zhenhuan; Boyuka, David; Zou, X; Liu, Gary; Podhorszki, Norbert; Klasky, Scott A; Ma, Xiaosong; Samatova, Nagiza F

    2013-01-01

    Download Citation Email Print Request Permissions Save to Project The size and scope of cutting-edge scientific simulations are growing much faster than the I/O and storage capabilities of their run-time environments. The growing gap is exacerbated by exploratory, data-intensive analytics, such as querying simulation data with multivariate, spatio-temporal constraints, which induces heterogeneous access patterns that stress the performance of the underlying storage system. Previous work addresses data layout and indexing techniques to improve query performance for a single access pattern, which is not sufficient for complex analytics jobs. We present PARLO a parallel run-time layout optimization framework, to achieve multi-level data layout optimization for scientific applications at run-time before data is written to storage. The layout schemes optimize for heterogeneous access patterns with user-specified priorities. PARLO is integrated with ADIOS, a high-performance parallel I/O middleware for large-scale HPC applications, to achieve user-transparent, light-weight layout optimization for scientific datasets. It offers simple XML-based configuration for users to achieve flexible layout optimization without the need to modify or recompile application codes. Experiments show that PARLO improves performance by 2 to 26 times for queries with heterogeneous access patterns compared to state-of-the-art scientific database management systems. Compared to traditional post-processing approaches, its underlying run-time layout optimization achieves a 56% savings in processing time and a reduction in storage overhead of up to 50%. PARLO also exhibits a low run-time resource requirement, while also limiting the performance impact on running applications to a reasonable level.

  7. Massively parallel computational fluid dynamics calculations for aerodynamics and aerothermodynamics applications

    SciTech Connect

    Payne, J.L.; Hassan, B.

    1998-09-01

    Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.

  8. Exploring Cloud Computing for Large-scale Scientific Applications

    SciTech Connect

    Lin, Guang; Han, Binh; Yin, Jian; Gorton, Ian

    2013-06-27

    This paper explores cloud computing for large-scale data-intensive scientific applications. Cloud computing is attractive because it provides hardware and software resources on-demand, which relieves the burden of acquiring and maintaining a huge amount of resources that may be used only once by a scientific application. However, unlike typical commercial applications that often just requires a moderate amount of ordinary resources, large-scale scientific applications often need to process enormous amount of data in the terabyte or even petabyte range and require special high performance hardware with low latency connections to complete computation in a reasonable amount of time. To address these challenges, we build an infrastructure that can dynamically select high performance computing hardware across institutions and dynamically adapt the computation to the selected resources to achieve high performance. We have also demonstrated the effectiveness of our infrastructure by building a system biology application and an uncertainty quantification application for carbon sequestration, which can efficiently utilize data and computation resources across several institutions.

  9. Analysis of composite ablators using massively parallel computation

    NASA Technical Reports Server (NTRS)

    Shia, David

    1995-01-01

    In this work, the feasibility of using massively parallel computation to study the response of ablative materials is investigated. Explicit and implicit finite difference methods are used on a massively parallel computer, the Thinking Machines CM-5. The governing equations are a set of nonlinear partial differential equations. The governing equations are developed for three sample problems: (1) transpiration cooling, (2) ablative composite plate, and (3) restrained thermal growth testing. The transpiration cooling problem is solved using a solution scheme based solely on the explicit finite difference method. The results are compared with available analytical steady-state through-thickness temperature and pressure distributions and good agreement between the numerical and analytical solutions is found. It is also found that a solution scheme based on the explicit finite difference method has the following advantages: incorporates complex physics easily, results in a simple algorithm, and is easily parallelizable. However, a solution scheme of this kind needs very small time steps to maintain stability. A solution scheme based on the implicit finite difference method has the advantage that it does not require very small times steps to maintain stability. However, this kind of solution scheme has the disadvantages that complex physics cannot be easily incorporated into the algorithm and that the solution scheme is difficult to parallelize. A hybrid solution scheme is then developed to combine the strengths of the explicit and implicit finite difference methods and minimize their weaknesses. This is achieved by identifying the critical time scale associated with the governing equations and applying the appropriate finite difference method according to this critical time scale. The hybrid solution scheme is then applied to the ablative composite plate and restrained thermal growth problems. The gas storage term is included in the explicit pressure calculation of both

  10. Dynamic remapping decisions in multi-phase parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, D. M.; Reynolds, P. F., Jr.

    1986-01-01

    The effectiveness of any given mapping of workload to processors in a parallel system is dependent on the stochastic behavior of the workload. Program behavior is often characterized by a sequence of phases, with phase changes occurring unpredictably. During a phase, the behavior is fairly stable, but may become quite different during the next phase. Thus a workload assignment generated for one phase may hinder performance during the next phase. We consider the problem of deciding whether to remap a paralled computation in the face of uncertainty in remapping's utility. Fundamentally, it is necessary to balance the expected remapping performance gain against the delay cost of remapping. This paper treats this problem formally by constructing a probabilistic model of a computation with at most two phases. We use stochastic dynamic programming to show that the remapping decision policy which minimizes the expected running time of the computation has an extremely simple structure: the optimal decision at any step is followed by comparing the probability of remapping gain against a threshold. This theoretical result stresses the importance of detecting a phase change, and assessing the possibility of gain from remapping. We also empirically study the sensitivity of optimal performance to imprecise decision threshold. Under a wide range of model parameter values, we find nearly optimal performance if remapping is chosen simply when the gain probability is high. These results strongly suggest that except in extreme cases, the remapping decision problem is essentially that of dynamically determining whether gain can be achieved by remapping after a phase change; precise quantification of the decision model parameters is not necessary.

  11. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, J.

    1999-01-01

    A new atmospheric objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 1 X 1 lat-lon grid with 18 levels of heights and winds and 10 levels of moisture) using 120,000 observations in 17 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system is totally portable and can run on several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from 1 to 32 CPUs is 18%. In addition, the analysis results are identical regardless of the number of processors used. This system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. Static tests with a 2 X 2.5 resolution version of this system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from several months of cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (O-F statistics) as the current operational system.

  12. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, James G.

    1999-01-01

    A new objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 2 x 2.5 lat-lon grid with 20 levels of heights and winds and 10 levels of moisture) using 120,000 observations in less than 3 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system Ls totally portable and can run on -several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from I to 32 CPus is 18%. in addition, the analysis results are identical regardless of the number of processors used. T'his system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. It also includes a new quality control (buddy check) system. Static tests with the system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from a 2-month cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (0-F statistics) throughout the entire two months.

  13. Massively parallel computation of RCS with finite elements

    NASA Technical Reports Server (NTRS)

    Parker, Jay

    1993-01-01

    One of the promising combinations of finite element approaches for scattering problems uses Whitney edge elements, spherical vector wave-absorbing boundary conditions, and bi-conjugate gradient solution for the frequency-domain near field. Each of these approaches may be criticized. Low-order elements require high mesh density, but also result in fast, reliable iterative convergence. Spherical wave-absorbing boundary conditions require additional space to be meshed beyond the most minimal near-space region, but result in fully sparse, symmetric matrices which keep storage and solution times low. Iterative solution is somewhat unpredictable and unfriendly to multiple right-hand sides, yet we find it to be uniformly fast on large problems to date, given the other two approaches. Implementation of these approaches on a distributed memory, message passing machine yields huge dividends, as full scalability to the largest machines appears assured and iterative solution times are well-behaved for large problems. We present times and solutions for computed RCS for a conducting cube and composite permeability/conducting sphere on the Intel ipsc860 with up to 16 processors solving over 200,000 unknowns. We estimate problems of approximately 10 million unknowns, encompassing 1000 cubic wavelengths, may be attempted on a currently available 512 processor machine, but would be exceedingly tedious to prepare. The most severe bottlenecks are due to the slow rate of mesh generation on non-parallel machines and the large transfer time from such a machine to the parallel processor. One solution, in progress, is to create and then distribute a coarse mesh among the processors, followed by systematic refinement within each processor. Elimination of redundant node definitions at the mesh-partition surfaces, snap-to-surface post processing of the resulting mesh for good modelling of curved surfaces, and load-balancing redistribution of new elements after the refinement are auxiliary

  14. Ontology-Driven Discovery of Scientific Computational Entities

    ERIC Educational Resources Information Center

    Brazier, Pearl W.

    2010-01-01

    Many geoscientists use modern computational resources, such as software applications, Web services, scientific workflows and datasets that are readily available on the Internet, to support their research and many common tasks. These resources are often shared via human contact and sometimes stored in data portals; however, they are not necessarily…

  15. A class of parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1989-01-01

    Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

  16. Design of a massively parallel computer using bit serial processing elements

    NASA Technical Reports Server (NTRS)

    Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing

    1995-01-01

    A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.

  17. Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel A.

    2016-03-01

    In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.

  18. Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer

    DOEpatents

    Davis, Kristan D.; Faraj, Daniel

    2016-05-03

    In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.

  19. Institute for Scientific Computing Research Annual Report: Fiscal Year 2004

    SciTech Connect

    Keyes, D E

    2005-02-07

    Large-scale scientific computation and all of the disciplines that support and help to validate it have been placed at the focus of Lawrence Livermore National Laboratory (LLNL) by the Advanced Simulation and Computing (ASC) program of the National Nuclear Security Administration (NNSA) and the Scientific Discovery through Advanced Computing (SciDAC) initiative of the Office of Science of the Department of Energy (DOE). The maturation of computational simulation as a tool of scientific and engineering research is underscored in the November 2004 statement of the Secretary of Energy that, ''high performance computing is the backbone of the nation's science and technology enterprise''. LLNL operates several of the world's most powerful computers--including today's single most powerful--and has undertaken some of the largest and most compute-intensive simulations ever performed. Ultrascale simulation has been identified as one of the highest priorities in DOE's facilities planning for the next two decades. However, computers at architectural extremes are notoriously difficult to use efficiently. Furthermore, each successful terascale simulation only points out the need for much better ways of interacting with the resulting avalanche of data. Advances in scientific computing research have, therefore, never been more vital to LLNL's core missions than at present. Computational science is evolving so rapidly along every one of its research fronts that to remain on the leading edge, LLNL must engage researchers at many academic centers of excellence. In Fiscal Year 2004, the Institute for Scientific Computing Research (ISCR) served as one of LLNL's main bridges to the academic community with a program of collaborative subcontracts, visiting faculty, student internships, workshops, and an active seminar series. The ISCR identifies researchers from the academic community for computer science and computational science collaborations with LLNL and hosts them for short- and

  20. A Visual Database System for Image Analysis on Parallel Computers and its Application to the EOS Amazon Project

    NASA Technical Reports Server (NTRS)

    Shapiro, Linda G.; Tanimoto, Steven L.; Ahrens, James P.

    1996-01-01

    The goal of this task was to create a design and prototype implementation of a database environment that is particular suited for handling the image, vision and scientific data associated with the NASA's EOC Amazon project. The focus was on a data model and query facilities that are designed to execute efficiently on parallel computers. A key feature of the environment is an interface which allows a scientist to specify high-level directives about how query execution should occur.

  1. Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

    NASA Technical Reports Server (NTRS)

    Hsia, T. C.; Lu, G. Z.; Han, W. H.

    1987-01-01

    In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.

  2. SAF - Sets and Fields parallel I/O and scientific data modeling system

    2005-07-01

    SAF is being developed as part of the Data Models and Formats (DMF) component of the Accelerated Strategic Computing Initiative (ASCI). SAF represents a revolutionary approach to interoperation of high performance, scientific computing applications based upon rigorous, math oriented data modeling principles. Previous technologies have required all applications to use the same data structures and/or mesh objects to represent scientific data or lead to an ever expanding set of incrementally different data structures and/or meshmore » objects. SAF addresses this problem by providing a small set of mathematical building blocks, sets, relations and fields, out of which a wide variety of scientific data can be characterized. Applications literally model their data by assembling these building blocks. Sets and fields building blocks are at once, both primitive and abstract: * They are primitive enough to model a wide variety of scientific data. * They are abstract enough to model the data in terms of what it represents in a mathematical or physical sense independent of how it is represented in an implementation sense. For example, while there are many ways to represent the airflow over the wing of a supersonic aircraft in a computer program, there is only one mathematical/physical interpretation: a field of 3D velocity vectors over a 2D surface. This latter description is immutable. It is independent of any particular representation or implementation choices. Understanding this what versus how relationship, that is what is represented versus how it is represented, is key to developing a solution for large scale integration of scientific software.« less

  3. SAF - Sets and Fields parallel I/O and scientific data modeling system

    SciTech Connect

    Matzke, Robb; Illescas, Eric; Espen, Peter; Jones, Jake S.; Sjaardema, Gregory; Miller, Mark C.; Schoof, Larry A.; Reus, James F.; Arrighi, William; Hitt, Ray T.; O'Brien, Matthew J.

    2005-07-01

    SAF is being developed as part of the Data Models and Formats (DMF) component of the Accelerated Strategic Computing Initiative (ASCI). SAF represents a revolutionary approach to interoperation of high performance, scientific computing applications based upon rigorous, math oriented data modeling principles. Previous technologies have required all applications to use the same data structures and/or mesh objects to represent scientific data or lead to an ever expanding set of incrementally different data structures and/or mesh objects. SAF addresses this problem by providing a small set of mathematical building blocks, sets, relations and fields, out of which a wide variety of scientific data can be characterized. Applications literally model their data by assembling these building blocks. Sets and fields building blocks are at once, both primitive and abstract: * They are primitive enough to model a wide variety of scientific data. * They are abstract enough to model the data in terms of what it represents in a mathematical or physical sense independent of how it is represented in an implementation sense. For example, while there are many ways to represent the airflow over the wing of a supersonic aircraft in a computer program, there is only one mathematical/physical interpretation: a field of 3D velocity vectors over a 2D surface. This latter description is immutable. It is independent of any particular representation or implementation choices. Understanding this what versus how relationship, that is what is represented versus how it is represented, is key to developing a solution for large scale integration of scientific software.

  4. Parallel computations on pedigree data through mapping to configurable computing devices

    PubMed Central

    Henshall, John M; Little, Bryce Alvin

    2006-01-01

    Pedigree data structures have a number of applications in genetics, including the estimation of allelic or haplotype probabilities in humans and agricultural species, and the estimation of breeding values in agricultural species. Sequential algorithms for general purpose CPU-based computers are commonly used, but are inadequate for some tasks on large data sets. We show that pedigree data can be directly represented on Field Programmable Gate Arrays (FPGA), allowing highly efficient massively parallel simulation of the flow of genes. Operating on the whole pedigree in parallel, the transmission of genes can occur for all individuals in a single clock cycle. By using FPGA, the algorithms to estimate inbreeding coefficients and allelic probabilities are shown to operate hundreds to thousands of times faster than the corresponding sequentially based algorithms. Where problems can be largely represented in an integer form, FPGA provide an efficient platform for computations on pedigree data. PMID:16635449

  5. Parallelizing Sylvester-like operations on a distributed memory computer

    SciTech Connect

    Hu, D.Y.; Sorensen, D.C.

    1994-12-31

    Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.

  6. Three-dimensional radiative transfer on a massively parallel computer

    NASA Technical Reports Server (NTRS)

    Vath, H. M.

    1994-01-01

    We perform 3D radiative transfer calculations in non-local thermodynamic equilibrium (NLTE) in the simple two-level atom approximation on the Mas-Par MP-1, which contains 8192 processors and is a single instruction multiple data (SIMD) machine, an example of the new generation of massively parallel computers. On such a machine, all processors execute the same command at a given time, but on different data. To make radiative transfer calculations efficient, we must re-consider the numerical methods and storage of data. To solve the transfer equation, we adopt the short characteristic method and examine different acceleration methods to obtain the source function. We use the ALI method and test local and non-local operators. Furthermore, we compare the Ng and the orthomin methods of acceleration. We also investigate the use of multi-grid methods to get fast solutions for the NLTE case. In order to test these numerical methods, we apply them to two problems with and without periodic boundary conditions.

  7. Parallel multiphysics algorithms and software for computational nuclear engineering

    NASA Astrophysics Data System (ADS)

    Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.

    2009-07-01

    There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.

  8. A parallel computational fluid dynamics unstructured grid generator

    NASA Astrophysics Data System (ADS)

    Davis, Deborah E.

    1993-12-01

    This research addressed the development of a parallel computational fluid dynamics unstructured grid generator using Delaunay triangulation. The generator is applied to simple elliptical and cylindrical two-dimensional bodies. The methodologies used included Watson's point insertion algorithm, Holmes and Snyder's point creation algorithm, a discretized surface definition, Anderson's clustering function, and a Laplacian smoother. The first version of the software involved a processor boundary exchange at the end of each iteration with no inter-processor communications during the iterations. The second version used inter-processor communication during each iteration instead of the boundary exchange. Version 1 demonstrated a speedup of 1.8 for some portions of the code, but proved to be unscalable for more than two nodes due to the interdependency of the triangular elements. The results of Version 2 were similar. Two distribution methodologies, a simple 360-degree distribution and recursive spectral bisection (RSB), were examined. For the initial grid distribution, the distribution generated by the RSB code would be similar to the distribution generated by the 360-degree methodology and would require significantly more time to execute.

  9. Acts -- A collection of high performing software tools for scientific computing

    SciTech Connect

    Drummond, L.A.; Marques, O.A.

    2002-11-01

    During the past decades there has been a continuous growth in the number of physical and societal problems that have been successfully studied and solved by means of computational modeling and simulation. Further, many new discoveries depend on high performance computer simulations to satisfy their demands for large computational resources and short response time. The Advanced CompuTational Software (ACTS) Collection brings together a number of general-purpose computational tool development projects funded and supported by the U.S. Department of Energy (DOE). These tools make it easier for scientific code developers to write high performance applications for parallel computers. They tackle a number of computational issues that are common to a large number of scientific applications, mainly implementation of numerical algorithms, and support for code development, execution and optimization. The ACTS collection promotes code portability, reusability, reduction of duplicate efforts, and tool maturity. This paper presents a brief introduction to the functionality available in ACTS. It also highlight the tools that are in demand by Climate and Weather modelers.

  10. Educational NASA Computational and Scientific Studies (enCOMPASS)

    NASA Technical Reports Server (NTRS)

    Memarsadeghi, Nargess

    2013-01-01

    Educational NASA Computational and Scientific Studies (enCOMPASS) is an educational project of NASA Goddard Space Flight Center aimed at bridging the gap between computational objectives and needs of NASA's scientific research, missions, and projects, and academia's latest advances in applied mathematics and computer science. enCOMPASS achieves this goal via bidirectional collaboration and communication between NASA and academia. Using developed NASA Computational Case Studies in university computer science/engineering and applied mathematics classes is a way of addressing NASA's goals of contributing to the Science, Technology, Education, and Math (STEM) National Objective. The enCOMPASS Web site at http://encompass.gsfc.nasa.gov provides additional information. There are currently nine enCOMPASS case studies developed in areas of earth sciences, planetary sciences, and astrophysics. Some of these case studies have been published in AIP and IEEE's Computing in Science and Engineering magazines. A few university professors have used enCOMPASS case studies in their computational classes and contributed their findings to NASA scientists. In these case studies, after introducing the science area, the specific problem, and related NASA missions, students are first asked to solve a known problem using NASA data and past approaches used and often published in a scientific/research paper. Then, after learning about the NASA application and related computational tools and approaches for solving the proposed problem, students are given a harder problem as a challenge for them to research and develop solutions for. This project provides a model for NASA scientists and engineers on one side, and university students, faculty, and researchers in computer science and applied mathematics on the other side, to learn from each other's areas of work, computational needs and solutions, and the latest advances in research and development. This innovation takes NASA science and

  11. Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

    NASA Technical Reports Server (NTRS)

    Wigton, Larry

    1996-01-01

    Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.

  12. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  13. A Computing Environment to Support Repeatable Scientific Big Data Experimentation of World-Wide Scientific Literature

    SciTech Connect

    Schlicher, Bob G; Kulesz, James J; Abercrombie, Robert K; Kruse, Kara L

    2015-01-01

    A principal tenant of the scientific method is that experiments must be repeatable and relies on ceteris paribus (i.e., all other things being equal). As a scientific community, involved in data sciences, we must investigate ways to establish an environment where experiments can be repeated. We can no longer allude to where the data comes from, we must add rigor to the data collection and management process from which our analysis is conducted. This paper describes a computing environment to support repeatable scientific big data experimentation of world-wide scientific literature, and recommends a system that is housed at the Oak Ridge National Laboratory in order to provide value to investigators from government agencies, academic institutions, and industry entities. The described computing environment also adheres to the recently instituted digital data management plan mandated by multiple US government agencies, which involves all stages of the digital data life cycle including capture, analysis, sharing, and preservation. It particularly focuses on the sharing and preservation of digital research data. The details of this computing environment are explained within the context of cloud services by the three layer classification of Software as a Service , Platform as a Service , and Infrastructure as a Service .

  14. New computing environments:Parallel, vector and systolic

    SciTech Connect

    Wouk, A.

    1986-01-01

    This book presents papers on supercomputers and array processors. Topics considered include nested dissection, the systolic level 2 BLAS, parallel processing a hydrodynamic shock wave problem, MACH-1, portable standard LISP on the Cray, distributed combinator evaluation, performance and library issues, scale problems, multiprocessor architecture, the MIDAS multiprocessor system, parallel algorithms for incompressible and compressible flows on a multiprocessor, and parallel algorithms for elliptic equations.

  15. Computer network access to scientific information systems for minority universities

    NASA Astrophysics Data System (ADS)

    Thomas, Valerie L.; Wakim, Nagi T.

    1993-08-01

    The evolution of computer networking technology has lead to the establishment of a massive networking infrastructure which interconnects various types of computing resources at many government, academic, and corporate institutions. A large segment of this infrastructure has been developed to facilitate information exchange and resource sharing within the scientific community. The National Aeronautics and Space Administration (NASA) supports both the development and the application of computer networks which provide its community with access to many valuable multi-disciplinary scientific information systems and on-line databases. Recognizing the need to extend the benefits of this advanced networking technology to the under-represented community, the National Space Science Data Center (NSSDC) in the Space Data and Computing Division at the Goddard Space Flight Center has developed the Minority University-Space Interdisciplinary Network (MU-SPIN) Program: a major networking and education initiative for Historically Black Colleges and Universities (HBCUs) and Minority Universities (MUs). In this paper, we will briefly explain the various components of the MU-SPIN Program while highlighting how, by providing access to scientific information systems and on-line data, it promotes a higher level of collaboration among faculty and students and NASA scientists.

  16. The Potential of the Cell Processor for Scientific Computing

    SciTech Connect

    Williams, Samuel; Shalf, John; Oliker, Leonid; Husbands, Parry; Kamil, Shoaib; Yelick, Katherine

    2005-10-14

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of the using the forth coming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. We are the first to present quantitative Cell performance data on scientific kernels and show direct comparisons against leading superscalar (AMD Opteron), VLIW (IntelItanium2), and vector (Cray X1) architectures. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop both analytical models and simulators to predict kernel performance. Our work also explores the complexity of mapping several important scientific algorithms onto the Cells unique architecture. Additionally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

  17. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

    SciTech Connect

    Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael

    2015-04-08

    The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.

  18. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

    DOE PAGESBeta

    Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael

    2015-04-08

    The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on themore » performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.« less

  19. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-11

    Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  20. Technologies for Large Data Management in Scientific Computing

    NASA Astrophysics Data System (ADS)

    Pace, Alberto

    2014-01-01

    In recent years, intense usage of computing has been the main strategy of investigations in several scientific research projects. The progress in computing technology has opened unprecedented opportunities for systematic collection of experimental data and the associated analysis that were considered impossible only few years ago. This paper focuses on the strategies in use: it reviews the various components that are necessary for an effective solution that ensures the storage, the long term preservation, and the worldwide distribution of large quantities of data that are necessary in a large scientific research project. The paper also mentions several examples of data management solutions used in High Energy Physics for the CERN Large Hadron Collider (LHC) experiments in Geneva, Switzerland which generate more than 30,000 terabytes of data every year that need to be preserved, analyzed, and made available to a community of several tenth of thousands scientists worldwide.

  1. Efficient graph algorithms for sequential and parallel computers. Doctoral thesis

    SciTech Connect

    Goldberg, A.V.

    1987-02-01

    This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.

  2. Recent Scientific Evidence and Technical Developments in Cardiovascular Computed Tomography.

    PubMed

    Marcus, Roy; Ruff, Christer; Burgstahler, Christof; Notohamiprodjo, Mike; Nikolaou, Konstantin; Geisler, Tobias; Schroeder, Stephen; Bamberg, Fabian

    2016-05-01

    In recent years, coronary computed tomography angiography has become an increasingly safe and noninvasive modality for the evaluation of the anatomical structure of the coronary artery tree with diagnostic benefits especially in patients with a low-to-intermediate pretest probability of disease. Currently, increasing evidence from large randomized diagnostic trials is accumulating on the diagnostic impact of computed tomography angiography for the management of patients with acute and stable chest pain syndrome. At the same time, technical advances have substantially reduced adverse effects and limiting factors, such as radiation exposure, the amount of iodinated contrast agent, and scanning time, rendering the technique appropriate for broader clinical applications. In this work, we review the latest developments in computed tomography technology and describe the scientific evidence on the use of cardiac computed tomography angiography to evaluate patients with acute and stable chest pain syndrome.

  3. Recent Scientific Evidence and Technical Developments in Cardiovascular Computed Tomography.

    PubMed

    Marcus, Roy; Ruff, Christer; Burgstahler, Christof; Notohamiprodjo, Mike; Nikolaou, Konstantin; Geisler, Tobias; Schroeder, Stephen; Bamberg, Fabian

    2016-05-01

    In recent years, coronary computed tomography angiography has become an increasingly safe and noninvasive modality for the evaluation of the anatomical structure of the coronary artery tree with diagnostic benefits especially in patients with a low-to-intermediate pretest probability of disease. Currently, increasing evidence from large randomized diagnostic trials is accumulating on the diagnostic impact of computed tomography angiography for the management of patients with acute and stable chest pain syndrome. At the same time, technical advances have substantially reduced adverse effects and limiting factors, such as radiation exposure, the amount of iodinated contrast agent, and scanning time, rendering the technique appropriate for broader clinical applications. In this work, we review the latest developments in computed tomography technology and describe the scientific evidence on the use of cardiac computed tomography angiography to evaluate patients with acute and stable chest pain syndrome. PMID:27025303

  4. Memory Benchmarks for SMP-Based High Performance Parallel Computers

    SciTech Connect

    Yoo, A B; de Supinski, B; Mueller, F; Mckee, S A

    2001-11-20

    As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even more complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.

  5. Algorithmic support for commodity-based parallel computing systems.

    SciTech Connect

    Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann

    2003-10-01

    The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.

  6. Smart Libraries: Best SQE Practices for Libraries with an Emphasis on Scientific Computing

    SciTech Connect

    Miller, M C; Reus, J F; Matzke, R P; Koziol, Q A; Cheng, A P

    2004-12-15

    As scientific computing applications grow in complexity, more and more functionality is being packaged in independently developed libraries. Worse, as the computing environments in which these applications run grow in complexity, it gets easier to make mistakes in building, installing and using libraries as well as the applications that depend on them. Unfortunately, SQA standards so far developed focus primarily on applications, not libraries. We show that SQA standards for libraries differ from applications in many respects. We introduce and describe a variety of practices aimed at minimizing the likelihood of making mistakes in using libraries and at maximizing users' ability to diagnose and correct them when they occur. We introduce the term Smart Library to refer to a library that is developed with these basic principles in mind. We draw upon specific examples from existing products we believe incorporate smart features: MPI, a parallel message passing library, and HDF5 and SAF, both of which are parallel I/O libraries supporting scientific computing applications. We conclude with a narrative of some real-world experiences in using smart libraries with Ale3d, VisIt and SAF.

  7. O2scl: Object-oriented scientific computing library

    NASA Astrophysics Data System (ADS)

    Steiner, Andrew W.

    2014-08-01

    O2scl is an object-oriented library for scientific computing in C++ useful for solving, minimizing, differentiating, integrating, interpolating, optimizing, approximating, analyzing, fitting, and more. Many classes operate on generic function and vector types; it includes classes based on GSL and CERNLIB. O2scl also contains code for computing the basic thermodynamic integrals for fermions and bosons, for generating almost all of the most common equations of state of nuclear and neutron star matter, and for solving the TOV equations. O2scl can be used on Linux, Mac and Windows (Cygwin) platforms and has extensive documentation.

  8. Applications of the Aurora parallel Prolog system to computational molecular biology

    SciTech Connect

    Lusk, E.L.; Overbeek, R.; Mudambi, S.; Szeredi, P.

    1993-09-01

    We describe an investigation into the use of the Aurora parallel Prolog system in two applications within the area of computational molecular biology. The computational requirements were large, due to the nature of the applications, and were large, due to the nature of the applications, and were carried out on a scalable parallel computer the BBN ``Butterfly`` TC-2000. Results include both a demonstration that logic programming can be effective in the context of demanding applications on large-scale parallel machines, and some insights into parallel programming in Prolog.

  9. The transition to massively parallel computing within a production environment at a DOE access center

    SciTech Connect

    McCoy, M.G.

    1993-04-01

    In contemplating the transition from sequential to MP computing, the National Energy Research Supercomputer Center (NERSC) is faced with the frictions inherent in the duality of its mission. There have been two goals, the first has been to provide a stable, serviceable, production environment to the user base, the second to bring the most capable early serial supercomputers to the Center to make possible the leading edge simulations. This seeming conundrum has in reality been a source of strength. The task of meeting both goals was faced before with the CRAY 1 which, as delivered, was all iron; so the problems associated with the advent of parallel computers are not entirely new, but they are serious. Current vector supercomputers, such as the C90, offer mature production environments, including software tools, a large applications base, and generality; these machines can be used to attack the spectrum of scientific applications by a large user base knowledgeable in programming techniques for this architecture. Parallel computers to date have offered less developed, even rudimentary, working environments, a sparse applications base, and forced specialization. They have been specialized in terms of programming models, and specialized in terms of the kinds of applications which would do well on the machines. Given this context, why do many service computer centers feel that now is the time to cease or slow the procurement of traditional vector supercomputers in favor of MP systems? What are some of the issues that NERSC must face to engineer a smooth transition? The answers to these questions are multifaceted and by no means completely clear. However, a route exists as a result of early efforts at the Laboratories combined with research within the HPCC Program. One can begin with an analysis of why the hardware and software appearing shortly should be made available to the mainstream, and then address what would be required in an initial production environment.

  10. Parallel paving: An algorithm for generating distributed, adaptive, all-quadrilateral meshes on parallel computers

    SciTech Connect

    Lober, R.R.; Tautges, T.J.; Vaughan, C.T.

    1997-03-01

    Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.

  11. Performance analysis of large scale parallel CFD computing based on Code_Saturne

    NASA Astrophysics Data System (ADS)

    Shang, Zhi

    2013-02-01

    In order to run computational fluid dynamics (CFD) codes on large scales, parallel computing has to be employed. For instance, on Petascale computing, general parallel computing without any optimization is not enough, especially for complex industrial issues that employ a large number of mesh cells to capture the details of the geometry. How to distribute these mesh cells among the multi-processors for Terascale and Petascale systems to obtain a good performance on parallel computing is really a challenge. Some mesh partitioning software packages, such as Metis, ParMetis, PT-Scotch and Zoltan, were chosen as the candidates ported into Code_Saturne to test if they can lead Code_Saturne towards Petascale and Exascale parallel CFD computing. Through the studies, it was found that mesh partitioning optimization software packages based on the graph mesh partitioning method can help the CFD code obtain good mesh distributions for high performance computing (HPC).

  12. Experimental free-space optical network for massively parallel computers

    NASA Astrophysics Data System (ADS)

    Araki, S.; Kajita, M.; Kasahara, K.; Kubota, K.; Kurihara, K.; Redmond, I.; Schenfeld, E.; Suzaki, T.

    1996-03-01

    A free-space optical interconnection scheme is described for massively parallel processors based on the interconnection-cached network architecture. The optical network operates in a circuit-switching mode. Combined with a packet-switching operation among the circuit-switched optical channels, a high-bandwidth, low-latency network for massively parallel processing results. The design and assembly of a 64-channel experimental prototype is discussed, and operational results are presented.

  13. Charon Message-Passing Toolkit for Scientific Computations

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Yan, Jerry (Technical Monitor)

    2000-01-01

    Charon is a library, callable from C and Fortran, that aids the conversion of structured-grid legacy codes-such as those used in the numerical computation of fluid flows-into parallel, high- performance codes. Key are functions that define distributed arrays, that map between distributed and non-distributed arrays, and that allow easy specification of common communications on structured grids. The library is based on the widely accepted MPI message passing standard. We present an overview of the functionality of Charon, and some representative results.

  14. Building Cognition: The Construction of Computational Representations for Scientific Discovery.

    PubMed

    Chandrasekharan, Sanjay; Nersessian, Nancy J

    2015-11-01

    Novel computational representations, such as simulation models of complex systems and video games for scientific discovery (Foldit, EteRNA etc.), are dramatically changing the way discoveries emerge in science and engineering. The cognitive roles played by such computational representations in discovery are not well understood. We present a theoretical analysis of the cognitive roles such representations play, based on an ethnographic study of the building of computational models in a systems biology laboratory. Specifically, we focus on a case of model-building by an engineer that led to a remarkable discovery in basic bioscience. Accounting for such discoveries requires a distributed cognition (DC) analysis, as DC focuses on the roles played by external representations in cognitive processes. However, DC analyses by and large have not examined scientific discovery, and they mostly focus on memory offloading, particularly how the use of existing external representations changes the nature of cognitive tasks. In contrast, we study discovery processes and argue that discoveries emerge from the processes of building the computational representation. The building process integrates manipulations in imagination and in the representation, creating a coupled cognitive system of model and modeler, where the model is incorporated into the modeler's imagination. This account extends DC significantly, and we present some of the theoretical and application implications of this extended account.

  15. Cooperative storage of shared files in a parallel computing system with dynamic block size

    DOEpatents

    Bent, John M.; Faibish, Sorin; Grider, Gary

    2015-11-10

    Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).

  16. Beyond input-output computings: error-driven emergence with parallel non-distributed slime mold computer.

    PubMed

    Aono, Masashi; Gunji, Yukio-Pegio

    2003-10-01

    The emergence derived from errors is the key importance for both novel computing and novel usage of the computer. In this paper, we propose an implementable experimental plan for the biological computing so as to elicit the emergent property of complex systems. An individual plasmodium of the true slime mold Physarum polycephalum acts in the slime mold computer. Modifying the Elementary Cellular Automaton as it entails the global synchronization problem upon the parallel computing provides the NP-complete problem solved by the slime mold computer. The possibility to solve the problem by giving neither all possible results nor explicit prescription of solution-seeking is discussed. In slime mold computing, the distributivity in the local computing logic can change dynamically, and its parallel non-distributed computing cannot be reduced into the spatial addition of multiple serial computings. The computing system based on exhaustive absence of the super-system may produce, something more than filling the vacancy.

  17. InSAR Scientific Computing Environment on the Cloud

    NASA Astrophysics Data System (ADS)

    Rosen, P. A.; Shams, K. S.; Gurrola, E. M.; George, B. A.; Knight, D. S.

    2012-12-01

    In response to the needs of the international scientific and operational Earth observation communities, spaceborne Synthetic Aperture Radar (SAR) systems are being tasked to produce enormous volumes of raw data daily, with availability to scientists to increase substantially as more satellites come online and data becomes more accessible through more open data policies. The availability of these unprecedentedly dense and rich datasets has led to the development of sophisticated algorithms that can take advantage of them. In particular, interferometric time series analysis of SAR data provides insights into the changing earth and requires substantial computational power to process data across large regions and over large time periods. This poses challenges for existing infrastructure, software, and techniques required to process, store, and deliver the results to the global community of scientists. The current state-of-the-art solutions employ traditional data storage and processing applications that require download of data to the local repositories before processing. This approach is becoming untenable in light of the enormous volume of data that must be processed in an iterative and collaborative manner. We have analyzed and tested new cloud computing and virtualization approaches to address these challenges within the context of InSAR in the earth science community. Cloud computing is democratizing computational and storage capabilities for science users across the world. The NASA Jet Propulsion Laboratory has been an early adopter of this technology, successfully integrating cloud computing in a variety of production applications ranging from mission operations to downlink data processing. We have ported a new InSAR processing suite called ISCE (InSAR Scientific Computing Environment) to a scalable distributed system running in the Amazon GovCloud to demonstrate the efficacy of cloud computing for this application. We have integrated ISCE with Polyphony to

  18. Paradigms and strategies for scientific computing on distributed memory concurrent computers

    SciTech Connect

    Foster, I.T.; Walker, D.W.

    1994-06-01

    In this work we examine recent advances in parallel languages and abstractions that have the potential for improving the programmability and maintainability of large-scale, parallel, scientific applications running on high performance architectures and networks. This paper focuses on Fortran M, a set of extensions to Fortran 77 that supports the modular design of message-passing programs. We describe the Fortran M implementation of a particle-in-cell (PIC) plasma simulation application, and discuss issues in the optimization of the code. The use of two other methodologies for parallelizing the PIC application are considered. The first is based on the shared object abstraction as embodied in the Orca language. The second approach is the Split-C language. In Fortran M, Orca, and Split-C the ability of the programmer to control the granularity of communication is important is designing an efficient implementation.

  19. National Energy Research Scientific Computing Center (NERSC): Advancing the frontiers of computational science and technology

    SciTech Connect

    Hules, J.

    1996-11-01

    National Energy Research Scientific Computing Center (NERSC) provides researchers with high-performance computing tools to tackle science`s biggest and most challenging problems. Founded in 1974 by DOE/ER, the Controlled Thermonuclear Research Computer Center was the first unclassified supercomputer center and was the model for those that followed. Over the years the center`s name was changed to the National Magnetic Fusion Energy Computer Center and then to NERSC; it was relocated to LBNL. NERSC, one of the largest unclassified scientific computing resources in the world, is the principal provider of general-purpose computing services to DOE/ER programs: Magnetic Fusion Energy, High Energy and Nuclear Physics, Basic Energy Sciences, Health and Environmental Research, and the Office of Computational and Technology Research. NERSC users are a diverse community located throughout US and in several foreign countries. This brochure describes: the NERSC advantage, its computational resources and services, future technologies, scientific resources, and computational science of scale (interdisciplinary research over a decade or longer; examples: combustion in engines, waste management chemistry, global climate change modeling).

  20. A comparative study of serial and parallel aeroelastic computations of wings

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Guruswamy, Guru P.

    1994-01-01

    A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.

  1. The Centre of High-Performance Scientific Computing, Geoverbund, ABC/J - Geosciences enabled by HPSC

    NASA Astrophysics Data System (ADS)

    Kollet, Stefan; Görgen, Klaus; Vereecken, Harry; Gasper, Fabian; Hendricks-Franssen, Harrie-Jan; Keune, Jessica; Kulkarni, Ketan; Kurtz, Wolfgang; Sharples, Wendy; Shrestha, Prabhakar; Simmer, Clemens; Sulis, Mauro; Vanderborght, Jan

    2016-04-01

    The Centre of High-Performance Scientific Computing (HPSC TerrSys) was founded 2011 to establish a centre of competence in high-performance scientific computing in terrestrial systems and the geosciences enabling fundamental and applied geoscientific research in the Geoverbund ABC/J (geoscientfic research alliance of the Universities of Aachen, Cologne, Bonn and the Research Centre Jülich, Germany). The specific goals of HPSC TerrSys are to achieve relevance at the national and international level in (i) the development and application of HPSC technologies in the geoscientific community; (ii) student education; (iii) HPSC services and support also to the wider geoscientific community; and in (iv) the industry and public sectors via e.g., useful applications and data products. A key feature of HPSC TerrSys is the Simulation Laboratory Terrestrial Systems, which is located at the Jülich Supercomputing Centre (JSC) and provides extensive capabilities with respect to porting, profiling, tuning and performance monitoring of geoscientific software in JSC's supercomputing environment. We will present a summary of success stories of HPSC applications including integrated terrestrial model development, parallel profiling and its application from watersheds to the continent; massively parallel data assimilation using physics-based models and ensemble methods; quasi-operational terrestrial water and energy monitoring; and convection permitting climate simulations over Europe. The success stories stress the need for a formalized education of students in the application of HPSC technologies in future.

  2. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2013-07-09

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: establishing, for each node, a plurality of logical rings, each ring including a different set of at least one core on that node, each ring including the cores on at least two of the nodes; iteratively for each node: assigning each core of that node to one of the rings established for that node to which the core has not previously been assigned, and performing, for each ring for that node, a global allreduce operation using contribution data for the cores assigned to that ring or any global allreduce results from previous global allreduce operations, yielding current global allreduce results for each core; and performing, for each node, a local allreduce operation using the global allreduce results.

  3. Performing an allreduce operation on a plurality of compute nodes of a parallel computer

    DOEpatents

    Faraj, Ahmad

    2013-02-12

    Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer, each node including at least two processing cores, that include: performing, for each node, a local reduction operation using allreduce contribution data for the cores of that node, yielding, for each node, a local reduction result for one or more representative cores for that node; establishing one or more logical rings among the nodes, each logical ring including only one of the representative cores from each node; performing, for each logical ring, a global allreduce operation using the local reduction result for the representative cores included in that logical ring, yielding a global allreduce result for each representative core included in that logical ring; and performing, for each node, a local broadcast operation using the global allreduce results for each representative core on that node.

  4. Chaining direct memory access data transfer operations for compute nodes in a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.

    2010-09-28

    Methods, systems, and products are disclosed for chaining DMA data transfer operations for compute nodes in a parallel computer that include: receiving, by an origin DMA engine on an origin node in an origin injection FIFO buffer for the origin DMA engine, a RGET data descriptor specifying a DMA transfer operation data descriptor on the origin node and a second RGET data descriptor on the origin node, the second RGET data descriptor specifying a target RGET data descriptor on the target node, the target RGET data descriptor specifying an additional DMA transfer operation data descriptor on the origin node; creating, by the origin DMA engine, an RGET packet in dependence upon the RGET data descriptor, the RGET packet containing the DMA transfer operation data descriptor and the second RGET data descriptor; and transferring, by the origin DMA engine to a target DMA engine on the target node, the RGET packet.

  5. Advances in time-domain electromagnetic simulation capabilities through the use of overset grids and massively parallel computing

    NASA Astrophysics Data System (ADS)

    Blake, Douglas Clifton

    A new methodology is presented for conducting numerical simulations of electromagnetic scattering and wave-propagation phenomena on massively parallel computing platforms. A process is constructed which is rooted in the Finite-Volume Time-Domain (FVTD) technique to create a simulation capability that is both versatile and practical. In terms of versatility, the method is platform independent, is easily modifiable, and is capable of solving a large number of problems with no alterations. In terms of practicality, the method is sophisticated enough to solve problems of engineering significance and is not limited to mere academic exercises. In order to achieve this capability, techniques are integrated from several scientific disciplines including computational fluid dynamics, computational electromagnetics, and parallel computing. The end result is the first FVTD solver capable of utilizing the highly flexible overset-gridding process in a distributed-memory computing environment. In the process of creating this capability, work is accomplished to conduct the first study designed to quantify the effects of domain-decomposition dimensionality on the parallel performance of hyperbolic partial differential equations solvers; to develop a new method of partitioning a computational domain comprised of overset grids; and to provide the first detailed assessment of the applicability of overset grids to the field of computational electromagnetics. Using these new methods and capabilities, results from a large number of wave propagation and scattering simulations are presented. The overset-grid FVTD algorithm is demonstrated to produce results of comparable accuracy to single-grid simulations while simultaneously shortening the grid-generation process and increasing the flexibility and utility of the FVTD technique. Furthermore, the new domain-decomposition approaches developed for overset grids are shown to be capable of producing partitions that are better load balanced and

  6. Monte Carlo simulation of light propagation in skin tissue phantoms using a parallel computing method

    NASA Astrophysics Data System (ADS)

    Wu, Di M.; Zhao, S. S.; Lu, Jun Q.; Hu, Xin-Hua

    2000-06-01

    In Monte Carlo simulations of light propagating in biological tissues, photons propagating in the media are described as classic particles being scattered and absorbed randomly in the media, and their path are tracked individually. To obtain any statistically significant results, however, a large number of photons is needed in the simulations and the calculations are time consuming and sometime impossible with existing computing resource, especially when considering the inhomogeneous boundary conditions. To overcome this difficulty, we have implemented a parallel computing technique into our Monte Carlo simulations. And this moment is well justified due to the nature of the Monte Carlo simulation. Utilizing the PVM (Parallel Virtual Machine, a parallel computing software package), parallel codes in both C and Fortran have been developed on the massive parallel computer of Cray T3E and a local PC-network running Unix/Sun Solaris. Our results show that parallel computing can significantly reduce the running time and make efficient usage of low cost personal computers. In this report, we present a numerical study of light propagation in a slab phantom of skin tissue using the parallel computing technique.

  7. Locating hardware faults in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-01-12

    Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

  8. OPENING REMARKS: SciDAC: Scientific Discovery through Advanced Computing

    NASA Astrophysics Data System (ADS)

    Strayer, Michael

    2005-01-01

    Good morning. Welcome to SciDAC 2005 and San Francisco. SciDAC is all about computational science and scientific discovery. In a large sense, computational science characterizes SciDAC and its intent is change. It transforms both our approach and our understanding of science. It opens new doors and crosses traditional boundaries while seeking discovery. In terms of twentieth century methodologies, computational science may be said to be transformational. There are a number of examples to this point. First are the sciences that encompass climate modeling. The application of computational science has in essence created the field of climate modeling. This community is now international in scope and has provided precision results that are challenging our understanding of our environment. A second example is that of lattice quantum chromodynamics. Lattice QCD, while adding precision and insight to our fundamental understanding of strong interaction dynamics, has transformed our approach to particle and nuclear science. The individual investigator approach has evolved to teams of scientists from different disciplines working side-by-side towards a common goal. SciDAC is also undergoing a transformation. This meeting is a prime example. Last year it was a small programmatic meeting tracking progress in SciDAC. This year, we have a major computational science meeting with a variety of disciplines and enabling technologies represented. SciDAC 2005 should position itself as a new corner stone for Computational Science and its impact on science. As we look to the immediate future, FY2006 will bring a new cycle to SciDAC. Most of the program elements of SciDAC will be re-competed in FY2006. The re-competition will involve new instruments for computational science, new approaches for collaboration, as well as new disciplines. There will be new opportunities for virtual experiments in carbon sequestration, fusion, and nuclear power and nuclear waste, as well as collaborations

  9. Molecular Science Computing Facility Scientific Challenges: Linking Across Scales

    SciTech Connect

    De Jong, Wibe A.; Windus, Theresa L.

    2005-07-01

    The purpose of this document is to define the evolving science drivers for performing environmental molecular research at the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL) and to provide guidance associated with the next-generation high-performance computing center that must be developed at EMSL's Molecular Science Computing Facility (MSCF) in order to address this critical research. The MSCF is the pre-eminent computing facility?supported by the U.S. Department of Energy's (DOE's) Office of Biological and Environmental Research (BER)?tailored to provide the fastest time-to-solution for current computational challenges in chemistry and biology, as well as providing the means for broad research in the molecular and environmental sciences. The MSCF provides integral resources and expertise to emerging EMSL Scientific Grand Challenges and Collaborative Access Teams that are designed to leverage the multiple integrated research capabilities of EMSL, thereby creating a synergy between computation and experiment to address environmental molecular science challenges critical to DOE and the nation.

  10. Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

    NASA Astrophysics Data System (ADS)

    de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl

    2002-03-01

    We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.

  11. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  12. Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

    NASA Technical Reports Server (NTRS)

    Weeks, Cindy Lou

    1986-01-01

    Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.

  13. Parallel computing for simultaneous iterative tomographic imaging by graphics processing units

    NASA Astrophysics Data System (ADS)

    Bello-Maldonado, Pedro D.; López, Ricardo; Rogers, Colleen; Jin, Yuanwei; Lu, Enyue

    2016-05-01

    In this paper, we address the problem of accelerating inversion algorithms for nonlinear acoustic tomographic imaging by parallel computing on graphics processing units (GPUs). Nonlinear inversion algorithms for tomographic imaging often rely on iterative algorithms for solving an inverse problem, thus computationally intensive. We study the simultaneous iterative reconstruction technique (SIRT) for the multiple-input-multiple-output (MIMO) tomography algorithm which enables parallel computations of the grid points as well as the parallel execution of multiple source excitation. Using graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) programming model an overall improvement of 26.33x was achieved when combining both approaches compared with sequential algorithms. Furthermore we propose an adaptive iterative relaxation factor and the use of non-uniform weights to improve the overall convergence of the algorithm. Using these techniques, fast computations can be performed in parallel without the loss of image quality during the reconstruction process.

  14. Wing-Body Aeroelasticity Using Finite-Difference Fluid/Finite-Element Structural Equations on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Guruswamy, Guru P.; Kutler, Paul (Technical Monitor)

    1994-01-01

    In recent years significant advances have been made for parallel computers in both hardware and software. Now parallel computers have become viable tools in computational mechanics. Many application codes developed on conventional computers have been modified to benefit from parallel computers. Significant speedups in some areas have been achieved by parallel computations. For single-discipline use of both fluid dynamics and structural dynamics, computations have been made on wing-body configurations using parallel computers. However, only a limited amount of work has been completed in combining these two disciplines for multidisciplinary applications. The prime reason is the increased level of complication associated with a multidisciplinary approach. In this work, procedures to compute aeroelasticity on parallel computers using direct coupling of fluid and structural equations will be investigated for wing-body configurations. The parallel computer selected for computations is an Intel iPSC/860 computer which is a distributed-memory, multiple-instruction, multiple data (MIMD) computer with 128 processors. In this study, the computational efficiency issues of parallel integration of both fluid and structural equations will be investigated in detail. The fluid and structural domains will be modeled using finite-difference and finite-element approaches, respectively. Results from the parallel computer will be compared with those from the conventional computers using a single processor. This study will provide an efficient computational tool for the aeroelastic analysis of wing-body structures on MIMD type parallel computers.

  15. Comparisons of Two Viscous Models for Vortex Methods in Parallel Computation

    NASA Astrophysics Data System (ADS)

    Lee, Sang Hwan; Jin, Dong Sik; Yoon, Jin Sup

    A parallel implementation of vortex methods dealing with unsteady viscous flows on a distributed computing environment through Parallel Virtual Machine (PVM) is reported in this paper. We test the recently developed diffusion schemes of vortex methods. We directly compare the particle strength exchange method with the vorticity distribution method in terms of their accuracy and computational efficiency. Comparisons between both viscous models described are presented for the impulsively started flows past a circular cylinder at Reynolds number 60. We also present the comparisons of both methods in their parallel computation efficiency and speed-up ratio.

  16. From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators

    NASA Astrophysics Data System (ADS)

    Yim, Keun Soo

    This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of

  17. Epistemology of scientific inquiry and computer-supported collaborative learning

    NASA Astrophysics Data System (ADS)

    Hakkarainen, Kai Pekka Juhani

    1998-12-01

    The problem addressed in the study was whether 10- and 11-year-old children, collaborating within a computer-supported classroom, could learn a process of inquiry that represented certain principal features of scientific inquiry, namely (1) engagement in increasingly deep levels of explanation, (2) progressive generation of subordinate questions, and (3) collaborative effort to advance explanations. Technical infrastructure for the study was provided by the Computer-Supported Intentional Learning Environments, CSILE. The study was entirely based on qualitative content analysis of students' written productions posted to CSILE's database. Five studies were carried out to analyze CSILE students' process of inquiry. The first two studies aimed at analyzing changes in CSILE students' culture of inquiry in two CSILE classrooms across a three-year period. The results of the studies indicate that the classroom culture changed over three years following the introduction of CSILE. The explanatory level of knowledge produced by the students became increasingly deeper in tracking from the first to third year representing the first principal feature of scientific inquiry. Moreover, between-student communication increasingly focused on facilitating advancement of explanation (the third principal feature). These effects were substantial only in one classroom; the teacher of this class provided strong pedagogical support and epistemological guidance for the students. Detailed analysis of this classroom's inquiry, carried out in the last three studies, indicated that with teacher's guidance the students were able to produce meaningful intuitive explanations as well as go beyond the functional and empirical nature of their intuitive explanations and appropriate theoretical scientific explanations (the first principal feature). Advancement of the students' inquiry appeared to be closely associated with generation of new subordinate questions (the second principal feature) and peer

  18. MEDUSA - An overset grid flow solver for network-based parallel computer systems

    NASA Technical Reports Server (NTRS)

    Smith, Merritt H.; Pallis, Jani M.

    1993-01-01

    Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallel computer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.

  19. Execution models for mapping programs onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

  20. JPARSS: A Java Parallel Network Package for Grid Computing

    SciTech Connect

    Chen, Jie; Akers, Walter; Chen, Ying; Watson, William

    2002-03-01

    The emergence of high speed wide area networks makes grid computinga reality. However grid applications that need reliable data transfer still have difficulties to achieve optimal TCP performance due to network tuning of TCP window size to improve bandwidth and to reduce latency on a high speed wide area network. This paper presents a Java package called JPARSS (Java Parallel Secure Stream (Socket)) that divides data into partitions that are sent over several parallel Java streams simultaneously and allows Java or Web applications to achieve optimal TCP performance in a grid environment without the necessity of tuning TCP window size. This package enables single sign-on, certificate delegation and secure or plain-text data transfer using several security components based on X.509 certificate and SSL. Several experiments will be presented to show that using Java parallelstreams is more effective than tuning TCP window size. In addition a simple architecture using Web services