A performance comparison of the Cray-2 and the Cray X-MP
NASA Technical Reports Server (NTRS)
Schmickley, Ronald; Bailey, David H.
1986-01-01
A suite of thirteen large Fortran benchmark codes were run on Cray-2 and Cray X-MP supercomputers. These codes were a mix of compute-intensive scientific application programs (mostly Computational Fluid Dynamics) and some special vectorized computation exercise programs. For the general class of programs tested on the Cray-2, most of which were not specially tuned for speed, the floating point operation rates varied under a variety of system load configurations from 40 percent up to 125 percent of X-MP performance rates. It is concluded that the Cray-2, in the original system configuration studied (without memory pseudo-banking) will run untuned Fortran code, on average, about 70 percent of X-MP speeds.
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1994-01-01
Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
FFTs in external or hierarchical memory
NASA Technical Reports Server (NTRS)
Bailey, David H.
1989-01-01
A description is given of advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) use strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly 2 Gflops.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Maxwell, Don E; Ezell, Matthew A; Becklehimer, Jeff
While sites generally have systems in place to monitor the health of Cray computers themselves, often the cooling systems are ignored until a computer failure requires investigation into the source of the failure. The Liebert XDP units used to cool the Cray XE/XK models as well as the Cray proprietary cooling system used for the Cray XC30 models provide data useful for health monitoring. Unfortunately, this valuable information is often available only to custom solutions not accessible by a center-wide monitoring system or is simply ignored entirely. In this paper, methods and tools used to harvest the monitoring data availablemore » are discussed, and the implementation needed to integrate the data into a center-wide monitoring system at the Oak Ridge National Laboratory is provided.« less
Transferring ecosystem simulation codes to supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1995-01-01
Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.
Y-MP floating point and Cholesky factorization
NASA Technical Reports Server (NTRS)
Carter, Russell
1991-01-01
The floating point arithmetics implemented in the Cray 2 and Cray Y-MP computer systems are nearly identical, but large scale computations performed on the two systems have exhibited significant differences in accuracy. The difference in accuracy is analyzed for Cholesky factorization algorithm, and it is found that the source of the difference is the subtract magnitude operation of the Cray Y-MP. The results from numerical experiments for a range of problem sizes are presented, and an efficient method for improving the accuracy of the factorization obtained on the Y-MP is presented.
NASA Technical Reports Server (NTRS)
Gillian, Ronnie E.; Lotts, Christine G.
1988-01-01
The Computational Structural Mechanics (CSM) Activity at Langley Research Center is developing methods for structural analysis on modern computers. To facilitate that research effort, an applications development environment has been constructed to insulate the researcher from the many computer operating systems of a widely distributed computer network. The CSM Testbed development system was ported to the Numerical Aerodynamic Simulator (NAS) Cray-2, at the Ames Research Center, to provide a high end computational capability. This paper describes the implementation experiences, the resulting capability, and the future directions for the Testbed on supercomputers.
New computing systems and their impact on structural analysis and design
NASA Technical Reports Server (NTRS)
Noor, Ahmed K.
1989-01-01
A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
SNS programming environment user's guide
NASA Technical Reports Server (NTRS)
Tennille, Geoffrey M.; Howser, Lona M.; Humes, D. Creig; Cronin, Catherine K.; Bowen, John T.; Drozdowski, Joseph M.; Utley, Judith A.; Flynn, Theresa M.; Austin, Brenda A.
1992-01-01
The computing environment is briefly described for the Supercomputing Network Subsystem (SNS) of the Central Scientific Computing Complex of NASA Langley. The major SNS computers are a CRAY-2, a CRAY Y-MP, a CONVEX C-210, and a CONVEX C-220. The software is described that is common to all of these computers, including: the UNIX operating system, computer graphics, networking utilities, mass storage, and mathematical libraries. Also described is file management, validation, SNS configuration, documentation, and customer services.
Large-scale structural analysis: The structural analyst, the CSM Testbed and the NAS System
NASA Technical Reports Server (NTRS)
Knight, Norman F., Jr.; Mccleary, Susan L.; Macy, Steven C.; Aminpour, Mohammad A.
1989-01-01
The Computational Structural Mechanics (CSM) activity is developing advanced structural analysis and computational methods that exploit high-performance computers. Methods are developed in the framework of the CSM testbed software system and applied to representative complex structural analysis problems from the aerospace industry. An overview of the CSM testbed methods development environment is presented and some numerical methods developed on a CRAY-2 are described. Selected application studies performed on the NAS CRAY-2 are also summarized.
NASA Technical Reports Server (NTRS)
Gentzsch, W.
1982-01-01
Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
Integrating Grid Services into the Cray XT4 Environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
NERSC; Cholia, Shreyas; Lin, Hwa-Chun Wendy
2009-05-01
The 38640 core Cray XT4"Franklin" system at the National Energy Research Scientific Computing Center (NERSC) is a massively parallel resource available to Department of Energy researchers that also provides on-demand grid computing to the Open Science Grid. The integration of grid services on Franklin presented various challenges, including fundamental differences between the interactive and compute nodes, a stripped down compute-node operating system without dynamic library support, a shared-root environment and idiosyncratic application launching. Inour work, we describe how we resolved these challenges on a running, general-purpose production system to provide on-demand compute, storage, accounting and monitoring services through generic gridmore » interfaces that mask the underlying system-specific details for the end user.« less
Performance of the fusion code GYRO on four generations of Cray computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fahey, Mark R
2014-01-01
GYRO is a code used for the direct numerical simulation of plasma microturbulence. It has been ported to a variety of modern MPP platforms including several modern commodity clusters, IBM SPs, and Cray XC, XT, and XE series machines. We briefly describe the mathematical structure of the equations, the data layout, and the redistribution scheme. Also, while the performance and scaling of GYRO on many of these systems has been shown before, here we show the comparative performance and scaling on four generations of Cray supercomputers including the newest addition - the Cray XC30. The more recently added hybrid OpenMP/MPImore » imple- mentation also shows a great deal of promise on custom HPC systems that utilize fast CPUs and proprietary interconnects. Four machines of varying sizes were used in the experiment, all of which are located at the National Institute for Computational Sciences at the University of Tennessee at Knoxville and Oak Ridge National Laboratory. The advantages, limitations, and performance of using each system are discussed.« less
Application of high-performance computing to numerical simulation of human movement
NASA Technical Reports Server (NTRS)
Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
1995-01-01
We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
Production Experiences with the Cray-Enabled TORQUE Resource Manager
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ezell, Matthew A; Maxwell, Don E; Beer, David
High performance computing resources utilize batch systems to manage the user workload. Cray systems are uniquely different from typical clusters due to Cray s Application Level Placement Scheduler (ALPS). ALPS manages binary transfer, job launch and monitoring, and error handling. Batch systems require special support to integrate with ALPS using an XML protocol called BASIL. Previous versions of Adaptive Computing s TORQUE and Moab batch suite integrated with ALPS from within Moab, using PERL scripts to interface with BASIL. This would occasionally lead to problems when all the components would become unsynchronized. Version 4.1 of the TORQUE Resource Manager introducedmore » new features that allow it to directly integrate with ALPS using BASIL. This paper describes production experiences at Oak Ridge National Laboratory using the new TORQUE software versions, as well as ongoing and future work to improve TORQUE.« less
Multitasking the three-dimensional transport code TORT on CRAY platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azmy, Y.Y.; Barnett, D.A.; Burre, C.A.
1996-04-01
The multitasking options in the three-dimensional neutral particle transport code TORT originally implemented for Cray`s CTSS operating system are revived and extended to run on Cray Y/MP and C90 computers using the UNICOS operating system. These include two coarse-grained domain decompositions; across octants, and across directions within an octant, termed Octant Parallel (OP), and Direction Parallel (DP), respectively. Parallel performance of the DP is significantly enhanced by increasing the task grain size and reducing load imbalance via dynamic scheduling of the discrete angles among the participating tasks. Substantial Wall Clock speedup factors, approaching 4.5 using 8 tasks, have been measuredmore » in a time-sharing environment, and generally depend on the test problem specifications, number of tasks, and machine loading during execution.« less
Edison - A New Cray Supercomputer Advances Discovery at NERSC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dosanjh, Sudip; Parkinson, Dula; Yelick, Kathy
2014-02-06
When a supercomputing center installs a new system, users are invited to make heavy use of the computer as part of the rigorous testing. In this video, find out what top scientists have discovered using Edison, a Cray XC30 supercomputer, and how NERSC's newest supercomputer will accelerate their future research.
Edison - A New Cray Supercomputer Advances Discovery at NERSC
Dosanjh, Sudip; Parkinson, Dula; Yelick, Kathy; Trebotich, David; Broughton, Jeff; Antypas, Katie; Lukic, Zarija, Borrill, Julian; Draney, Brent; Chen, Jackie
2018-01-16
When a supercomputing center installs a new system, users are invited to make heavy use of the computer as part of the rigorous testing. In this video, find out what top scientists have discovered using Edison, a Cray XC30 supercomputer, and how NERSC's newest supercomputer will accelerate their future research.
NASA Technical Reports Server (NTRS)
Purdon, David J.; Baruah, Pranab K.; Bussoletti, John E.; Epton, Michael A.; Massena, William A.; Nelson, Franklin D.; Tsurusaki, Kiyoharu
1990-01-01
The Maintenance Document Version 3.0 is a guide to the PAN AIR software system, a system which computes the subsonic or supersonic linear potential flow about a body of nearly arbitrary shape, using a higher order panel method. The document describes the overall system and each program module of the system. Sufficient detail is given for program maintenance, updating, and modification. It is assumed that the reader is familiar with programming and CRAY computer systems. The PAN AIR system was written in FORTRAN 4 language except for a few CAL language subroutines which exist in the PAN AIR library. Structured programming techniques were used to provide code documentation and maintainability. The operating systems accommodated are COS 1.11, COS 1.12, COS 1.13, and COS 1.14 on the CRAY 1S, 1M, and X-MP computing systems. The system is comprised of a data base management system, a program library, an execution control module, and nine separate FORTRAN technical modules. Each module calculates part of the posed PAN AIR problem. The data base manager is used to communicate between modules and within modules. The technical modules must be run in a prescribed fashion for each PAN AIR problem. In order to ease the problem of supplying the many JCL cards required to execute the modules, a set of CRAY procedures (PAPROCS) was created to automatically supply most of the JCL cards. Most of this document has not changed for Version 3.0. It now, however, strictly applies only to PAN AIR version 3.0. The major changes are: (1) additional sections covering the new FDP module (which calculates streamlines and offbody points); (2) a complete rewrite of the section on the MAG module; and (3) strict applicability to CRAY computing systems.
CSM Testbed Development and Large-Scale Structural Applications
NASA Technical Reports Server (NTRS)
Knight, Norman F., Jr.; Gillian, R. E.; Mccleary, Susan L.; Lotts, C. G.; Poole, E. L.; Overman, A. L.; Macy, S. C.
1989-01-01
A research activity called Computational Structural Mechanics (CSM) conducted at the NASA Langley Research Center is described. This activity is developing advanced structural analysis and computational methods that exploit high-performance computers. Methods are developed in the framework of the CSM Testbed software system and applied to representative complex structural analysis problems from the aerospace industry. An overview of the CSM Testbed methods development environment is presented and some new numerical methods developed on a CRAY-2 are described. Selected application studies performed on the NAS CRAY-2 are also summarized.
The solution of linear systems of equations with a structural analysis code on the NAS CRAY-2
NASA Technical Reports Server (NTRS)
Poole, Eugene L.; Overman, Andrea L.
1988-01-01
Two methods for solving linear systems of equations on the NAS Cray-2 are described. One is a direct method; the other is an iterative method. Both methods exploit the architecture of the Cray-2, particularly the vectorization, and are aimed at structural analysis applications. To demonstrate and evaluate the methods, they were installed in a finite element structural analysis code denoted the Computational Structural Mechanics (CSM) Testbed. A description of the techniques used to integrate the two solvers into the Testbed is given. Storage schemes, memory requirements, operation counts, and reformatting procedures are discussed. Finally, results from the new methods are compared with results from the initial Testbed sparse Choleski equation solver for three structural analysis problems. The new direct solvers described achieve the highest computational rates of the methods compared. The new iterative methods are not able to achieve as high computation rates as the vectorized direct solvers but are best for well conditioned problems which require fewer iterations to converge to the solution.
ARC2D - EFFICIENT SOLUTION METHODS FOR THE NAVIER-STOKES EQUATIONS (CRAY VERSION)
NASA Technical Reports Server (NTRS)
Pulliam, T. H.
1994-01-01
ARC2D is a computational fluid dynamics program developed at the NASA Ames Research Center specifically for airfoil computations. The program uses implicit finite-difference techniques to solve two-dimensional Euler equations and thin layer Navier-Stokes equations. It is based on the Beam and Warming implicit approximate factorization algorithm in generalized coordinates. The methods are either time accurate or accelerated non-time accurate steady state schemes. The evolution of the solution through time is physically realistic; good solution accuracy is dependent on mesh spacing and boundary conditions. The mathematical development of ARC2D begins with the strong conservation law form of the two-dimensional Navier-Stokes equations in Cartesian coordinates, which admits shock capturing. The Navier-Stokes equations can be transformed from Cartesian coordinates to generalized curvilinear coordinates in a manner that permits one computational code to serve a wide variety of physical geometries and grid systems. ARC2D includes an algebraic mixing length model to approximate the effect of turbulence. In cases of high Reynolds number viscous flows, thin layer approximation can be applied. ARC2D allows for a variety of solutions to stability boundaries, such as those encountered in flows with shocks. The user has considerable flexibility in assigning geometry and developing grid patterns, as well as in assigning boundary conditions. However, the ARC2D model is most appropriate for attached and mildly separated boundary layers; no attempt is made to model wake regions and widely separated flows. The techniques have been successfully used for a variety of inviscid and viscous flowfield calculations. The Cray version of ARC2D is written in FORTRAN 77 for use on Cray series computers and requires approximately 5Mb memory. The program is fully vectorized. The tape includes variations for the COS and UNICOS operating systems. Also included is a sample routine for CONVEX computers to emulate Cray system time calls, which should be easy to modify for other machines as well. The standard distribution media for this version is a 9-track 1600 BPI ASCII Card Image format magnetic tape. The Cray version was developed in 1987. The IBM ES/3090 version is an IBM port of the Cray version. It is written in IBM VS FORTRAN and has the capability of executing in both vector and parallel modes on the MVS/XA operating system and in vector mode on the VM/XA operating system. Various options of the IBM VS FORTRAN compiler provide new features for the ES/3090 version, including 64-bit arithmetic and up to 2 GB of virtual addressability. The IBM ES/3090 version is available only as a 9-track, 1600 BPI IBM IEBCOPY format magnetic tape. The IBM ES/3090 version was developed in 1989. The DEC RISC ULTRIX version is a DEC port of the Cray version. It is written in FORTRAN 77 for RISC-based Digital Equipment platforms. The memory requirement is approximately 7Mb of main memory. It is available in UNIX tar format on TK50 tape cartridge. The port to DEC RISC ULTRIX was done in 1990. COS and UNICOS are trademarks and Cray is a registered trademark of Cray Research, Inc. IBM, ES/3090, VS FORTRAN, MVS/XA, and VM/XA are registered trademarks of International Business Machines. DEC and ULTRIX are registered trademarks of Digital Equipment Corporation.
Chemical calculations on Cray computers
NASA Technical Reports Server (NTRS)
Taylor, Peter R.; Bauschlicher, Charles W., Jr.; Schwenke, David W.
1989-01-01
The influence of recent developments in supercomputing on computational chemistry is discussed with particular reference to Cray computers and their pipelined vector/limited parallel architectures. After reviewing Cray hardware and software the performance of different elementary program structures are examined, and effective methods for improving program performance are outlined. The computational strategies appropriate for obtaining optimum performance in applications to quantum chemistry and dynamics are discussed. Finally, some discussion is given of new developments and future hardware and software improvements.
NASA Technical Reports Server (NTRS)
Tennille, Geoffrey M.; Howser, Lona M.
1993-01-01
This document briefly describes the use of the CRAY supercomputers that are an integral part of the Supercomputing Network Subsystem of the Central Scientific Computing Complex at LaRC. Features of the CRAY supercomputers are covered, including: FORTRAN, C, PASCAL, architectures of the CRAY-2 and CRAY Y-MP, the CRAY UNICOS environment, batch job submittal, debugging, performance analysis, parallel processing, utilities unique to CRAY, and documentation. The document is intended for all CRAY users as a ready reference to frequently asked questions and to more detailed information contained in the vendor manuals. It is appropriate for both the novice and the experienced user.
Data communication requirements for the advanced NAS network
NASA Technical Reports Server (NTRS)
Levin, Eugene; Eaton, C. K.; Young, Bruce
1986-01-01
The goal of the Numerical Aerodynamic Simulation (NAS) Program is to provide a powerful computational environment for advanced research and development in aeronautics and related disciplines. The present NAS system consists of a Cray 2 supercomputer connected by a data network to a large mass storage system, to sophisticated local graphics workstations, and by remote communications to researchers throughout the United States. The program plan is to continue acquiring the most powerful supercomputers as they become available. In the 1987/1988 time period it is anticipated that a computer with 4 times the processing speed of a Cray 2 will be obtained and by 1990 an additional supercomputer with 16 times the speed of the Cray 2. The implications of this 20-fold increase in processing power on the data communications requirements are described. The analysis was based on models of the projected workload and system architecture. The results are presented together with the estimates of their sensitivity to assumptions inherent in the models.
NASA Langley Research Center's distributed mass storage system
NASA Technical Reports Server (NTRS)
Pao, Juliet Z.; Humes, D. Creig
1993-01-01
There is a trend in institutions with high performance computing and data management requirements to explore mass storage systems with peripherals directly attached to a high speed network. The Distributed Mass Storage System (DMSS) Project at NASA LaRC is building such a system and expects to put it into production use by the end of 1993. This paper presents the design of the DMSS, some experiences in its development and use, and a performance analysis of its capabilities. The special features of this system are: (1) workstation class file servers running UniTree software; (2) third party I/O; (3) HIPPI network; (4) HIPPI/IPI3 disk array systems; (5) Storage Technology Corporation (STK) ACS 4400 automatic cartridge system; (6) CRAY Research Incorporated (CRI) CRAY Y-MP and CRAY-2 clients; (7) file server redundancy provision; and (8) a transition mechanism from the existent mass storage system to the DMSS.
Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application
NASA Technical Reports Server (NTRS)
Tennille, Geoffrey M.; Overman, Andrea L.; Lambiotte, Jules J.; Streett, Craig L.
1991-01-01
The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs.
NASA Technical Reports Server (NTRS)
Howe, G.; Saunders, D.
1983-01-01
Users of the CDC 7600 at Ames are assisted in making the transition to the CRAY-1. Similarities and differences in the basic JCL are summarized, and a dozen or so examples of typical batch jobs for the two systems are shown in parallel. Some changes to look for in FORTRAN programs and in the use of UPDATE are also indicated. No attempt is made to cover magnetic tape handling. The material here should not be considered a substitute for reading the more conventional manuals or the User's Guide for the Advanced Computational Facility, available from the Computer Information Center.
ATLAS and LHC computing on CRAY
NASA Astrophysics Data System (ADS)
Sciacca, F. G.; Haug, S.; ATLAS Collaboration
2017-10-01
Access and exploitation of large scale computing resources, such as those offered by general purpose HPC centres, is one important measure for ATLAS and the other Large Hadron Collider experiments in order to meet the challenge posed by the full exploitation of the future data within the constraints of flat budgets. We report on the effort of moving the Swiss WLCG T2 computing, serving ATLAS, CMS and LHCb, from a dedicated cluster to the large Cray systems at the Swiss National Supercomputing Centre CSCS. These systems do not only offer very efficient hardware, cooling and highly competent operators, but also have large backfill potentials due to size and multidisciplinary usage and potential gains due to economy at scale. Technical solutions, performance, expected return and future plans are discussed.
ARC2D - EFFICIENT SOLUTION METHODS FOR THE NAVIER-STOKES EQUATIONS (DEC RISC ULTRIX VERSION)
NASA Technical Reports Server (NTRS)
Biyabani, S. R.
1994-01-01
ARC2D is a computational fluid dynamics program developed at the NASA Ames Research Center specifically for airfoil computations. The program uses implicit finite-difference techniques to solve two-dimensional Euler equations and thin layer Navier-Stokes equations. It is based on the Beam and Warming implicit approximate factorization algorithm in generalized coordinates. The methods are either time accurate or accelerated non-time accurate steady state schemes. The evolution of the solution through time is physically realistic; good solution accuracy is dependent on mesh spacing and boundary conditions. The mathematical development of ARC2D begins with the strong conservation law form of the two-dimensional Navier-Stokes equations in Cartesian coordinates, which admits shock capturing. The Navier-Stokes equations can be transformed from Cartesian coordinates to generalized curvilinear coordinates in a manner that permits one computational code to serve a wide variety of physical geometries and grid systems. ARC2D includes an algebraic mixing length model to approximate the effect of turbulence. In cases of high Reynolds number viscous flows, thin layer approximation can be applied. ARC2D allows for a variety of solutions to stability boundaries, such as those encountered in flows with shocks. The user has considerable flexibility in assigning geometry and developing grid patterns, as well as in assigning boundary conditions. However, the ARC2D model is most appropriate for attached and mildly separated boundary layers; no attempt is made to model wake regions and widely separated flows. The techniques have been successfully used for a variety of inviscid and viscous flowfield calculations. The Cray version of ARC2D is written in FORTRAN 77 for use on Cray series computers and requires approximately 5Mb memory. The program is fully vectorized. The tape includes variations for the COS and UNICOS operating systems. Also included is a sample routine for CONVEX computers to emulate Cray system time calls, which should be easy to modify for other machines as well. The standard distribution media for this version is a 9-track 1600 BPI ASCII Card Image format magnetic tape. The Cray version was developed in 1987. The IBM ES/3090 version is an IBM port of the Cray version. It is written in IBM VS FORTRAN and has the capability of executing in both vector and parallel modes on the MVS/XA operating system and in vector mode on the VM/XA operating system. Various options of the IBM VS FORTRAN compiler provide new features for the ES/3090 version, including 64-bit arithmetic and up to 2 GB of virtual addressability. The IBM ES/3090 version is available only as a 9-track, 1600 BPI IBM IEBCOPY format magnetic tape. The IBM ES/3090 version was developed in 1989. The DEC RISC ULTRIX version is a DEC port of the Cray version. It is written in FORTRAN 77 for RISC-based Digital Equipment platforms. The memory requirement is approximately 7Mb of main memory. It is available in UNIX tar format on TK50 tape cartridge. The port to DEC RISC ULTRIX was done in 1990. COS and UNICOS are trademarks and Cray is a registered trademark of Cray Research, Inc. IBM, ES/3090, VS FORTRAN, MVS/XA, and VM/XA are registered trademarks of International Business Machines. DEC and ULTRIX are registered trademarks of Digital Equipment Corporation.
Efficient multitasking of Choleski matrix factorization on CRAY supercomputers
NASA Technical Reports Server (NTRS)
Overman, Andrea L.; Poole, Eugene L.
1991-01-01
A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gittens, Alex; Kottalam, Jey; Yang, Jiyan
We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with themore » fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.« less
Internal computational fluid mechanics on supercomputers for aerospace propulsion systems
NASA Technical Reports Server (NTRS)
Andersen, Bernhard H.; Benson, Thomas J.
1987-01-01
The accurate calculation of three-dimensional internal flowfields for application towards aerospace propulsion systems requires computational resources available only on supercomputers. A survey is presented of three-dimensional calculations of hypersonic, transonic, and subsonic internal flowfields conducted at the Lewis Research Center. A steady state Parabolized Navier-Stokes (PNS) solution of flow in a Mach 5.0, mixed compression inlet, a Navier-Stokes solution of flow in the vicinity of a terminal shock, and a PNS solution of flow in a diffusing S-bend with vortex generators are presented and discussed. All of these calculations were performed on either the NAS Cray-2 or the Lewis Research Center Cray XMP.
Introducing Argonne’s Theta Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
None
Theta, the Argonne Leadership Computing Facility’s (ALCF) new Intel-Cray supercomputer, is officially open to the research community. Theta’s massively parallel, many-core architecture puts the ALCF on the path to Aurora, the facility’s future Intel-Cray system. Capable of nearly 10 quadrillion calculations per second, Theta enables researchers to break new ground in scientific investigations that range from modeling the inner workings of the brain to developing new materials for renewable energy applications.
A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav
2014-01-01
We present here a performance analysis of three of current architectures that have become commonplace in the High Performance Computing world. Blue Gene/Q is the third generation of systems from IBM that use modestly performing cores but at large-scale in order to achieve high performance. The XE6 is the latest in a long line of Cray systems that use a 3-D topology but the first to use its Gemini interconnection network. InfiniBand provides the flexibility of using compute nodes from many vendors that can be connected in many possible topologies. The performance characteristics of each vary vastly, and the waymore » in which nodes are allocated in each type of system can significantly impact on achieved performance. In this work we compare these three systems using a combination of micro-benchmarks and a set of production applications. In addition we also examine the differences in performance variability observed on each system and quantify the lost performance using a combination of both empirical measurements and performance models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
NASA Technical Reports Server (NTRS)
McGuire, Tim
1998-01-01
In this paper, we report the results of our recent research on the application of a multiprocessor Cray T916 supercomputer in modeling super-thermal electron transport in the earth's magnetic field. In general, this mathematical model requires numerical solution of a system of partial differential equations. The code we use for this model is moderately vectorized. By using Amdahl's Law for vector processors, it can be verified that the code is about 60% vectorized on a Cray computer. Speedup factors on the order of 2.5 were obtained compared to the unvectorized code. In the following sections, we discuss the methodology of improving the code. In addition to our goal of optimizing the code for solution on the Cray computer, we had the goal of scalability in mind. Scalability combines the concepts of portabilty with near-linear speedup. Specifically, a scalable program is one whose performance is portable across many different architectures with differing numbers of processors for many different problem sizes. Though we have access to a Cray at this time, the goal was to also have code which would run well on a variety of architectures.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Development of a CRAY 1 version of the SINDA program. [thermo-structural analyzer program
NASA Technical Reports Server (NTRS)
Juba, S. M.; Fogerson, P. E.
1982-01-01
The SINDA thermal analyzer program was transferred from the UNIVAC 1110 computer to a CYBER And then to a CRAY 1. Significant changes to the code of the program were required in order to execute efficiently on the CYBER and CRAY. The program was tested on the CRAY using a thermal math model of the shuttle which was too large to run on either the UNIVAC or CYBER. An effort was then begun to further modify the code of SINDA in order to make effective use of the vector capabilities of the CRAY.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mubarak, Misbah; Ross, Robert B.
This technical report describes the experiments performed to validate the MPI performance measurements reported by the CODES dragonfly network simulation with the Theta Cray XC system at the Argonne Leadership Computing Facility (ALCF).
NASA Technical Reports Server (NTRS)
Babrauckas, Theresa
2000-01-01
The Affordable High Performance Computing (AHPC) project demonstrated that high-performance computing based on a distributed network of computer workstations is a cost-effective alternative to vector supercomputers for running CPU and memory intensive design and analysis tools. The AHPC project created an integrated system called a Network Supercomputer. By connecting computer work-stations through a network and utilizing the workstations when they are idle, the resulting distributed-workstation environment has the same performance and reliability levels as the Cray C90 vector Supercomputer at less than 25 percent of the C90 cost. In fact, the cost comparison between a Cray C90 Supercomputer and Sun workstations showed that the number of distributed networked workstations equivalent to a C90 costs approximately 8 percent of the C90.
Multitasking a three-dimensional Navier-Stokes algorithm on the Cray-2
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.
1989-01-01
A three-dimensional computational aerodynamics algorithm has been multitasked for efficient parallel execution on the Cray-2. It provides a means for examining the multitasking performance of a complete CFD application code. An embedded zonal multigrid scheme is used to solve the Reynolds-averaged Navier-Stokes equations for an internal flow model problem. The explicit nature of each component of the method allows a spatial partitioning of the computational domain to achieve a well-balanced task load for MIMD computers with vector-processing capability. Experiments have been conducted with both two- and three-dimensional multitasked cases. The best speedup attained by an individual task group was 3.54 on four processors of the Cray-2, while the entire solver yielded a speedup of 2.67 on four processors for the three-dimensional case. The multiprocessing efficiency of various types of computational tasks is examined, performance on two Cray-2s with different memory access speeds is compared, and extrapolation to larger problems is discussed.
Force user's manual: A portable, parallel FORTRAN
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.
1990-01-01
The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
FAST: A multi-processed environment for visualization of computational fluid dynamics
NASA Technical Reports Server (NTRS)
Bancroft, Gordon V.; Merritt, Fergus J.; Plessel, Todd C.; Kelaita, Paul G.; Mccabe, R. Kevin
1991-01-01
Three-dimensional, unsteady, multi-zoned fluid dynamics simulations over full scale aircraft are typical of the problems being investigated at NASA Ames' Numerical Aerodynamic Simulation (NAS) facility on CRAY2 and CRAY-YMP supercomputers. With multiple processor workstations available in the 10-30 Mflop range, we feel that these new developments in scientific computing warrant a new approach to the design and implementation of analysis tools. These larger, more complex problems create a need for new visualization techniques not possible with the existing software or systems available as of this writing. The visualization techniques will change as the supercomputing environment, and hence the scientific methods employed, evolves even further. The Flow Analysis Software Toolkit (FAST), an implementation of a software system for fluid mechanics analysis, is discussed.
Hot Chips and Hot Interconnects for High End Computing Systems
NASA Technical Reports Server (NTRS)
Saini, Subhash
2005-01-01
I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
Strategies for vectorizing the sparse matrix vector product on the CRAY XMP, CRAY 2, and CYBER 205
NASA Technical Reports Server (NTRS)
Bauschlicher, Charles W., Jr.; Partridge, Harry
1987-01-01
Large, randomly sparse matrix vector products are important in a number of applications in computational chemistry, such as matrix diagonalization and the solution of simultaneous equations. Vectorization of this process is considered for the CRAY XMP, CRAY 2, and CYBER 205, using a matrix of dimension of 20,000 with from 1 percent to 6 percent nonzeros. Efficient scatter/gather capabilities add coding flexibility and yield significant improvements in performance. For the CYBER 205, it is shown that minor changes in the IO can reduce the CPU time by a factor of 50. Similar changes in the CRAY codes make a far smaller improvement.
The International Conference on Vector and Parallel Computing (2nd)
1989-01-17
Computation of the SVD of Bidiagonal Matrices" ...................................... 11 " Lattice QCD -As a Large Scale Scientific Computation...vectorizcd for the IBM 3090 Vector Facility. In addition, elapsed times " Lattice QCD -As a Large Scale Scientific have been reduced by using 3090...benchmarked Lattice QCD on a large number ofcompu- come from the wavefront solver routine. This was exten- ters: CrayX-MP and Cray 2 (vector
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christoph, G.G; Jackson, K.A.; Neuman, M.C.
An effective method for detecting computer misuse is the automatic auditing and analysis of on-line user activity. This activity is reflected in the system audit record, by changes in the vulnerability posture of the system configuration, and in other evidence found through active testing of the system. In 1989 we started developing an automatic misuse detection system for the Integrated Computing Network (ICN) at Los Alamos National Laboratory. Since 1990 this system has been operational, monitoring a variety of network systems and services. We call it the Network Anomaly Detection and Intrusion Reporter, or NADIR. During the last year andmore » a half, we expanded NADIR to include processing of audit and activity records for the Cray UNICOS operating system. This new component is called the UNICOS Real-time NADIR, or UNICORN. UNICORN summarizes user activity and system configuration information in statistical profiles. In near real-time, it can compare current activity to historical profiles and test activity against expert rules that express our security policy and define improper or suspicious behavior. It reports suspicious behavior to security auditors and provides tools to aid in follow-up investigations. UNICORN is currently operational on four Crays in Los Alamos` main computing network, the ICN.« less
Parallel computation in a three-dimensional elastic-plastic finite-element analysis
NASA Technical Reports Server (NTRS)
Shivakumar, K. N.; Bigelow, C. A.; Newman, J. C., Jr.
1992-01-01
A CRAY parallel processing technique called autotasking was implemented in a three-dimensional elasto-plastic finite-element code. The technique was evaluated on two CRAY supercomputers, a CRAY 2 and a CRAY Y-MP. Autotasking was implemented in all major portions of the code, except the matrix equations solver. Compiler directives alone were not able to properly multitask the code; user-inserted directives were required to achieve better performance. It was noted that the connect time, rather than wall-clock time, was more appropriate to determine speedup in multiuser environments. For a typical example problem, a speedup of 2.1 (1.8 when the solution time was included) was achieved in a dedicated environment and 1.7 (1.6 with solution time) in a multiuser environment on a four-processor CRAY 2 supercomputer. The speedup on a three-processor CRAY Y-MP was about 2.4 (2.0 with solution time) in a multiuser environment.
Computing Operating Characteristics Of Bearing/Shaft Systems
NASA Technical Reports Server (NTRS)
Moore, James D.
1996-01-01
SHABERTH computer program predicts operating characteristics of bearings in multibearing load-support system. Lubricated and nonlubricated bearings modeled. Calculates loads, torques, temperatures, and fatigue lives of ball and/or roller bearings on single shaft. Provides for analysis of reaction of system to termination of supply of lubricant to bearings and other lubricated mechanical elements. Valuable in design and analysis of shaft/bearing systems. Two versions of SHABERTH available. Cray version (LEW-14860), "Computing Thermal Performances Of Shafts and Bearings". IBM PC version (MFS-28818), written for IBM PC-series and compatible computers running MS-DOS.
Scalable Vector Media-processors for Embedded Systems
2002-05-01
Set Architecture for Multimedia “When you do the common things in life in an uncommon way, you will command the attention of the world.” George ...Bibliography [ABHS89] M. August, G. Brost , C. Hsiung, and C. Schiffleger. Cray X-MP: The Birth of a Super- computer. IEEE Computer, 22(1):45–52, January
Distributed Finite Element Analysis Using a Transputer Network
NASA Technical Reports Server (NTRS)
Watson, James; Favenesi, James; Danial, Albert; Tombrello, Joseph; Yang, Dabby; Reynolds, Brian; Turrentine, Ronald; Shephard, Mark; Baehmann, Peggy
1989-01-01
The principal objective of this research effort was to demonstrate the extraordinarily cost effective acceleration of finite element structural analysis problems using a transputer-based parallel processing network. This objective was accomplished in the form of a commercially viable parallel processing workstation. The workstation is a desktop size, low-maintenance computing unit capable of supercomputer performance yet costs two orders of magnitude less. To achieve the principal research objective, a transputer based structural analysis workstation termed XPFEM was implemented with linear static structural analysis capabilities resembling commercially available NASTRAN. Finite element model files, generated using the on-line preprocessing module or external preprocessing packages, are downloaded to a network of 32 transputers for accelerated solution. The system currently executes at about one third Cray X-MP24 speed but additional acceleration appears likely. For the NASA selected demonstration problem of a Space Shuttle main engine turbine blade model with about 1500 nodes and 4500 independent degrees of freedom, the Cray X-MP24 required 23.9 seconds to obtain a solution while the transputer network, operated from an IBM PC-AT compatible host computer, required 71.7 seconds. Consequently, the $80,000 transputer network demonstrated a cost-performance ratio about 60 times better than the $15,000,000 Cray X-MP24 system.
Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav
2013-01-21
Abstract—Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Cray’s proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In thismore » work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. In particular we discuss the causes of variability in performance across the systems and also quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XT6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
Barrier-breaking performance for industrial problems on the CRAY C916
DOE Office of Scientific and Technical Information (OSTI.GOV)
Graffunder, S.K.
1993-12-31
Nine applications, including third-party codes, were submitted to the Gordon Bell Prize committee showing the CRAY C916 supercomputer providing record-breaking time to solution for industrial problems in several disciplines. Performance was obtained by balancing raw hardware speed; effective use of large, real, shared memory; compiler vectorization and autotasking; hand optimization; asynchronous I/O techniques; and new algorithms. The highest GFLOPS performance for the submissions was 11.1 GFLOPS out of a peak advertised performance of 16 GFLOPS for the CRAY C916 system. One program achieved a 15.45 speedup from the compiler with just two hand-inserted directives to scope variables properly for themore » mathematical library. New I/O techniques hide tens of gigabytes of I/O behind parallel computations. Finally, new iterative solver algorithms have demonstrated times to solution on 1 CPU as high as 70 times faster than the best direct solvers.« less
NASA Technical Reports Server (NTRS)
Rogers, S. E.
1994-01-01
INS3D computes steady-state solutions to the incompressible Navier-Stokes equations. The INS3D approach utilizes pseudo-compressibility combined with an approximate factorization scheme. This computational fluid dynamics (CFD) code has been verified on problems such as flow through a channel, flow over a backwardfacing step and flow over a circular cylinder. Three dimensional cases include flow over an ogive cylinder, flow through a rectangular duct, wind tunnel inlet flow, cylinder-wall juncture flow and flow through multiple posts mounted between two plates. INS3D uses a pseudo-compressibility approach in which a time derivative of pressure is added to the continuity equation, which together with the momentum equations form a set of four equations with pressure and velocity as the dependent variables. The equations' coordinates are transformed for general three dimensional applications. The equations are advanced in time by the implicit, non-iterative, approximately-factored, finite-difference scheme of Beam and Warming. The numerical stability of the scheme depends on the use of higher-order smoothing terms to damp out higher-frequency oscillations caused by second-order central differencing. The artificial compressibility introduces pressure (sound) waves of finite speed (whereas the speed of sound would be infinite in an incompressible fluid). As the solution converges, these pressure waves die out, causing the derivation of pressure with respect to time to approach zero. Thus, continuity is satisfied for the incompressible fluid in the steady state. Computational efficiency is achieved using a diagonal algorithm. A block tri-diagonal option is also available. When a steady-state solution is reached, the modified continuity equation will satisfy the divergence-free velocity field condition. INS3D is capable of handling several different types of boundaries encountered in numerical simulations, including solid-surface, inflow and outflow, and far-field boundaries. Three machine versions of INS3D are available. INS3D for the CRAY is written in CRAY FORTRAN for execution on a CRAY X-MP under COS, INS3D for the IBM is written in FORTRAN 77 for execution on an IBM 3090 under the VM or MVS operating system, and INS3D for DEC RISC-based systems is written in RISC FORTRAN for execution on a DEC workstation running RISC ULTRIX 3.1 or later. The CRAY version has a central memory requirement of 730279 words. The central memory requirement for the IBM is 150Mb. The memory requirement for the DEC RISC ULTRIX version is 3Mb of main memory. INS3D was developed in 1987. The port to the IBM was done in 1990. The port to the DECstation 3100 was done in 1991. CRAY is a registered trademark of Cray Research Inc. IBM is a registered trademark of International Business Machines. DEC, DECstation, and ULTRIX are trademarks of the Digital Equipment Corporation.
A parallel finite-difference method for computational aerodynamics
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.
1989-01-01
A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed.
A Performance Evaluation of the Cray X1 for Scientific Applications
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak; Borrill, Julian; Canning, Andrew; Carter, Jonathan; Djomehri, M. Jahed; Shan, Hongzhang; Skinner, David
2003-01-01
The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and capacity computers because of their generality, scalability, and cost effectiveness. However, the recent development of massively parallel vector systems is having a significant effect on the supercomputing landscape. In this paper, we compare the performance of the recently-released Cray X1 vector system with that of the cacheless NEC SX-6 vector machine, and the superscalar cache-based IBM Power3 and Power4 architectures for scientific applications. Overall results demonstrate that the X1 is quite promising, but performance improvements are expected as the hardware, systems software, and numerical libraries mature. Code reengineering to effectively utilize the complex architecture may also lead to significant efficiency enhancements.
Theoretical research program to study chemical reactions in AOTV bow shock tubes
NASA Technical Reports Server (NTRS)
Taylor, P.
1986-01-01
Progress in the development of computational methods for the characterization of chemical reactions in aerobraking orbit transfer vehicle (AOTV) propulsive flows is reported. Two main areas of code development were undertaken: (1) the implementation of CASSCF (complete active space self-consistent field) and SCF (self-consistent field) analytical first derivatives on the CRAY X-MP; and (2) the installation of the complete set of electronic structure codes on the CRAY 2. In the area of application calculations the main effort was devoted to performing full configuration-interaction calculations and using these results to benchmark other methods. Preprints describing some of the systems studied are included.
Optimal Full Information Synthesis for Flexible Structures Implemented on Cray Supercomputers
NASA Technical Reports Server (NTRS)
Lind, Rick; Balas, Gary J.
1995-01-01
This paper considers an algorithm for synthesis of optimal controllers for full information feedback. The synthesis procedure reduces to a single linear matrix inequality which may be solved via established convex optimization algorithms. The computational cost of the optimization is investigated. It is demonstrated the problem dimension and corresponding matrices can become large for practical engineering problems. This algorithm represents a process that is impractical for standard workstations for large order systems. A flexible structure is presented as a design example. Control synthesis requires several days on a workstation but may be solved in a reasonable amount of time using a Cray supercomputer.
A vectorized Lanczos eigensolver for high-performance computers
NASA Technical Reports Server (NTRS)
Bostic, Susan W.
1990-01-01
The computational strategies used to implement a Lanczos-based-method eigensolver on the latest generation of supercomputers are described. Several examples of structural vibration and buckling problems are presented that show the effects of using optimization techniques to increase the vectorization of the computational steps. The data storage and access schemes and the tools and strategies that best exploit the computer resources are presented. The method is implemented on the Convex C220, the Cray 2, and the Cray Y-MP computers. Results show that very good computation rates are achieved for the most computationally intensive steps of the Lanczos algorithm and that the Lanczos algorithm is many times faster than other methods extensively used in the past.
The SGI/CRAY T3E: Experiences and Insights
NASA Technical Reports Server (NTRS)
Bernard, Lisa Hamet
1999-01-01
The focus of the HPCC Earth and Space Sciences (ESS) Project is capability computing - pushing highly scalable computing testbeds to their performance limits. The drivers of this focus are the Grand Challenge problems in Earth and space science: those that could not be addressed in a capacity computing environment where large jobs must continually compete for resources. These Grand Challenge codes require a high degree of communication, large memory, and very large I/O (throughout the duration of the processing, not just in loading initial conditions and saving final results). This set of parameters led to the selection of an SGI/Cray T3E as the current ESS Computing Testbed. The T3E at the Goddard Space Flight Center is a unique computational resource within NASA. As such, it must be managed to effectively support the diverse research efforts across the NASA research community yet still enable the ESS Grand Challenge Investigator teams to achieve their performance milestones, for which the system was intended. To date, all Grand Challenge Investigator teams have achieved the 10 GFLOPS milestone, eight of nine have achieved the 50 GFLOPS milestone, and three have achieved the 100 GFLOPS milestone. In addition, many technical papers have been published highlighting results achieved on the NASA T3E, including some at this Workshop. The successes enabled by the NASA T3E computing environment are best illustrated by the 512 PE upgrade funded by the NASA Earth Science Enterprise earlier this year. Never before has an HPCC computing testbed been so well received by the general NASA science community that it was deemed critical to the success of a core NASA science effort. NASA looks forward to many more success stories before the conclusion of the NASA-SGI/Cray cooperative agreement in June 1999.
Applications of CFD and visualization techniques
NASA Technical Reports Server (NTRS)
Saunders, James H.; Brown, Susan T.; Crisafulli, Jeffrey J.; Southern, Leslie A.
1992-01-01
In this paper, three applications are presented to illustrate current techniques for flow calculation and visualization. The first two applications use a commercial computational fluid dynamics (CFD) code, FLUENT, performed on a Cray Y-MP. The results are animated with the aid of data visualization software, apE. The third application simulates a particulate deposition pattern using techniques inspired by developments in nonlinear dynamical systems. These computations were performed on personal computers.
NAS technical summaries: Numerical aerodynamic simulation program, March 1991 - February 1992
NASA Technical Reports Server (NTRS)
1992-01-01
NASA created the Numerical Aerodynamic Simulation (NAS) Program in 1987 to focus resources on solving critical problems in aeroscience and related disciplines by utilizing the power of the most advanced supercomputers available. The NAS Program provides scientists with the necessary computing power to solve today's most demanding computational fluid dynamics problems and serves as a pathfinder in integrating leading-edge supercomputing technologies, thus benefiting other supercomputer centers in Government and industry. This report contains selected scientific results from the 1991-92 NAS Operational Year, March 4, 1991 to March 3, 1992, which is the fifth year of operation. During this year, the scientific community was given access to a Cray-2 and a Cray Y-MP. The Cray-2, the first generation supercomputer, has four processors, 256 megawords of central memory, and a total sustained speed of 250 million floating point operations per second. The Cray Y-MP, the second generation supercomputer, has eight processors and a total sustained speed of one billion floating point operations per second. Additional memory was installed this year, doubling capacity from 128 to 256 megawords of solid-state storage-device memory. Because of its higher performance, the Cray Y-MP delivered approximately 77 percent of the total number of supercomputer hours used during this year.
Experiences and results multitasking a hydrodynamics code on global and local memory machines
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mandell, D.
1987-01-01
A one-dimensional, time-dependent Lagrangian hydrodynamics code using a Godunov solution method has been multitasked for the Cray X-MP/48, the Intel iPSC hypercube, the Alliant FX series and the IBM RP3 computers. Actual multitasking results have been obtained for the Cray, Intel and Alliant computers and simulated results were obtained for the Cray and RP3 machines. The differences in the methods required to multitask on each of the machines is discussed. Results are presented for a sample problem involving a shock wave moving down a channel. Comparisons are made between theoretical speedups, predicted by Amdahl's law, and the actual speedups obtained.more » The problems of debugging on the different machines are also described.« less
The ASC Sequoia Programming Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seager, M
2008-08-06
In the late 1980's and early 1990's, Lawrence Livermore National Laboratory was deeply engrossed in determining the next generation programming model for the Integrated Design Codes (IDC) beyond vectorization for the Cray 1s series of computers. The vector model, developed in mid 1970's first for the CDC 7600 and later extended from stack based vector operation to memory to memory operations for the Cray 1s, lasted approximately 20 years (See Slide 5). The Cray vector era was deemed an extremely long lived era as it allowed vector codes to be developed over time (the Cray 1s were faster in scalarmore » mode than the CDC 7600) with vector unit utilization increasing incrementally over time. The other attributes of the Cray vector era at LLNL were that we developed, supported and maintained the Operating System (LTSS and later NLTSS), communications protocols (LINCS), Compilers (Civic Fortran77 and Model), operating system tools (e.g., batch system, job control scripting, loaders, debuggers, editors, graphics utilities, you name it) and math and highly machine optimized libraries (e.g., SLATEC, and STACKLIB). Although LTSS was adopted by Cray for early system generations, they later developed COS and UNICOS operating systems and environment on their own. In the late 1970s and early 1980s two trends appeared that made the Cray vector programming model (described above including both the hardware and system software aspects) seem potentially dated and slated for major revision. These trends were the appearance of low cost CMOS microprocessors and their attendant, departmental and mini-computers and later workstations and personal computers. With the wide spread adoption of Unix in the early 1980s, it appeared that LLNL (and the other DOE Labs) would be left out of the mainstream of computing without a rapid transition to these 'Killer Micros' and modern OS and tools environments. The other interesting advance in the period is that systems were being developed with multiple 'cores' in them and called Symmetric Multi-Processor or Shared Memory Processor (SMP) systems. The parallel revolution had begun. The Laboratory started a small 'parallel processing project' in 1983 to study the new technology and its application to scientific computing with four people: Tim Axelrod, Pete Eltgroth, Paul Dubois and Mark Seager. Two years later, Eugene Brooks joined the team. This team focused on Unix and 'killer micro' SMPs. Indeed, Eugene Brooks was credited with coming up with the 'Killer Micro' term. After several generations of SMP platforms (e.g., Sequent Balance 8000 with 8 33MHz MC32032s, Allian FX8 with 8 MC68020 and FPGA based Vector Units and finally the BB&N Butterfly with 128 cores), it became apparent to us that the killer micro revolution would indeed take over Crays and that we definitely needed a new programming and systems model. The model developed by Mark Seager and Dale Nielsen focused on both the system aspects (Slide 3) and the code development aspects (Slide 4). Although now succinctly captured in two attached slides, at the time there was tremendous ferment in the research community as to what parallel programming model would emerge, dominate and survive. In addition, we wanted a model that would provide portability between platforms of a single generation but also longevity over multiple--and hopefully--many generations. Only after we developed the 'Livermore Model' and worked it out in considerable detail did it become obvious that what we came up with was the right approach. In a nutshell, the applications programming model of the Livermore Model posited that SMP parallelism would ultimately not scale indefinitely and one would have to bite the bullet and implement MPI parallelism within the Integrated Design Code (IDC). We also had a major emphasis on doing everything in a completely standards based, portable methodology with POSIX/Unix as the target environment. We decided against specialized libraries like STACKLIB for performance, but kept as many general purpose, portable math libraries as were needed by the codes. Third, we assumed that the SMPs in clusters would evolve in time to become more powerful, feature rich and, in particular, offer more cores. Thus, we focused on OpenMP, and POSIX PThreads for programming SMP parallelism. These code porting efforts were lead by Dale Nielsen, A-Division code group leader, and Randy Christensen, B-Division code group leader. Most of the porting effort revolved removing 'Crayisms' in the codes: artifacts of LTSS/NLTSS, Civic compiler extensions beyond Fortran77, IO libraries and dealing with new code control languages (we switched to Perl and later to Python). Adding MPI to the codes was initially problematic and error prone because the programmers used MPI directly and sprinkled the calls throughout the code.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Computational Research Division, Lawrence Berkeley National Laboratory; NERSC, Lawrence Berkeley National Laboratory; Computer Science Department, University of California, Berkeley
2009-05-04
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicore-specific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads permore » MPI task. Our study presents a detailed performance analysis when moving along an isocurve of constant hardware usage: fixed total memory, total cores, and total nodes. Overall, our work points to approaches for improving intra- and inter-node efficiency on large-scale multicore systems for demanding scientific applications.« less
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1
NASA Technical Reports Server (NTRS)
Simon, Horst D.; Saini, Subhash; Grassi, Charles
1994-01-01
The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
MAGNA (Materially and Geometrically Nonlinear Analysis). Part I. Finite Element Analysis Manual.
1982-12-01
provided for operating the program, modifying storage caoacity, preparing input data, estimating computer run times , and interpreting the output...7.1.3 Reserved File Names 7.1.16 7.1.4 Typical Execution Times on CDC Computers 7.1.18 7.2 CRAY PROGRAM VERSION 7.2.1 7.2.1 Job Control Language 7.2.1...7.2.2 Modification of Storage Capacity 7.2.8 7.2.3 Execution Times on the CRAY-I Computer 7.2.12 7.3 VAX PROGRAM VERSION 7.3.1 8 INPUT DATA 8.0.1 8.1
NASA Technical Reports Server (NTRS)
Wigton, Larry
1996-01-01
Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
Investigating the impact of the cielo cray XE6 architecture on scientific application codes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rajan, Mahesh; Barrett, Richard; Pedretti, Kevin Thomas Tauke
2010-12-01
Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, andmore » supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1983-09-09
This Validation Summary Report (VSR) for the Cray Research, Inc., CRAY FORTRAN Translator (CFT) Version 1.11 Bugfix 1 running under the CRAY Operating System (COS) Version 1.12 provides a consolidated summary of the results obtained from the validation of the subject compiler against the 1978 FORTRAN Standard (X3.9-1978/FIPS PUB 69). The compiler was validated against the Full Level FORTRAN level of FIPS PUB 69. The VSR is made up of several sections showing all the discrepancies found -if any. These include an overview of the validation which lists all categories of discrepancies together with the tests which failed.
NASA Technical Reports Server (NTRS)
Logan, Terry G.
1994-01-01
The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
NASA Technical Reports Server (NTRS)
Biyabani, S. R.
1994-01-01
INS3D computes steady-state solutions to the incompressible Navier-Stokes equations. The INS3D approach utilizes pseudo-compressibility combined with an approximate factorization scheme. This computational fluid dynamics (CFD) code has been verified on problems such as flow through a channel, flow over a backwardfacing step and flow over a circular cylinder. Three dimensional cases include flow over an ogive cylinder, flow through a rectangular duct, wind tunnel inlet flow, cylinder-wall juncture flow and flow through multiple posts mounted between two plates. INS3D uses a pseudo-compressibility approach in which a time derivative of pressure is added to the continuity equation, which together with the momentum equations form a set of four equations with pressure and velocity as the dependent variables. The equations' coordinates are transformed for general three dimensional applications. The equations are advanced in time by the implicit, non-iterative, approximately-factored, finite-difference scheme of Beam and Warming. The numerical stability of the scheme depends on the use of higher-order smoothing terms to damp out higher-frequency oscillations caused by second-order central differencing. The artificial compressibility introduces pressure (sound) waves of finite speed (whereas the speed of sound would be infinite in an incompressible fluid). As the solution converges, these pressure waves die out, causing the derivation of pressure with respect to time to approach zero. Thus, continuity is satisfied for the incompressible fluid in the steady state. Computational efficiency is achieved using a diagonal algorithm. A block tri-diagonal option is also available. When a steady-state solution is reached, the modified continuity equation will satisfy the divergence-free velocity field condition. INS3D is capable of handling several different types of boundaries encountered in numerical simulations, including solid-surface, inflow and outflow, and far-field boundaries. Three machine versions of INS3D are available. INS3D for the CRAY is written in CRAY FORTRAN for execution on a CRAY X-MP under COS, INS3D for the IBM is written in FORTRAN 77 for execution on an IBM 3090 under the VM or MVS operating system, and INS3D for DEC RISC-based systems is written in RISC FORTRAN for execution on a DEC workstation running RISC ULTRIX 3.1 or later. The CRAY version has a central memory requirement of 730279 words. The central memory requirement for the IBM is 150Mb. The memory requirement for the DEC RISC ULTRIX version is 3Mb of main memory. INS3D was developed in 1987. The port to the IBM was done in 1990. The port to the DECstation 3100 was done in 1991. CRAY is a registered trademark of Cray Research Inc. IBM is a registered trademark of International Business Machines. DEC, DECstation, and ULTRIX are trademarks of the Digital Equipment Corporation.
Early MIMD experience on the CRAY X-MP
NASA Astrophysics Data System (ADS)
Rhoades, Clifford E.; Stevens, K. G.
1985-07-01
This paper describes some early experience with converting four physics simulation programs to the CRAY X-MP, a current Multiple Instruction, Multiple Data (MIMD) computer consisting of two processors each with an architecture similar to that of the CRAY-1. As a multi-processor, the CRAY X-MP together with the high speed Solid-state Storage Device (SSD) in an ideal machine upon which to study MIMD algorithms for solving the equations of mathematical physics because it is fast enough to run real problems. The computer programs used in this study are all FORTRAN versions of original production codes. They range in sophistication from a one-dimensional numerical simulation of collisionless plasma to a two-dimensional hydrodynamics code with heat flow to a couple of three-dimensional fluid dynamics codes with varying degrees of viscous modeling. Early research with a dual processor configuration has shown speed-ups ranging from 1.55 to 1.98. It has been observed that a few simple extensions to FORTRAN allow a typical programmer to achieve a remarkable level of efficiency. These extensions involve the concept of memory local to a concurrent subprogram and memory common to all concurrent subprograms.
Researchers Mine Information from Next-Generation Subsurface Flow Simulations
Gedenk, Eric D.
2015-12-01
A research team based at Virginia Tech University leveraged computing resources at the US Department of Energy's (DOE's) Oak Ridge National Laboratory to explore subsurface multiphase flow phenomena that can't be experimentally observed. Using the Cray XK7 Titan supercomputer at the Oak Ridge Leadership Computing Facility, the team took Micro-CT images of subsurface geologic systems and created two-phase flow simulations. The team's model development has implications for computational research pertaining to carbon sequestration, oil recovery, and contaminant transport.
LASL benchmark performance 1978. [CDC STAR-100, 6600, 7600, Cyber 73, and CRAY-1
DOE Office of Scientific and Technical Information (OSTI.GOV)
McKnight, A.L.
1979-08-01
This report presents the results of running several benchmark programs on a CDC STAR-100, a Cray Research CRAY-1, a CDC 6600, a CDC 7600, and a CDC Cyber 73. The benchmark effort included CRAY-1's at several installations running different operating systems and compilers. This benchmark is part of an ongoing program at Los Alamos Scientific Laboratory to collect performance data and monitor the development trend of supercomputers. 3 tables.
GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hargrove, Paul H.; Bonachea, Dan
This document is a deliverable for milestone STPM17-6 of the Exascale Computing Project, delivered by WBS 2.3.1.14. It reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as “specializations”, primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth ismore » increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are available in the GASNet-EX 2018.3.0 release.« less
Modeling high-temperature superconductors and metallic alloys on the Intel IPSC/860
NASA Astrophysics Data System (ADS)
Geist, G. A.; Peyton, B. W.; Shelton, W. A.; Stocks, G. M.
Oak Ridge National Laboratory has embarked on several computational Grand Challenges, which require the close cooperation of physicists, mathematicians, and computer scientists. One of these projects is the determination of the material properties of alloys from first principles and, in particular, the electronic structure of high-temperature superconductors. While the present focus of the project is on superconductivity, the approach is general enough to permit study of other properties of metallic alloys such as strength and magnetic properties. This paper describes the progress to date on this project. We include a description of a self-consistent KKR-CPA method, parallelization of the model, and the incorporation of a dynamic load balancing scheme into the algorithm. We also describe the development and performance of a consolidated KKR-CPA code capable of running on CRAYs, workstations, and several parallel computers without source code modification. Performance of this code on the Intel iPSC/860 is also compared to a CRAY 2, CRAY YMP, and several workstations. Finally, some density of state calculations of two perovskite superconductors are given.
Optimization of large matrix calculations for execution on the Cray X-MP vector supercomputer
NASA Technical Reports Server (NTRS)
Hornfeck, William A.
1988-01-01
A considerable volume of large computational computer codes were developed for NASA over the past twenty-five years. This code represents algorithms developed for machines of earlier generation. With the emergence of the vector supercomputer as a viable, commercially available machine, an opportunity exists to evaluate optimization strategies to improve the efficiency of existing software. This result is primarily due to architectural differences in the latest generation of large-scale machines and the earlier, mostly uniprocessor, machines. A sofware package being used by NASA to perform computations on large matrices is described, and a strategy for conversion to the Cray X-MP vector supercomputer is also described.
Antenna pattern control using impedance surfaces
NASA Technical Reports Server (NTRS)
Balanis, Constantine A.; Liu, Kefeng
1992-01-01
During this research period, we have effectively transferred existing computer codes from CRAY supercomputer to work station based systems. The work station based version of our code preserved the accuracy of the numerical computations while giving a much better turn-around time than the CRAY supercomputer. Such a task relieved us of the heavy dependence of the supercomputer account budget and made codes developed in this research project more feasible for applications. The analysis of pyramidal horns with impedance surfaces was our major focus during this research period. Three different modeling algorithms in analyzing lossy impedance surfaces were investigated and compared with measured data. Through this investigation, we discovered that a hybrid Fourier transform technique, which uses the eigen mode in the stepped waveguide section and the Fourier transformed field distributions across the stepped discontinuities for lossy impedances coating, gives a better accuracy in analyzing lossy coatings. After a further refinement of the present technique, we will perform an accurate radiation pattern synthesis in the coming reporting period.
NASA Technical Reports Server (NTRS)
Perkey, D. J.; Kreitzberg, C. W.
1984-01-01
The dynamic prediction model along with its macro-processor capability and data flow system from the Drexel Limited-Area and Mesoscale Prediction System (LAMPS) were converted and recorded for the Perkin-Elmer 3220. The previous version of this model was written for Control Data Corporation 7600 and CRAY-1a computer environment which existed until recently at the National Center for Atmospheric Research. The purpose of this conversion was to prepare LAMPS for porting to computer environments other than that encountered at NCAR. The emphasis was shifted from programming tasks to model simulation and evaluation tests.
CDC to CRAY FORTRAN conversion manual
NASA Technical Reports Server (NTRS)
Mcgary, C.; Diebert, D.
1983-01-01
Documentation describing software differences between two general purpose computers for scientific applications is presented. Descriptions of the use of the FORTRAN and FORTRAN 77 high level programming language on a CDC 7600 under SCOPE and a CRAY XMP under COS are offered. Itemized differences of the FORTRAN language sets of the two machines are also included. The material is accompanied by numerous examples of preferred programming techniques for the two machines.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel W.
Coupled-cluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain the productivity of computational quantum chemists.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel
Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel; ...
2017-03-08
Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
High Performance Programming Using Explicit Shared Memory Model on the Cray T3D
NASA Technical Reports Server (NTRS)
Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)
1994-01-01
The Cray T3D is the first-phase system in Cray Research Inc.'s (CRI) three-phase massively parallel processing program. In this report we describe the architecture of the T3D, as well as the CRAFT (Cray Research Adaptive Fortran) programming model, and contrast it with PVM, which is also supported on the T3D We present some performance data based on the NAS Parallel Benchmarks to illustrate both architectural and software features of the T3D.
Performance Analysis of the Unitree Central File
NASA Technical Reports Server (NTRS)
Pentakalos, Odysseas I.; Flater, David
1994-01-01
This report consists of two parts. The first part briefly comments on the documentation status of two major systems at NASA#s Center for Computational Sciences, specifically the Cray C98 and the Convex C3830. The second part describes the work done on improving the performance of file transfers between the Unitree Mass Storage System running on the Convex file server and the users workstations distributed over a large georgraphic area.
TRASYS - THERMAL RADIATION ANALYZER SYSTEM (CRAY VERSION WITH NASADIG)
NASA Technical Reports Server (NTRS)
Anderson, G. E.
1994-01-01
The Thermal Radiation Analyzer System, TRASYS, is a computer software system with generalized capability to solve the radiation related aspects of thermal analysis problems. TRASYS computes the total thermal radiation environment for a spacecraft in orbit. The software calculates internode radiation interchange data as well as incident and absorbed heat rate data originating from environmental radiant heat sources. TRASYS provides data of both types in a format directly usable by such thermal analyzer programs as SINDA/FLUINT (available from COSMIC, program number MSC-21528). One primary feature of TRASYS is that it allows users to write their own driver programs to organize and direct the preprocessor and processor library routines in solving specific thermal radiation problems. The preprocessor first reads and converts the user's geometry input data into the form used by the processor library routines. Then, the preprocessor accepts the user's driving logic, written in the TRASYS modified FORTRAN language. In many cases, the user has a choice of routines to solve a given problem. Users may also provide their own routines where desirable. In particular, the user may write output routines to provide for an interface between TRASYS and any thermal analyzer program using the R-C network concept. Input to the TRASYS program consists of Options and Edit data, Model data, and Logic Flow and Operations data. Options and Edit data provide for basic program control and user edit capability. The Model data describe the problem in terms of geometry and other properties. This information includes surface geometry data, documentation data, nodal data, block coordinate system data, form factor data, and flux data. Logic Flow and Operations data house the user's driver logic, including the sequence of subroutine calls and the subroutine library. Output from TRASYS consists of two basic types of data: internode radiation interchange data, and incident and absorbed heat rate data. The flexible structure of TRASYS allows considerable freedom in the definition and choice of solution method for a thermal radiation problem. The program's flexible structure has also allowed TRASYS to retain the same basic input structure as the authors update it in order to keep up with changing requirements. Among its other important features are the following: 1) up to 3200 node problem size capability with shadowing by intervening opaque or semi-transparent surfaces; 2) choice of diffuse, specular, or diffuse/specular radiant interchange solutions; 3) a restart capability that minimizes recomputing; 4) macroinstructions that automatically provide the executive logic for orbit generation that optimizes the use of previously completed computations; 5) a time variable geometry package that provides automatic pointing of the various parts of an articulated spacecraft and an automatic look-back feature that eliminates redundant form factor calculations; 6) capability to specify submodel names to identify sets of surfaces or components as an entity; and 7) subroutines to perform functions which save and recall the internodal and/or space form factors in subsequent steps for nodes with fixed geometry during a variable geometry run. There are two machine versions of TRASYS v27: a DEC VAX version and a Cray UNICOS version. Both versions require installation of the NASADIG library (MSC-21801 for DEC VAX or COS-10049 for CRAY), which is available from COSMIC either separately or bundled with TRASYS. The NASADIG (NASA Device Independent Graphics Library) plot package provides a pictorial representation of input geometry, orbital/orientation parameters, and heating rate output as a function of time. NASADIG supports Tektronix terminals. The CRAY version of TRASYS v27 is written in FORTRAN 77 for batch or interactive execution and has been implemented on CRAY X-MP and CRAY Y-MP series computers running UNICOS. The standard distribution medium for MSC-21959 (CRAY version without NASADIG) is a 1600 BPI 9-track magnetic tape in UNIX tar format. The standard distribution medium for COS-10040 (CRAY version with NASADIG) is a set of two 6250 BPI 9-track magnetic tapes in UNIX tar format. Alternate distribution media and formats are available upon request. The DEC VAX version of TRASYS v27 is written in FORTRAN 77 for batch execution (only the plotting driver program is interactive) and has been implemented on a DEC VAX 8650 computer under VMS. Since the source codes for MSC-21030 and COS-10026 are in VAX/VMS text library files and DEC Command Language files, COSMIC will only provide these programs in the following formats: MSC-21030, TRASYS (DEC VAX version without NASADIG) is available on a 1600 BPI 9-track magnetic tape in VAX BACKUP format (standard distribution medium) or in VAX BACKUP format on a TK50 tape cartridge; COS-10026, TRASYS (DEC VAX version with NASADIG), is available in VAX BACKUP format on a set of three 6250 BPI 9-track magnetic tapes (standard distribution medium) or a set of three TK50 tape cartridges in VAX BACKUP format. TRASYS was last updated in 1993.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices
NASA Technical Reports Server (NTRS)
Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.
1991-01-01
A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
Achieving High Performance on the i860 Microprocessor
NASA Technical Reports Server (NTRS)
Lee, King; Kutler, Paul (Technical Monitor)
1998-01-01
The i860 is a high performance microprocessor used in the Intel Touchstone project. This paper proposes a paradigm for programming the i860 that is modelled on the vector instructions of the Cray computers. Fortran callable assembler subroutines were written that mimic the concurrent vector instructions of the Cray. Cache takes the place of vector registers. Using this paradigm we have achieved twice the performance of compiled code on a traditional solve.
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks
NASA Technical Reports Server (NTRS)
Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;
2006-01-01
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.
Solving large-scale dynamic systems using band Lanczos method in Rockwell NASTRAN on CRAY X-MP
NASA Technical Reports Server (NTRS)
Gupta, V. K.; Zillmer, S. D.; Allison, R. E.
1986-01-01
The improved cost effectiveness using better models, more accurate and faster algorithms and large scale computing offers more representative dynamic analyses. The band Lanczos eigen-solution method was implemented in Rockwell's version of 1984 COSMIC-released NASTRAN finite element structural analysis computer program to effectively solve for structural vibration modes including those of large complex systems exceeding 10,000 degrees of freedom. The Lanczos vectors were re-orthogonalized locally using the Lanczos Method and globally using the modified Gram-Schmidt method for sweeping rigid-body modes and previously generated modes and Lanczos vectors. The truncated band matrix was solved for vibration frequencies and mode shapes using Givens rotations. Numerical examples are included to demonstrate the cost effectiveness and accuracy of the method as implemented in ROCKWELL NASTRAN. The CRAY version is based on RPK's COSMIC/NASTRAN. The band Lanczos method was more reliable and accurate and converged faster than the single vector Lanczos Method. The band Lanczos method was comparable to the subspace iteration method which was a block version of the inverse power method. However, the subspace matrix tended to be fully populated in the case of subspace iteration and not as sparse as a band matrix.
TOP500 Sublist for November 2001
DOE Office of Scientific and Technical Information (OSTI.GOV)
Strohmaier, Erich; Meuer, Hans W.; Dongarra, Jack J.
2001-11-09
18th Edition of TOP500 List of World's Fastest Supercomputers Released MANNHEIM, GERMANY; KNOXVILLE, TENN.; BERKELEY, CALIF. In what has become a much-anticipated event in the world of high-performance computing, the 18th edition of the TOP500 list of the world's fastest supercomputers was released today (November 9, 2001). The latest edition of the twice-yearly ranking finds IBM as the leader in the field, with 32 percent in terms of installed systems and 37 percent in terms of total performance of all the installed systems. In a surprise move Hewlett-Packard captured the second place with 30 percent of the systems. Most ofmore » these systems are smaller in size and as a consequence HP's share of installed performance is smaller with 15 percent. This is still enough for second place in this category. SGI, Cray and Sun follow in the number of TOP500 systems with 41 (8 percent), 39 (8 percent), and 31 (6 percent) respectively. In the category of installed performance Cray Inc. keeps the third position with 11 percent ahead of SGI (8 percent) and Compaq (8 percent).« less
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.; Storaasli, Olaf O.; Qin, Jiangning; Qamar, Ramzi
1994-01-01
An automatic differentiation tool (ADIFOR) is incorporated into a finite element based structural analysis program for shape and non-shape design sensitivity analysis of structural systems. The entire analysis and sensitivity procedures are parallelized and vectorized for high performance computation. Small scale examples to verify the accuracy of the proposed program and a medium scale example to demonstrate the parallel vector performance on multiple CRAY C90 processors are included.
NASA Technical Reports Server (NTRS)
Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.
1987-01-01
A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to are specific to the Cray X-MP line of computers and its associated SSD (Solid-State Disk). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
A secure file manager for UNIX
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeVries, R.G.
1990-12-31
The development of a secure file management system for a UNIX-based computer facility with supercomputers and workstations is described. Specifically, UNIX in its usual form does not address: (1) Operation which would satisfy rigorous security requirements. (2) Online space management in an environment where total data demands would be many times the actual online capacity. (3) Making the file management system part of a computer network in which users of any computer in the local network could retrieve data generated on any other computer in the network. The characteristics of UNIX can be exploited to develop a portable, secure filemore » manager which would operate on computer systems ranging from workstations to supercomputers. Implementation considerations making unusual use of UNIX features, rather than requiring extensive internal system changes, are described, and implementation using the Cray Research Inc. UNICOS operating system is outlined.« less
The growth of the UniTree mass storage system at the NASA Center for Computational Sciences
NASA Technical Reports Server (NTRS)
Tarshish, Adina; Salmon, Ellen
1993-01-01
In October 1992, the NASA Center for Computational Sciences made its Convex-based UniTree system generally available to users. The ensuing months saw the growth of near-online data from nil to nearly three terabytes, a doubling of the number of CPU's on the facility's Cray YMP (the primary data source for UniTree), and the necessity for an aggressive regimen for repacking sparse tapes and hierarchical 'vaulting' of old files to freestanding tape. Connectivity was enhanced as well with the addition of UltraNet HiPPI. This paper describes the increasing demands placed on the storage system's performance and throughput that resulted from the significant augmentation of compute-server processor power and network speed.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics
NASA Technical Reports Server (NTRS)
Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier
1992-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
The SGI/Cray T3E: Experiences and Insights
NASA Technical Reports Server (NTRS)
Bernard, Lisa Hamet
1998-01-01
The NASA Goddard Space Flight Center is home to the fifth most powerful supercomputer in the world, a 1024 processor SGI/Cray T3E-600. The original 512 processor system was placed at Goddard in March, 1997 as part of a cooperative agreement between the High Performance Computing and Communications Program's Earth and Space Sciences Project (ESS) and SGI/Cray Research. The goal of this system is to facilitate achievement of the Project milestones of 10, 50 and 100 GFLOPS sustained performance on selected Earth and space science application codes. The additional 512 processors were purchased in March, 1998 by the NASA Earth Science Enterprise for the NASA Seasonal to Interannual Prediction Project (NSIPP). These two "halves" still operate as a single system, and must satisfy the unique requirements of both aforementioned groups, as well as guest researchers from the Earth, space, microgravity, manned space flight and aeronautics communities. Few large scalable parallel systems are configured for capability computing, so models are hard to find. This unique environment has created a challenging system administration task, and has yielded some insights into the supercomputing needs of the various NASA Enterprises, as well as insights into the strengths and weaknesses of the T3E architecture and software. The T3E is a distributed memory system in which the processing elements (PE's) are connected by a low latency, high bandwidth bidirectional 3-D torus. Due to the focus on high speed communication between PE's, the T3E requires PE's to be allocated contiguously per job. Further, jobs will only execute on the user specified number of PE's and PE timesharing is possible but impractical. With a highly varied job mix in both size and runtime of jobs, the resulting scenario is PE fragmentation and an inability to achieve near 100% utilization. SGI/Cray has provided several scheduling and configuration tools to minimize the impact of fragmentation. These tools include PScheD (the political scheduler), GRM (the global resource manager) and NQE (the Network Queuing Environment). Features and impact of these tools will be discussed, as will resulting performance and utilization data. As a distributed memory system, the T3E is designed to be programmed through explicit message passing. Consequently, certain assumptions related to code design are made by the operating system (UNICOS/mk) and its scheduling tools. With the exception of HPF, which does run on the T3E, however poorly, alternative programming styles have the potential to impact the T3E in unexpected and undesirable ways. Several examples will be presented (preceeded with the disclaimer, "Don't try this at home! Violators will be prosecuted!")
ORNL Cray X1 evaluation status report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Agarwal, P.K.; Alexander, R.A.; Apra, E.
2004-05-01
On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. ''This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership,'' said Dr. Raymond L. Orbach, director of the department's Office of Science. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of themore » architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel bench marks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation. Application performance is found to be markedly improved by this architecture: - Large-scale simulations of high-temperature superconductors run 25 times faster than on an IBM Power4 cluster using the same number of processors. - Best performance of the parallel ocean program (POP v1.4.3) is 50 percent higher than on Japan s Earth Simulator and 5 times higher than on an IBM Power4 cluster. - A fusion application, global GYRO transport, was found to be 16 times faster on the X1 than on an IBM Power3. The increased performance allowed simulations to fully resolve questions raised by a prior study. - The transport kernel in the AGILE-BOLTZTRAN astrophysics code runs 15 times faster than on an IBM Power4 cluster using the same number of processors. - Molecular dynamics simulations related to the phenomenon of photon echo run 8 times faster than previously achieved. Even at 256 processors, the Cray X1 system is already outperforming other supercomputers with thousands of processors for a certain class of applications such as climate modeling and some fusion applications. This evaluation is the outcome of a number of meetings with both high-performance computing (HPC) system vendors and application experts over the past 9 months and has received broad-based support from the scientific community and other agencies.« less
Attaching IBM-compatible 3380 disks to Cray X-MP
DOE Office of Scientific and Technical Information (OSTI.GOV)
Engert, D.E.; Midlock, J.L.
1989-01-01
A method of attaching IBM-compatible 3380 disks directly to a Cray X-MP via the XIOP with a BMC is described. The IBM 3380 disks appear to the UNICOS operating system as DD-29 disks with UNICOS file systems. IBM 3380 disks provide cheap, reliable large capacity disk storage. Combined with a small number of high-speed Cray disks, the IBM disks provide for the bulk of the storage for small files and infrequently used files. Cray Research designed the BMC and its supporting software in the XIOP to allow IBM tapes and other devices to be attached to the X-MP. No hardwaremore » changes were necessary, and we added less than 2000 lines of code to the XIOP to accomplish this project. This system has been in operation for over eight months. Future enhancements such as the use of a cache controller and attachment to a Y-MP are also described. 1 tab.« less
NASA Technical Reports Server (NTRS)
Chan, J. S.; Freeman, J. A.
1984-01-01
The viscous, axisymmetric flow in the thrust chamber of the space shuttle main engine (SSME) was computed on the CRAY 205 computer using the general interpolants method (GIM) code. Results show that the Navier-Stokes codes can be used for these flows to study trends and viscous effects as well as determine flow patterns; but further research and development is needed before they can be used as production tools for nozzle performance calculations. The GIM formulation, numerical scheme, and computer code are described. The actual SSME nozzle computation showing grid points, flow contours, and flow parameter plots is discussed. The computer system and run times/costs are detailed.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
NASA Technical Reports Server (NTRS)
Fatoohi, Rod; Saini, Subbash; Ciotti, Robert
2006-01-01
We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wolfe, A.
1986-03-10
Supercomputing software is moving into high gear, spurred by the rapid spread of supercomputers into new applications. The critical challenge is how to develop tools that will make it easier for programmers to write applications that take advantage of vectorizing in the classical supercomputer and the parallelism that is emerging in supercomputers and minisupercomputers. Writing parallel software is a challenge that every programmer must face because parallel architectures are springing up across the range of computing. Cray is developing a host of tools for programmers. Tools to support multitasking (in supercomputer parlance, multitasking means dividing up a single program tomore » run on multiple processors) are high on Cray's agenda. On tap for multitasking is Premult, dubbed a microtasking tool. As a preprocessor for Cray's CFT77 FORTRAN compiler, Premult will provide fine-grain multitasking.« less
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
User and Performance Impacts from Franklin Upgrades
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Yun
2009-05-10
The NERSC flagship computer Cray XT4 system"Franklin" has gone through three major upgrades: quad core upgrade, CLE 2.1 upgrade, and IO upgrade, during the past year. In this paper, we will discuss the various aspects of the user impacts such as user access, user environment, and user issues etc from these upgrades. The performance impacts on the kernel benchmarks and selected application benchmarks will also be presented.
NASA Technical Reports Server (NTRS)
Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.
1987-01-01
A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to in this paper are specific to the Cray X-MP line of computers and its associated SSD (Solid-state Storage Device). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
High performance computing applications in neurobiological research
NASA Technical Reports Server (NTRS)
Ross, Muriel D.; Cheng, Rei; Doshay, David G.; Linton, Samuel W.; Montgomery, Kevin; Parnas, Bruce R.
1994-01-01
The human nervous system is a massively parallel processor of information. The vast numbers of neurons, synapses and circuits is daunting to those seeking to understand the neural basis of consciousness and intellect. Pervading obstacles are lack of knowledge of the detailed, three-dimensional (3-D) organization of even a simple neural system and the paucity of large scale, biologically relevant computer simulations. We use high performance graphics workstations and supercomputers to study the 3-D organization of gravity sensors as a prototype architecture foreshadowing more complex systems. Scaled-down simulations run on a Silicon Graphics workstation and scale-up, three-dimensional versions run on the Cray Y-MP and CM5 supercomputers.
Climate Ocean Modeling on a Beowulf Class System
NASA Technical Reports Server (NTRS)
Cheng, B. N.; Chao, Y.; Wang, P.; Bondarenko, M.
2000-01-01
With the growing power and shrinking cost of personal computers. the availability of fast ethernet interconnections, and public domain software packages, it is now possible to combine them to build desktop parallel computers (named Beowulf or PC clusters) at a fraction of what it would cost to buy systems of comparable power front supercomputer companies. This led as to build and assemble our own sys tem. specifically for climate ocean modeling. In this article, we present our experience with such a system, discuss its network performance, and provide some performance comparison data with both HP SPP2000 and Cray T3E for an ocean Model used in present-day oceanographic research.
Climate Data Assimilation on a Massively Parallel Supercomputer
NASA Technical Reports Server (NTRS)
Ding, Hong Q.; Ferraro, Robert D.
1996-01-01
We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512-nodes of an Intel Paragon. The preconditioned Conjugate Gradient solver achieves a sustained 18 Gflops performance. Consequently, we achieve an unprecedented 100-fold reduction in time to solution on the Intel Paragon over a single head of a Cray C90. This not only exceeds the daily performance requirement of the Data Assimilation Office at NASA's Goddard Space Flight Center, but also makes it possible to explore much larger and challenging data assimilation problems which are unthinkable on a traditional computer platform such as the Cray C90.
NAS (Numerical Aerodynamic Simulation Program) technical summaries, March 1989 - February 1990
NASA Technical Reports Server (NTRS)
1990-01-01
Given here are selected scientific results from the Numerical Aerodynamic Simulation (NAS) Program's third year of operation. During this year, the scientific community was given access to a Cray-2 and a Cray Y-MP supercomputer. Topics covered include flow field analysis of fighter wing configurations, large-scale ocean modeling, the Space Shuttle flow field, advanced computational fluid dynamics (CFD) codes for rotary-wing airloads and performance prediction, turbulence modeling of separated flows, airloads and acoustics of rotorcraft, vortex-induced nonlinearities on submarines, and standing oblique detonation waves.
TRASYS - THERMAL RADIATION ANALYZER SYSTEM (DEC VAX VERSION WITH NASADIG)
NASA Technical Reports Server (NTRS)
Anderson, G. E.
1994-01-01
The Thermal Radiation Analyzer System, TRASYS, is a computer software system with generalized capability to solve the radiation related aspects of thermal analysis problems. TRASYS computes the total thermal radiation environment for a spacecraft in orbit. The software calculates internode radiation interchange data as well as incident and absorbed heat rate data originating from environmental radiant heat sources. TRASYS provides data of both types in a format directly usable by such thermal analyzer programs as SINDA/FLUINT (available from COSMIC, program number MSC-21528). One primary feature of TRASYS is that it allows users to write their own driver programs to organize and direct the preprocessor and processor library routines in solving specific thermal radiation problems. The preprocessor first reads and converts the user's geometry input data into the form used by the processor library routines. Then, the preprocessor accepts the user's driving logic, written in the TRASYS modified FORTRAN language. In many cases, the user has a choice of routines to solve a given problem. Users may also provide their own routines where desirable. In particular, the user may write output routines to provide for an interface between TRASYS and any thermal analyzer program using the R-C network concept. Input to the TRASYS program consists of Options and Edit data, Model data, and Logic Flow and Operations data. Options and Edit data provide for basic program control and user edit capability. The Model data describe the problem in terms of geometry and other properties. This information includes surface geometry data, documentation data, nodal data, block coordinate system data, form factor data, and flux data. Logic Flow and Operations data house the user's driver logic, including the sequence of subroutine calls and the subroutine library. Output from TRASYS consists of two basic types of data: internode radiation interchange data, and incident and absorbed heat rate data. The flexible structure of TRASYS allows considerable freedom in the definition and choice of solution method for a thermal radiation problem. The program's flexible structure has also allowed TRASYS to retain the same basic input structure as the authors update it in order to keep up with changing requirements. Among its other important features are the following: 1) up to 3200 node problem size capability with shadowing by intervening opaque or semi-transparent surfaces; 2) choice of diffuse, specular, or diffuse/specular radiant interchange solutions; 3) a restart capability that minimizes recomputing; 4) macroinstructions that automatically provide the executive logic for orbit generation that optimizes the use of previously completed computations; 5) a time variable geometry package that provides automatic pointing of the various parts of an articulated spacecraft and an automatic look-back feature that eliminates redundant form factor calculations; 6) capability to specify submodel names to identify sets of surfaces or components as an entity; and 7) subroutines to perform functions which save and recall the internodal and/or space form factors in subsequent steps for nodes with fixed geometry during a variable geometry run. There are two machine versions of TRASYS v27: a DEC VAX version and a Cray UNICOS version. Both versions require installation of the NASADIG library (MSC-21801 for DEC VAX or COS-10049 for CRAY), which is available from COSMIC either separately or bundled with TRASYS. The NASADIG (NASA Device Independent Graphics Library) plot package provides a pictorial representation of input geometry, orbital/orientation parameters, and heating rate output as a function of time. NASADIG supports Tektronix terminals. The CRAY version of TRASYS v27 is written in FORTRAN 77 for batch or interactive execution and has been implemented on CRAY X-MP and CRAY Y-MP series computers running UNICOS. The standard distribution medium for MSC-21959 (CRAY version without NASADIG) is a 1600 BPI 9-track magnetic tape in UNIX tar format. The standard distribution medium for COS-10040 (CRAY version with NASADIG) is a set of two 6250 BPI 9-track magnetic tapes in UNIX tar format. Alternate distribution media and formats are available upon request. The DEC VAX version of TRASYS v27 is written in FORTRAN 77 for batch execution (only the plotting driver program is interactive) and has been implemented on a DEC VAX 8650 computer under VMS. Since the source codes for MSC-21030 and COS-10026 are in VAX/VMS text library files and DEC Command Language files, COSMIC will only provide these programs in the following formats: MSC-21030, TRASYS (DEC VAX version without NASADIG) is available on a 1600 BPI 9-track magnetic tape in VAX BACKUP format (standard distribution medium) or in VAX BACKUP format on a TK50 tape cartridge; COS-10026, TRASYS (DEC VAX version with NASADIG), is available in VAX BACKUP format on a set of three 6250 BPI 9-track magnetic tapes (standard distribution medium) or a set of three TK50 tape cartridges in VAX BACKUP format. TRASYS was last updated in 1993.
TRASYS - THERMAL RADIATION ANALYZER SYSTEM (DEC VAX VERSION WITHOUT NASADIG)
NASA Technical Reports Server (NTRS)
Vogt, R. A.
1994-01-01
The Thermal Radiation Analyzer System, TRASYS, is a computer software system with generalized capability to solve the radiation related aspects of thermal analysis problems. TRASYS computes the total thermal radiation environment for a spacecraft in orbit. The software calculates internode radiation interchange data as well as incident and absorbed heat rate data originating from environmental radiant heat sources. TRASYS provides data of both types in a format directly usable by such thermal analyzer programs as SINDA/FLUINT (available from COSMIC, program number MSC-21528). One primary feature of TRASYS is that it allows users to write their own driver programs to organize and direct the preprocessor and processor library routines in solving specific thermal radiation problems. The preprocessor first reads and converts the user's geometry input data into the form used by the processor library routines. Then, the preprocessor accepts the user's driving logic, written in the TRASYS modified FORTRAN language. In many cases, the user has a choice of routines to solve a given problem. Users may also provide their own routines where desirable. In particular, the user may write output routines to provide for an interface between TRASYS and any thermal analyzer program using the R-C network concept. Input to the TRASYS program consists of Options and Edit data, Model data, and Logic Flow and Operations data. Options and Edit data provide for basic program control and user edit capability. The Model data describe the problem in terms of geometry and other properties. This information includes surface geometry data, documentation data, nodal data, block coordinate system data, form factor data, and flux data. Logic Flow and Operations data house the user's driver logic, including the sequence of subroutine calls and the subroutine library. Output from TRASYS consists of two basic types of data: internode radiation interchange data, and incident and absorbed heat rate data. The flexible structure of TRASYS allows considerable freedom in the definition and choice of solution method for a thermal radiation problem. The program's flexible structure has also allowed TRASYS to retain the same basic input structure as the authors update it in order to keep up with changing requirements. Among its other important features are the following: 1) up to 3200 node problem size capability with shadowing by intervening opaque or semi-transparent surfaces; 2) choice of diffuse, specular, or diffuse/specular radiant interchange solutions; 3) a restart capability that minimizes recomputing; 4) macroinstructions that automatically provide the executive logic for orbit generation that optimizes the use of previously completed computations; 5) a time variable geometry package that provides automatic pointing of the various parts of an articulated spacecraft and an automatic look-back feature that eliminates redundant form factor calculations; 6) capability to specify submodel names to identify sets of surfaces or components as an entity; and 7) subroutines to perform functions which save and recall the internodal and/or space form factors in subsequent steps for nodes with fixed geometry during a variable geometry run. There are two machine versions of TRASYS v27: a DEC VAX version and a Cray UNICOS version. Both versions require installation of the NASADIG library (MSC-21801 for DEC VAX or COS-10049 for CRAY), which is available from COSMIC either separately or bundled with TRASYS. The NASADIG (NASA Device Independent Graphics Library) plot package provides a pictorial representation of input geometry, orbital/orientation parameters, and heating rate output as a function of time. NASADIG supports Tektronix terminals. The CRAY version of TRASYS v27 is written in FORTRAN 77 for batch or interactive execution and has been implemented on CRAY X-MP and CRAY Y-MP series computers running UNICOS. The standard distribution medium for MSC-21959 (CRAY version without NASADIG) is a 1600 BPI 9-track magnetic tape in UNIX tar format. The standard distribution medium for COS-10040 (CRAY version with NASADIG) is a set of two 6250 BPI 9-track magnetic tapes in UNIX tar format. Alternate distribution media and formats are available upon request. The DEC VAX version of TRASYS v27 is written in FORTRAN 77 for batch execution (only the plotting driver program is interactive) and has been implemented on a DEC VAX 8650 computer under VMS. Since the source codes for MSC-21030 and COS-10026 are in VAX/VMS text library files and DEC Command Language files, COSMIC will only provide these programs in the following formats: MSC-21030, TRASYS (DEC VAX version without NASADIG) is available on a 1600 BPI 9-track magnetic tape in VAX BACKUP format (standard distribution medium) or in VAX BACKUP format on a TK50 tape cartridge; COS-10026, TRASYS (DEC VAX version with NASADIG), is available in VAX BACKUP format on a set of three 6250 BPI 9-track magnetic tapes (standard distribution medium) or a set of three TK50 tape cartridges in VAX BACKUP format. TRASYS was last updated in 1993.
Experiences From NASA/Langley's DMSS Project
NASA Technical Reports Server (NTRS)
1996-01-01
There is a trend in institutions with high performance computing and data management requirements to explore mass storage systems with peripherals directly attached to a high speed network. The Distributed Mass Storage System (DMSS) Project at the NASA Langley Research Center (LaRC) has placed such a system into production use. This paper will present the experiences, both good and bad, we have had with this system since putting it into production usage. The system is comprised of: 1) National Storage Laboratory (NSL)/UniTree 2.1, 2) IBM 9570 HIPPI attached disk arrays (both RAID 3 and RAID 5), 3) IBM RS6000 server, 4) HIPPI/IPI3 third party transfers between the disk array systems and the supercomputer clients, a CRAY Y-MP and a CRAY 2, 5) a "warm spare" file server, 6) transition software to convert from CRAY's Data Migration Facility (DMF) based system to DMSS, 7) an NSC PS32 HIPPI switch, and 8) a STK 4490 robotic library accessed from the IBM RS6000 block mux interface. This paper will cover: the performance of the DMSS in the following areas: file transfer rates, migration and recall, and file manipulation (listing, deleting, etc.); the appropriateness of a workstation class of file server for NSL/UniTree with LaRC's present storage requirements in mind the role of the third party transfers between the supercomputers and the DMSS disk array systems in DMSS; a detailed comparison (both in performance and functionality) between the DMF and DMSS systems LaRC's enhancements to the NSL/UniTree system administration environment the mechanism for DMSS to provide file server redundancy the statistics on the availability of DMSS the design and experiences with the locally developed transparent transition software which allowed us to make over 1.5 million DMF files available to NSL/UniTree with minimal system outage
Multitasking the code ARC3D. [for computational fluid dynamics
NASA Technical Reports Server (NTRS)
Barton, John T.; Hsiung, Christopher C.
1986-01-01
The CRAY multitasking system was developed in order to utilize all four processors and sharply reduce the wall clock run time. This paper describes the techniques used to modify the computational fluid dynamics code ARC3D for this run and analyzes the achieved speedup. The ARC3D code solves either the Euler or thin-layer N-S equations using an implicit approximate factorization scheme. Results indicate that multitask processing can be used to achieve wall clock speedup factors of over three times, depending on the nature of the program code being used. Multitasking appears to be particularly advantageous for large-memory problems running on multiple CPU computers.
EFFECTS OF TUMORS ON INHALED PHARMACOLOGIC DRUGS: II. PARTICLE MOTION
ABSTRACT
Computer simulations were conducted to describe drug particle motion in human lung bifurcations with tumors. The computations used FIDAP with a Cray T90 supercomputer. The objective was to better understand particle behavior as affected by particle characteristics...
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Practical engineering application can often be formulated in the form of a constrained optimization problem. There are several solution algorithms for solving a constrained optimization problem. One approach is to convert a constrained problem into a series of unconstrained problems. Furthermore, unconstrained solution algorithms can be used as part of the constrained solution algorithms. Structural optimization is an iterative process where one starts with an initial design, a finite element structure analysis is then performed to calculate the response of the system (such as displacements, stresses, eigenvalues, etc.). Based upon the sensitivity information on the objective and constraint functions, an optimizer such as ADS or IDESIGN, can be used to find the new, improved design. For the structural analysis phase, the equation solver for the system of simultaneous, linear equations plays a key role since it is needed for either static, or eigenvalue, or dynamic analysis. For practical, large-scale structural analysis-synthesis applications, computational time can be excessively large. Thus, it is necessary to have a new structural analysis-synthesis code which employs new solution algorithms to exploit both parallel and vector capabilities offered by modern, high performance computers such as the Convex, Cray-2 and Cray-YMP computers. The objective of this research project is, therefore, to incorporate the latest development in the parallel-vector equation solver, PVSOLVE into the widely popular finite-element production code, such as the SAP-4. Furthermore, several nonlinear unconstrained optimization subroutines have also been developed and tested under a parallel computer environment. The unconstrained optimization subroutines are not only useful in their own right, but they can also be incorporated into a more popular constrained optimization code, such as ADS.
DOE Office of Scientific and Technical Information (OSTI.GOV)
G.A. Pope; K. Sephernoori; D.C. McKinney
1996-03-15
This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
RISC Processors and High Performance Computing
NASA Technical Reports Server (NTRS)
Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)
1995-01-01
In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.
Parallel Navier-Stokes computations on shared and distributed memory architectures
NASA Technical Reports Server (NTRS)
Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar
1995-01-01
We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
NASA Astrophysics Data System (ADS)
Filipcic, A.; Haug, S.; Hostettler, M.; Walker, R.; Weber, M.
2015-12-01
The Piz Daint Cray XC30 HPC system at CSCS, the Swiss National Supercomputing centre, was the highest ranked European system on TOP500 in 2014, also featuring GPU accelerators. Event generation and detector simulation for the ATLAS experiment have been enabled for this machine. We report on the technical solutions, performance, HPC policy challenges and possible future opportunities for HEP on extreme HPC systems. In particular a custom made integration to the ATLAS job submission system has been developed via the Advanced Resource Connector (ARC) middleware. Furthermore, a partial GPU acceleration of the Geant4 detector simulations has been implemented.
A Programming Model Performance Study Using the NAS Parallel Benchmarks
Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...
2010-01-01
Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less
OPAL: An Open-Source MPI-IO Library over Cray XT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, Weikuan; Vetter, Jeffrey S; Canon, Richard Shane
Parallel IO over Cray XT is supported by a vendor-supplied MPI-IO package. This package contains a proprietary ADIO implementation built on top of the sysio library. While it is reasonable to maintain a stable code base for application scientists' convenience, it is also very important to the system developers and researchers to analyze and assess the effectiveness of parallel IO software, and accordingly, tune and optimize the MPI-IO implementation. A proprietary parallel IO code base relinquishes such flexibilities. On the other hand, a generic UFS-based MPI-IO implementation is typically used on many Linux-based platforms. We have developed an open-source MPI-IOmore » package over Lustre, referred to as OPAL (OPportunistic and Adaptive MPI-IO Library over Lustre). OPAL provides a single source-code base for MPI-IO over Lustre on Cray XT and Linux platforms. Compared to Cray implementation, OPAL provides a number of good features, including arbitrary specification of striping patterns and Lustre-stripe aligned file domain partitioning. This paper presents the performance comparisons between OPAL and Cray's proprietary implementation. Our evaluation demonstrates that OPAL achieves the performance comparable to the Cray implementation. We also exemplify the benefits of an open source package in revealing the underpinning of the parallel IO performance.« less
Parallel Calculation of Sensitivity Derivatives for Aircraft Design using Automatic Differentiation
NASA Technical Reports Server (NTRS)
Bischof, c. H.; Green, L. L.; Haigler, K. J.; Knauff, T. L., Jr.
1994-01-01
Sensitivity derivative (SD) calculation via automatic differentiation (AD) typical of that required for the aerodynamic design of a transport-type aircraft is considered. Two ways of computing SD via code generated by the ADIFOR automatic differentiation tool are compared for efficiency and applicability to problems involving large numbers of design variables. A vector implementation on a Cray Y-MP computer is compared with a coarse-grained parallel implementation on an IBM SP1 computer, employing a Fortran M wrapper. The SD are computed for a swept transport wing in turbulent, transonic flow; the number of geometric design variables varies from 1 to 60 with coupling between a wing grid generation program and a state-of-the-art, 3-D computational fluid dynamics program, both augmented for derivative computation via AD. For a small number of design variables, the Cray Y-MP implementation is much faster. As the number of design variables grows, however, the IBM SP1 becomes an attractive alternative in terms of compute speed, job turnaround time, and total memory available for solutions with large numbers of design variables. The coarse-grained parallel implementation also can be moved easily to a network of workstations.
Exploring Accelerating Science Applications with FPGAs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Storaasli, Olaf O; Strenski, Dave
2007-01-01
FPGA hardware and tools (VHDL, Viva, MitrionC and CHiMPS) are described. FPGA performance is evaluated on two Cray XD1 systems (Virtex-II Pro 50 and Virtex-4 LX160) for human genome (DNA and protein) sequence comparisons for a computational biology code (FASTA). Scalable FPGA speedups of 50X (Virtex-II) and 100X (Virtex-4) over a 2.2 GHz Opteron were achieved. Coding and IO issues faced for human genome data are described.
Study of the TRAC Airfoil Table Computational System
NASA Technical Reports Server (NTRS)
Hu, Hong
1999-01-01
The report documents the study of the application of the TRAC airfoil table computational package (TRACFOIL) to the prediction of 2D airfoil force and moment data over a wide range of angle of attack and Mach number. The TRACFOIL generates the standard C-81 airfoil table for input into rotorcraft comprehensive codes such as CAM- RAD. The existing TRACFOIL computer package is successfully modified to run on Digital alpha workstations and on Cray-C90 supercomputers. A step-by-step instruction for using the package on both computer platforms is provided. Application of the newer version of TRACFOIL is made for two airfoil sections. The C-81 data obtained using the TRACFOIL method are compared with those of wind-tunnel data and results are presented.
The Hopper System: How the Largest XE6 in the World Went From Requirements to Reality.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Antypas, Katie; Butler, Tina; Carter, Jonathan
This paper will discuss the entire process of acquiring and deploying Hopper from the first vendor market surveys to providing 3.8 million hours of production cycles per day for NERSC users. Installing the latest system at NERSC has been both a logistical and technical adventure. Balancing compute requirements with power, cooling, and space limitations drove the initial choice and configuration of the XE6, and a number of first-of- a-kind features implemented in collaboration with Cray have resulted in a high performance, usable, and reliable system.
Using Strassen's algorithm to accelerate the solution of linear systems
NASA Technical Reports Server (NTRS)
Bailey, David H.; Lee, King; Simon, Horst D.
1990-01-01
Strassen's algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY-2 and CRAY Y-MP supercomputers. Several techniques have been used to reduce the scratch space requirement for this algorithm while simultaneously preserving a high level of performance. When the resulting Strassen-based matrix multiply routine is combined with some routines from the new LAPACK library, LU decomposition can be performed with rates significantly higher than those achieved by conventional means. We succeeded in factoring a 2048 x 2048 matrix on the CRAY Y-MP at a rate equivalent to 325 MFLOPS.
Use of Continuous Integration Tools for Application Performance Monitoring
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vergara Larrea, Veronica G; Joubert, Wayne; Fuson, Christopher B
High performance computing systems are becom- ing increasingly complex, both in node architecture and in the multiple layers of software stack required to compile and run applications. As a consequence, the likelihood is increasing for application performance regressions to occur as a result of routine upgrades of system software components which interact in complex ways. The purpose of this study is to evaluate the effectiveness of continuous integration tools for application performance monitoring on HPC systems. In addition, this paper also describes a prototype system for application perfor- mance monitoring based on Jenkins, a Java-based continuous integration tool. The monitoringmore » system described leverages several features in Jenkins to track application performance results over time. Preliminary results and lessons learned from monitoring applications on Cray systems at the Oak Ridge Leadership Computing Facility are presented.« less
Enabling Diverse Software Stacks on Supercomputers using High Performance Virtual Clusters.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Younge, Andrew J.; Pedretti, Kevin; Grant, Ryan
While large-scale simulations have been the hallmark of the High Performance Computing (HPC) community for decades, Large Scale Data Analytics (LSDA) workloads are gaining attention within the scientific community not only as a processing component to large HPC simulations, but also as standalone scientific tools for knowledge discovery. With the path towards Exascale, new HPC runtime systems are also emerging in a way that differs from classical distributed com- puting models. However, system software for such capabilities on the latest extreme-scale DOE supercomputing needs to be enhanced to more appropriately support these types of emerging soft- ware ecosystems. In thismore » paper, we propose the use of Virtual Clusters on advanced supercomputing resources to enable systems to support not only HPC workloads, but also emerging big data stacks. Specifi- cally, we have deployed the KVM hypervisor within Cray's Compute Node Linux on a XC-series supercomputer testbed. We also use libvirt and QEMU to manage and provision VMs directly on compute nodes, leveraging Ethernet-over-Aries network emulation. To our knowledge, this is the first known use of KVM on a true MPP supercomputer. We investigate the overhead our solution using HPC benchmarks, both evaluating single-node performance as well as weak scaling of a 32-node virtual cluster. Overall, we find single node performance of our solution using KVM on a Cray is very efficient with near-native performance. However overhead increases by up to 20% as virtual cluster size increases, due to limitations of the Ethernet-over-Aries bridged network. Furthermore, we deploy Apache Spark with large data analysis workloads in a Virtual Cluster, ef- fectively demonstrating how diverse software ecosystems can be supported by High Performance Virtual Clusters.« less
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
A Pacific Ocean general circulation model for satellite data assimilation
NASA Technical Reports Server (NTRS)
Chao, Y.; Halpern, D.; Mechoso, C. R.
1991-01-01
A tropical Pacific Ocean General Circulation Model (OGCM) to be used in satellite data assimilation studies is described. The transfer of the OGCM from a CYBER-205 at NOAA's Geophysical Fluid Dynamics Laboratory to a CRAY-2 at NASA's Ames Research Center is documented. Two 3-year model integrations from identical initial conditions but performed on those two computers are compared. The model simulations are very similar to each other, as expected, but the simulations performed with the higher-precision CRAY-2 is smoother than that with the lower-precision CYBER-205. The CYBER-205 and CRAY-2 use 32 and 64-bit mantissa arithmetic, respectively. The major features of the oceanic circulation in the tropical Pacific, namely the North Equatorial Current, the North Equatorial Countercurrent, the South Equatorial Current, and the Equatorial Undercurrent, are realistically produced and their seasonal cycles are described. The OGCM provides a powerful tool for study of tropical oceans and for the assimilation of satellite altimetry data.
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.
1989-01-01
An explicit flow solver, applicable to the hierarchy of model equations ranging from Euler to full Navier-Stokes, is combined with several techniques designed to reduce computational expense. The computational domain consists of local grid refinements embedded in a global coarse mesh, where the locations of these refinements are defined by the physics of the flow. Flow characteristics are also used to determine which set of model equations is appropriate for solution in each region, thereby reducing not only the number of grid points at which the solution must be obtained, but also the computational effort required to get that solution. Acceleration to steady-state is achieved by applying multigrid on each of the subgrids, regardless of the particular model equations being solved. Since each of these components is explicit, advantage can readily be taken of the vector- and parallel-processing capabilities of machines such as the Cray X-MP and Cray-2.
Using a Cray Y-MP as an array processor for a RISC Workstation
NASA Technical Reports Server (NTRS)
Lamaster, Hugh; Rogallo, Sarah J.
1992-01-01
As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.
Multitasking and microtasking experience on the NA S Cray-2 and ACF Cray X-MP
NASA Technical Reports Server (NTRS)
Raiszadeh, Farhad
1987-01-01
The fast Fourier transform (FFT) kernel of the NAS benchmark program has been utilized to experiment with the multitasking library on the Cray-2 and Cray X-MP/48, and microtasking directives on the Cray X-MP. Some performance figures are shown, and the state of multitasking software is described.
Comparison of scientific computing platforms for MCNP4A Monte Carlo calculations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hendricks, J.S.; Brockhoff, R.C.
1994-04-01
The performance of seven computer platforms is evaluated with the widely used and internationally available MCNP4A Monte Carlo radiation transport code. All results are reproducible and are presented in such a way as to enable comparison with computer platforms not in the study. The authors observed that the HP/9000-735 workstation runs MCNP 50% faster than the Cray YMP 8/64. Compared with the Cray YMP 8/64, the IBM RS/6000-560 is 68% as fast, the Sun Sparc10 is 66% as fast, the Silicon Graphics ONYX is 90% as fast, the Gateway 2000 model 4DX2-66V personal computer is 27% as fast, and themore » Sun Sparc2 is 24% as fast. In addition to comparing the timing performance of the seven platforms, the authors observe that changes in compilers and software over the past 2 yr have resulted in only modest performance improvements, hardware improvements have enhanced performance by less than a factor of [approximately]3, timing studies are very problem dependent, MCNP4Q runs about as fast as MCNP4.« less
Mining Software Usage with the Automatic Library Tracking Database (ALTD)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hadri, Bilel; Fahey, Mark R
2013-01-01
Tracking software usage is important for HPC centers, computer vendors, code developers and funding agencies to provide more efficient and targeted software support, and to forecast needs and guide HPC software effort towards the Exascale era. However, accurately tracking software usage on HPC systems has been a challenging task. In this paper, we present a tool called Automatic Library Tracking Database (ALTD) that has been developed and put in production on several Cray systems. The ALTD infrastructure prototype automatically and transparently stores information about libraries linked into an application at compilation time and also the executables launched in a batchmore » job. We will illustrate the usage of libraries, compilers and third party software applications on a system managed by the National Institute for Computational Sciences.« less
Performance of a plasma fluid code on the Intel parallel computers
NASA Technical Reports Server (NTRS)
Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.
1992-01-01
One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2
NASA Technical Reports Server (NTRS)
Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad
1995-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
Three-dimensional transonic potential flow about complex 3-dimensional configurations
NASA Technical Reports Server (NTRS)
Reyhner, T. A.
1984-01-01
An analysis has been developed and a computer code written to predict three-dimensional subsonic or transonic potential flow fields about lifting or nonlifting configurations. Possible condfigurations include inlets, nacelles, nacelles with ground planes, S-ducts, turboprop nacelles, wings, and wing-pylon-nacelle combinations. The solution of the full partial differential equation for compressible potential flow written in terms of a velocity potential is obtained using finite differences, line relaxation, and multigrid. The analysis uses either a cylindrical or Cartesian coordinate system. The computational mesh is not body fitted. The analysis has been programmed in FORTRAN for both the CDC CYBER 203 and the CRAY-1 computers. Comparisons of computed results with experimental measurement are presented. Descriptions of the program input and output formats are included.
Tuning HDF5 subfiling performance on parallel file systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Byna, Suren; Chaarawi, Mohamad; Koziol, Quincey
Subfiling is a technique used on parallel file systems to reduce locking and contention issues when multiple compute nodes interact with the same storage target node. Subfiling provides a compromise between the single shared file approach that instigates the lock contention problems on parallel file systems and having one file per process, which results in generating a massive and unmanageable number of files. In this paper, we evaluate and tune the performance of recently implemented subfiling feature in HDF5. In specific, we explain the implementation strategy of subfiling feature in HDF5, provide examples of using the feature, and evaluate andmore » tune parallel I/O performance of this feature with parallel file systems of the Cray XC40 system at NERSC (Cori) that include a burst buffer storage and a Lustre disk-based storage. We also evaluate I/O performance on the Cray XC30 system, Edison, at NERSC. Our results show performance benefits of 1.2X to 6X performance advantage with subfiling compared to writing a single shared HDF5 file. We present our exploration of configurations, such as the number of subfiles and the number of Lustre storage targets to storing files, as optimization parameters to obtain superior I/O performance. Based on this exploration, we discuss recommendations for achieving good I/O performance as well as limitations with using the subfiling feature.« less
LARCRIM user's guide, version 1.0
NASA Technical Reports Server (NTRS)
Davis, John S.; Heaphy, William J.
1993-01-01
LARCRIM is a relational database management system (RDBMS) which performs the conventional duties of an RDBMS with the added feature that it can store attributes which consist of arrays or matrices. This makes it particularly valuable for scientific data management. It is accessible as a stand-alone system and through an application program interface. The stand-alone system may be executed in two modes: menu or command. The menu mode prompts the user for the input required to create, update, and/or query the database. The command mode requires the direct input of LARCRIM commands. Although LARCRIM is an update of an old database family, its performance on modern computers is quite satisfactory. LARCRIM is written in FORTRAN 77 and runs under the UNIX operating system. Versions have been released for the following computers: SUN (3 & 4), Convex, IRIS, Hewlett-Packard, CRAY 2 & Y-MP.
Improved Access to Supercomputers Boosts Chemical Applications.
ERIC Educational Resources Information Center
Borman, Stu
1989-01-01
Supercomputing is described in terms of computing power and abilities. The increase in availability of supercomputers for use in chemical calculations and modeling are reported. Efforts of the National Science Foundation and Cray Research are highlighted. (CW)
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver
NASA Technical Reports Server (NTRS)
Ajmani, Kumud; Taylor, Arthur C., III
1994-01-01
This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Parallelization of ARC3D with Computer-Aided Tools
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1998-01-01
A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Scaling of data communications for an advanced supercomputer network
NASA Technical Reports Server (NTRS)
Levin, E.; Eaton, C. K.; Young, Bruce
1986-01-01
The goal of NASA's Numerical Aerodynamic Simulation (NAS) Program is to provide a powerful computational environment for advanced research and development in aeronautics and related disciplines. The present NAS system consists of a Cray 2 supercomputer connected by a data network to a large mass storage system, to sophisticated local graphics workstations and by remote communication to researchers throughout the United States. The program plan is to continue acquiring the most powerful supercomputers as they become available. The implications of a projected 20-fold increase in processing power on the data communications requirements are described.
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Performance Analysis of the NAS Y-MP Workload
NASA Technical Reports Server (NTRS)
Bergeron, Robert J.; Kutler, Paul (Technical Monitor)
1997-01-01
This paper describes the performance characteristics of the computational workloads on the NAS Cray Y-MP machines, a Y-MP 832 and later a Y-MP 8128. Hardware measurements indicated that the Y-MP workload performance matured over time, ultimately sustaining an average throughput of 0.8 GFLOPS and a vector operation fraction of 87%. The measurements also revealed an operation rate exceeding 1 per clock period, a well-balanced architecture featuring a strong utilization of vector functional units, and an efficient memory organization. Introduction of the larger memory 8128 increased throughput by allowing a more efficient utilization of CPUs. Throughput also depended on the metering of the batch queues; low-idle Saturday workloads required a buffer of small jobs to prevent memory starvation of the CPU. UNICOS required about 7% of total CPU time to service the 832 workloads; this overhead decreased to 5% for the 8128 workloads. While most of the system time went to service I/O requests, efficient scheduling prevented excessive idle due to I/O wait. System measurements disclosed no obvious bottlenecks in the response of the machine and UNICOS to the workloads. In most cases, Cray-provided software tools were- quite sufficient for measuring the performance of both the machine and operating, system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Werner, N.E.; Van Matre, S.W.
1985-05-01
This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
Understanding the Cray X1 System
NASA Technical Reports Server (NTRS)
Cheung, Samson
2004-01-01
This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms that are available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.
1986-10-10
Ames Director William 'Bill' Ballhaus (center left) joins visitor Sir Jeffrey Pope from Royla Aircraft Industry, England (center right) at the NAS Facility Cray 2 computer with Ron Deiss, NAS Deputy Manager (L) and Vic Peterson, Ames Deputy Director (R).
Supercomputers for engineering analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Goudreau, G.L.; Benson, D.J.; Hallquist, J.O.
1986-07-01
The Cray-1 and Cray X-MP/48 experience in engineering computations at the Lawrence Livermore National Laboratory is surveyed. The fully vectorized explicit DYNA and implicit NIKE finite element codes are discussed with respect to solid and structural mechanics. The main efficiencies for production analyses are currently obtained by simple CFT compiler exploitation of pipeline architecture for inner do-loop optimization. Current developmet of outer-loop multitasking is also discussed. Applications emphasis will be on 3D examples spanning earth penetrator loads analysis, target lethality assessment, and crashworthiness. The use of a vectorized large deformation shell element in both DYNA and NIKE has substantially expandedmore » 3D nonlinear capability. 25 refs., 7 figs.« less
Lightweight computational steering of very large scale molecular dynamics simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beazley, D.M.; Lomdahl, P.S.
1996-09-01
We present a computational steering approach for controlling, analyzing, and visualizing very large scale molecular dynamics simulations involving tens to hundreds of millions of atoms. Our approach relies on extensible scripting languages and an easy to use tool for building extensions and modules. The system is extremely easy to modify, works with existing C code, is memory efficient, and can be used from inexpensive workstations and networks. We demonstrate how we have used this system to manipulate data from production MD simulations involving as many as 104 million atoms running on the CM-5 and Cray T3D. We also show howmore » this approach can be used to build systems that integrate common scripting languages (including Tcl/Tk, Perl, and Python), simulation code, user extensions, and commercial data analysis packages.« less
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tumeo, Antonino; Secchi, Simone; Villa, Oreste
Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
NASA Astrophysics Data System (ADS)
Tripathi, Vijay S.; Yeh, G. T.
1993-06-01
Sophisticated and highly computation-intensive models of transport of reactive contaminants in groundwater have been developed in recent years. Application of such models to real-world contaminant transport problems, e.g., simulation of groundwater transport of 10-15 chemically reactive elements (e.g., toxic metals) and relevant complexes and minerals in two and three dimensions over a distance of several hundred meters, requires high-performance computers including supercomputers. Although not widely recognized as such, the computational complexity and demand of these models compare with well-known computation-intensive applications including weather forecasting and quantum chemical calculations. A survey of the performance of a variety of available hardware, as measured by the run times for a reactive transport model HYDROGEOCHEM, showed that while supercomputers provide the fastest execution times for such problems, relatively low-cost reduced instruction set computer (RISC) based scalar computers provide the best performance-to-price ratio. Because supercomputers like the Cray X-MP are inherently multiuser resources, often the RISC computers also provide much better turnaround times. Furthermore, RISC-based workstations provide the best platforms for "visualization" of groundwater flow and contaminant plumes. The most notable result, however, is that current workstations costing less than $10,000 provide performance within a factor of 5 of a Cray X-MP.
Parallelization of the FLAPW method
NASA Astrophysics Data System (ADS)
Canning, A.; Mannstadt, W.; Freeman, A. J.
2000-08-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.
Adaptation of MSC/NASTRAN to a supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gloudeman, J.F.; Hodge, J.C.
1982-01-01
MSC/NASTRAN is a large-scale general purpose digital computer program which solves a wider variety of engineering analysis problems by the finite element method. The program capabilities include static and dynamic structural analysis (linear and nonlinear), heat transfer, acoustics, electromagnetism and other types of field problems. It is used worldwide by large and small companies in such diverse fields as automotive, aerospace, civil engineering, shipbuilding, offshore oil, industrial equipment, chemical engineering, biomedical research, optics and government research. The paper presents the significant aspects of the adaptation of MSC/NASTRAN to the Cray-1. First, the general architecture and predominant functional use of MSC/NASTRANmore » are discussed to help explain the imperatives and the challenges of this undertaking. The key characteristics of the Cray-1 which influenced the decision to undertake this effort are then reviewed to help identify performance targets. An overview of the MSC/NASTRAN adaptation effort is then given to help define the scope of the project. Finally, some measures of MSC/NASTRAN's operational performance on the Cray-1 are given, along with a few guidelines to help avoid improper interpretation. 17 references.« less
Optimization strategies for molecular dynamics programs on Cray computers and scalar work stations
NASA Astrophysics Data System (ADS)
Unekis, Michael J.; Rice, Betsy M.
1994-12-01
We present results of timing runs and different optimization strategies for a prototype molecular dynamics program that simulates shock waves in a two-dimensional (2-D) model of a reactive energetic solid. The performance of the program may be improved substantially by simple changes to the Fortran or by employing various vendor-supplied compiler optimizations. The optimum strategy varies among the machines used and will vary depending upon the details of the program. The effect of various compiler options and vendor-supplied subroutine calls is demonstrated. Comparison is made between two scalar workstations (IBM RS/6000 Model 370 and Model 530) and several Cray supercomputers (X-MP/48, Y-MP8/128, and C-90/16256). We find that for a scientific application program dominated by sequential, scalar statements, a relatively inexpensive high-end work station such as the IBM RS/60006 RISC series will outperform single processor performance of the Cray X-MP/48 and perform competitively with single processor performance of the Y-MP8/128 and C-9O/16256.
Porting the AVS/Express scientific visualization software to Cray XT4.
Leaver, George W; Turner, Martin J; Perrin, James S; Mummery, Paul M; Withers, Philip J
2011-08-28
Remote scientific visualization, where rendering services are provided by larger scale systems than are available on the desktop, is becoming increasingly important as dataset sizes increase beyond the capabilities of desktop workstations. Uptake of such services relies on access to suitable visualization applications and the ability to view the resulting visualization in a convenient form. We consider five rules from the e-Science community to meet these goals with the porting of a commercial visualization package to a large-scale system. The application uses message-passing interface (MPI) to distribute data among data processing and rendering processes. The use of MPI in such an interactive application is not compatible with restrictions imposed by the Cray system being considered. We present details, and performance analysis, of a new MPI proxy method that allows the application to run within the Cray environment yet still support MPI communication required by the application. Example use cases from materials science are considered.
Close to real life. [solving for transonic flow about lifting airfoils using supercomputers
NASA Technical Reports Server (NTRS)
Peterson, Victor L.; Bailey, F. Ron
1988-01-01
NASA's Numerical Aerodynamic Simulation (NAS) facility for CFD modeling of highly complex aerodynamic flows employs as its basic hardware two Cray-2s, an ETA-10 Model Q, an Amdahl 5880 mainframe computer that furnishes both support processing and access to 300 Gbytes of disk storage, several minicomputers and superminicomputers, and a Thinking Machines 16,000-device 'connection machine' processor. NAS, which was the first supercomputer facility to standardize operating-system and communication software on all processors, has done important Space Shuttle aerodynamics simulations and will be critical to the configurational refinement of the National Aerospace Plane and its intergrated powerplant, which will involve complex, high temperature reactive gasdynamic computations.
Extensions and improvements on XTRAN3S
NASA Technical Reports Server (NTRS)
Borland, C. J.
1989-01-01
Improvements to the XTRAN3S computer program are summarized. Work on this code, for steady and unsteady aerodynamic and aeroelastic analysis in the transonic flow regime has concentrated on the following areas: (1) Maintenance of the XTRAN3S code, including correction of errors, enhancement of operational capability, and installation on the Cray X-MP system; (2) Extension of the vectorization concepts in XTRAN3S to include additional areas of the code for improved execution speed; (3) Modification of the XTRAN3S algorithm for improved numerical stability for swept, tapered wing cases and improved computational efficiency; and (4) Extension of the wing-only version of XTRAN3S to include pylon and nacelle or external store capability.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jette, M.; Caliga, D.
Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less
The Science of Computing: Virtual Memory
NASA Technical Reports Server (NTRS)
Denning, Peter J.
1986-01-01
In the March-April issue, I described how a computer's storage system is organized as a hierarchy consisting of cache, main memory, and secondary memory (e.g., disk). The cache and main memory form a subsystem that functions like main memory but attains speeds approaching cache. What happens if a program and its data are too large for the main memory? This is not a frivolous question. Every generation of computer users has been frustrated by insufficient memory. A new line of computers may have sufficient storage for the computations of its predecessor, but new programs will soon exhaust its capacity. In 1960, a longrange planning committee at MIT dared to dream of a computer with 1 million words of main memory. In 1985, the Cray-2 was delivered with 256 million words. Computational physicists dream of computers with 1 billion words. Computer architects have done an outstanding job of enlarging main memories yet they have never kept up with demand. Only the shortsighted believe they can.
Smart active pilot-in-the-loop systems
NASA Astrophysics Data System (ADS)
Thomas, Segun
1995-04-01
Representation of on-orbit microgravity environment in a 1-g environment is a continuing problem in space engineering analysis, procedures development and crew training. A way of adequately depicting weightlessness in the performance of on-orbit tasks is by a realistic (or real-time) computer based representation that provides the look, touch, and feel of on-orbit operation. This paper describes how a facility, the Systems Engineering Simulator at the Johnson Space Center, is utilizing recent advances in computer processing power and multi- processing capability to intelligently represent all systems, sub-systems and environmental elements associated with space flight operations. It first describes the computer hardware and interconnection between processors; the computer software responsible for task scheduling, health monitoring, sub-system and environment representation; control room and crew station. It then describes, the mathematical models that represent the dynamics of contact between the Mir and the Space Shuttle during the upcoming US and Russian Shuttle/Mir space mission. Results are presented comparing the response of the smart, active pilot-in-the-loop system to non-time critical CRAY model. A final example of how these systems are utilized is given in the development that supported the highly successful Hubble Space Telescope repair mission.
Parallelization of the FLAPW method and comparison with the PPW method
NASA Astrophysics Data System (ADS)
Canning, Andrew; Mannstadt, Wolfgang; Freeman, Arthur
2000-03-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. In the past the FLAPW method has been limited to systems of about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell running on up to 512 processors on a Cray T3E parallel supercomputer. Some results will also be presented on a comparison of the plane-wave pseudopotential method and the FLAPW method on large systems.
NASA Technical Reports Server (NTRS)
Houston, Johnny L.
1990-01-01
Program EAGLE (Eglin Arbitrary Geometry Implicit Euler) is a multiblock grid generation and steady-state flow solver system. This system combines a boundary conforming surface generation, a composite block structure grid generation scheme, and a multiblock implicit Euler flow solver algorithm. The three codes are intended to be used sequentially from the definition of the configuration under study to the flow solution about the configuration. EAGLE was specifically designed to aid in the analysis of both freestream and interference flow field configurations. These configurations can be comprised of single or multiple bodies ranging from simple axisymmetric airframes to complex aircraft shapes with external weapons. Each body can be arbitrarily shaped with or without multiple lifting surfaces. Program EAGLE is written to compile and execute efficiently on any CRAY machine with or without Solid State Disk (SSD) devices. Also, the code uses namelist inputs which are supported by all CRAY machines using the FORTRAN Compiler CF177. The use of namelist inputs makes it easier for the user to understand the inputs and to operate Program EAGLE. Recently, the Code was modified to operate on other computers, especially the Sun Spare4 Workstation. Several two-dimensional grid configurations were completely and successfully developed using EAGLE. Currently, EAGLE is being used for three-dimension grid applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
National Energy Research Supercomputing Center; He, Yun; Kramer, William T.C.
2008-05-07
The newest workhorse of the National Energy Research Scientific Computing Center is a Cray XT4 with 9,736 dual core nodes. This paper summarizes Franklin user experiences from friendly early user period to production period. Selected successful user stories along with top issues affecting user experiences are presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Doerfler, Douglas; Austin, Brian; Cook, Brandon
There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
An Automated Parallel Image Registration Technique Based on the Correlation of Wavelet Features
NASA Technical Reports Server (NTRS)
LeMoigne, Jacqueline; Campbell, William J.; Cromp, Robert F.; Zukor, Dorothy (Technical Monitor)
2001-01-01
With the increasing importance of multiple platform/multiple remote sensing missions, fast and automatic integration of digital data from disparate sources has become critical to the success of these endeavors. Our work utilizes maxima of wavelet coefficients to form the basic features of a correlation-based automatic registration algorithm. Our wavelet-based registration algorithm is tested successfully with data from the National Oceanic and Atmospheric Administration (NOAA) Advanced Very High Resolution Radiometer (AVHRR) and the Landsat/Thematic Mapper(TM), which differ by translation and/or rotation. By the choice of high-frequency wavelet features, this method is similar to an edge-based correlation method, but by exploiting the multi-resolution nature of a wavelet decomposition, our method achieves higher computational speeds for comparable accuracies. This algorithm has been implemented on a Single Instruction Multiple Data (SIMD) massively parallel computer, the MasPar MP-2, as well as on the CrayT3D, the Cray T3E and a Beowulf cluster of Pentium workstations.
The HARNESS Workbench: Unified and Adaptive Access to Diverse HPC Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sunderam, Vaidy S.
2012-03-20
The primary goal of the Harness WorkBench (HWB) project is to investigate innovative software environments that will help enhance the overall productivity of applications science on diverse HPC platforms. Two complementary frameworks were designed: one, a virtualized command toolkit for application building, deployment, and execution, that provides a common view across diverse HPC systems, in particular the DOE leadership computing platforms (Cray, IBM, SGI, and clusters); and two, a unified runtime environment that consolidates access to runtime services via an adaptive framework for execution-time and post processing activities. A prototype of the first was developed based on the concept ofmore » a 'system-call virtual machine' (SCVM), to enhance portability of the HPC application deployment process across heterogeneous high-end machines. The SCVM approach to portable builds is based on the insertion of toolkit-interpretable directives into original application build scripts. Modifications resulting from these directives preserve the semantics of the original build instruction flow. The execution of the build script is controlled by our toolkit that intercepts build script commands in a manner transparent to the end-user. We have applied this approach to a scientific production code (Gamess-US) on the Cray-XT5 machine. The second facet, termed Unibus, aims to facilitate provisioning and aggregation of multifaceted resources from resource providers and end-users perspectives. To achieve that, Unibus proposes a Capability Model and mediators (resource drivers) to virtualize access to diverse resources, and soft and successive conditioning to enable automatic and user-transparent resource provisioning. A proof of concept implementation has demonstrated the viability of this approach on high end machines, grid systems and computing clouds.« less
SHABERTH - ANALYSIS OF A SHAFT BEARING SYSTEM (CRAY VERSION)
NASA Technical Reports Server (NTRS)
Coe, H. H.
1994-01-01
The SHABERTH computer program was developed to predict operating characteristics of bearings in a multibearing load support system. Lubricated and non-lubricated bearings can be modeled. SHABERTH calculates the loads, torques, temperatures, and fatigue life for ball and/or roller bearings on a single shaft. The program also allows for an analysis of the system reaction to the termination of lubricant supply to the bearings and other lubricated mechanical elements. SHABERTH has proven to be a valuable tool in the design and analysis of shaft bearing systems. The SHABERTH program is structured with four nested calculation schemes. The thermal scheme performs steady state and transient temperature calculations which predict system temperatures for a given operating state. The bearing dimensional equilibrium scheme uses the bearing temperatures, predicted by the temperature mapping subprograms, and the rolling element raceway load distribution, predicted by the bearing subprogram, to calculate bearing diametral clearance for a given operating state. The shaft-bearing system load equilibrium scheme calculates bearing inner ring positions relative to the respective outer rings such that the external loading applied to the shaft is brought into equilibrium by the rolling element loads which develop at each bearing inner ring for a given operating state. The bearing rolling element and cage load equilibrium scheme calculates the rolling element and cage equilibrium positions and rotational speeds based on the relative inner-outer ring positions, inertia effects, and friction conditions. The ball bearing subprograms in the current SHABERTH program have several model enhancements over similar programs. These enhancements include an elastohydrodynamic (EHD) film thickness model that accounts for thermal heating in the contact area and lubricant film starvation; a new model for traction combined with an asperity load sharing model; a model for the hydrodynamic rolling and shear forces in the inlet zone of lubricated contacts, which accounts for the degree of lubricant film starvation; modeling normal and friction forces between a ball and a cage pocket, which account for the transition between the hydrodynamic and elastohydrodynamic regimes of lubrication; and a model of the effect on fatigue life of the ratio of the EHD plateau film thickness to the composite surface roughness. SHABERTH is intended to be as general as possible. The models in SHABERTH allow for the complete mathematical simulation of real physical systems. Systems are limited to a maximum of five bearings supporting the shaft, a maximum of thirty rolling elements per bearing, and a maximum of one hundred temperature nodes. The SHABERTH program structure is modular and has been designed to permit refinement and replacement of various component models as the need and opportunities develop. A preprocessor is included in the IBM PC version of SHABERTH to provide a user friendly means of developing SHABERTH models and executing the resulting code. The preprocessor allows the user to create and modify data files with minimal effort and a reduced chance for errors. Data is utilized as it is entered; the preprocessor then decides what additional data is required to complete the model. Only this required information is requested. The preprocessor can accommodate data input for any SHABERTH compatible shaft bearing system model. The system may include ball bearings, roller bearings, and/or tapered roller bearings. SHABERTH is written in FORTRAN 77, and two machine versions are available from COSMIC. The CRAY version (LEW-14860) has a RAM requirement of 176K of 64 bit words. The IBM PC version (MFS-28818) is written for IBM PC series and compatible computers running MS-DOS, and includes a sample MS-DOS executable. For execution, the PC version requires at least 1Mb of RAM and an 80386 or 486 processor machine with an 80x87 math co-processor. The standard distribution medium for the IBM PC version is a set of two 5.25 inch 360K MS-DOS format diskettes. The contents of the diskettes are compressed using the PKWARE archiving tools. The utility to unarchive the files, PKUNZIP.EXE, is included. The standard distribution medium for the CRAY version is also a 5.25 inch 360K MS-DOS format diskette, but alternate distribution media and formats are available upon request. The original version of SHABERTH was developed in FORTRAN IV at Lewis Research Center for use on a UNIVAC 1100 series computer. The Cray version was released in 1988, and was updated in 1990 to incorporate fluid rheological data for Rocket Propellant 1 (RP-1), thereby allowing the analysis of bearings lubricated with RP-1. The PC version is a port of the 1990 CRAY version and was developed in 1992 by SRS Technologies under contract to NASA Marshall Space Flight Center.
Improving User Notification on Frequently Changing HPC Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fuson, Christopher B; Renaud, William A
2016-01-01
Today s HPC centers user environments can be very complex. Centers often contain multiple large complicated computational systems each with their own user environment. Changes to a system s environment can be very impactful; however, a center s user environment is, in one-way or another, frequently changing. Because of this, it is vital for centers to notify users of change. For users, untracked changes can be costly, resulting in unnecessary debug time as well as wasting valuable compute allocations and research time. Communicating frequent change to diverse user communities is a common and ongoing task for HPC centers. This papermore » will cover the OLCF s current processes and methods used to communicate change to users of the center s large Cray systems and supporting resources. The paper will share lessons learned and goals as well as practices, tools, and methods used to continually improve and reach members of the OLCF user community.« less
Compute Server Performance Results
NASA Technical Reports Server (NTRS)
Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)
1994-01-01
Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
NASA Astrophysics Data System (ADS)
Georgiev, K.; Zlatev, Z.
2010-11-01
The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
ERA 1103 UNIVAC 2 Calculating Machine
1955-09-21
The new 10-by 10-Foot Supersonic Wind Tunnel at the Lewis Flight Propulsion Laboratory included high tech data acquisition and analysis systems. The reliable gathering of pressure, speed, temperature, and other data from test runs in the facilities was critical to the research process. Throughout the 1940s and early 1950s female employees, known as computers, recorded all test data and performed initial calculations by hand. The introduction of punch card computers in the late 1940s gradually reduced the number of hands-on calculations. In the mid-1950s new computational machines were installed in the office building of the 10-by 10-Foot tunnel. The new systems included this UNIVAC 1103 vacuum tube computer—the lab’s first centralized computer system. The programming was done on paper tape and fed into the machine. The 10-by 10 computer center also included the Lewis-designed Computer Automated Digital Encoder (CADDE) and Digital Automated Multiple Pressure Recorder (DAMPR) systems which converted test data to binary-coded decimal numbers and recorded test pressures automatically, respectively. The systems primarily served the 10-by 10, but were also applied to the other large facilities. Engineering Research Associates (ERA) developed the initial UNIVAC computer for the Navy in the late 1940s. In 1952 the company designed a commercial version, the UNIVAC 1103. The 1103 was the first computer designed by Seymour Cray and the first commercially successful computer.
NASA Technical Reports Server (NTRS)
Cullimore, B.
1994-01-01
SINDA, the Systems Improved Numerical Differencing Analyzer, is a software system for solving lumped parameter representations of physical problems governed by diffusion-type equations. SINDA was originally designed for analyzing thermal systems represented in electrical analog, lumped parameter form, although its use may be extended to include other classes of physical systems which can be modeled in this form. As a thermal analyzer, SINDA can handle such interrelated phenomena as sublimation, diffuse radiation within enclosures, transport delay effects, and sensitivity analysis. FLUINT, the FLUid INTegrator, is an advanced one-dimensional fluid analysis program that solves arbitrary fluid flow networks. The working fluids can be single phase vapor, single phase liquid, or two phase. The SINDA'85/FLUINT system permits the mutual influences of thermal and fluid problems to be analyzed. The SINDA system consists of a programming language, a preprocessor, and a subroutine library. The SINDA language is designed for working with lumped parameter representations and finite difference solution techniques. The preprocessor accepts programs written in the SINDA language and converts them into standard FORTRAN. The SINDA library consists of a large number of FORTRAN subroutines that perform a variety of commonly needed actions. The use of these subroutines can greatly reduce the programming effort required to solve many problems. A complete run of a SINDA'85/FLUINT model is a four step process. First, the user's desired model is run through the preprocessor which writes out data files for the processor to read and translates the user's program code. Second, the translated code is compiled. The third step requires linking the user's code with the processor library. Finally, the processor is executed. SINDA'85/FLUINT program features include 20,000 nodes, 100,000 conductors, 100 thermal submodels, and 10 fluid submodels. SINDA'85/FLUINT can also model two phase flow, capillary devices, user defined fluids, gravity and acceleration body forces on a fluid, and variable volumes. SINDA'85/FLUINT offers the following numerical solution techniques. The Finite difference formulation of the explicit method is the Forward-difference explicit approximation. The formulation of the implicit method is the Crank-Nicolson approximation. The program allows simulation of non-uniform heating and facilitates modeling thin-walled heat exchangers. The ability to model non-equilibrium behavior within two-phase volumes is included. Recent improvements to the program were made in modeling real evaporator-pumps and other capillary-assist evaporators. SINDA'85/FLUINT is available by license for a period of ten (10) years to approved licensees. The licensed program product includes the source code and one copy of the supporting documentation. Additional copies of the documentation may be purchased separately at any time. SINDA'85/FLUINT is written in FORTRAN 77. Version 2.3 has been implemented on Cray series computers running UNICOS, CONVEX computers running CONVEX OS, and DEC RISC computers running ULTRIX. Binaries are included with the Cray version only. The Cray version of SINDA'85/FLUINT also contains SINGE, an additional graphics program developed at Johnson Space Flight Center. Both source and executable code are provided for SINGE. Users wishing to create their own SINGE executable will also need the NASA Device Independent Graphics Library (NASADIG, previously known as SMDDIG; UNIX version, MSC-22001). The Cray and CONVEX versions of SINDA'85/FLUINT are available on 9-track 1600 BPI UNIX tar format magnetic tapes. The CONVEX version is also available on a .25 inch streaming magnetic tape cartridge in UNIX tar format. The DEC RISC ULTRIX version is available on a TK50 magnetic tape cartridge in UNIX tar format. SINDA was developed in 1971, and first had fluid capability added in 1975. SINDA'85/FLUINT version 2.3 was released in 1990.
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan
2010-10-04
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less
Multitasking the three-dimensional shock wave code CTH on the Cray X-MP/416
DOE Office of Scientific and Technical Information (OSTI.GOV)
McGlaun, J.M.; Thompson, S.L.
1988-01-01
CTH is a software system under development at Sandia National Laboratories Albuquerque that models multidimensional, multi-material, large-deformation, strong shock wave physics. CTH was carefully designed to both vectorize and multitask on the Cray X-MP/416. All of the physics routines are vectorized except the thermodynamics and the interface tracer. All of the physics routines are multitasked except the boundary conditions. The Los Alamos National Laboratory multitasking library was used for the multitasking. The resulting code is easy to maintain, easy to understand, gives the same answers as the unitasked code, and achieves a measured speedup of approximately 3.5 on the fourmore » cpu Cray. This document discusses the design, prototyping, development, and debugging of CTH. It also covers the architecture features of CTH that enhances multitasking, granularity of the tasks, and synchronization of tasks. The utility of system software and utilities such as simulators and interactive debuggers are also discussed. 5 refs., 7 tabs.« less
A leap forward with UTK s Cray XC30
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fahey, Mark R
2014-01-01
This paper shows a significant productivity leap for several science groups and the accomplishments they have made to date on Darter - a Cray XC30 at the University of Tennessee Knoxville. The increased productivity is due to faster processors and interconnect combined in a new generation from Cray, and yet it still has a very similar programming environment as compared to previous generations of Cray machines that makes porting easy.
Implementing TCP/IP and a socket interface as a server in a message-passing operating system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hipp, E.; Wiltzius, D.
1990-03-01
The UNICOS 4.3BSD network code and socket transport interface are the basis of an explicit network server for NLTSS, a message passing operating system on the Cray YMP. A BSD socket user library provides access to the network server using an RPC mechanism. The advantages of this server methodology are its modularity and extensibility to migrate to future protocol suites (e.g. OSI) and transport interfaces. In addition, the network server is implemented in an explicit multi-tasking environment to take advantage of the Cray YMP multi-processor platform. 19 refs., 5 figs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Painter, J.; McCormick, P.; Krogh, M.
This paper presents the ACL (Advanced Computing Lab) Message Passing Library. It is a high throughput, low latency communications library, based on Thinking Machines Corp.`s CMMD, upon which message passing applications can be built. The library has been implemented on the Cray T3D, Thinking Machines CM-5, SGI workstations, and on top of PVM.
Research on Spectroscopy, Opacity, and Atmospheres
NASA Technical Reports Server (NTRS)
Kurucz, Robert L.
1999-01-01
A web site has been set up to make the calculations accessible; (i.e., cfakus.harvard.edu) This data can also be accessed by FTP. It has all of the atomic and diatomic molecular data, tables of distribution function opacities, grids of model atmospheres, colors, fluxes, etc, programs that are ready for distribution, and most of recent papers developed during this grant. Atlases and computed spectra will be added as they are completed. New atomic and molecular calculations will be added as they are completed. The atomic programs that had been running on a Cray at the San Diego Supercomputer Center can now run on the Vaxes and Alpha. The work started with Ni and Co because there were new laboratory analyses that included isotopic and hyperfine splitting. Those calculations are described in the appended abstract for the 6th Atomic Spectroscopy and oscillator Strengths meeting in Victoria last summer. A surprising finding is that quadrupole transitions have been grossly in error because mixing with higher levels has not been included. All levels up through n=9 for Fe I and II, the spectra for which the most information is available, are now included. After Fe I and Fe II, all other spectra are "easy". ATLAS12, the opacity sampling program for computing models with arbitrary abundances, has been put on the web server. A new distribution function opacity program for workstations that replaces the one used on the Cray at the San Diego Supercomputer Center has been written. Each set of abundances would take 100 Cray hours costing $100,000.
RIP-REMOTE INTERACTIVE PARTICLE-TRACER
NASA Technical Reports Server (NTRS)
Rogers, S. E.
1994-01-01
Remote Interactive Particle-tracing (RIP) is a distributed-graphics program which computes particle traces for computational fluid dynamics (CFD) solution data sets. A particle trace is a line which shows the path a massless particle in a fluid will take; it is a visual image of where the fluid is going. The program is able to compute and display particle traces at a speed of about one trace per second because it runs on two machines concurrently. The data used by the program is contained in two files. The solution file contains data on density, momentum and energy quantities of a flow field at discrete points in three-dimensional space, while the grid file contains the physical coordinates of each of the discrete points. RIP requires two computers. A local graphics workstation interfaces with the user for program control and graphics manipulation, and a remote machine interfaces with the solution data set and performs time-intensive computations. The program utilizes two machines in a distributed mode for two reasons. First, the data to be used by the program is usually generated on the supercomputer. RIP avoids having to convert and transfer the data, eliminating any memory limitations of the local machine. Second, as computing the particle traces can be computationally expensive, RIP utilizes the power of the supercomputer for this task. Although the remote site code was developed on a CRAY, it is possible to port this to any supercomputer class machine with a UNIX-like operating system. Integration of a velocity field from a starting physical location produces the particle trace. The remote machine computes the particle traces using the particle-tracing subroutines from PLOT3D/AMES, a CFD post-processing graphics program available from COSMIC (ARC-12779). These routines use a second-order predictor-corrector method to integrate the velocity field. Then the remote program sends graphics tokens to the local machine via a remote-graphics library. The local machine interprets the graphics tokens and draws the particle traces. The program is menu driven. RIP is implemented on the silicon graphics IRIS 3000 (local workstation) with an IRIX operating system and on the CRAY2 (remote station) with a UNICOS 1.0 or 2.0 operating system. The IRIS 4D can be used in place of the IRIS 3000. The program is written in C (67%) and FORTRAN 77 (43%) and has an IRIS memory requirement of 4 MB. The remote and local stations must use the same user ID. PLOT3D/AMES unformatted data sets are required for the remote machine. The program was developed in 1988.
CLARET user's manual: Mainframe Logs. Revision 1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Frobose, R.H.
1984-11-12
CLARET (Computer Logging and RETrieval) is a stand-alone PDP 11/23 system that can support 16 terminals. It provides a forms-oriented front end by which operators enter online activity logs for the Lawrence Livermore National Laboratory's OCTOPUS computer network. The logs are stored on the PDP 11/23 disks for later retrieval, and hardcopy reports are generated both automatically and upon request. Online viewing of the current logs is provided to management. As each day's logs are completed, the information is automatically sent to a CRAY and included in an online database system. The terminal used for the CLARET system is amore » dual-port Hewlett Packard 2626 terminal that can be used as either the CLARET logging station or as an independent OCTOPUS terminal. Because this is a stand-alone system, it does not depend on the availability of the OCTOPUS network to run and, in the event of a power failure, can be brought up independently.« less
Integrated risk/cost planning models for the US Air Traffic system
NASA Technical Reports Server (NTRS)
Mulvey, J. M.; Zenios, S. A.
1985-01-01
A prototype network planning model for the U.S. Air Traffic control system is described. The model encompasses the dual objectives of managing collision risks and transportation costs where traffic flows can be related to these objectives. The underlying structure is a network graph with nonseparable convex costs; the model is solved efficiently by capitalizing on its intrinsic characteristics. Two specialized algorithms for solving the resulting problems are described: (1) truncated Newton, and (2) simplicial decomposition. The feasibility of the approach is demonstrated using data collected from a control center in the Midwest. Computational results with different computer systems are presented, including a vector supercomputer (CRAY-XMP). The risk/cost model has two primary uses: (1) as a strategic planning tool using aggregate flight information, and (2) as an integrated operational system for forecasting congestion and monitoring (controlling) flow throughout the U.S. In the latter case, access to a supercomputer is required due to the model's enormous size.
NASA Astrophysics Data System (ADS)
Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.
2018-07-01
This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Computational physics in RISC environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rhoades, C.E. Jr.
The new high performance Reduced Instruction Set Computers (RISC) promise near Cray-level performance at near personal-computer prices. This paper explores the performance, conversion and compatibility issues associated with developing, testing and using our traditional, large-scale simulation models in the RISC environments exemplified by the IBM RS6000 and MISP R3000 machines. The questions of operating systems (CTSS versus UNIX), compilers (Fortran, C, pointers) and data are addressed in detail. Overall, it is concluded that the RISC environments are practical for a very wide range of computational physic activities. Indeed, all but the very largest two- and three-dimensional codes will work quitemore » well, particularly in a single user environment. Easily projected hardware-performance increases will revolutionize the field of computational physics. The way we do research will change profoundly in the next few years. There is, however, nothing more difficult to plan, nor more dangerous to manage than the creation of this new world.« less
Computational physics in RISC environments. Revision 1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rhoades, C.E. Jr.
The new high performance Reduced Instruction Set Computers (RISC) promise near Cray-level performance at near personal-computer prices. This paper explores the performance, conversion and compatibility issues associated with developing, testing and using our traditional, large-scale simulation models in the RISC environments exemplified by the IBM RS6000 and MISP R3000 machines. The questions of operating systems (CTSS versus UNIX), compilers (Fortran, C, pointers) and data are addressed in detail. Overall, it is concluded that the RISC environments are practical for a very wide range of computational physic activities. Indeed, all but the very largest two- and three-dimensional codes will work quitemore » well, particularly in a single user environment. Easily projected hardware-performance increases will revolutionize the field of computational physics. The way we do research will change profoundly in the next few years. There is, however, nothing more difficult to plan, nor more dangerous to manage than the creation of this new world.« less
Improvements to the Unstructured Mesh Generator MESH3D
NASA Technical Reports Server (NTRS)
Thomas, Scott D.; Baker, Timothy J.; Cliff, Susan E.
1999-01-01
The AIRPLANE process starts with an aircraft geometry stored in a CAD system. The surface is modeled with a mesh of triangles and then the flow solver produces pressures at surface points which may be integrated to find forces and moments. The biggest advantage is that the grid generation bottleneck of the CFD process is eliminated when an unstructured tetrahedral mesh is used. MESH3D is the key to turning around the first analysis of a CAD geometry in days instead of weeks. The flow solver part of AIRPLANE has proven to be robust and accurate over a decade of use at NASA. It has been extensively validated with experimental data and compares well with other Euler flow solvers. AIRPLANE has been applied to all the HSR geometries treated at Ames over the course of the HSR program in order to verify the accuracy of other flow solvers. The unstructured approach makes handling complete and complex geometries very simple because only the surface of the aircraft needs to be discretized, i.e. covered with triangles. The volume mesh is created automatically by MESH3D. AIRPLANE runs well on multiple platforms. Vectorization on the Cray Y-MP is reasonable for a code that uses indirect addressing. Massively parallel computers such as the IBM SP2, SGI Origin 2000, and the Cray T3E have been used with an MPI version of the flow solver and the code scales very well on these systems. AIRPLANE can run on a desktop computer as well. AIRPLANE has a future. The unstructured technologies developed as part of the HSR program are now targeting high Reynolds number viscous flow simulation. The pacing item in this effort is Navier-Stokes mesh generation.
National resource for computation in chemistry, phase I: evaluation and recommendations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1980-05-01
The National Resource for Computation in Chemistry (NRCC) was inaugurated at the Lawrence Berkeley Laboratory (LBL) in October 1977, with joint funding by the Department of Energy (DOE) and the National Science Foundation (NSF). The chief activities of the NRCC include: assembling a staff of eight postdoctoral computational chemists, establishing an office complex at LBL, purchasing a midi-computer and graphics display system, administering grants of computer time, conducting nine workshops in selected areas of computational chemistry, compiling a library of computer programs with adaptations and improvements, initiating a software distribution system, providing user assistance and consultation on request. This reportmore » presents assessments and recommendations of an Ad Hoc Review Committee appointed by the DOE and NSF in January 1980. The recommendations are that NRCC should: (1) not fund grants for computing time or research but leave that to the relevant agencies, (2) continue the Workshop Program in a mode similar to Phase I, (3) abandon in-house program development and establish instead a competitive external postdoctoral program in chemistry software development administered by the Policy Board and Director, and (4) not attempt a software distribution system (leaving that function to the QCPE). Furthermore, (5) DOE should continue to make its computational facilities available to outside users (at normal cost rates) and should find some way to allow the chemical community to gain occasional access to a CRAY-level computer.« less
High performance semantic factoring of giga-scale semantic graph databases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
al-Saffar, Sinan; Adolf, Bob; Haglin, David
2010-10-01
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less
A comparison of the Cray-2 performance before and after the installation of memory pseudo-banking
NASA Technical Reports Server (NTRS)
Schmickley, Ronald D.; Bailey, David H.
1987-01-01
A suite of 13 large Fortran benchmark codes were run on a Cray-2 configured with memory pseudo-banking circuits, and floating point operation rates were measured for each under a variety of system load configurations. These were compared with similar flop measurements taken on the same system before installation of the pseudo-banking. A useful memory access efficiency parameter was defined and calculated for both sets of performance rates, allowing a crude quantitative measure of the improvement in efficiency due to pseudo-banking. Programs were categorized as either highly scalar (S) or highly vectorized (V) and either memory-intensive or register-intensive, giving 4 categories: S-memory, S-register, V-memory, and V-register. Using flop rates as a simple quantifier of these 4 categories, a scatter plot of efficiency gain vs Mflops roughly illustrates the improvement in floating point processing speed due to pseudo-banking. On the Cray-2 system tested this improvement ranged from 1 percent for S-memory codes to about 12 percent for V-memory codes. No significant gains were made for V-register codes, which was to be expected.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dennig, Yasmin
Sandia National Laboratories has a long history of significant contributions to the high performance community and industry. Our innovative computer architectures allowed the United States to become the first to break the teraFLOP barrier—propelling us to the international spotlight. Our advanced simulation and modeling capabilities have been integral in high consequence US operations such as Operation Burnt Frost. Strong partnerships with industry leaders, such as Cray, Inc. and Goodyear, have enabled them to leverage our high performance computing (HPC) capabilities to gain a tremendous competitive edge in the marketplace. As part of our continuing commitment to providing modern computing infrastructuremore » and systems in support of Sandia missions, we made a major investment in expanding Building 725 to serve as the new home of HPC systems at Sandia. Work is expected to be completed in 2018 and will result in a modern facility of approximately 15,000 square feet of computer center space. The facility will be ready to house the newest National Nuclear Security Administration/Advanced Simulation and Computing (NNSA/ASC) Prototype platform being acquired by Sandia, with delivery in late 2019 or early 2020. This new system will enable continuing advances by Sandia science and engineering staff in the areas of operating system R&D, operation cost effectiveness (power and innovative cooling technologies), user environment and application code performance.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wasserman, H.J.
1996-02-01
The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
NASA Technical Reports Server (NTRS)
Noor, Ahmed K.; Peters, Jeanne M.
1989-01-01
A computational procedure is presented for the nonlinear dynamic analysis of unsymmetric structures on vector multiprocessor systems. The procedure is based on a novel hierarchical partitioning strategy in which the response of the unsymmetric and antisymmetric response vectors (modes), each obtained by using only a fraction of the degrees of freedom of the original finite element model. The three key elements of the procedure which result in high degree of concurrency throughout the solution process are: (1) mixed (or primitive variable) formulation with independent shape functions for the different fields; (2) operator splitting or restructuring of the discrete equations at each time step to delineate the symmetric and antisymmetric vectors constituting the response; and (3) two level iterative process for generating the response of the structure. An assessment is made of the effectiveness of the procedure on the CRAY X-MP/4 computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gorentla Venkata, Manjunath; Shamis, Pavel; Graham, Richard L
2013-01-01
Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth ofmore » hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.« less
Solving large sparse eigenvalue problems on supercomputers
NASA Technical Reports Server (NTRS)
Philippe, Bernard; Saad, Youcef
1988-01-01
An important problem in scientific computing consists in finding a few eigenvalues and corresponding eigenvectors of a very large and sparse matrix. The most popular methods to solve these problems are based on projection techniques on appropriate subspaces. The main attraction of these methods is that they only require the use of the matrix in the form of matrix by vector multiplications. The implementations on supercomputers of two such methods for symmetric matrices, namely Lanczos' method and Davidson's method are compared. Since one of the most important operations in these two methods is the multiplication of vectors by the sparse matrix, methods of performing this operation efficiently are discussed. The advantages and the disadvantages of each method are compared and implementation aspects are discussed. Numerical experiments on a one processor CRAY 2 and CRAY X-MP are reported. Possible parallel implementations are also discussed.
Vectorized and multitasked solution of the few-group neutron diffusion equations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zee, S.K.; Turinsky, P.J.; Shayer, Z.
1989-03-01
A numerical algorithm with parallelism was used to solve the two-group, multidimensional neutron diffusion equations on computers characterized by shared memory, vector pipeline, and multi-CPU architecture features. Specifically, solutions were obtained on the Cray X/MP-48, the IBM-3090 with vector facilities, and the FPS-164. The material-centered mesh finite difference method approximation and outer-inner iteration method were employed. Parallelism was introduced in the inner iterations using the cyclic line successive overrelaxation iterative method and solving in parallel across lines. The outer iterations were completed using the Chebyshev semi-iterative method that allows parallelism to be introduced in both space and energy groups. Formore » the three-dimensional model, power, soluble boron, and transient fission product feedbacks were included. Concentrating on the pressurized water reactor (PWR), the thermal-hydraulic calculation of moderator density assumed single-phase flow and a closed flow channel, allowing parallelism to be introduced in the solution across the radial plane. Using a pinwise detail, quarter-core model of a typical PWR in cycle 1, for the two-dimensional model without feedback the measured million floating point operations per second (MFLOPS)/vector speedups were 83/11.7. 18/2.2, and 2.4/5.6 on the Cray, IBM, and FPS without multitasking, respectively. Lower performance was observed with a coarser mesh, i.e., shorter vector length, due to vector pipeline start-up. For an 18 x 18 x 30 (x-y-z) three-dimensional model with feedback of the same core, MFLOPS/vector speedups of --61/6.7 and an execution time of 0.8 CPU seconds on the Cray without multitasking were measured. Finally, using two CPUs and the vector pipelines of the Cray, a multitasking efficiency of 81% was noted for the three-dimensional model.« less
Modeling Subsurface Reactive Flows Using Leadership-Class Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mills, Richard T; Hammond, Glenn; Lichtner, Peter
2009-01-01
We describe our experiences running PFLOTRAN - a code for simulation of coupled hydro-thermal-chemical processes in variably saturated, non-isothermal, porous media - on leadership-class supercomputers, including initial experiences running on the petaflop incarnation of Jaguar, the Cray XT5 at the National Center for Computational Sciences at Oak Ridge National Laboratory. PFLOTRAN utilizes fully implicit time-stepping and is built on top of the Portable, Extensible Toolkit for Scientific Computation (PETSc). We discuss some of the hurdles to 'at scale' performance with PFLOTRAN and the progress we have made in overcoming them on leadership-class computer architectures.
Parallelized reliability estimation of reconfigurable computer networks
NASA Technical Reports Server (NTRS)
Nicol, David M.; Das, Subhendu; Palumbo, Dan
1990-01-01
A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.
A Performance Evaluation of the Cray X1 for Scientific Applications
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak; Borrill, Julian; Canning, Andrew; Carter, Jonathan; Djomehri, M. Jahed; Shan, Hongzhang; Skinner, David
2004-01-01
The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and cost effectiveness. However, the recent development of massively parallel vector systems is having a significant effect on the supercomputing landscape. In this paper, we compare the performance of the recently released Cray X1 vector system with that of the cacheless NEC SX-6 vector machine, and the superscalar cache-based IBM Power3 and Power4 architectures for scientific applications. Overall results demonstrate that the X1 is quite promising, but performance improvements are expected as the hardware, systems software, and numerical libraries mature. Code reengineering to effectively utilize the complex architecture may also lead to significant efficiency enhancements.
Scalable nuclear density functional theory with Sky3D
NASA Astrophysics Data System (ADS)
Afibuzzaman, Md; Schuetrumpf, Bastian; Aktulga, Hasan Metin
2018-02-01
In nuclear astrophysics, quantum simulations of large inhomogeneous dense systems as they appear in the crusts of neutron stars present big challenges. The number of particles in a simulation with periodic boundary conditions is strongly limited due to the immense computational cost of the quantum methods. In this paper, we describe techniques for an efficient and scalable parallel implementation of Sky3D, a nuclear density functional theory solver that operates on an equidistant grid. Presented techniques allow Sky3D to achieve good scaling and high performance on a large number of cores, as demonstrated through detailed performance analysis on a Cray XC40 supercomputer.
NASA Astrophysics Data System (ADS)
Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.
2017-11-01
Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.
S-HARP: A parallel dynamic spectral partitioner
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sohn, A.; Simon, H.
1998-01-01
Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less
The Portals 4.0 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2012-11-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
Automated Generation of Message-Passing Programs: An Evaluation Using CAPTools
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Jin, Haoqiang; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
Scientists at NASA Ames Research Center have been developing computational aeroscience applications on highly parallel architectures over the past ten years. During that same time period, a steady transition of hardware and system software also occurred, forcing us to expend great efforts into migrating and re-coding our applications. As applications and machine architectures become increasingly complex, the cost and time required for this process will become prohibitive. In this paper, we present the first set of results in our evaluation of interactive parallelization tools. In particular, we evaluate CAPTool's ability to parallelize computational aeroscience applications. CAPTools was tested on serial versions of the NAS Parallel Benchmarks and ARC3D, a computational fluid dynamics application, on two platforms: the SGI Origin 2000 and the Cray T3E. This evaluation includes performance, amount of user interaction required, limitations and portability. Based on these results, a discussion on the feasibility of computer aided parallelization of aerospace applications is presented along with suggestions for future work.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thorson, L.D.
A description is given of a new version of the TRUMP (UCRL-14754) computer code, NOTRUMP, which runs on both the CDC-7600 and CRAY-1. There are slight differences in the input and major changes in output capability. A postprocessor, AFTER, is available to manipulate some of the new output features. Old data decks for TRUMP will normally run with only minor changes.
Parallel-vector out-of-core equation solver for computational mechanics
NASA Technical Reports Server (NTRS)
Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.
1993-01-01
A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
Comparison of Implicit Collocation Methods for the Heat Equation
NASA Technical Reports Server (NTRS)
Kouatchou, Jules; Jezequel, Fabienne; Zukor, Dorothy (Technical Monitor)
2001-01-01
We combine a high-order compact finite difference scheme to approximate spatial derivatives arid collocation techniques for the time component to numerically solve the two dimensional heat equation. We use two approaches to implement the collocation methods. The first one is based on an explicit computation of the coefficients of polynomials and the second one relies on differential quadrature. We compare them by studying their merits and analyzing their numerical performance. All our computations, based on parallel algorithms, are carried out on the CRAY SV1.
NASA Astrophysics Data System (ADS)
Guan, W.; Cheng, X.; Huang, J.; Huber, G.; Li, W.; McCammon, J. A.; Zhang, B.
2018-06-01
RPYFMM is a software package for the efficient evaluation of the potential field governed by the Rotne-Prager-Yamakawa (RPY) tensor interactions in biomolecular hydrodynamics simulations. In our algorithm, the RPY tensor is decomposed as a linear combination of four Laplace interactions, each of which is evaluated using the adaptive fast multipole method (FMM) (Greengard and Rokhlin, 1997) where the exponential expansions are applied to diagonalize the multipole-to-local translation operators. RPYFMM offers a unified execution on both shared and distributed memory computers by leveraging the DASHMM library (DeBuhr et al., 2016, 2018). Preliminary numerical results show that the interactions for a molecular system of 15 million particles (beads) can be computed within one second on a Cray XC30 cluster using 12,288 cores, while achieving approximately 54% strong-scaling efficiency.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jackson, K.A.; Neuman, M.C.; Simmonds, D.D.
An effective method for detecting computer misuse is the automatic monitoring and analysis of on-line user activity. This activity is reflected in the system audit record, in the system vulnerability posture, and in other evidence found through active testing of the system. During the last several years we have implemented an automatic misuse detection system at Los Alamos. This is the Network Anomaly Detection and Intrusion Reporter (NADIR). We are currently expanding NADIR to include processing of the Cray UNICOS operating system. This new component is called the UNICOS Realtime NADIR, or UNICORN. UNICORN summarizes user activity and system configurationmore » in statistical profiles. It compares these profiles to expert rules that define security policy and improper or suspicious behavior. It reports suspicious behavior to security auditors and provides tools to aid in follow-up investigations. The first phase of UNICORN development is nearing completion, and will be operational in late 1994.« less
Three-dimensional multigrid Navier-Stokes computations for turbomachinery applications
NASA Astrophysics Data System (ADS)
Subramanian, S. V.
1989-07-01
The fully three-dimensional, time-dependent compressible Navier-Stokes equations in cylindrical coordinates are presently used, in conjunction with the multistage Runge-Kutta numerical integration scheme for solution of the governing flow equations, to simulate complex flowfields within turbomechanical components whose pertinent effects encompass those of viscosity, compressibility, blade rotation, and tip clearance. Computed results are presented for selected cascades, emphasizing the code's capabilities in the accurate prediction of such features as airfoil loadings, exit flow angles, shocks, and secondary flows. Computations for several test cases have been performed on a Cray-YMP, using nearly 90,000 grid points.
TOSCA calculations and measurements for the SLAC SLC damping ring dipole magnet
NASA Astrophysics Data System (ADS)
Early, R. A.; Cobb, J. K.
1985-04-01
The SLAC damping ring dipole magnet was originally designed with removable nose pieces at the ends. Recently, a set of magnetic measurements was taken of the vertical component of induction along the center of the magnet for four different pole-end configurations and several current settings. The three dimensional computer code TOSCA, which is currently installed on the National Magnetic Fusion Energy Computer Center's Cray X-MP, was used to compute field values for the four configurations at current settings near saturation. Comparisons were made for magnetic induction as well as effective magnetic lengths for the different configurations.
1993 Gordon Bell Prize Winners
NASA Technical Reports Server (NTRS)
Karp, Alan H.; Simon, Horst; Heller, Don; Cooper, D. M. (Technical Monitor)
1994-01-01
The Gordon Bell Prize recognizes significant achievements in the application of supercomputers to scientific and engineering problems. In 1993, finalists were named for work in three categories: (1) Performance, which recognizes those who solved a real problem in the quickest elapsed time. (2) Price/performance, which encourages the development of cost-effective supercomputing. (3) Compiler-generated speedup, which measures how well compiler writers are facilitating the programming of parallel processors. The winners were announced November 17 at the Supercomputing 93 conference in Portland, Oregon. Gordon Bell, an independent consultant in Los Altos, California, is sponsoring $2,000 in prizes each year for 10 years to promote practical parallel processing research. This is the sixth year of the prize, which Computer administers. Something unprecedented in Gordon Bell Prize competition occurred this year: A computer manufacturer was singled out for recognition. Nine entries reporting results obtained on the Cray C90 were received, seven of the submissions orchestrated by Cray Research. Although none of these entries showed sufficiently high performance to win outright, the judges were impressed by the breadth of applications that ran well on this machine, all nine running at more than a third of the peak performance of the machine.
Vectorization of a particle simulation method for hypersonic rarefied flow
NASA Technical Reports Server (NTRS)
Mcdonald, Jeffrey D.; Baganoff, Donald
1988-01-01
An efficient particle simulation technique for hypersonic rarefied flows is presented at an algorithmic and implementation level. The implementation is for a vector computer architecture, specifically the Cray-2. The method models an ideal diatomic Maxwell molecule with three translational and two rotational degrees of freedom. Algorithms are designed specifically for compatibility with fine grain parallelism by reducing the number of data dependencies in the computation. By insisting on this compatibility, the method is capable of performing simulation on a much larger scale than previously possible. A two-dimensional simulation of supersonic flow over a wedge is carried out for the near-continuum limit where the gas is in equilibrium and the ideal solution can be used as a check on the accuracy of the gas model employed in the method. Also, a three-dimensional, Mach 8, rarefied flow about a finite-span flat plate at a 45 degree angle of attack was simulated. It utilized over 10 to the 7th particles carried through 400 discrete time steps in less than one hour of Cray-2 CPU time. This problem was chosen to exhibit the capability of the method in handling a large number of particles and a true three-dimensional geometry.
Aerodynamic optimization studies on advanced architecture computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana
1995-01-01
The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
Efficient simulation of incompressible viscous flow over multi-element airfoils
NASA Technical Reports Server (NTRS)
Rogers, Stuart E.; Wiltberger, N. Lyn; Kwak, Dochan
1992-01-01
The incompressible, viscous, turbulent flow over single and multi-element airfoils is numerically simulated in an efficient manner by solving the incompressible Navier-Stokes equations. The computer code uses the method of pseudo-compressibility with an upwind-differencing scheme for the convective fluxes and an implicit line-relaxation solution algorithm. The motivation for this work includes interest in studying the high-lift take-off and landing configurations of various aircraft. In particular, accurate computation of lift and drag at various angles of attack, up to stall, is desired. Two different turbulence models are tested in computing the flow over an NACA 4412 airfoil; an accurate prediction of stall is obtained. The approach used for multi-element airfoils involves the use of multiple zones of structured grids fitted to each element. Two different approaches are compared: a patched system of grids, and an overlaid Chimera system of grids. Computational results are presented for two-element, three-element, and four-element airfoil configurations. Excellent agreement with experimental surface pressure coefficients is seen. The code converges in less than 200 iterations, requiring on the order of one minute of CPU time (on a CRAY YMP) per element in the airfoil configuration.
Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP
NASA Technical Reports Server (NTRS)
Chan, Tony F.; Fatoohi, Rod A.
1990-01-01
The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.
Reversible Parallel Discrete-Event Execution of Large-scale Epidemic Outbreak Models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
The spatial scale, runtime speed and behavioral detail of epidemic outbreak simulations together require the use of large-scale parallel processing. In this paper, an optimistic parallel discrete event execution of a reaction-diffusion simulation model of epidemic outbreaks is presented, with an implementation over themore » $$\\mu$$sik simulator. Rollback support is achieved with the development of a novel reversible model that combines reverse computation with a small amount of incremental state saving. Parallel speedup and other runtime performance metrics of the simulation are tested on a small (8,192-core) Blue Gene / P system, while scalability is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes (up to several hundred million individuals in the largest case) are exercised.« less
The portals 4.0.1 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2013-04-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
A study of workstation computational performance for real-time flight simulation
NASA Technical Reports Server (NTRS)
Maddalon, Jeffrey M.; Cleveland, Jeff I., II
1995-01-01
With recent advances in microprocessor technology, some have suggested that modern workstations provide enough computational power to properly operate a real-time simulation. This paper presents the results of a computational benchmark, based on actual real-time flight simulation code used at Langley Research Center, which was executed on various workstation-class machines. The benchmark was executed on different machines from several companies including: CONVEX Computer Corporation, Cray Research, Digital Equipment Corporation, Hewlett-Packard, Intel, International Business Machines, Silicon Graphics, and Sun Microsystems. The machines are compared by their execution speed, computational accuracy, and porting effort. The results of this study show that the raw computational power needed for real-time simulation is now offered by workstations.
High Performance Descriptive Semantic Analysis of Semantic Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Adolf, Robert D.; al-Saffar, Sinan
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to understand their inherent semantic structure, whether codified in explicit ontologies or not. Our group is researching novel methods for what we call descriptive semantic analysis of RDF triplestores, to serve purposes of analysis, interpretation, visualization, and optimization. But data size and computational complexity makes it increasingly necessary to bring high performance computational resources to bear on this task. Our research group built a novel high performance hybrid system comprisingmore » computational capability for semantic graph database processing utilizing the large multi-threaded architecture of the Cray XMT platform, conventional servers, and large data stores. In this paper we describe that architecture and our methods, and present the results of our analyses of basic properties, connected components, namespace interaction, and typed paths such for the Billion Triple Challenge 2010 dataset.« less
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Finite Element Flow Code Optimization on the Cray T3D,
1997-04-01
present time, the system is configured with 512 processing elements and 32.8 Cigabytes of memory. Through a gift of time from MSCI and other arrangements, the AHPCRC has limited access to this system.
NASA Astrophysics Data System (ADS)
Eisenbach, Markus; Larkin, Jeff; Lutjens, Justin; Rennich, Steven; Rogers, James H.
2017-02-01
The Locally Self-consistent Multiple Scattering (LSMS) code solves the first principles Density Functional theory Kohn-Sham equation for a wide range of materials with a special focus on metals, alloys and metallic nano-structures. It has traditionally exhibited near perfect scalability on massively parallel high performance computer architectures. We present our efforts to exploit GPUs to accelerate the LSMS code to enable first principles calculations of O(100,000) atoms and statistical physics sampling of finite temperature properties. We reimplement the scattering matrix calculation for GPUs with a block matrix inversion algorithm that only uses accelerator memory. Using the Cray XK7 system Titan at the Oak Ridge Leadership Computing Facility we achieve a sustained performance of 14.5PFlop/s and a speedup of 8.6 compared to the CPU only code.
Understanding Aprun Use Patterns
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lin, Hwa-Chun Wendy
2009-05-06
On the Cray XT, aprun is the command to launch an application to a set of compute nodes reserved through the Application Level Placement Scheduler (ALPS). At the National Energy Research Scientific Computing Center (NERSC), interactive aprun is disabled. That is, invocations of aprun have to go through the batch system. Batch scripts can and often do contain several apruns which either use subsets of the reserved nodes in parallel, or use all reserved nodes in consecutive apruns. In order to better understand how NERSC users run on the XT, it is necessary to associate aprun information with jobs. Itmore » is surprisingly more challenging than it sounds. In this paper, we describe those challenges and how we solved them to produce daily per-job reports for completed apruns. We also describe additional uses of the data, e.g. adjusting charging policy accordingly or associating node failures with jobs/users, and plans for enhancements.« less
Airloads on Bluff Bodies, with Application to the Rotor-Induced Downloads on Tilt-Rotor Aircraft.
1983-09-01
interference aerodynamics would be tion on hover performance (Ref. (11). to study the two-dimensional sec- tion characteristics of a wing in the wake of a...resources for large numbers of vortices; a typical case requires 10-15 min CPU time on the Ames Cray IS computer. Figure 6 shows a typical result. Here...CPU time per case on a Prime 550UPPER SURFACE (WINDWARD) computer to converge to a steady solution; this would be equivalent to one or two seconds on
HEAT.PRO - THERMAL IMBALANCE FORCE SIMULATION AND ANALYSIS USING PDE2D
NASA Technical Reports Server (NTRS)
Vigue, Y.
1994-01-01
HEAT.PRO calculates the thermal imbalance force resulting from satellite surface heating. The heated body of a satellite re-radiates energy at a rate that is proportional to its temperature, losing the energy in the form of photons. By conservation of momentum, this momentum flux out of the body creates a reaction force against the radiation surface, and the net thermal force can be observed as a small perturbation that affects long term orbital behavior of the satellite. HEAT.PRO calculates this thermal imbalance force and then determines its effects on satellite orbits, especially where the Earth's shadowing of an orbiting satellite causes periodic changes in the spacecraft's thermal environment. HEAT.PRO implements a finite element method routine called PDE2D which incorporates material properties to determine the solar panel surface temperatures. The nodal temperatures are computed at specified time steps and are used to determine the magnitude and direction of the thermal force on the spacecraft. These calculations are based on the solar panel orientation and satellite's position with respect to the earth and sun. It is necessary to have accurate, current knowledge of surface emissivity, thermal conductivity, heat capacity, and material density. These parameters, which may change due to degradation of materials in the environment of space, influence the nodal temperatures that are computed and thus the thermal force calculations. HEAT.PRO was written in FORTRAN 77 for Cray series computers running UNICOS. The source code contains directives for and is used as input to the required partial differential equation solver, PDE2D. HEAT.PRO is available on a 9-track 1600 BPI magnetic tape in UNIX tar format (standard distribution medium) or a .25 inch streaming magnetic tape cartridge in UNIX tar format. An electronic copy of the documentation in Macintosh Microsoft Word format is included on the distribution tape. HEAT.PRO was developed in 1991. Cray and UNICOS are registered trademarks of Cray Research, Inc. UNIX is a trademark of AT&T Bell Laboratories. PDE2D is available from Granville Sewell, Mathematics Dept., University of Texas at El Paso, El Paso, Texas 79968.
Highly parallel implementation of non-adiabatic Ehrenfest molecular dynamics
NASA Astrophysics Data System (ADS)
Kanai, Yosuke; Schleife, Andre; Draeger, Erik; Anisimov, Victor; Correa, Alfredo
2014-03-01
While the adiabatic Born-Oppenheimer approximation tremendously lowers computational effort, many questions in modern physics, chemistry, and materials science require an explicit description of coupled non-adiabatic electron-ion dynamics. Electronic stopping, i.e. the energy transfer of a fast projectile atom to the electronic system of the target material, is a notorious example. We recently implemented real-time time-dependent density functional theory based on the plane-wave pseudopotential formalism in the Qbox/qb@ll codes. We demonstrate that explicit integration using a fourth-order Runge-Kutta scheme is very suitable for modern highly parallelized supercomputers. Applying the new implementation to systems with hundreds of atoms and thousands of electrons, we achieved excellent performance and scalability on a large number of nodes both on the BlueGene based ``Sequoia'' system at LLNL as well as the Cray architecture of ``Blue Waters'' at NCSA. As an example, we discuss our work on computing the electronic stopping power of aluminum and gold for hydrogen projectiles, showing an excellent agreement with experiment. These first-principles calculations allow us to gain important insight into the the fundamental physics of electronic stopping.
IBM PC enhances the world's future
NASA Technical Reports Server (NTRS)
Cox, Jozelle
1988-01-01
Although the purpose of this research is to illustrate the importance of computers to the public, particularly the IBM PC, present examinations will include computers developed before the IBM PC was brought into use. IBM, as well as other computing facilities, began serving the public years ago, and is continuing to find ways to enhance the existence of man. With new developments in supercomputers like the Cray-2, and the recent advances in artificial intelligence programming, the human race is gaining knowledge at a rapid pace. All have benefited from the development of computers in the world; not only have they brought new assets to life, but have made life more and more of a challenge everyday.
A computational/experimental study of the flow around a body of revolution at angle of attack
NASA Technical Reports Server (NTRS)
Zilliac, Gregory G.
1986-01-01
The incompressible Navier-Stokes equations are numerically solved for steady flow around an ogive-cylinder (fineness ration 4.5) at angle of attack. The three-dimensional vortical flow is investigated with emphasis on the tip and the near wake region. The implicit, finite-difference computation is performed on the CRAY X-MP computer using the method of pseudo-compressibility. Comparisons of computational results with results of a companion towing tank experiment are presented for two symmetric leeside flow cases of moderate angles of attack. The topology of the flow is discussed and conclusions are drawn concerning the growth and stability of the primary vortices.
NASA Technical Reports Server (NTRS)
Tennille, Geoffrey M.; Howser, Lona M.
1993-01-01
The use of the CONVEX computers that are an integral part of the Supercomputing Network Subsystems (SNS) of the Central Scientific Computing Complex of LaRC is briefly described. Features of the CONVEX computers that are significantly different than the CRAY supercomputers are covered, including: FORTRAN, C, architecture of the CONVEX computers, the CONVEX environment, batch job submittal, debugging, performance analysis, utilities unique to CONVEX, and documentation. This revision reflects the addition of the Applications Compiler and X-based debugger, CXdb. The document id intended for all CONVEX users as a ready reference to frequently asked questions and to more detailed information contained with the vendor manuals. It is appropriate for both the novice and the experienced user.
NASA Astrophysics Data System (ADS)
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.
2018-01-01
We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
NASA Technical Reports Server (NTRS)
Gupta, K. K.
1997-01-01
A multidisciplinary, finite element-based, highly graphics-oriented, linear and nonlinear analysis capability that includes such disciplines as structures, heat transfer, linear aerodynamics, computational fluid dynamics, and controls engineering has been achieved by integrating several new modules in the original STARS (STructural Analysis RoutineS) computer program. Each individual analysis module is general-purpose in nature and is effectively integrated to yield aeroelastic and aeroservoelastic solutions of complex engineering problems. Examples of advanced NASA Dryden Flight Research Center projects analyzed by the code in recent years include the X-29A, F-18 High Alpha Research Vehicle/Thrust Vectoring Control System, B-52/Pegasus Generic Hypersonics, National AeroSpace Plane (NASP), SR-71/Hypersonic Launch Vehicle, and High Speed Civil Transport (HSCT) projects. Extensive graphics capabilities exist for convenient model development and postprocessing of analysis results. The program is written in modular form in standard FORTRAN language to run on a variety of computers, such as the IBM RISC/6000, SGI, DEC, Cray, and personal computer; associated graphics codes use OpenGL and IBM/graPHIGS language for color depiction. This program is available from COSMIC, the NASA agency for distribution of computer programs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-05-29
We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Efficient Iterative Methods Applied to the Solution of Transonic Flows
NASA Astrophysics Data System (ADS)
Wissink, Andrew M.; Lyrintzis, Anastasios S.; Chronopoulos, Anthony T.
1996-02-01
We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton's method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GMRES method. The preconditioner is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton-GMRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems.
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching Yuen
1991-01-01
A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching-Yuen
1992-01-01
This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.
The computation of pi to 29,360,000 decimal digits using Borweins' quartically convergent algorithm
NASA Technical Reports Server (NTRS)
Bailey, David H.
1988-01-01
The quartically convergent numerical algorithm developed by Borwein and Borwein (1987) for 1/pi is implemented via a prime-modulus-transform multiprecision technique on the NASA Ames Cray-2 supercomputer to compute the first 2.936 x 10 to the 7th digits of the decimal expansion of pi. The history of pi computations is briefly recalled; the most recent algorithms are characterized; the implementation procedures are described; and samples of the output listing are presented. Statistical analyses show that the present decimal expansion is completely random, with only acceptable numbers of long repeating strings and single-digit runs.
Implementation of the Automated Numerical Model Performance Metrics System
2011-09-26
question. As of this writing, the DSRC IBM AIX machines DaVinci and Pascal, and the Cray XT Einstein all use the PBS batch queuing system for...3.3). 12 Appendix A – General Automation System This system provides general purpose tools and a general way to automatically run
Collective Framework and Performance Optimizations to Open MPI for Cray XT Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ladd, Joshua S; Gorentla Venkata, Manjunath; Shamis, Pavel
2011-01-01
The performance and scalability of collective operations plays a key role in the performance and scalability of many scientific applications. Within the Open MPI code base we have developed a general purpose hierarchical collective operations framework called Cheetah, and applied it at large scale on the Oak Ridge Leadership Computing Facility's Jaguar (OLCF) platform, obtaining better performance and scalability than the native MPI implementation. This paper discuss Cheetah's design and implementation, and optimizations to the framework for Cray XT 5 platforms. Our results show that the Cheetah's Broadcast and Barrier perform better than the native MPI implementation. For medium data,more » the Cheetah's Broadcast outperforms the native MPI implementation by 93% for 49,152 processes problem size. For small and large data, it out performs the native MPI implementation by 10% and 9%, respectively, at 24,576 processes problem size. The Cheetah's Barrier performs 10% better than the native MPI implementation for 12,288 processes problem size.« less
Systolic array IC for genetic computation
NASA Technical Reports Server (NTRS)
Anderson, D.
1991-01-01
Measuring similarities between large sequences of genetic information is a formidable task requiring enormous amounts of computer time. Geneticists claim that nearly two months of CRAY-2 time are required to run a single comparison of the known database against the new bases that will be found this year, and more than a CRAY-2 year for next year's genetic discoveries, and so on. The DNA IC, designed at HP-ICBD in cooperation with the California Institute of Technology and the Jet Propulsion Laboratory, is being implemented in order to move the task of genetic comparison onto workstations and personal computers, while vastly improving performance. The chip is a systolic (pumped) array comprised of 16 processors, control logic, and global RAM, totaling 400,000 FETS. At 12 MHz, each chip performs 2.7 billion 16 bit operations per second. Using 35 of these chips in series on one PC board (performing nearly 100 billion operations per second), a sequence of 560 bases can be compared against the eventual total genome of 3 billion bases, in minutes--on a personal computer. While the designed purpose of the DNA chip is for genetic research, other disciplines requiring similarity measurements between strings of 7 bit encoded data could make use of this chip as well. Cryptography and speech recognition are two examples. A mix of full custom design and standard cells, in CMOS34, were used to achieve these goals. Innovative test methods were developed to enhance controllability and observability in the array. This paper describes these techniques as well as the chip's functionality. This chip was designed in the 1989-90 timeframe.
Massively parallel quantum computer simulator
NASA Astrophysics Data System (ADS)
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan
This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
A progress report on UNICOS misuse detection at Los Alamos
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thompson, J.L.; Jackson, K.A.; Stallings, C.A.
An effective method for detecting computer misuse is the automatic monitoring and analysis of on-line user activity. During the past year, Los Alamos enhanced its Network Anomaly Detection and Intrusion Reporter (NADIR) to include analysis of user activity on Los Alamos` UNICOS Crays. In near real-time, NADIR compares user activity to historical profiles and tests activity against expert rules. The expert rules express Los Alamos` security policy and define improper or suspicious behavior. NADIR reports suspicious behavior to security auditors and provides tools to aid in follow-up investigations. This paper describes the implementation to date of the UNICOS component ofmore » NADIR, along with the operational experiences and future plans for the system.« less
An analysis of file migration in a UNIX supercomputing environment
NASA Technical Reports Server (NTRS)
Miller, Ethan L.; Katz, Randy H.
1992-01-01
The super computer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer's local disks. This paper presents an analysis of file migration data collected over two years. The analysis shows that requests to the MSS are periodic, with one day and one week periods. Read requests to the MSS account for the majority of the periodicity; as write requests are relatively constant over the course of a week. Additionally, reads show a far greater fluctuation than writes over a day and week since reads are driven by human users while writes are machine-driven.
Performance of the engineering analysis and data system 2 common file system
NASA Technical Reports Server (NTRS)
Debrunner, Linda S.
1993-01-01
The Engineering Analysis and Data System (EADS) was used from April 1986 to July 1993 to support large scale scientific and engineering computation (e.g. computational fluid dynamics) at Marshall Space Flight Center. The need for an updated system resulted in a RFP in June 1991, after which a contract was awarded to Cray Grumman. EADS II was installed in February 1993, and by July 1993 most users were migrated. EADS II is a network of heterogeneous computer systems supporting scientific and engineering applications. The Common File System (CFS) is a key component of this system. The CFS provides a seamless, integrated environment to the users of EADS II including both disk and tape storage. UniTree software is used to implement this hierarchical storage management system. The performance of the CFS suffered during the early months of the production system. Several of the performance problems were traced to software bugs which have been corrected. Other problems were associated with hardware. However, the use of NFS in UniTree UCFM software limits the performance of the system. The performance issues related to the CFS have led to a need to develop a greater understanding of the CFS organization. This paper will first describe the EADS II with emphasis on the CFS. Then, a discussion of mass storage systems will be presented, and methods of measuring the performance of the Common File System will be outlined. Finally, areas for further study will be identified and conclusions will be drawn.
Efficient simulation of incompressible viscous flow over multi-element airfoils
NASA Technical Reports Server (NTRS)
Rogers, Stuart E.; Wiltberger, N. Lyn; Kwak, Dochan
1993-01-01
The incompressible, viscous, turbulent flow over single and multi-element airfoils is numerically simulated in an efficient manner by solving the incompressible Navier-Stokes equations. The solution algorithm employs the method of pseudo compressibility and utilizes an upwind differencing scheme for the convective fluxes, and an implicit line-relaxation scheme. The motivation for this work includes interest in studying high-lift take-off and landing configurations of various aircraft. In particular, accurate computation of lift and drag at various angles of attack up to stall is desired. Two different turbulence models are tested in computing the flow over an NACA 4412 airfoil; an accurate prediction of stall is obtained. The approach used for multi-element airfoils involves the use of multiple zones of structured grids fitted to each element. Two different approaches are compared; a patched system of grids, and an overlaid Chimera system of grids. Computational results are presented for two-element, three-element, and four-element airfoil configurations. Excellent agreement with experimental surface pressure coefficients is seen. The code converges in less than 200 iterations, requiring on the order of one minute of CPU time on a CRAY YMP per element in the airfoil configuration.
New tools using the hardware performance monitor to help users tune programs on the Cray X-MP
DOE Office of Scientific and Technical Information (OSTI.GOV)
Engert, D.E.; Rudsinski, L.; Doak, J.
1991-09-25
The performance of a Cray system is highly dependent on the tuning techniques used by individuals on their codes. Many of our users were not taking advantage of the tuning tools that allow them to monitor their own programs by using the Hardware Performance Monitor (HPM). We therefore modified UNICOS to collect HPM data for all processes and to report Mflop ratings based on users, programs, and time used. Our tuning efforts are now being focused on the users and programs that have the best potential for performance improvements. These modifications and some of the more striking performance improvements aremore » described.« less
Deploying Darter A Cray XC30 System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fahey, Mark R; Budiardja, Reuben D; Crosby, Lonnie D
TheUniversityofTennessee,KnoxvilleacquiredaCrayXC30 supercomputer, called Darter, with a peak performance of 248.9 Ter- aflops. Darter was deployed in late March of 2013 with a very aggressive production timeline - the system was deployed, accepted, and placed into production in only 2 weeks. The Spring Experiment for the Center for Analysis and Prediction of Storms (CAPS) largely drove the accelerated timeline, as the experiment was scheduled to start in mid-April. The Consortium for Advanced Simulation of Light Water Reactors (CASL) project also needed access and was able to meet their tight deadlines on the newly acquired XC30. Darter s accelerated deployment and op-more » erations schedule resulted in substantial scientific impacts within the re- search community as well as immediate real-world impacts such as early severe tornado warnings« less
Input-independent, Scalable and Fast String Matching on the Cray XMT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J
2009-05-25
String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
Implementations of BLAST for parallel computers.
Jülich, A
1995-02-01
The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
1990-11-12
This feature prevents any significant unexpected and undesired size overhead introduced by the automatic inlining of a called subprogram. Any...PRESERVELAYOUT forces the 5.5.1 compiler to maintain the Ada source order of a given record type, thereby, preventing the compiler from performing this...Environment, Volme 2: Prgram nng Guide assignments to the copied array in Ada do not affect the Fortran version of the array. The dimensions and order of
NASA Technical Reports Server (NTRS)
Adams, Neil S.; Bollenbacher, Gary
1992-01-01
This report discusses the development and underlying mathematics of a rigid-body computer model of a proposed cryogenic on-orbit liquid depot storage, acquisition, and transfer spacecraft (COLD-SAT). This model, referred to in this report as the COLD-SAT dynamic model, consists of both a trajectory model and an attitudinal model. All disturbance forces and torques expected to be significant for the actual COLD-SAT spacecraft are modeled to the required degree of accuracy. Control and experimental thrusters are modeled, as well as fluid slosh. The model also computes microgravity disturbance accelerations at any specified point in the spacecraft. The model was developed by using the Boeing EASY5 dynamic analysis package and will run on Apollo, Cray, and other computing platforms.
NASA Technical Reports Server (NTRS)
Chan, Gordon C.; Turner, Horace Q.
1990-01-01
COSMIC/NASTRAN, as it is supported and maintained by COSMIC, runs on four main-frame computers - CDC, VAX, IBM and UNIVAC. COSMIC/NASTRAN on other computers, such as CRAY, AMDAHL, PRIME, CONVEX, etc., is available commercially from a number of third party organizations. All these computers, with their own one-of-a-kind operating systems, make NASTRAN machine dependent. The job control language (JCL), the file management, and the program execution procedure of these computers are vastly different, although 95 percent of NASTRAN source code was written in standard ANSI FORTRAN 77. The advantage of the UNIX operating system is that it has no machine boundary. UNIX is becoming widely used in many workstations, mini's, super-PC's, and even some main-frame computers. NASTRAN for the UNIX operating system is definitely the way to go in the future, and makes NASTRAN available to a host of computers, big and small. Since 1985, many NASTRAN improvements and enhancements were made to conform to the ANSI FORTRAN 77 standards. A major UNIX migration effort was incorporated into COSMIC NASTRAN 1990 release. As a pioneer work for the UNIX environment, a version of COSMIC 89 NASTRAN was officially released in October 1989 for DEC ULTRIX VAXstation 3100 (with VMS extensions). A COSMIC 90 NASTRAN version for DEC ULTRIX DECstation 3100 (with RISC) is planned for April 1990 release. Both workstations are UNIX based computers. The COSMIC 90 NASTRAN will be made available on a TK50 tape for the DEC ULTRIX workstations. Previously in 1988, an 88 NASTRAN version was tested successfully on a SiliconGraphics workstation.
Cheetah: A Framework for Scalable Hierarchical Collective Operations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Graham, Richard L; Gorentla Venkata, Manjunath; Ladd, Joshua S
2011-01-01
Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passingmore » Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.« less
ARCGRAPH SYSTEM - AMES RESEARCH GRAPHICS SYSTEM
NASA Technical Reports Server (NTRS)
Hibbard, E. A.
1994-01-01
Ames Research Graphics System, ARCGRAPH, is a collection of libraries and utilities which assist researchers in generating, manipulating, and visualizing graphical data. In addition, ARCGRAPH defines a metafile format that contains device independent graphical data. This file format is used with various computer graphics manipulation and animation packages at Ames, including SURF (COSMIC Program ARC-12381) and GAS (COSMIC Program ARC-12379). In its full configuration, the ARCGRAPH system consists of a two stage pipeline which may be used to output graphical primitives. Stage one is associated with the graphical primitives (i.e. moves, draws, color, etc.) along with the creation and manipulation of the metafiles. Five distinct data filters make up stage one. They are: 1) PLO which handles all 2D vector primitives, 2) POL which handles all 3D polygonal primitives, 3) RAS which handles all 2D raster primitives, 4) VEC which handles all 3D raster primitives, and 5) PO2 which handles all 2D polygonal primitives. Stage two is associated with the process of displaying graphical primitives on a device. To generate the various graphical primitives, create and reprocess ARCGRAPH metafiles, and access the device drivers in the VDI (Video Device Interface) library, users link their applications to ARCGRAPH's GRAFIX library routines. Both FORTRAN and C language versions of the GRAFIX and VDI libraries exist for enhanced portability within these respective programming environments. The ARCGRAPH libraries were developed on a VAX running VMS. Minor documented modification of various routines, however, allows the system to run on the following computers: Cray X-MP running COS (no C version); Cray 2 running UNICOS; DEC VAX running BSD 4.3 UNIX, or Ultrix; SGI IRIS Turbo running GL2-W3.5 and GL2-W3.6; Convex C1 running UNIX; Amhdahl 5840 running UTS; Alliant FX8 running UNIX; Sun 3/160 running UNIX (no native device driver); Stellar GS1000 running Stellex (no native device driver); and an SGI IRIS 4D running IRIX (no native device driver). Currently with version 7.0 of ARCGRAPH, the VDI library supports the following output devices: A VT100 terminal with a RETRO-GRAPHICS board installed, a VT240 using the Tektronix 4010 emulation capability, an SGI IRIS turbo using the native GL2 library, a Tektronix 4010, a Tektronix 4105, and the Tektronix 4014. ARCGRAPH version 7.0 was developed in 1988.
Data Transfer Study HPSS Archiving
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wynne, James; Parete-Koon, Suzanne T; Mitchell, Quinn
2015-01-01
The movement of the large amounts of data produced by codes run in a High Performance Computing (HPC) environment can be a bottleneck for project workflows. To balance filesystem capacity and performance requirements, HPC centers enforce data management policies to purge old files to make room for new computation and analysis results. Users at Oak Ridge Leadership Computing Facility (OLCF) and many other HPC user facilities must archive data to avoid data loss during purges, therefore the time associated with data movement for archiving is something that all users must consider. This study observed the difference in transfer speed frommore » the originating location on the Lustre filesystem to the more permanent High Performance Storage System (HPSS). The tests were done with a number of different transfer methods for files that spanned a variety of sizes and compositions that reflect OLCF user data. This data will be used to help users of Titan and other Cray supercomputers plan their workflow and data transfers so that they are most efficient for their project. We will also discuss best practice for maintaining data at shared user facilities.« less
Structural/aerodynamic Blade Analyzer (SAB) User's Guide, Version 1.0
NASA Technical Reports Server (NTRS)
Morel, M. R.
1994-01-01
The structural/aerodynamic blade (SAB) analyzer provides an automated tool for the static-deflection analysis of turbomachinery blades with aerodynamic and rotational loads. A structural code calculates a deflected blade shape using aerodynamic loads input. An aerodynamic solver computes aerodynamic loads using deflected blade shape input. The two programs are iterated automatically until deflections converge. Currently, SAB version 1.0 is interfaced with MSC/NASTRAN to perform the structural analysis and PROP3D to perform the aerodynamic analysis. This document serves as a guide for the operation of the SAB system with specific emphasis on its use at NASA Lewis Research Center (LeRC). This guide consists of six chapters: an introduction which gives a summary of SAB; SAB's methodology, component files, links, and interfaces; input/output file structure; setup and execution of the SAB files on the Cray computers; hints and tips to advise the user; and an example problem demonstrating the SAB process. In addition, four appendices are presented to define the different computer programs used within the SAB analyzer and describe the required input decks.
Analysis of internal flows relative to the space shuttle main engine
NASA Technical Reports Server (NTRS)
1987-01-01
Cooperative efforts between the Lockheed-Huntsville Computational Mechanics Group and the NASA-MSFC Computational Fluid Dynamics staff has resulted in improved capabilities for numerically simulating incompressible flows generic to the Space Shuttle Main Engine (SSME). A well established and documented CFD code was obtained, modified, and applied to laminar and turbulent flows of the type occurring in the SSME Hot Gas Manifold. The INS3D code was installed on the NASA-MSFC CRAY-XMP computer system and is currently being used by NASA engineers. Studies to perform a transient analysis of the FPB were conducted. The COBRA/TRAC code is recommended for simulating the transient flow of oxygen into the LOX manifold. Property data for modifying the code to represent LOX/GOX flow was collected. The ALFA code was developed and recommended for representing the transient combustion in the preburner. These two codes will couple through the transient boundary conditions to simulate the startup and/or shutdown of the fuel preburner. A study, NAS8-37461, is currently being conducted to implement this modeling effort.
POMESH - DIFFRACTION ANALYSIS OF REFLECTOR ANTENNAS
NASA Technical Reports Server (NTRS)
Hodges, R. E.
1994-01-01
POMESH is a computer program capable of predicting the performance of reflector antennas. Both far field pattern and gain calculations are performed using the Physical Optics (PO) approximation of the equivalent surface currents. POMESH is primarily intended for relatively small reflectors. It is useful in situations where the surface is described by irregular data that must be interpolated and for cases where the surface derivatives are not known. This method is flexible and robust and also supports near field calculations. Because of the near field computation ability, this computational engine is quite useful for subreflector computations. The program is constructed in a highly modular form so that it may be readily adapted to perform tasks other than the one that is explicitly described here. Since the computationally intensive portions of the algorithm are simple loops, the program can be easily adapted to take advantage of vector processor and parallel architectures. In POMESH the reflector is represented as a piecewise planar surface comprised of triangular regions known as facets. A uniform physical optics (PO) current is assumed to exist on each triangular facet. Then, the PO integral on a facet is approximated by the product of the PO current value at the center and the area of the triangle. In this way, the PO integral over the reflector surface is reduced to a summation of the contribution from each triangular facet. The source horn, or feed, that illuminates the subreflector is approximated by a linear combination of plane patterns. POMESH contains three polarization pattern definitions for the feed; a linear x-polarized element, linear y-polarized element, and a circular polarized element. If a more general feed pattern is required, it is a simple matter to replace the subroutine that implements the pattern definitions. POMESH obtains information necessary to specify the coordinate systems, location of other data files, and parameters of the desired calculation from a user provided data file. A numerical description of the principle plane patterns of the source horn must also be provided. The program is supplied with an analytically defined parabolic reflector surface. However, it is a simple matter to replace it with a user defined reflector surface. Output is given in the form of a data stream to the terminal; a summary of the parameters used in the computation and some sample results in a file; and a data file of the results of the pattern calculations suitable for plotting. POMESH is written in FORTRAN 77 for execution on CRAY series computers running UNICOS. With minor modifications, it has also been successfully implemented on a Sun4 series computer running SunOS, a DEC VAX series computer running VMS, and an IBM PC series computer running OS/2. It requires 2.5Mb of RAM under SunOS 4.1.1, 2.5Mb of RAM under VMS 5-4.3, and 2.5Mb of RAM under OS/2. The OS/2 version requires the Lahey F77L compiler. The standard distribution medium for this program is one 5.25 inch 360K MS-DOS format diskette. It is also available on a .25 inch streaming magnetic tape cartridge in UNIX tar format and a 9-track 1600 BPI magnetic tape in DEC VAX FILES-11 format. POMESH was developed in 1989 and is a copyrighted work with all copyright vested in NASA. CRAY and UNICOS are registered trademarks of Cray Research, Inc. SunOS and Sun4 are trademarks of Sun Microsystems, Inc. DEC, DEC FILES-11, VAX and VMS are trademarks of Digital Equipment Corporation. IBM PC and OS/2 are registered trademarks of International Business Machines, Inc. UNIX is a registered trademark of Bell Laboratories.
Using the K-25 C TD Common File System: A guide to CFSI (CFS Interface)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1989-12-01
A CFS (Common File System) is a large, centralized file management and storage facility based on software developed at Los Alamos National Laboratory. This manual is a guide to use of the CFS available to users of the Cray UNICOS system at Martin Marietta Energy Systems, Inc., in Oak Ridge, Tennessee.
Efficient iterative methods applied to the solution of transonic flows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wissink, A.M.; Lyrintzis, A.S.; Chronopoulos, A.T.
1996-02-01
We investigate the use of an inexact Newton`s method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton`s method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GIVIRES method. The preconditionermore » is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton- GIVIRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems. 38 refs., 14 figs., 7 tabs.« less
NASA Astrophysics Data System (ADS)
Eisenbach, Markus
The Locally Self-consistent Multiple Scattering (LSMS) code solves the first principles Density Functional theory Kohn-Sham equation for a wide range of materials with a special focus on metals, alloys and metallic nano-structures. It has traditionally exhibited near perfect scalability on massively parallel high performance computer architectures. We present our efforts to exploit GPUs to accelerate the LSMS code to enable first principles calculations of O(100,000) atoms and statistical physics sampling of finite temperature properties. Using the Cray XK7 system Titan at the Oak Ridge Leadership Computing Facility we achieve a sustained performance of 14.5PFlop/s and a speedup of 8.6 compared to the CPU only code. This work has been sponsored by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Material Sciences and Engineering Division and by the Office of Advanced Scientific Computing. This work used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...
2017-09-14
In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao
In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Vectorized program architectures for supercomputer-aided circuit design
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rizzoli, V.; Ferlito, M.; Neri, A.
1986-01-01
Vector processors (supercomputers) can be effectively employed in MIC or MMIC applications to solve problems of large numerical size such as broad-band nonlinear design or statistical design (yield optimization). In order to fully exploit the capabilities of a vector hardware, any program architecture must be structured accordingly. This paper presents a possible approach to the ''semantic'' vectorization of microwave circuit design software. Speed-up factors of the order of 50 can be obtained on a typical vector processor (Cray X-MP), with respect to the most powerful scaler computers (CDC 7600), with cost reductions of more than one order of magnitude. Thismore » could broaden the horizon of microwave CAD techniques to include problems that are practically out of the reach of conventional systems.« less
NASA Technical Reports Server (NTRS)
Raju, I. S.; Newman, J. C., Jr.
1993-01-01
A computer program, surf3d, that uses the 3D finite-element method to calculate the stress-intensity factors for surface, corner, and embedded cracks in finite-thickness plates with and without circular holes, was developed. The cracks are assumed to be either elliptic or part eliptic in shape. The computer program uses eight-noded hexahedral elements to model the solid. The program uses a skyline storage and solver. The stress-intensity factors are evaluated using the force method, the crack-opening displacement method, and the 3-D virtual crack closure methods. In the manual the input to and the output of the surf3d program are described. This manual also demonstrates the use of the program and describes the calculation of the stress-intensity factors. Several examples with sample data files are included with the manual. To facilitate modeling of the user's crack configuration and loading, a companion program (a preprocessor program) that generates the data for the surf3d called gensurf was also developed. The gensurf program is a three dimensional mesh generator program that requires minimal input and that builds a complete data file for surf3d. The program surf3d is operational on Unix machines such as CRAY Y-MP, CRAY-2, and Convex C-220.
High Performance Computing Software Applications for Space Situational Awareness
NASA Astrophysics Data System (ADS)
Giuliano, C.; Schumacher, P.; Matson, C.; Chun, F.; Duncan, B.; Borelli, K.; Desonia, R.; Gusciora, G.; Roe, K.
The High Performance Computing Software Applications Institute for Space Situational Awareness (HSAI-SSA) has completed its first full year of applications development. The emphasis of our work in this first year was in improving space surveillance sensor models and image enhancement software. These applications are the Space Surveillance Network Analysis Model (SSNAM), the Air Force Space Fence simulation (SimFence), and physically constrained iterative de-convolution (PCID) image enhancement software tool. Specifically, we have demonstrated order of magnitude speed-up in those codes running on the latest Cray XD-1 Linux supercomputer (Hoku) at the Maui High Performance Computing Center. The software applications improvements that HSAI-SSA has made, has had significant impact to the warfighter and has fundamentally changed the role of high performance computing in SSA.
Time-partitioning simulation models for calculation on parallel computers
NASA Technical Reports Server (NTRS)
Milner, Edward J.; Blech, Richard A.; Chima, Rodrick V.
1987-01-01
A technique allowing time-staggered solution of partial differential equations is presented in this report. Using this technique, called time-partitioning, simulation execution speedup is proportional to the number of processors used because all processors operate simultaneously, with each updating of the solution grid at a different time point. The technique is limited by neither the number of processors available nor by the dimension of the solution grid. Time-partitioning was used to obtain the flow pattern through a cascade of airfoils, modeled by the Euler partial differential equations. An execution speedup factor of 1.77 was achieved using a two processor Cray X-MP/24 computer.
Application of a distributed network in computational fluid dynamic simulations
NASA Technical Reports Server (NTRS)
Deshpande, Manish; Feng, Jinzhang; Merkle, Charles L.; Deshpande, Ashish
1994-01-01
A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computational fluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion
NASA Technical Reports Server (NTRS)
Bailey, David H.; Ferguson, Helaman R. P.
1988-01-01
Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
Performance evaluation of the Engineering Analysis and Data Systems (EADS) 2
NASA Technical Reports Server (NTRS)
Debrunner, Linda S.
1994-01-01
The Engineering Analysis and Data System (EADS)II (1) was installed in March 1993 to provide high performance computing for science and engineering at Marshall Space Flight Center (MSFC). EADS II increased the computing capabilities over the existing EADS facility in the areas of throughput and mass storage. EADS II includes a Vector Processor Compute System (VPCS), a Virtual Memory Compute System (CFS), a Common Output System (COS), as well as Image Processing Station, Mini Super Computers, and Intelligent Workstations. These facilities are interconnected by a sophisticated network system. This work considers only the performance of the VPCS and the CFS. The VPCS is a Cray YMP. The CFS is implemented on an RS 6000 using the UniTree Mass Storage System. To better meet the science and engineering computing requirements, EADS II must be monitored, its performance analyzed, and appropriate modifications for performance improvement made. Implementing this approach requires tool(s) to assist in performance monitoring and analysis. In Spring 1994, PerfStat 2.0 was purchased to meet these needs for the VPCS and the CFS. PerfStat(2) is a set of tools that can be used to analyze both historical and real-time performance data. Its flexible design allows significant user customization. The user identifies what data is collected, how it is classified, and how it is displayed for evaluation. Both graphical and tabular displays are supported. The capability of the PerfStat tool was evaluated, appropriate modifications to EADS II to optimize throughput and enhance productivity were suggested and implemented, and the effects of these modifications on the systems performance were observed. In this paper, the PerfStat tool is described, then its use with EADS II is outlined briefly. Next, the evaluation of the VPCS, as well as the modifications made to the system are described. Finally, conclusions are drawn and recommendations for future worked are outlined.
Numerical Technology for Large-Scale Computational Electromagnetics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sharpe, R; Champagne, N; White, D
The key bottleneck of implicit computational electromagnetics tools for large complex geometries is the solution of the resulting linear system of equations. The goal of this effort was to research and develop critical numerical technology that alleviates this bottleneck for large-scale computational electromagnetics (CEM). The mathematical operators and numerical formulations used in this arena of CEM yield linear equations that are complex valued, unstructured, and indefinite. Also, simultaneously applying multiple mathematical modeling formulations to different portions of a complex problem (hybrid formulations) results in a mixed structure linear system, further increasing the computational difficulty. Typically, these hybrid linear systems aremore » solved using a direct solution method, which was acceptable for Cray-class machines but does not scale adequately for ASCI-class machines. Additionally, LLNL's previously existing linear solvers were not well suited for the linear systems that are created by hybrid implicit CEM codes. Hence, a new approach was required to make effective use of ASCI-class computing platforms and to enable the next generation design capabilities. Multiple approaches were investigated, including the latest sparse-direct methods developed by our ASCI collaborators. In addition, approaches that combine domain decomposition (or matrix partitioning) with general-purpose iterative methods and special purpose pre-conditioners were investigated. Special-purpose pre-conditioners that take advantage of the structure of the matrix were adapted and developed based on intimate knowledge of the matrix properties. Finally, new operator formulations were developed that radically improve the conditioning of the resulting linear systems thus greatly reducing solution time. The goal was to enable the solution of CEM problems that are 10 to 100 times larger than our previous capability.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bland, Arthur S Buddy; Hack, James J; Baker, Ann E
Oak Ridge National Laboratory's (ORNL's) Cray XT5 supercomputer, Jaguar, kicked off the era of petascale scientific computing in 2008 with applications that sustained more than a thousand trillion floating point calculations per second - or 1 petaflop. Jaguar continues to grow even more powerful as it helps researchers broaden the boundaries of knowledge in virtually every domain of computational science, including weather and climate, nuclear energy, geosciences, combustion, bioenergy, fusion, and materials science. Their insights promise to broaden our knowledge in areas that are vitally important to the Department of Energy (DOE) and the nation as a whole, particularly energymore » assurance and climate change. The science of the 21st century, however, will demand further revolutions in computing, supercomputers capable of a million trillion calculations a second - 1 exaflop - and beyond. These systems will allow investigators to continue attacking global challenges through modeling and simulation and to unravel longstanding scientific questions. Creating such systems will also require new approaches to daunting challenges. High-performance systems of the future will need to be codesigned for scientific and engineering applications with best-in-class communications networks and data-management infrastructures and teams of skilled researchers able to take full advantage of these new resources. The Oak Ridge Leadership Computing Facility (OLCF) provides the nation's most powerful open resource for capability computing, with a sustainable path that will maintain and extend national leadership for DOE's Office of Science (SC). The OLCF has engaged a world-class team to support petascale science and to take a dramatic step forward, fielding new capabilities for high-end science. This report highlights the successful delivery and operation of a petascale system and shows how the OLCF fosters application development teams, developing cutting-edge tools and resources for next-generation systems.« less
ALCF Data Science Program: Productive Data-centric Supercomputing
NASA Astrophysics Data System (ADS)
Romero, Nichols; Vishwanath, Venkatram
The ALCF Data Science Program (ADSP) is targeted at big data science problems that require leadership computing resources. The goal of the program is to explore and improve a variety of computational methods that will enable data-driven discoveries across all scientific disciplines. The projects will focus on data science techniques covering a wide area of discovery including but not limited to uncertainty quantification, statistics, machine learning, deep learning, databases, pattern recognition, image processing, graph analytics, data mining, real-time data analysis, and complex and interactive workflows. Project teams will be among the first to access Theta, ALCFs forthcoming 8.5 petaflops Intel/Cray system. The program will transition to the 200 petaflop/s Aurora supercomputing system when it becomes available. In 2016, four projects have been selected to kick off the ADSP. The selected projects span experimental and computational sciences and range from modeling the brain to discovering new materials for solar-powered windows to simulating collision events at the Large Hadron Collider (LHC). The program will have a regular call for proposals with the next call expected in Spring 2017.http://www.alcf.anl.gov/alcf-data-science-program This research used resources of the ALCF, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Research on Spectroscopy, Opacity, and Atmospheres
NASA Technical Reports Server (NTRS)
Kurucz, Robert L.
1999-01-01
To make my calculations more readily accessible I have set up a web site cfaku5.harvard.edu that can also be accessed by FTP. it has 5 9GB disks that hold all of my atomic and diatomic molecular data, my tables of distribution function opacities, my grids of model atmospheres, colors, fluxes, etc, my program that are ready for distribution, most of my recent papers. Atlases and computed spectra will be added as they are completed. New atomic and molecular calculations will be added as they are completed. I got my atomic programs that had been running on a Cray at the San Diego Supercomputer Center to run on my Vaxes and Alpha. I started with Ni and Co because there were new laboratory analyses that included isotopic and hyperfine splitting. Those calculations are described in the appended abstract for the 6th Atomic Spectroscopy and oscillator Strengths meeting in Victoria last summer. A surprising finding is that quadrupole transitions have been grossly in error because mixing with higher levels has not been included. I now have enough memory in my Alpha to treat 3000 x 3000 matrices. I now include all levels up through n=9 for Fe I and 11, the spectra for which the most information is available. I am finishing those calculations right now. After Fe I and Fe 11, all other spectra are "easy", and I will be in mass production. ATL;LS12, my opacity sampling program for computing models with arbitrary abundances, has been put on the web server. I wrote a new distribution function opacity program for workstations that replaces the one I used on the Cray at the San Diego Supercomputer Center. Each set of abundances would take 100 Cray hours costing $100,000. 1 ran 25 cases. Each of my opacity CDs contains three abundances. I have a new program -iinning on the Alpha that takes about a week. I am going to have to get a faster processor or I will have to dedicate a whole workstation just to opacities.
Lanczos eigensolution method for high-performance computers
NASA Technical Reports Server (NTRS)
Bostic, Susan W.
1991-01-01
The theory, computational analysis, and applications are presented of a Lanczos algorithm on high performance computers. The computationally intensive steps of the algorithm are identified as: the matrix factorization, the forward/backward equation solution, and the matrix vector multiples. These computational steps are optimized to exploit the vector and parallel capabilities of high performance computers. The savings in computational time from applying optimization techniques such as: variable band and sparse data storage and access, loop unrolling, use of local memory, and compiler directives are presented. Two large scale structural analysis applications are described: the buckling of a composite blade stiffened panel with a cutout, and the vibration analysis of a high speed civil transport. The sequential computational time for the panel problem executed on a CONVEX computer of 181.6 seconds was decreased to 14.1 seconds with the optimized vector algorithm. The best computational time of 23 seconds for the transport problem with 17,000 degs of freedom was on the the Cray-YMP using an average of 3.63 processors.
NASA Technical Reports Server (NTRS)
Powers, Alan K.
1994-01-01
The Numerical Aerodynamics Simulation Facility's (NAS) CRAY C916/1024 accesses a "virtual" on-line file system, which is expanding beyond a terabyte of information. This paper will present some options to fine tuning Data Migration Facility (DMF) to stretch the online disk capacity and explore the transitions to newer devices (STK 4490, ER90, RAID).
Heart Fibrillation and Parallel Supercomputers
NASA Technical Reports Server (NTRS)
Kogan, B. Y.; Karplus, W. J.; Chudin, E. E.
1997-01-01
The Luo and Rudy 3 cardiac cell mathematical model is implemented on the parallel supercomputer CRAY - T3D. The splitting algorithm combined with variable time step and an explicit method of integration provide reasonable solution times and almost perfect scaling for rectilinear wave propagation. The computer simulation makes it possible to observe new phenomena: the break-up of spiral waves caused by intracellular calcium and dynamics and the non-uniformity of the calcium distribution in space during the onset of the spiral wave.
Parallel FEM Simulation of Electromechanics in the Heart
NASA Astrophysics Data System (ADS)
Xia, Henian; Wong, Kwai; Zhao, Xiaopeng
2011-11-01
Cardiovascular disease is the leading cause of death in America. Computer simulation of complicated dynamics of the heart could provide valuable quantitative guidance for diagnosis and treatment of heart problems. In this paper, we present an integrated numerical model which encompasses the interaction of cardiac electrophysiology, electromechanics, and mechanoelectrical feedback. The model is solved by finite element method on a Linux cluster and the Cray XT5 supercomputer, kraken. Dynamical influences between the effects of electromechanics coupling and mechanic-electric feedback are shown.
Scalability Analysis of Gleipnir: A Memory Tracing and Profiling Tool, on Titan
DOE Office of Scientific and Technical Information (OSTI.GOV)
Janjusic, Tommy; Kartsaklis, Christos; Wang, Dali
2013-01-01
Application performance is hindered by a variety of factors but most notably driven by the well know CPU-memory speed gap (also known as the memory wall). Understanding application s memory behavior is key if we are trying to optimize performance. Understanding application performance properties is facilitated with various performance profiling tools. The scope of profiling tools varies in complexity, ease of deployment, profiling performance, and the detail of profiled information. Specifically, using profiling tools for performance analysis is a common task when optimizing and understanding scientific applications on complex and large scale systems such as Cray s XK7. This papermore » describes the performance characteristics of using Gleipnir, a memory tracing tool, on the Titan Cray XK7 system when instrumenting large applications such as the Community Earth System Model. Gleipnir is a memory tracing tool built as a plug-in tool for the Valgrind instrumentation framework. The goal of Gleipnir is to provide fine-grained trace information. The generated traces are a stream of executed memory transactions mapped to internal structures per process, thread, function, and finally the data structure or variable. Our focus was to expose tool performance characteristics when using Gleipnir with a combination of an external tools such as a cache simulator, Gl CSim, to characterize the tool s overall performance. In this paper we describe our experience with deploying Gleipnir on the Titan Cray XK7 system, report on the tool s ease-of-use, and analyze run-time performance characteristics under various workloads. While all performance aspects are important we mainly focus on I/O characteristics analysis due to the emphasis on the tools output which are trace-files. Moreover, the tool is dependent on the run-time system to provide the necessary infrastructure to expose low level system detail; therefore, we also discuss any theoretical benefits that can be achieved if such modules were present.« less
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1999-01-01
Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.
NASA Astrophysics Data System (ADS)
Strayer, Michael
2007-09-01
Good morning. Welcome to Boston, the home of the Red Sox, Celtics and Bruins, baked beans, tea parties, Robert Parker, and SciDAC 2007. A year ago I stood before you to share the legacy of the first SciDAC program and identify the challenges that we must address on the road to petascale computing—a road E E Cummins described as `. . . never traveled, gladly beyond any experience.' Today, I want to explore the preparations for the rapidly approaching extreme scale (X-scale) generation. These preparations are the first step propelling us along the road of burgeoning scientific discovery enabled by the application of X- scale computing. We look to petascale computing and beyond to open up a world of discovery that cuts across scientific fields and leads us to a greater understanding of not only our world, but our universe. As part of the President's America Competitiveness Initiative, the ASCR Office has been preparing a ten year vision for computing. As part of this planning the LBNL together with ORNL and ANL hosted three town hall meetings on Simulation and Modeling at the Exascale for Energy, Ecological Sustainability and Global Security (E3). The proposed E3 initiative is organized around four programmatic themes: Engaging our top scientists, engineers, computer scientists and applied mathematicians; investing in pioneering large-scale science; developing scalable analysis algorithms, and storage architectures to accelerate discovery; and accelerating the build-out and future development of the DOE open computing facilities. It is clear that we have only just started down the path to extreme scale computing. Plan to attend Thursday's session on the out-briefing and discussion of these meetings. The road to the petascale has been at best rocky. In FY07, the continuing resolution provided 12% less money for Advanced Scientific Computing than either the President, the Senate, or the House. As a consequence, many of you had to absorb a no cost extension for your SciDAC work. I am pleased that the President's FY08 budget restores the funding for SciDAC. Quoting from Advanced Scientific Computing Research description in the House Energy and Water Development Appropriations Bill for FY08, "Perhaps no other area of research at the Department is so critical to sustaining U.S. leadership in science and technology, revolutionizing the way science is done and improving research productivity." As a society we need to revolutionize our approaches to energy, environmental and global security challenges. As we go forward along the road to the X-scale generation, the use of computation will continue to be a critical tool along with theory and experiment in understanding the behavior of the fundamental components of nature as well as for fundamental discovery and exploration of the behavior of complex systems. The foundation to overcome these societal challenges will build from the experiences and knowledge gained as you, members of our SciDAC research teams, work together to attack problems at the tera- and peta- scale. If SciDAC is viewed as an experiment for revolutionizing scientific methodology, then a strategic goal of ASCR program must be to broaden the intellectual base prepared to address the challenges of the new X-scale generation of computing. We must focus our computational science experiences gained over the past five years on the opportunities introduced with extreme scale computing. Our facilities are on a path to provide the resources needed to undertake the first part of our journey. Using the newly upgraded 119 teraflop Cray XT system at the Leadership Computing Facility, SciDAC research teams have in three days performed a 100-year study of the time evolution of the atmospheric CO2 concentration originating from the land surface. The simulation of the El Nino/Southern Oscillation which was part of this study has been characterized as `the most impressive new result in ten years' gained new insight into the behavior of superheated ionic gas in the ITER reactor as a result of an AORSA run on 22,500 processors that achieved over 87 trillion calculations per second (87 teraflops) which is 74% of the system's theoretical peak. Tomorrow, Argonne and IBM will announce that the first IBM Blue Gene/P, a 100 teraflop system, will be shipped to the Argonne Leadership Computing Facility later this fiscal year. By the end of FY2007 ASCR high performance and leadership computing resources will include the 114 teraflop IBM Blue Gene/P; a 102 teraflop Cray XT4 at NERSC and a 119 teraflop Cray XT system at Oak Ridge. Before ringing in the New Year, Oak Ridge will upgrade to 250 teraflops with the replacement of the dual core processors with quad core processors and Argonne will upgrade to between 250-500 teraflops, and next year, a petascale Cray Baker system is scheduled for delivery at Oak Ridge. The multidisciplinary teams in our SciDAC Centers for Enabling Technologies and our SciDAC Institutes must continue to work with our Scientific Application teams to overcome the barriers that prevent effective use of these new systems. These challenges include: the need for new algorithms as well as operating system and runtime software and tools which scale to parallel systems composed of hundreds of thousands processors; program development environments and tools which scale effectively and provide ease of use for developers and scientific end users; and visualization and data management systems that support moving, storing, analyzing, manipulating and visualizing multi-petabytes of scientific data and objects. The SciDAC Centers, located primarily at our DOE national laboratories will take the lead in ensuring that critical computer science and applied mathematics issues are addressed in a timely and comprehensive fashion and to address issues associated with research software lifecycle. In contrast, the SciDAC Institutes, which are university-led centers of excellence, will have more flexibility to pursue new research topics through a range of research collaborations. The Institutes will also work to broaden the intellectual and researcher base—conducting short courses and summer schools to take advantage of new high performance computing capabilities. The SciDAC Outreach Center at Lawrence Berkeley National Laboratory complements the outreach efforts of the SciDAC Institutes. The Outreach Center is our clearinghouse for SciDAC activities and resources and will communicate with the high performance computing community in part to understand their needs for workshops, summer schools and institutes. SciDAC is not ASCR's only effort to broaden the computational science community needed to meet the challenges of the new X-scale generation. I hope that you were able to attend the Computational Science Graduate Fellowship poster session last night. ASCR developed the fellowship in 1991 to meet the nation's growing need for scientists and technology professionals with advanced computer skills. CSGF, now jointly funded between ASCR and NNSA, is more than a traditional academic fellowship. It has provided more than 200 of the best and brightest graduate students with guidance, support and community in preparing them as computational scientists. Today CSGF alumni are bringing their diverse top-level skills and knowledge to research teams at DOE laboratories and in industries such as Proctor and Gamble, Lockheed Martin and Intel. At universities they are working to train the next generation of computational scientists. To build on this success, we intend to develop a wholly new Early Career Principal Investigator's (ECPI) program. Our objective is to stimulate academic research in scientific areas within ASCR's purview especially among faculty in early stages of their academic careers. Last February, we lost Ken Kennedy, one of the leading lights of our community. As we move forward into the extreme computing generation, his vision and insight will be greatly missed. In memorial to Ken Kennedy, we shall designate the ECPI grants to beginning faculty in Computer Science as the Ken Kennedy Fellowship. Watch the ASCR website for more information about ECPI and other early career programs in the computational sciences. We look to you, our scientists, researchers, and visionaries to take X-scale computing and use it to explode scientific discovery in your fields. We at SciDAC will work to ensure that this tool is the sharpest and most precise and efficient instrument to carve away the unknown and reveal the most exciting secrets and stimulating scientific discoveries of our time. The partnership between research and computing is the marriage that will spur greater discovery, and as Spencer said to Susan in Robert Parker's novel, `Sudden Mischief', `We stick together long enough, and we may get as smart as hell'. Michael Strayer
User's manual for PANDA II: A computer code for calculating equations of state
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kerley, G.I.
1991-07-18
PANDA is an interactive computer code that is used to compute equations of state (EOS) for many classes of materials over a wide range of densities and temperatures. The first step in the development of a general EOS model is to determine the EOS for a one- component system, consisting of a single solid or fluid phase and a single chemical species. The results of several such calculations can then be combined to construct EOS for multiphase and multicomponent systems. For one-component solids and fluids, PANDA offers a variety of options for modeling various contributions to the EOS: the zeromore » Kelvin isotherm, lattice vibrations, fluid degrees of freedom, thermal electronic excitation and ionization, and molecular vibrational and rotational degrees of freedom. Two options are available for computing EOS for multicomponent systems from separate EOS for the individual species and phases. The phase transition model is used for a system of immiscible phases, each having the same chemical composition. In the mixture model, the components can be either miscible or immiscible and can have different chemical compositions; mixtures cab be either inert or reactive. PANDA provides over 50 commands that are used to define the EOS models, to make calculations and compare the models to experimental data, and to generate and maintain tabular EOS libraries for use in hydrocodes and other applications. Versions of the code available for the Cray (UNICOS and CTSS), SUN (UNIX), and VAX(VMS) machines, and a small version is available for personal computers (DOS). This report describes the EOS models, use of the commands, and several sample problems. 92 refs., 7 figs., 10 tabs.« less
OpenSHMEM-UCX : Evaluation of UCX for implementing OpenSHMEM Programming Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Matthew B; Gorentla Venkata, Manjunath; Aderholdt, William Ferrol
2016-01-01
The OpenSHMEM reference implementation was developed towards the goal of developing an open source and high-performing Open- SHMEM implementation. To achieve portability and performance across various networks, the OpenSHMEM reference implementation uses GAS- Net and UCCS for network operations. Recently, new network layers have emerged with the promise of providing high-performance, scalabil- ity, and portability for HPC applications. In this paper, we implement the OpenSHMEM reference implementation to use the UCX framework for network operations. Then, we evaluate its performance and scalabil- ity on Cray XK systems to understand UCX s suitability for developing the OpenSHMEM programming model. Further, wemore » develop a bench- mark called SHOMS for evaluating the OpenSHMEM implementation. Our experimental results show that OpenSHMEM-UCX outperforms the vendor supplied OpenSHMEM implementation in most cases on the Cray XK system by up to 40% with respect to message rate and up to 70% for the execution of application kernels.« less
Performance of the Cray T3D and Emerging Architectures on Canopy QCD Applications
NASA Astrophysics Data System (ADS)
Fischler, Mark; Uchima, Mike
1996-03-01
The Cray T3D, an MIMD system with NUMA shared memory capabilities and in principle very low communications latency, can support the Canopy framework for grid-oriented applications. CANOPY has been ported to the T3D, with the intent of making it available to a spectrum of users. The performance of the T3D running Canopy has been benchmarked on five QCD applications extensively run on ACPMAPS at Fermilab, requiring a variety of data access patterns. The net performance and scaling behavior reveals an efficiency relative to peak Gflops almost identical to that achieved on ACPMAPS. Detailed studies of the major factors impacting performance are presented. Generalizations applying this analysis to the newly emerging crop of commercial systems reveal where their limitations will lie. On these applications, efficiencies of above 25% are not to be expected; eliminating overheads due to Canopy will improve matters, but by less than a factor of two.
Performance of the Cray T3D and emerging architectures on canopy QCD applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischler, M.; Uchima, M.
1995-11-01
The Cray T3D, an MIMD system with NUMA shared memory capabilities and in principle very low communications latency, can support the Canopy framework for grid-oriented applications. CANOPY has been ported to the T3D, with the intent of making it available to a spectrum of users. The performance of the T3D running Canopy has been benchmarked on five QCD applications extensively run on ACPMAPS at Fermilab, requiring a variety of data access patterns. The net performance and scaling behavior reveals an efficiency relative to peak Gflops almost identical to that achieved on ACPMAPS. Detailed studies of the major factors impacting performancemore » are presented. Generalizations applying this analysis to the newly emerging crop of commercial systems reveal where their limitations will lie. On these applications, efficiencies of above 25% are not to be expected; eliminating overheads due to Canopy will improve matters, but by less than a factor of two.« less
Computation of transonic flow about helicopter rotor blades
NASA Technical Reports Server (NTRS)
Arieli, R.; Tauber, M. E.; Saunders, D. A.; Caughey, D. A.
1986-01-01
An inviscid, nonconservative, three-dimensional full-potential flow code, ROT22, has been developed for computing the quasi-steady flow about a lifting rotor blade. The code is valid throughout the subsonic and transonic regime. Calculations from the code are compared with detailed laser velocimeter measurements made in the tip region of a nonlifting rotor at a tip Mach number of 0.95 and zero advance ratio. In addition, comparisons are made with chordwise surface pressure measurements obtained in a wind tunnel for a nonlifting rotor blade at transonic tip speeds at advance ratios from 0.40 to 0.50. The overall agreement between theoretical calculations and experiment is very good. A typical run on a CRAY X-MP computer requires about 30 CPU seconds for one rotor position at transonic tip speed.
Solving the Cauchy-Riemann equations on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Numerical simulation of three dimensional transonic flows
NASA Technical Reports Server (NTRS)
Sahu, Jubaraj; Steger, Joseph L.
1987-01-01
The three-dimensional flow over a projectile has been computed using an implicit, approximately factored, partially flux-split algorithm. A simple composite grid scheme has been developed in which a single grid is partitioned into a series of smaller grids for applications which require an external large memory device such as the SSD of the CRAY X-MP/48, or multitasking. The accuracy and stability of the composite grid scheme has been tested by numerically simulating the flow over an ellipsoid at angle of attack and comparing the solution with a single grid solution. The flowfield over a projectile at M = 0.96 and 4 deg angle-of-attack has been computed using a fine grid, and compared with experiment.
PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seal, Sudip K; Hirshman, Steven Paul; Wingen, Andreas
The ability to sustain magnetically confined plasma in a state of stable equilibrium is crucial for optimal and cost-effective operations of fusion devices like tokamaks and stellarators. The Variational Moments Equilibrium Code (VMEC) is the de-facto serial application used by fusion scientists to compute magnetohydrodynamics (MHD) equilibria and study the physics of three dimensional plasmas in confined configurations. Modern fusion energy experiments have larger system scales with more interactive experimental workflows, both demanding faster analysis turnaround times on computational workloads that are stressing the capabilities of sequential VMEC. In this paper, we present PARVMEC, an efficient, parallel version of itsmore » sequential counterpart, capable of scaling to thousands of processors on distributed memory machines. PARVMEC is a non-linear code, with multiple numerical physics modules, each with its own computational complexity. A detailed speedup analysis supported by scaling results on 1,024 cores of a Cray XC30 supercomputer is presented. Depending on the mode of PARVMEC execution, speedup improvements of one to two orders of magnitude are reported. PARVMEC equips fusion scientists for the first time with a state-of-theart capability for rapid, high fidelity analyses of magnetically confined plasmas at unprecedented scales.« less
Highlights of X-Stack ExM Deliverable Swift/T
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wozniak, Justin M.
Swift/T is a key success from the ExM: System support for extreme-scale, many-task applications1 X-Stack project, which proposed to use concurrent dataflow as an innovative programming model to exploit extreme parallelism in exascale computers. The Swift/T component of the project reimplemented the Swift language from scratch to allow applications that compose scientific modules together to be build and run on available petascale computers (Blue Gene, Cray). Swift/T does this via a new compiler and runtime that generates and executes the application as an MPI program. We assume that mission-critical emerging exascale applications will be composed as scalable applications using existingmore » software components, connected by data dependencies. Developers wrap native code fragments using a higherlevel language, then build composite applications to form a computational experiment. This exemplifies hierarchical concurrency: lower-level messaging libraries are used for fine-grained parallelism; highlevel control is used for inter-task coordination. These patterns are best expressed with dataflow, but static DAGs (i.e., other workflow languages) limit the applications that can be built; they do not provide the expressiveness of Swift, such as conditional execution, iteration, and recursive functions.« less
NASADIG - NASA DEVICE INDEPENDENT GRAPHICS LIBRARY (AMDAHL VERSION)
NASA Technical Reports Server (NTRS)
Rogers, J. E.
1994-01-01
The NASA Device Independent Graphics Library, NASADIG, can be used with many computer-based engineering and management applications. The library gives the user the opportunity to translate data into effective graphic displays for presentation. The software offers many features which allow the user flexibility in creating graphics. These include two-dimensional plots, subplot projections in 3D-space, surface contour line plots, and surface contour color-shaded plots. Routines for three-dimensional plotting, wireframe surface plots, surface plots with hidden line removal, and surface contour line plots are provided. Other features include polar and spherical coordinate plotting, world map plotting utilizing either cylindrical equidistant or Lambert equal area projection, plot translation, plot rotation, plot blowup, splines and polynomial interpolation, area blanking control, multiple log/linear axes, legends and text control, curve thickness control, and multiple text fonts (18 regular, 4 bold). NASADIG contains several groups of subroutines. Included are subroutines for plot area and axis definition; text set-up and display; area blanking; line style set-up, interpolation, and plotting; color shading and pattern control; legend, text block, and character control; device initialization; mixed alphabets setting; and other useful functions. The usefulness of many routines is dependent on the prior definition of basic parameters. The program's control structure uses a serial-level construct with each routine restricted for activation at some prescribed level(s) of problem definition. NASADIG provides the following output device drivers: Selanar 100XL, VECTOR Move/Draw ASCII and PostScript files, Tektronix 40xx, 41xx, and 4510 Rasterizer, DEC VT-240 (4014 mode), IBM AT/PC compatible with SmartTerm 240 emulator, HP Lasergrafix Film Recorder, QMS 800/1200, DEC LN03+ Laserprinters, and HP LaserJet (Series III). NASADIG is written in FORTRAN and is available for several platforms. NASADIG 5.7 is available for DEC VAX series computers running VMS 5.0 or later (MSC-21801), Cray X-MP and Y-MP series computers running UNICOS (COS-10049), and Amdahl 5990 mainframe computers running UTS (COS-10050). NASADIG 5.1 is available for UNIX-based operating systems (MSC-22001). The UNIX version has been successfully implemented on Sun4 series computers running SunOS, SGI IRIS computers running IRIX, Hewlett Packard 9000 computers running HP-UX, and Convex computers running Convex OS (MSC-22001). The standard distribution medium for MSC-21801 is a set of two 6250 BPI 9-track magnetic tapes in DEC VAX BACKUP format. It is also available on a set of two TK50 tape cartridges in DEC VAX BACKUP format. The standard distribution medium for COS-10049 and COS-10050 is a 6250 BPI 9-track magnetic tape in UNIX tar format. Other distribution media and formats may be available upon request. The standard distribution medium for MSC-22001 is a .25 inch streaming magnetic tape cartridge (Sun QIC-24) in UNIX tar format. Alternate distribution media and formats are available upon request. With minor modification, the UNIX source code can be ported to other platforms including IBM PC/AT series computers and compatibles. NASADIG is also available bundled with TRASYS, the Thermal Radiation Analysis System (COS-10026, DEC VAX version; COS-10040, CRAY version).
NASADIG - NASA DEVICE INDEPENDENT GRAPHICS LIBRARY (UNIX VERSION)
NASA Technical Reports Server (NTRS)
Rogers, J. E.
1994-01-01
The NASA Device Independent Graphics Library, NASADIG, can be used with many computer-based engineering and management applications. The library gives the user the opportunity to translate data into effective graphic displays for presentation. The software offers many features which allow the user flexibility in creating graphics. These include two-dimensional plots, subplot projections in 3D-space, surface contour line plots, and surface contour color-shaded plots. Routines for three-dimensional plotting, wireframe surface plots, surface plots with hidden line removal, and surface contour line plots are provided. Other features include polar and spherical coordinate plotting, world map plotting utilizing either cylindrical equidistant or Lambert equal area projection, plot translation, plot rotation, plot blowup, splines and polynomial interpolation, area blanking control, multiple log/linear axes, legends and text control, curve thickness control, and multiple text fonts (18 regular, 4 bold). NASADIG contains several groups of subroutines. Included are subroutines for plot area and axis definition; text set-up and display; area blanking; line style set-up, interpolation, and plotting; color shading and pattern control; legend, text block, and character control; device initialization; mixed alphabets setting; and other useful functions. The usefulness of many routines is dependent on the prior definition of basic parameters. The program's control structure uses a serial-level construct with each routine restricted for activation at some prescribed level(s) of problem definition. NASADIG provides the following output device drivers: Selanar 100XL, VECTOR Move/Draw ASCII and PostScript files, Tektronix 40xx, 41xx, and 4510 Rasterizer, DEC VT-240 (4014 mode), IBM AT/PC compatible with SmartTerm 240 emulator, HP Lasergrafix Film Recorder, QMS 800/1200, DEC LN03+ Laserprinters, and HP LaserJet (Series III). NASADIG is written in FORTRAN and is available for several platforms. NASADIG 5.7 is available for DEC VAX series computers running VMS 5.0 or later (MSC-21801), Cray X-MP and Y-MP series computers running UNICOS (COS-10049), and Amdahl 5990 mainframe computers running UTS (COS-10050). NASADIG 5.1 is available for UNIX-based operating systems (MSC-22001). The UNIX version has been successfully implemented on Sun4 series computers running SunOS, SGI IRIS computers running IRIX, Hewlett Packard 9000 computers running HP-UX, and Convex computers running Convex OS (MSC-22001). The standard distribution medium for MSC-21801 is a set of two 6250 BPI 9-track magnetic tapes in DEC VAX BACKUP format. It is also available on a set of two TK50 tape cartridges in DEC VAX BACKUP format. The standard distribution medium for COS-10049 and COS-10050 is a 6250 BPI 9-track magnetic tape in UNIX tar format. Other distribution media and formats may be available upon request. The standard distribution medium for MSC-22001 is a .25 inch streaming magnetic tape cartridge (Sun QIC-24) in UNIX tar format. Alternate distribution media and formats are available upon request. With minor modification, the UNIX source code can be ported to other platforms including IBM PC/AT series computers and compatibles. NASADIG is also available bundled with TRASYS, the Thermal Radiation Analysis System (COS-10026, DEC VAX version; COS-10040, CRAY version).
: A Scalable and Transparent System for Simulating MPI Programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S
2010-01-01
is a scalable, transparent system for experimenting with the execution of parallel programs on simulated computing platforms. The level of simulated detail can be varied for application behavior as well as for machine characteristics. Unique features of are repeatability of execution, scalability to millions of simulated (virtual) MPI ranks, scalability to hundreds of thousands of host (real) MPI ranks, portability of the system to a variety of host supercomputing platforms, and the ability to experiment with scientific applications whose source-code is available. The set of source-code interfaces supported by is being expanded to support a wider set of applications, andmore » MPI-based scientific computing benchmarks are being ported. In proof-of-concept experiments, has been successfully exercised to spawn and sustain very large-scale executions of an MPI test program given in source code form. Low slowdowns are observed, due to its use of purely discrete event style of execution, and due to the scalability and efficiency of the underlying parallel discrete event simulation engine, sik. In the largest runs, has been executed on up to 216,000 cores of a Cray XT5 supercomputer, successfully simulating over 27 million virtual MPI ranks, each virtual rank containing its own thread context, and all ranks fully synchronized by virtual time.« less
Dynamic overset grid communication on distributed memory parallel processors
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.
1993-01-01
A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
Network issues for large mass storage requirements
NASA Technical Reports Server (NTRS)
Perdue, James
1992-01-01
File Servers and Supercomputing environments need high performance networks to balance the I/O requirements seen in today's demanding computing scenarios. UltraNet is one solution which permits both high aggregate transfer rates and high task-to-task transfer rates as demonstrated in actual tests. UltraNet provides this capability as both a Server-to-Server and Server-to-Client access network giving the supercomputing center the following advantages highest performance Transport Level connections (to 40 MBytes/sec effective rates); matches the throughput of the emerging high performance disk technologies, such as RAID, parallel head transfer devices and software striping; supports standard network and file system applications using SOCKET's based application program interface such as FTP, rcp, rdump, etc.; supports access to the Network File System (NFS) and LARGE aggregate bandwidth for large NFS usage; provides access to a distributed, hierarchical data server capability using DISCOS UniTree product; supports file server solutions available from multiple vendors, including Cray, Convex, Alliant, FPS, IBM, and others.
Methodology for analysis and simulation of large multidisciplinary problems
NASA Technical Reports Server (NTRS)
Russell, William C.; Ikeda, Paul J.; Vos, Robert G.
1989-01-01
The Integrated Structural Modeling (ISM) program is being developed for the Air Force Weapons Laboratory and will be available for Air Force work. Its goal is to provide a design, analysis, and simulation tool intended primarily for directed energy weapons (DEW), kinetic energy weapons (KEW), and surveillance applications. The code is designed to run on DEC (VMS and UNIX), IRIS, Alliant, and Cray hosts. Several technical disciplines are included in ISM, namely structures, controls, optics, thermal, and dynamics. Four topics from the broad ISM goal are discussed. The first is project configuration management and includes two major areas: the software and database arrangement and the system model control. The second is interdisciplinary data transfer and refers to exchange of data between various disciplines such as structures and thermal. Third is a discussion of the integration of component models into one system model, i.e., multiple discipline model synthesis. Last is a presentation of work on a distributed processing computing environment.
UPEML Version 2. 0: A machine-portable CDC Update emulator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mehlhorn, T.A.; Young, M.F.
1987-05-01
UPEML is a machine-portable CDC Update emulation program. UPEML is written in ANSI standard Fortran-77 and is relatively simple and compact. It is capable of emulating a significant subset of the standard CDC Update functions, including program library creation and subsequent modification. Machine-portability is an essential attribute of UPEML. UPEML was written primarily to facilitate the use of CDC-based scientific packages on alternate computer systems such as the VAX 11/780 and the IBM 3081. UPEML has also been successfully used on the multiprocessor ELXSI, on CRAYs under both COS and CTSS operating systems, on APOLLO workstations, and on the HP-9000.more » Version 2.0 includes enhanced error checking, full ASCI character support, a program library audit capability, and a partial update option in which only selected or modified decks are written to the compile file. Further enhancements include checks for overlapping corrections, processing of nested calls to common decks, and reads and addfiles from alternate input files.« less
TECA: Petascale pattern recognition for climate science
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prabhat, .; Byna, Surendra; Vishwanath, Venkatram
Climate Change is one of the most pressing challenges facing humanity in the 21st century. Climate simulations provide us with a unique opportunity to examine effects of anthropogenic emissions. Highresolution climate simulations produce “Big Data”: contemporary climate archives are ≈ 5PB in size and we expect future archives to measure on the order of Exa-Bytes. In this work, we present the successful application of TECA (Toolkit for Extreme Climate Analysis) framework, for extracting extreme weather patterns such as Tropical Cyclones, Atmospheric Rivers and Extra-Tropical Cyclones from TB-sized simulation datasets. TECA has been run at full-scale on Cray XE6 and IBMmore » BG/Q systems, and has reduced the runtime for pattern detection tasks from years to hours. TECA has been utilized to evaluate the performance of various computational models in reproducing the statistics of extreme weather events, and for characterizing the change in frequency of storm systems in the future.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brower, Richard C.
This proposal is to develop the software and algorithmic infrastructure needed for the numerical study of quantum chromodynamics (QCD), and of theories that have been proposed to describe physics beyond the Standard Model (BSM) of high energy physics, on current and future computers. This infrastructure will enable users (1) to improve the accuracy of QCD calculations to the point where they no longer limit what can be learned from high-precision experiments that seek to test the Standard Model, and (2) to determine the predictions of BSM theories in order to understand which of them are consistent with the data thatmore » will soon be available from the LHC. Work will include the extension and optimizations of community codes for the next generation of leadership class computers, the IBM Blue Gene/Q and the Cray XE/XK, and for the dedicated hardware funded for our field by the Department of Energy. Members of our collaboration at Brookhaven National Laboratory and Columbia University worked on the design of the Blue Gene/Q, and have begun to develop software for it. Under this grant we will build upon their experience to produce high-efficiency production codes for this machine. Cray XE/XK computers with many thousands of GPU accelerators will soon be available, and the dedicated commodity clusters we obtain with DOE funding include growing numbers of GPUs. We will work with our partners in NVIDIA's Emerging Technology group to scale our existing software to thousands of GPUs, and to produce highly efficient production codes for these machines. Work under this grant will also include the development of new algorithms for the effective use of heterogeneous computers, and their integration into our codes. It will include improvements of Krylov solvers and the development of new multigrid methods in collaboration with members of the FASTMath SciDAC Institute, using their HYPRE framework, as well as work on improved symplectic integrators.« less
Accelerating Science with the NERSC Burst Buffer Early User Program
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhimji, Wahid; Bard, Debbie; Romanus, Melissa
NVRAM-based Burst Buffers are an important part of the emerging HPC storage landscape. The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory recently installed one of the first Burst Buffer systems as part of its new Cori supercomputer, collaborating with Cray on the development of the DataWarp software. NERSC has a diverse user base comprised of over 6500 users in 700 different projects spanning a wide variety of scientific computing applications. The use-cases of the Burst Buffer at NERSC are therefore also considerable and diverse. We describe here performance measurements and lessons learned from the Burstmore » Buffer Early User Program at NERSC, which selected a number of research projects to gain early access to the Burst Buffer and exercise its capability to enable new scientific advancements. To the best of our knowledge this is the first time a Burst Buffer has been stressed at scale by diverse, real user workloads and therefore these lessons will be of considerable benefit to shaping the developing use of Burst Buffers at HPC centers.« less
Beyond the Face of Race: Emo-Cognitive Explorations of White Neurosis and Racial Cray-Cray
ERIC Educational Resources Information Center
Matias, Cheryl E.; DiAngelo, Robin
2013-01-01
In this article, the authors focus on the emotional and cognitive context that underlies whiteness. They employ interdisciplinary approaches of critical Whiteness studies and critical race theory to entertain how common White responses to racial material stem from the need for Whites to deny race, a traumatizing process that begins in childhood.…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dongarra, J.J.; Hewitt, T.
1985-08-01
This note describes some experiments on simple, dense linear algebra algorithms. These experiments show that the CRAY X-MP is capable of small-grain multitasking arising from standard implementations of LU and Cholesky decomposition. The implementation described here provides the ''fastest'' execution rate for LU decomposition, 718 MFLOPS for a matrix of order 1000.
Space shuttle main engine numerical modeling code modifications and analysis
NASA Technical Reports Server (NTRS)
Ziebarth, John P.
1988-01-01
The user of computational fluid dynamics (CFD) codes must be concerned with the accuracy and efficiency of the codes if they are to be used for timely design and analysis of complicated three-dimensional fluid flow configurations. A brief discussion of how accuracy and efficiency effect the CFD solution process is given. A more detailed discussion of how efficiency can be enhanced by using a few Cray Research Inc. utilities to address vectorization is presented and these utilities are applied to a three-dimensional Navier-Stokes CFD code (INS3D).
Unstructured-grid methods development: Lessons le arned
NASA Technical Reports Server (NTRS)
Batina, John T.
1991-01-01
The development is summarized of unstructured grid methods for the solution of the equations of fluid flow and some of the lessons learned are shared. The 3-D Euler equations are solved, including spatial discretizations, temporal discretizations, and boundary conditions. An example calculation with an upwind implicit method using a CFL (Courant Friedricks Lewy) number of infinity is presented for the Boeing 747 aircraft. The results obtained in less than one hour of CPU time on a Cray-2 computer, thus demonstrating the speed and robustness of the present capability.
Parallel Climate Data Assimilation PSAS Package
NASA Technical Reports Server (NTRS)
Ding, Hong Q.; Chan, Clara; Gennery, Donald B.; Ferraro, Robert D.
1996-01-01
We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512node Intel Paragon. The equation solver achieves a sustained 18 Gflops performance. As the results, we achieved an unprecedented 100-fold solution time reduction on the Intel Paragon parallel platform over the Cray C90. This not only meets and exceeds the DAO time requirements, but also significantly enlarges the window of exploration in climate data assimilations.
Computer Center Reference Manual
1988-06-20
Compiler options, separated by colons (default: A-:B?-:BREG-8:BT-:C-:D+:H2:A H+24:L+:O+:P-:+:RV-:S4:S+4:A ST-:T+: TREG -8:U-:V+:A-:Z+) CPU- - Cray to execute...4 a 0~ 0 -. C 6 0 _ .0’ 63 r C 6 0 C .. 60 0 6 .- 6 ILC A L U 00 00 IL C 6 1 6j 0 1.- U 0 E L 14. E Lc C0 a 60. LL 1 0’. . - 66 6 0 DL C 6 a 4
Optimization of Supercomputer Use on EADS II System
NASA Technical Reports Server (NTRS)
Ahmed, Ardsher
1998-01-01
The main objective of this research was to optimize supercomputer use to achieve better throughput and utilization of supercomputers and to help facilitate the movement of non-supercomputing (inappropriate for supercomputer) codes to mid-range systems for better use of Government resources at Marshall Space Flight Center (MSFC). This work involved the survey of architectures available on EADS II and monitoring customer (user) applications running on a CRAY T90 system.
Large-Scale Simulations of Plastic Neural Networks on Neuromorphic Hardware
Knight, James C.; Tully, Philip J.; Kaplan, Bernhard A.; Lansner, Anders; Furber, Steve B.
2016-01-01
SpiNNaker is a digital, neuromorphic architecture designed for simulating large-scale spiking neural networks at speeds close to biological real-time. Rather than using bespoke analog or digital hardware, the basic computational unit of a SpiNNaker system is a general-purpose ARM processor, allowing it to be programmed to simulate a wide variety of neuron and synapse models. This flexibility is particularly valuable in the study of biological plasticity phenomena. A recently proposed learning rule based on the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm offers a generic framework for modeling the interaction of different plasticity mechanisms using spiking neurons. However, it can be computationally expensive to simulate large networks with BCPNN learning since it requires multiple state variables for each synapse, each of which needs to be updated every simulation time-step. We discuss the trade-offs in efficiency and accuracy involved in developing an event-based BCPNN implementation for SpiNNaker based on an analytical solution to the BCPNN equations, and detail the steps taken to fit this within the limited computational and memory resources of the SpiNNaker architecture. We demonstrate this learning rule by learning temporal sequences of neural activity within a recurrent attractor network which we simulate at scales of up to 2.0 × 104 neurons and 5.1 × 107 plastic synapses: the largest plastic neural network ever to be simulated on neuromorphic hardware. We also run a comparable simulation on a Cray XC-30 supercomputer system and find that, if it is to match the run-time of our SpiNNaker simulation, the super computer system uses approximately 45× more power. This suggests that cheaper, more power efficient neuromorphic systems are becoming useful discovery tools in the study of plasticity in large-scale brain models. PMID:27092061
NASA Technical Reports Server (NTRS)
Badavi, F. F.
1989-01-01
Aerodynamic loads on a multi-bladed helicopter rotor in forward flight at transonic tip conditions are calculated. The unsteady, three-dimensional, time-accurate compressible Reynolds-averaged thin layer Navier-Stokes equations are solved in a rotating coordinate system on a body-conformed, curvilinear grid of C-H topology. Detailed boundary layer and global numerical comparisons of NACA-0012 symmetrical and CAST7-158 supercritical airfoils are made under identical forward flight conditions. The rotor wake effects are modeled by applying a correction to the geometric angle of attack of the blade. This correction is obtained by computing the local induced downwash velocity with a free wake analysis program. The calculations are performed on the Numerical Aerodynamic Simulation Cray 2 and the VPS32 (a derivative of a Cyber 205 at the Langley Research Center) for a model helicopter rotor in forward flight.
FORTRAN multitasking library for use on the ELXSI 6400 and the CRAY XMP
DOE Office of Scientific and Technical Information (OSTI.GOV)
Montry, G.R.
1985-07-16
A library of FORTRAN-based multitasking routines has been written for the ELXSI 6400 and the CRAY XMP. This library is designed to make multitasking codes easily transportable between machines with different hardware configurations. The library provides enhanced error checking and diagnostics over vendor-supplied multitasking intrinsics. The library also contains multitasking control structures not normally supplied by the vendor.
Developing software to use parallel processing effectively. Final report, June-December 1987
DOE Office of Scientific and Technical Information (OSTI.GOV)
Center, J.
1988-10-01
This report describes the difficulties involved in writing efficient parallel programs and describes the hardware and software support currently available for generating software that utilizes processing effectively. Historically, the processing rate of single-processor computers has increased by one order of magnitude every five years. However, this pace is slowing since electronic circuitry is coming up against physical barriers. Unfortunately, the complexity of engineering and research problems continues to require ever more processing power (far in excess of the maximum estimated 3 Gflops achievable by single-processor computers). For this reason, parallel-processing architectures are receiving considerable interest, since they offer high performancemore » more cheaply than a single-processor supercomputer, such as the Cray.« less
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2002-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
High Temperature Composite Analyzer (HITCAN) demonstration manual, version 1.0
NASA Technical Reports Server (NTRS)
Singhal, S. N; Lackney, J. J.; Murthy, P. L. N.
1993-01-01
This manual comprises a variety of demonstration cases for the HITCAN (HIgh Temperature Composite ANalyzer) code. HITCAN is a general purpose computer program for predicting nonlinear global structural and local stress-strain response of arbitrarily oriented, multilayered high temperature metal matrix composite structures. HITCAN is written in FORTRAN 77 computer language and has been configured and executed on the NASA Lewis Research Center CRAY XMP and YMP computers. Detailed description of all program variables and terms used in this manual may be found in the User's Manual. The demonstration includes various cases to illustrate the features and analysis capabilities of the HITCAN computer code. These cases include: (1) static analysis, (2) nonlinear quasi-static (incremental) analysis, (3) modal analysis, (4) buckling analysis, (5) fiber degradation effects, (6) fabrication-induced stresses for a variety of structures; namely, beam, plate, ring, shell, and built-up structures. A brief discussion of each demonstration case with the associated input data file is provided. Sample results taken from the actual computer output are also included.
ERDC MSRC (Major Shared Resource Center) Resource. Spring 2008
2008-01-01
obtained from ADCIRC results. The alpha test was performed on the Cray XT3 machine (Sapphire) at ERDC and the IBM P575+ system ( Babbage ) at the...2008 20 Scotty Swillie (center) and Charles Ray (far right) were part of the team that constructed the DoD HPCMP booth for the Conference (From
Experiences with Cray multi-tasking
NASA Technical Reports Server (NTRS)
Miya, E. N.
1985-01-01
The issues involved in modifying an existing code for multitasking is explored. They include Cray extensions to FORTRAN, an examination of the application code under study, designing workable modifications, specific code modifications to the VAX and Cray versions, performance, and efficiency results. The finished product is a faster, fully synchronous, parallel version of the original program. A production program is partitioned by hand to run on two CPUs. Loop splitting multitasks three key subroutines. Simply dividing subroutine data and control structure down the middle of a subroutine is not safe. Simple division produces results that are inconsistent with uniprocessor runs. The safest way to partition the code is to transfer one block of loops at a time and check the results of each on a test case. Other issues include debugging and performance. Task startup and maintenance (e.g., synchronization) are potentially expensive.
NASA Astrophysics Data System (ADS)
Molcard, A. J.; Pinardi, N.; Ansaloni, R.
A new numerical model, SEOM (Spectral Element Ocean Model, (Iskandarani et al, 1994)), has been implemented in the Mediterranean Sea. Spectral element methods combine the geometric flexibility of finite element techniques with the rapid convergence rate of spectral schemes. The current version solves the shallow water equations with a fifth (or sixth) order accuracy spectral scheme and about 50.000 nodes. The domain decomposition philosophy makes it possible to exploit the power of parallel machines. The original MIMD master/slave version of SEOM, written in F90 and PVM, has been ported to the Cray T3D. When critical for performance, Cray specific high-performance one-sided communication routines (SHMEM) have been adopted to fully exploit the Cray T3D interprocessor network. Tests performed with highly unstructured and irregular grid, on up to 128 processors, show an almost linear scalability even with unoptimized domain decomposition techniques. Results from various case studies on the Mediterranean Sea are shown, involving realistic coastline geometry, and monthly mean 1000mb winds from the ECMWF's atmospheric model operational analysis from the period January 1987 to December 1994. The simulation results show that variability in the wind forcing considerably affect the circulation dynamics of the Mediterranean Sea.
A Hardware-Accelerated Quantum Monte Carlo framework (HAQMC) for N-body systems
NASA Astrophysics Data System (ADS)
Gothandaraman, Akila; Peterson, Gregory D.; Warren, G. Lee; Hinde, Robert J.; Harrison, Robert J.
2009-12-01
Interest in the study of structural and energetic properties of highly quantum clusters, such as inert gas clusters has motivated the development of a hardware-accelerated framework for Quantum Monte Carlo simulations. In the Quantum Monte Carlo method, the properties of a system of atoms, such as the ground-state energies, are averaged over a number of iterations. Our framework is aimed at accelerating the computations in each iteration of the QMC application by offloading the calculation of properties, namely energy and trial wave function, onto reconfigurable hardware. This gives a user the capability to run simulations for a large number of iterations, thereby reducing the statistical uncertainty in the properties, and for larger clusters. This framework is designed to run on the Cray XD1 high performance reconfigurable computing platform, which exploits the coarse-grained parallelism of the processor along with the fine-grained parallelism of the reconfigurable computing devices available in the form of field-programmable gate arrays. In this paper, we illustrate the functioning of the framework, which can be used to calculate the energies for a model cluster of helium atoms. In addition, we present the capabilities of the framework that allow the user to vary the chemical identities of the simulated atoms. Program summaryProgram title: Hardware Accelerated Quantum Monte Carlo (HAQMC) Catalogue identifier: AEEP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEP_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 691 537 No. of bytes in distributed program, including test data, etc.: 5 031 226 Distribution format: tar.gz Programming language: C/C++ for the QMC application, VHDL and Xilinx 8.1 ISE/EDK tools for FPGA design and development Computer: Cray XD1 consisting of a dual-core, dualprocessor AMD Opteron 2.2 GHz with a Xilinx Virtex-4 (V4LX160) or Xilinx Virtex-II Pro (XC2VP50) FPGA per node. We use the compute node with the Xilinx Virtex-4 FPGA Operating system: Red Hat Enterprise Linux OS Has the code been vectorised or parallelized?: Yes Classification: 6.1 Nature of problem: Quantum Monte Carlo is a practical method to solve the Schrödinger equation for large many-body systems and obtain the ground-state properties of such systems. This method involves the sampling of a number of configurations of atoms and averaging the properties of the configurations over a number of iterations. We are interested in applying the QMC method to obtain the energy and other properties of highly quantum clusters, such as inert gas clusters. Solution method: The proposed framework provides a combined hardware-software approach, in which the QMC simulation is performed on the host processor, with the computationally intensive functions such as energy and trial wave function computations mapped onto the field-programmable gate array (FPGA) logic device attached as a co-processor to the host processor. We perform the QMC simulation for a number of iterations as in the case of our original software QMC approach, to reduce the statistical uncertainty of the results. However, our proposed HAQMC framework accelerates each iteration of the simulation, by significantly reducing the time taken to calculate the ground-state properties of the configurations of atoms, thereby accelerating the overall QMC simulation. We provide a generic interpolation framework that can be extended to study a variety of pure and doped atomic clusters, irrespective of the chemical identities of the atoms. For the FPGA implementation of the properties, we use a two-region approach for accurately computing the properties over the entire domain, employ deep pipelines and fixed-point for all our calculations guaranteeing the accuracy required for our simulation.
NASA Technical Reports Server (NTRS)
Keppenne, C. L.; Rienecker, M.; Borovikov, A. Y.
1999-01-01
Two massively parallel data assimilation systems in which the model forecast-error covariances are estimated from the distribution of an ensemble of model integrations are applied to the assimilation of 97-98 TOPEX/POSEIDON altimetry and TOGA/TAO temperature data into a Pacific basin version the NASA Seasonal to Interannual Prediction Project (NSIPP)ls quasi-isopycnal ocean general circulation model. in the first system, ensemble of model runs forced by an ensemble of atmospheric model simulations is used to calculate asymptotic error statistics. The data assimilation then occurs in the reduced phase space spanned by the corresponding leading empirical orthogonal functions. The second system is an ensemble Kalman filter in which new error statistics are computed during each assimilation cycle from the time-dependent ensemble distribution. The data assimilation experiments are conducted on NSIPP's 512-processor CRAY T3E. The two data assimilation systems are validated by withholding part of the data and quantifying the extent to which the withheld information can be inferred from the assimilation of the remaining data. The pros and cons of each system are discussed.
NASA Technical Reports Server (NTRS)
Tarshish, Adina; Salmon, Ellen
1994-01-01
In October 1992, the NASA Center for Computational Sciences made its Convex-based UniTree system generally available to users. The ensuing months saw growth in every area. Within 26 months, data under UniTree control grew from nil to over 12 terabytes, nearly all of it stored on robotically mounted tape. HiPPI/UltraNet was added to enhance connectivity, and later HiPPI/TCP was added as well. Disks and robotic tape silos were added to those already under UniTree's control, and 18-track tapes were upgraded to 36-track. The primary data source for UniTree, the facility's Cray Y-MP/4-128, first doubled its processing power and then was replaced altogether by a C98/6-256 with nearly two-and-a-half times the Y-MP's combined peak gigaflops. The Convex/UniTree software was upgraded from version 1.5 to 1.7.5, and then to 1.7.6. Finally, the server itself, a Convex C3240, was upgraded to a C3830 with a second I/O bay, doubling the C3240's memory and capacity for I/O. This paper describes insights gained and reinforced with the burgeoning demands on the UniTree storage system and the significant increases in performance gained from the many upgrades.
Evaluation of a Fully 3-D Bpf Method for Small Animal PET Images on Mimd Architectures
NASA Astrophysics Data System (ADS)
Bevilacqua, A.
Positron Emission Tomography (PET) images can be reconstructed using Fourier transform methods. This paper describes the performance of a fully 3-D Backprojection-Then-Filter (BPF) algorithm on the Cray T3E machine and on a cluster of workstations. PET reconstruction of small animals is a class of problems characterized by poor counting statistics. The low-count nature of these studies necessitates 3-D reconstruction in order to improve the sensitivity of the PET system: by including axially oblique Lines Of Response (LORs), the sensitivity of the system can be significantly improved by the 3-D acquisition and reconstruction. The BPF method is widely used in clinical studies because of its speed and easy implementation. Moreover, the BPF method is suitable for on-time 3-D reconstruction as it does not need any sinogram or rearranged data. In order to investigate the possibility of on-line processing, we reconstruct a phantom using the data stored in the list-mode format by the data acquisition system. We show how the intrinsically parallel nature of the BPF method makes it suitable for on-line reconstruction on a MIMD system such as the Cray T3E. Lastly, we analyze the performance of this algorithm on a cluster of workstations.
High-Performance Design Patterns for Modern Fortran
Haveraaen, Magne; Morris, Karla; Rouson, Damian; ...
2015-01-01
This paper presents ideas for using coordinate-free numerics in modern Fortran to achieve code flexibility in the partial differential equation (PDE) domain. We also show how Fortran, over the last few decades, has changed to become a language well-suited for state-of-the-art software development. Fortran’s new coarray distributed data structure, the language’s class mechanism, and its side-effect-free, pure procedure capability provide the scaffolding on which we implement HPC software. These features empower compilers to organize parallel computations with efficient communication. We present some programming patterns that support asynchronous evaluation of expressions comprised of parallel operations on distributed data. We implemented thesemore » patterns using coarrays and the message passing interface (MPI). We compared the codes’ complexity and performance. The MPI code is much more complex and depends on external libraries. The MPI code on Cray hardware using the Cray compiler is 1.5–2 times faster than the coarray code on the same hardware. The Intel compiler implements coarrays atop Intel’s MPI library with the result apparently being 2–2.5 times slower than manually coded MPI despite exhibiting nearly linear scaling efficiency. As compilers mature and further improvements to coarrays comes in Fortran 2015, we expect this performance gap to narrow.« less
A static data flow simulation study at Ames Research Center
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Howard, Lauri S.
1987-01-01
Demands in computational power, particularly in the area of computational fluid dynamics (CFD), led NASA Ames Research Center to study advanced computer architectures. One architecture being studied is the static data flow architecture based on research done by Jack B. Dennis at MIT. To improve understanding of this architecture, a static data flow simulator, written in Pascal, has been implemented for use on a Cray X-MP/48. A matrix multiply and a two-dimensional fast Fourier transform (FFT), two algorithms used in CFD work at Ames, have been run on the simulator. Execution times can vary by a factor of more than 2 depending on the partitioning method used to assign instructions to processing elements. Service time for matching tokens has proved to be a major bottleneck. Loop control and array address calculation overhead can double the execution time. The best sustained MFLOPS rates were less than 50% of the maximum capability of the machine.
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines
NASA Technical Reports Server (NTRS)
1999-01-01
Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
Non-linear wave phenomena in Josephson elements for superconducting electronics
NASA Astrophysics Data System (ADS)
Christiansen, P. L.; Parmentier, R. D.; Skovgaard, O.
1985-07-01
The long and intermediate length Josephson tunnel junction oscillator with overlap geometry of linear and circular configuration, is investigated by computational solution of the perturbed sine-Gordon equation model and by experimental measurements. The model predicts the experimental results very well. Line oscillators as well as ring oscillators are treated. For long junctions soliton perturbation methods are developed and turn out to be efficient prediction tools, also providing physical understanding of the dynamics of the oscillator. For intermediate length junctions expansions in terms of linear cavity modes reduce computational costs. The narrow linewidth of the electromagnetic radiation (typically 1 kHz of a line at 10 GHz) is demonstrated experimentally. Corresponding computer simulations requiring a relative accuracy of less than 10 to the -7th power are performed on supercomputer CRAY-1-S. The broadening of linewidth due to external microradiation and internal thermal noise is determined.
User's and test case manual for FEMATS
NASA Technical Reports Server (NTRS)
Chatterjee, Arindam; Volakis, John; Nurnberger, Mike; Natzke, John
1995-01-01
The FEMATS program incorporates first-order edge-based finite elements and vector absorbing boundary conditions into the scattered field formulation for computation of the scattering from three-dimensional geometries. The code has been validated extensively for a large class of geometries containing inhomogeneities and satisfying transition conditions. For geometries that are too large for the workstation environment, the FEMATS code has been optimized to run on various supercomputers. Currently, FEMATS has been configured to run on the HP 9000 workstation, vectorized for the Cray Y-MP, and parallelized to run on the Kendall Square Research (KSR) architecture and the Intel Paragon.
Fluid behavior in microgravity environment
NASA Technical Reports Server (NTRS)
Hung, R. J.; Lee, C. C.; Tsao, Y. D.
1990-01-01
The instability of liquid and gas interface can be induced by the presence of longitudinal and lateral accelerations, vehicle vibration, and rotational fields of spacecraft in a microgravity environment. In a spacecraft design, the requirements of settled propellant are different for tank pressurization, engine restart, venting, or propellent transfer. In this paper, the dynamical behavior of liquid propellant, fluid reorientation, and propellent resettling have been carried out through the execution of a CRAY X-MP super computer to simulate fluid management in a microgravity environment. Characteristics of slosh waves excited by the restoring force field of gravity jitters have also been investigated.
Accuracy and speed in computing the Chebyshev collocation derivative
NASA Technical Reports Server (NTRS)
Don, Wai-Sun; Solomonoff, Alex
1991-01-01
We studied several algorithms for computing the Chebyshev spectral derivative and compare their roundoff error. For a large number of collocation points, the elements of the Chebyshev differentiation matrix, if constructed in the usual way, are not computed accurately. A subtle cause is is found to account for the poor accuracy when computing the derivative by the matrix-vector multiplication method. Methods for accurately computing the elements of the matrix are presented, and we find that if the entities of the matrix are computed accurately, the roundoff error of the matrix-vector multiplication is as small as that of the transform-recursion algorithm. Results of CPU time usage are shown for several different algorithms for computing the derivative by the Chebyshev collocation method for a wide variety of two-dimensional grid sizes on both an IBM and a Cray 2 computer. We found that which algorithm is fastest on a particular machine depends not only on the grid size, but also on small details of the computer hardware as well. For most practical grid sizes used in computation, the even-odd decomposition algorithm is found to be faster than the transform-recursion method.
Comparisons of some large scientific computers
NASA Technical Reports Server (NTRS)
Credeur, K. R.
1981-01-01
In 1975, the National Aeronautics and Space Administration (NASA) began studies to assess the technical and economic feasibility of developing a computer having sustained computational speed of one billion floating point operations per second and a working memory of at least 240 million words. Such a powerful computer would allow computational aerodynamics to play a major role in aeronautical design and advanced fluid dynamics research. Based on favorable results from these studies, NASA proceeded with developmental plans. The computer was named the Numerical Aerodynamic Simulator (NAS). To help insure that the estimated cost, schedule, and technical scope were realistic, a brief study was made of past large scientific computers. Large discrepancies between inception and operation in scope, cost, or schedule were studied so that they could be minimized with NASA's proposed new compter. The main computers studied were the ILLIAC IV, STAR 100, Parallel Element Processor Ensemble (PEPE), and Shuttle Mission Simulator (SMS) computer. Comparison data on memory and speed were also obtained on the IBM 650, 704, 7090, 360-50, 360-67, 360-91, and 370-195; the CDC 6400, 6600, 7600, CYBER 203, and CYBER 205; CRAY 1; and the Advanced Scientific Computer (ASC). A few lessons learned conclude the report.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Meneses, Esteban; Ni, Xiang; Jones, Terry R
The unprecedented computational power of cur- rent supercomputers now makes possible the exploration of complex problems in many scientific fields, from genomic analysis to computational fluid dynamics. Modern machines are powerful because they are massive: they assemble millions of cores and a huge quantity of disks, cards, routers, and other components. But it is precisely the size of these machines that glooms the future of supercomputing. A system that comprises many components has a high chance to fail, and fail often. In order to make the next generation of supercomputers usable, it is imperative to use some type of faultmore » tolerance platform to run applications on large machines. Most fault tolerance strategies can be optimized for the peculiarities of each system and boost efficacy by keeping the system productive. In this paper, we aim to understand how failure characterization can improve resilience in several layers of the software stack: applications, runtime systems, and job schedulers. We examine the Titan supercomputer, one of the fastest systems in the world. We analyze a full year of Titan in production and distill the failure patterns of the machine. By looking into Titan s log files and using the criteria of experts, we provide a detailed description of the types of failures. In addition, we inspect the job submission files and describe how the system is used. Using those two sources, we cross correlate failures in the machine to executing jobs and provide a picture of how failures affect the user experience. We believe such characterization is fundamental in developing appropriate fault tolerance solutions for Cray systems similar to Titan.« less
Multitasking 3-D forward modeling using high-order finite difference methods on the Cray X-MP/416
DOE Office of Scientific and Technical Information (OSTI.GOV)
Terki-Hassaine, O.; Leiss, E.L.
1988-01-01
The CRAY X-MP/416 was used to multitask 3-D forward modeling by the high-order finite difference method. Flowtrace analysis reveals that the most expensive operation in the unitasked program is a matrix vector multiplication. The in-core and out-of-core versions of a reentrant subroutine can perform any fraction of the matrix vector multiplication independently, a pattern compatible with multitasking. The matrix vector multiplication routine can be distributed over two to four processors. The rest of the program utilizes the microtasking feature that lets the system treat independent iterations of DO-loops as subtasks to be performed by any available processor. The availability ofmore » the Solid-State Storage Device (SSD) meant the I/O wait time was virtually zero. A performance study determined a theoretical speedup, taking into account the multitasking overhead. Multitasking programs utilizing both macrotasking and microtasking features obtained actual speedups that were approximately 80% of the ideal speedup.« less
Utilities for master source code distribution: MAX and Friends
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
MAX is a program for the manipulation of FORTRAN master source code (MSC). This is a technique by which one maintains one and only one master copy of a FORTRAN program under a program developing system, which for MAX is assumed to be VAX/VMS. The master copy is not intended to be directly compiled. Instead it must be pre-processed by MAX to produce compilable instances. These instances may correspond to different code versions (for example, double precision versus single precision), different machines (for example, IBM, CDC, Cray) or different operating systems (i.e., VAX/VMS versus VAX/UNIX). The advantage os using a master source is more pronounced in complex application programs that are developed and maintained over many years and are to be transported and executed on several computer environments. The version lag problem that plagues many such programs is avoided by this approach. MAX is complemented by several auxiliary programs that perform nonessential functions. The ensemble is collectively known as MAX and Friends. All of these programs, including MAX, are executed as foreign VAX/VMS commands and can easily be hidden in customized VMS command procedures.
NAS Parallel Benchmark Results 11-96. 1.0
NASA Technical Reports Server (NTRS)
Bailey, David H.; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
The NAS Parallel Benchmarks have been developed at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a "pencil and paper" fashion. In other words, the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. These results represent the best results that have been reported to us by the vendors for the specific 3 systems listed. In this report, we present new NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz), NEC SX-4/32, SGI/CRAY T3E, SGI Origin200, and SGI Origin2000. We also report High Performance Fortran (HPF) based NPB results for IBM SP2 Wide Nodes, HP/Convex Exemplar SPP2000, and SGI/CRAY T3D. These results have been submitted by Applied Parallel Research (APR) and Portland Group Inc. (PGI). We also present sustained performance per dollar for Class B LU, SP and BT benchmarks.
NASA Technical Reports Server (NTRS)
Farrara, John D.; Drummond, Leroy A.; Mechoso, Carlos R.; Spahr, Joseph A.
1998-01-01
The design, implementation and performance optimization on the CRAY T3E of an atmospheric general circulation model (AGCM) which includes the transport of, and chemical reactions among, an arbitrary number of constituents is reviewed. The parallel implementation is based on a two-dimensional (longitude and latitude) data domain decomposition. Initial optimization efforts centered on minimizing the impact of substantial static and weakly-dynamic load imbalances among processors through load redistribution schemes. Recent optimization efforts have centered on single-node optimization. Strategies employed include loop unrolling, both manually and through the compiler, the use of an optimized assembler-code library for special function calls, and restructuring of parts of the code to improve data locality. Data exchanges and synchronizations involved in coupling different data-distributed models can account for a significant fraction of the running time. Therefore, the required scattering and gathering of data must be optimized. In systems such as the T3E, there is much more aggregate bandwidth in the total system than in any particular processor. This suggests a distributed design. The design and implementation of a such distributed 'Data Broker' as a means to efficiently couple the components of our climate system model is described.
Engineering PFLOTRAN for Scalable Performance on Cray XT and IBM BlueGene Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mills, Richard T; Sripathi, Vamsi K; Mahinthakumar, Gnanamanika
We describe PFLOTRAN - a code for simulation of coupled hydro-thermal-chemical processes in variably saturated, non-isothermal, porous media - and the approaches we have employed to obtain scalable performance on some of the largest scale supercomputers in the world. We present detailed analyses of I/O and solver performance on Jaguar, the Cray XT5 at Oak Ridge National Laboratory, and Intrepid, the IBM BlueGene/P at Argonne National Laboratory, that have guided our choice of algorithms.
NASA Technical Reports Server (NTRS)
Hull, Gary; Ranade, Sanjay
1993-01-01
With over 5000 units sold, the Storage Tek Automated Cartridge System (ACS) 4400 tape library is currently the most popular large automated tape library. Based on 3480/90 tape technology, the library is used as the migration device ('nearline' storage) in high-performance mass storage systems. In its maximum configuration, one ACS 4400 tape library houses sixteen 3480/3490 tape drives and is capable of holding approximately 6000 cartridge tapes. The maximum storage capacity of one library using 3480 tapes is 1.2 TB and the advertised aggregate I/O rate is about 24 MB/s. This paper reports on an extensive set of tests designed to accurately assess the performance capabilities and operational characteristics of one STK ACS 4400 tape library holding approximately 5200 cartridge tapes and configured with eight 3480 tape drives. A Cray Y-MP EL2-256 was configured as its host machine. More than 40,000 tape jobs were run in a variety of conditions to gather data in the areas of channel speed characteristics, robotics motion, time taped mounts, and timed tape reads and writes.
High temperature composite analyzer (HITCAN) user's manual, version 1.0
NASA Technical Reports Server (NTRS)
Lackney, J. J.; Singhal, S. N.; Murthy, P. L. N.; Gotsis, P.
1993-01-01
This manual describes 'how-to-use' the computer code, HITCAN (HIgh Temperature Composite ANalyzer). HITCAN is a general purpose computer program for predicting nonlinear global structural and local stress-strain response of arbitrarily oriented, multilayered high temperature metal matrix composite structures. This code combines composite mechanics and laminate theory with an internal data base for material properties of the constituents (matrix, fiber and interphase). The thermo-mechanical properties of the constituents are considered to be nonlinearly dependent on several parameters including temperature, stress and stress rate. The computation procedure for the analysis of the composite structures uses the finite element method. HITCAN is written in FORTRAN 77 computer language and at present has been configured and executed on the NASA Lewis Research Center CRAY XMP and YMP computers. This manual describes HlTCAN's capabilities and limitations followed by input/execution/output descriptions and example problems. The input is described in detail including (1) geometry modeling, (2) types of finite elements, (3) types of analysis, (4) material data, (5) types of loading, (6) boundary conditions, (7) output control, (8) program options, and (9) data bank.
NASA Technical Reports Server (NTRS)
Sidwell, Kenneth W.; Baruah, Pranab K.; Bussoletti, John E.; Medan, Richard T.; Conner, R. S.; Purdon, David J.
1990-01-01
A comprehensive description of user problem definition for the PAN AIR (Panel Aerodynamics) system is given. PAN AIR solves the 3-D linear integral equations of subsonic and supersonic flow. Influence coefficient methods are used which employ source and doublet panels as boundary surfaces. Both analysis and design boundary conditions can be used. This User's Manual describes the information needed to use the PAN AIR system. The structure and organization of PAN AIR are described, including the job control and module execution control languages for execution of the program system. The engineering input data are described, including the mathematical and physical modeling requirements. Version 3.0 strictly applies only to PAN AIR version 3.0. The major revisions include: (1) inputs and guidelines for the new FDP module (which calculates streamlines and offbody points); (2) nine new class 1 and class 2 boundary conditions to cover commonly used modeling practices, in particular the vorticity matching Kutta condition; (3) use of the CRAY solid state Storage Device (SSD); and (4) incorporation of errata and typo's together with additional explanation and guidelines.
Discrete Event Modeling and Massively Parallel Execution of Epidemic Outbreak Phenomena
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S; Seal, Sudip K
2011-01-01
In complex phenomena such as epidemiological outbreaks, the intensity of inherent feedback effects and the significant role of transients in the dynamics make simulation the only effective method for proactive, reactive or post-facto analysis. The spatial scale, runtime speed, and behavioral detail needed in detailed simulations of epidemic outbreaks make it necessary to use large-scale parallel processing. Here, an optimistic parallel execution of a new discrete event formulation of a reaction-diffusion simulation model of epidemic propagation is presented to facilitate in dramatically increasing the fidelity and speed by which epidemiological simulations can be performed. Rollback support needed during optimistic parallelmore » execution is achieved by combining reverse computation with a small amount of incremental state saving. Parallel speedup of over 5,500 and other runtime performance metrics of the system are observed with weak-scaling execution on a small (8,192-core) Blue Gene / P system, while scalability with a weak-scaling speedup of over 10,000 is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes exceeding several hundreds of millions of individuals in the largest cases are successfully exercised to verify model scalability.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haynes, R.A.
The Network File System (NFS) is used in UNIX-based networks to provide transparent file sharing between heterogeneous systems. Although NFS is well-known for being weak in security, it is widely used and has become a de facto standard. This paper examines the user authentication shortcomings of NFS and the approach Sandia National Laboratories has taken to strengthen it with Kerberos. The implementation on a Cray Y-MP8/864 running UNICOS is described and resource/performance issues are discussed. 4 refs., 4 figs.
Parallel processing a three-dimensional free-lagrange code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mandell, D.A.; Trease, H.E.
1989-01-01
A three-dimensional, time-dependent free-Lagrange hydrodynamics code has been multitasked and autotasked on a CRAY X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the CRAY multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The three-dimensional algorithm has presented a number of problems that simpler algorithms, such as those for one-dimensional hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a CRAY-1, to a multitasking code are discussed. Autotasking of a rewritten versionmore » of the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given.« less
Parallel processing a real code: A case history
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mandell, D.A.; Trease, H.E.
1988-01-01
A three-dimensional, time-dependent Free-Lagrange hydrodynamics code has been multitasked and autotasked on a Cray X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the Cray multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The 3-D algorithm has presented a number of problems that simpler algorithms, such as 1-D hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a Cray 1, to a multitasking code are discussed, Autotasking of a rewritten version ofmore » the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given. 8 refs., 13 figs.« less
Thought Leaders during Crises in Massive Social Networks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Corley, Courtney D.; Farber, Robert M.; Reynolds, William
The vast amount of social media data that can be gathered from the internet coupled with workflows that utilize both commodity systems and massively parallel supercomputers, such as the Cray XMT, open new vistas for research to support health, defense, and national security. Computer technology now enables the analysis of graph structures containing more than 4 billion vertices joined by 34 billion edges along with metrics and massively parallel algorithms that exhibit near-linear scalability according to number of processors. The challenge lies in making this massive data and analysis comprehensible to an analyst and end-users that require actionable knowledge tomore » carry out their duties. Simply stated, we have developed language and content agnostic techniques to reduce large graphs built from vast media corpora into forms people can understand. Specifically, our tools and metrics act as a survey tool to identify thought leaders' -- those members that lead or reflect the thoughts and opinions of an online community, independent of the source language.« less
Exciting Quantized Vortex Rings in a Superfluid Unitary Fermi Gas
NASA Astrophysics Data System (ADS)
Bulgac, Aurel
2014-03-01
In a recent article, Yefsah et al., Nature 499, 426 (2013) report the observation of an unusual quantum excitation mode in an elongated harmonically trapped unitary Fermi gas. After phase imprinting a domain wall, they observe collective oscillations of the superfluid atomic cloud with a period almost an order of magnitude larger than that predicted by any theory of domain walls, which they interpret as a possible new quantum phenomenon dubbed ``a heavy soliton'' with an inertial mass some 50 times larger than one expected for a domain wall. We present compelling evidence that this ``heavy soliton'' is instead a quantized vortex ring by showing that the main aspects of the experiment can be naturally explained within an extension of the time-dependent density functional theory (TDDFT) to superfluid systems. The numerical simulations required the solution of some 260,000 nonlinear coupled time-dependent 3-dimensional partial differential equations and was implemented on 2048 GPUs on the Cray XK7 supercomputer Titan of the Oak Ridge Leadership Computing Facility.
NASA Technical Reports Server (NTRS)
Schutz, Bob E.; Baker, Gregory A.
1997-01-01
The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
Wu, Lingfei; Wu, Kesheng; Sim, Alex; ...
2016-06-01
A novel algorithm and implementation of real-time identification and tracking of blob-filaments in fusion reactor data is presented. Similar spatio-temporal features are important in many other applications, for example, ignition kernels in combustion and tumor cells in a medical image. This work presents an approach for extracting these features by dividing the overall task into three steps: local identification of feature cells, grouping feature cells into extended feature, and tracking movement of feature through overlapping in space. Through our extensive work in parallelization, we demonstrate that this approach can effectively make use of a large number of compute nodes tomore » detect and track blob-filaments in real time in fusion plasma. Here, on a set of 30GB fusion simulation data, we observed linear speedup on 1024 processes and completed blob detection in less than three milliseconds using Edison, a Cray XC30 system at NERSC.« less
NASA Astrophysics Data System (ADS)
Baker, Gregory Allen
The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
Simulation and analysis of a geopotential research mission
NASA Technical Reports Server (NTRS)
Schutz, B. E.
1986-01-01
A computer simulation was performed for a Geopotential Research Mission (GRM) to enable study of the gravitational sensitivity of the range/rate measurement between two satellites and to provide a set of simulated measurements to assist in the evaluation of techniques developed for the determination of the gravity field. The simulation, identified as SGRM 8511, was conducted with two satellites in near circular, frozen orbits at 160 km altitude and separated by 300 km. High precision numerical integration of the polar orbits was used with a gravitational field complete to degree and order 180 coefficients and to degree 300 in orders 0 to 10. The set of simulated data for a mission duration of about 32 days was generated on a Cray X-MP computer. The characteristics of the simulation and the nature of the results are described.
ACON: a multipurpose production controller for plasma physics codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Snell, C.
1983-01-01
ACON is a BCON controller designed to run large production codes on the CTSS Cray-1 or the LTSS 7600 computers. ACON can also be operated interactively, with input from the user's terminal. The controller can run one code or a sequence of up to ten codes during the same job. Options are available to get and save Mass storage files, to perform Historian file updating operations, to compile and load source files, and to send out print and film files. Special features include ability to retry after Mass failures, backup options for saving files, startup messages for the various codes,more » and ability to reserve specified amounts of computer time after successive code runs. ACON's flexibility and power make it useful for running a number of different production codes.« less
Using a multifrontal sparse solver in a high performance, finite element code
NASA Technical Reports Server (NTRS)
King, Scott D.; Lucas, Robert; Raefsky, Arthur
1990-01-01
We consider the performance of the finite element method on a vector supercomputer. The computationally intensive parts of the finite element method are typically the individual element forms and the solution of the global stiffness matrix both of which are vectorized in high performance codes. To further increase throughput, new algorithms are needed. We compare a multifrontal sparse solver to a traditional skyline solver in a finite element code on a vector supercomputer. The multifrontal solver uses the Multiple-Minimum Degree reordering heuristic to reduce the number of operations required to factor a sparse matrix and full matrix computational kernels (e.g., BLAS3) to enhance vector performance. The net result in an order-of-magnitude reduction in run time for a finite element application on one processor of a Cray X-MP.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.
1989-01-01
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
Particle simulation on heterogeneous distributed supercomputers
NASA Technical Reports Server (NTRS)
Becker, Jeffrey C.; Dagum, Leonardo
1993-01-01
We describe the implementation and performance of a three dimensional particle simulation distributed between a Thinking Machines CM-2 and a Cray Y-MP. These are connected by a combination of two high-speed networks: a high-performance parallel interface (HIPPI) and an optical network (UltraNet). This is the first application to use this configuration at NASA Ames Research Center. We describe our experience implementing and using the application and report the results of several timing measurements. We show that the distribution of applications across disparate supercomputing platforms is feasible and has reasonable performance. In addition, several practical aspects of the computing environment are discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reynolds, William; Weber, Marta S.; Farber, Robert M.
Social Media provide an exciting and novel view into social phenomena. The vast amounts of data that can be gathered from the Internet coupled with massively parallel supercomputers such as the Cray XMT open new vistas for research. Conclusions drawn from such analysis must recognize that social media are distinct from the underlying social reality. Rigorous validation is essential. This paper briefly presents results obtained from computational analysis of social media - utilizing both blog and twitter data. Validation of these results is discussed in the context of a framework of established methodologies from the social sciences. Finally, an outlinemore » for a set of supporting studies is proposed.« less
MHD Instability and Turbulence in the Tachocline
NASA Technical Reports Server (NTRS)
Werne, Joseph
2001-01-01
In this quarter we have begun simulations on the Cray T3E at PSC and we are debugging our code on the TSC. The PSC simulations are examining stratified shear turbulence with a flow-aligned magnetic field and passive tracer particles. We have conducted analysis of neutral simulations to establish a firm basis of comparison. Second-order structure functions have been computed, fit, and compared to theoretical expressions relating the dissipation fields and the structure-function-fit parameters. Agreement with high-Reynolds number observations is excellent, giving us confidence that the lower-Re simulations are relevant to higher-Re flows. We have also evaluated the neutral layer anisotropy.
Implementation of Parallel Computing Technology to Vortex Flow
NASA Technical Reports Server (NTRS)
Dacles-Mariani, Jennifer
1999-01-01
Mainframe supercomputers such as the Cray C90 was invaluable in obtaining large scale computations using several millions of grid points to resolve salient features of a tip vortex flow over a lifting wing. However, real flight configurations require tracking not only of the flow over several lifting wings but its growth and decay in the near- and intermediate- wake regions, not to mention the interaction of these vortices with each other. Resolving and tracking the evolution and interaction of these vortices shed from complex bodies is computationally intensive. Parallel computing technology is an attractive option in solving these flows. In planetary science vortical flows are also important in studying how planets and protoplanets form when cosmic dust and gases become gravitationally unstable and eventually form planets or protoplanets. The current paradigm for the formation of planetary systems maintains that the planets accreted from the nebula of gas and dust left over from the formation of the Sun. Traditional theory also indicate that such a preplanetary nebula took the form of flattened disk. The coagulation of dust led to the settling of aggregates toward the midplane of the disk, where they grew further into asteroid-like planetesimals. Some of the issues still remaining in this process are the onset of gravitational instability, the role of turbulence in the damping of particles and radial effects. In this study the focus will be with the role of turbulence and the radial effects.
Visual Analytics for Power Grid Contingency Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wong, Pak C.; Huang, Zhenyu; Chen, Yousu
2014-01-20
Contingency analysis is the process of employing different measures to model scenarios, analyze them, and then derive the best response to remove the threats. This application paper focuses on a class of contingency analysis problems found in the power grid management system. A power grid is a geographically distributed interconnected transmission network that transmits and delivers electricity from generators to end users. The power grid contingency analysis problem is increasingly important because of both the growing size of the underlying raw data that need to be analyzed and the urgency to deliver working solutions in an aggressive timeframe. Failure tomore » do so may bring significant financial, economic, and security impacts to all parties involved and the society at large. The paper presents a scalable visual analytics pipeline that transforms about 100 million contingency scenarios to a manageable size and form for grid operators to examine different scenarios and come up with preventive or mitigation strategies to address the problems in a predictive and timely manner. Great attention is given to the computational scalability, information scalability, visual scalability, and display scalability issues surrounding the data analytics pipeline. Most of the large-scale computation requirements of our work are conducted on a Cray XMT multi-threaded parallel computer. The paper demonstrates a number of examples using western North American power grid models and data.« less
The cascade high productivity language
NASA Technical Reports Server (NTRS)
Callahan, David; Chamberlain, Branford L.; Zima, Hans P.
2004-01-01
This paper describes the design of Chapel, the Cascade High Productivity Language, which is being developed in the DARPA-funded HPCS project Cascade led by Cray Inc. Chapel pushes the state-of-the-art in languages for HEC system programming by focusing on productivity, in particular by combining the goal of highest possible object code performance with that of programmability offered by a high-level user interface.
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.
The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
LaFarge, R.A.
1990-05-01
MCPRAM (Monte Carlo PReprocessor for AMEER), a computer program that uses Monte Carlo techniques to create an input file for the AMEER trajectory code, has been developed for the Sandia National Laboratories VAX and Cray computers. Users can select the number of trajectories to compute, which AMEER variables to investigate, and the type of probability distribution for each variable. Any legal AMEER input variable can be investigated anywhere in the input run stream with either a normal, uniform, or Rayleigh distribution. Users also have the option to use covariance matrices for the investigation of certain correlated variables such as boostermore » pre-reentry errors and wind, axial force, and atmospheric models. In conjunction with MCPRAM, AMEER was modified to include the variables introduced by the covariance matrices and to include provisions for six types of fuze models. The new fuze models and the new AMEER variables are described in this report.« less
Applications Performance Under MPL and MPI on NAS IBM SP2
NASA Technical Reports Server (NTRS)
Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)
1994-01-01
On July 5, 1994, an IBM Scalable POWER parallel System (IBM SP2) with 64 nodes, was installed at the Numerical Aerodynamic Simulation (NAS) Facility Each node of NAS IBM SP2 is a "wide node" consisting of a RISC 6000/590 workstation module with a clock of 66.5 MHz which can perform four floating point operations per clock with a peak performance of 266 Mflop/s. By the end of 1994, 64 nodes of IBM SP2 will be upgraded to 160 nodes with a peak performance of 42.5 Gflop/s. An overview of the IBM SP2 hardware is presented. The basic understanding of architectural details of RS 6000/590 will help application scientists the porting, optimizing, and tuning of codes from other machines such as the CRAY C90 and the Paragon to the NAS SP2. Optimization techniques such as quad-word loading, effective utilization of two floating point units, and data cache optimization of RS 6000/590 is illustrated, with examples giving performance gains at each optimization step. The conversion of codes using Intel's message passing library NX to codes using native Message Passing Library (MPL) and the Message Passing Interface (NMI) library available on the IBM SP2 is illustrated. In particular, we will present the performance of Fast Fourier Transform (FFT) kernel from NAS Parallel Benchmarks (NPB) under MPL and MPI. We have also optimized some of Fortran BLAS 2 and BLAS 3 routines, e.g., the optimized Fortran DAXPY runs at 175 Mflop/s and optimized Fortran DGEMM runs at 230 Mflop/s per node. The performance of the NPB (Class B) on the IBM SP2 is compared with the CRAY C90, Intel Paragon, TMC CM-5E, and the CRAY T3D.
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
Time-Shifted Boundary Conditions Used for Navier-Stokes Aeroelastic Solver
NASA Technical Reports Server (NTRS)
Srivastava, Rakesh
1999-01-01
Under the Advanced Subsonic Technology (AST) Program, an aeroelastic analysis code (TURBO-AE) based on Navier-Stokes equations is currently under development at NASA Lewis Research Center s Machine Dynamics Branch. For a blade row, aeroelastic instability can occur in any of the possible interblade phase angles (IBPA s). Analyzing small IBPA s is very computationally expensive because a large number of blade passages must be simulated. To reduce the computational cost of these analyses, we used time shifted, or phase-lagged, boundary conditions in the TURBO-AE code. These conditions can be used to reduce the computational domain to a single blade passage by requiring the boundary conditions across the passage to be lagged depending on the IBPA being analyzed. The time-shifted boundary conditions currently implemented are based on the direct-store method. This method requires large amounts of data to be stored over a period of the oscillation cycle. On CRAY computers this is not a major problem because solid-state devices can be used for fast input and output to read and write the data onto a disk instead of storing it in core memory.
Solution of matrix equations using sparse techniques
NASA Technical Reports Server (NTRS)
Baddourah, Majdi
1994-01-01
The solution of large systems of matrix equations is key to the solution of a large number of scientific and engineering problems. This talk describes the sparse matrix solver developed at Langley which can routinely solve in excess of 263,000 equations in 40 seconds on one Cray C-90 processor. It appears that for large scale structural analysis applications, sparse matrix methods have a significant performance advantage over other methods.
Dense and Sparse Matrix Operations on the Cell Processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, Samuel W.; Shalf, John; Oliker, Leonid
2005-05-01
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
A new procedure for dynamic adaption of three-dimensional unstructured grids
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Strawn, Roger
1993-01-01
A new procedure is presented for the simultaneous coarsening and refinement of three-dimensional unstructured tetrahedral meshes. This algorithm allows for localized grid adaption that is used to capture aerodynamic flow features such as vortices and shock waves in helicopter flowfield simulations. The mesh-adaption algorithm is implemented in the C programming language and uses a data structure consisting of a series of dynamically-allocated linked lists. These lists allow the mesh connectivity to be rapidly reconstructed when individual mesh points are added and/or deleted. The algorithm allows the mesh to change in an anisotropic manner in order to efficiently resolve directional flow features. The procedure has been successfully implemented on a single processor of a Cray Y-MP computer. Two sample cases are presented involving three-dimensional transonic flow. Computed results show good agreement with conventional structured-grid solutions for the Euler equations.
Multitasking for flows about multiple body configurations using the chimera grid scheme
NASA Technical Reports Server (NTRS)
Dougherty, F. C.; Morgan, R. L.
1987-01-01
The multitasking of a finite-difference scheme using multiple overset meshes is described. In this chimera, or multiple overset mesh approach, a multiple body configuration is mapped using a major grid about the main component of the configuration, with minor overset meshes used to map each additional component. This type of code is well suited to multitasking. Both steady and unsteady two dimensional computations are run on parallel processors on a CRAY-X/MP 48, usually with one mesh per processor. Flow field results are compared with single processor results to demonstrate the feasibility of running multiple mesh codes on parallel processors and to show the increase in efficiency.
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniquesmore » in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.« less
FPPAC94: A two-dimensional multispecies nonlinear Fokker-Planck package for UNIX systems
NASA Astrophysics Data System (ADS)
Mirin, A. A.; McCoy, M. G.; Tomaschke, G. P.; Killeen, J.
1994-07-01
FPPAC94 solves the complete nonlinear multispecies Fokker-Planck collison operator for a plasma in two-dimensional velocity space. The operator is expressed in terms of spherical coordinates (speed and pitch angle) under the assumption of azimuthal symmetry. Provision is made for additional physics contributions (e.g. rf heating, electric field acceleration). The charged species, referred to as general species, are assumed to be in the presence of an arbitrary number of fixed Maxwellian species. The electrons may be treated either as one of these Maxwellian species or as a general species. Coulomb interactions among all charged species are considered This program is a new version of FPPAC. FPPAC was last published in Computer Physics Communications in 1988. This new version is identical in scope to the previous version. However, it is written in standard Fortran 77 and is able to execute on a variety of Unix systems. The code has been tested on the Cray-C90, HP-755 and Sun Sparc-1. The answers agree on all platforms where the code has been tested. The test problems are the same as those provided in 1988. This version also corrects a bug in the 1988 version.
Tools for 3D scientific visualization in computational aerodynamics
NASA Technical Reports Server (NTRS)
Bancroft, Gordon; Plessel, Todd; Merritt, Fergus; Watson, Val
1989-01-01
The purpose is to describe the tools and techniques in use at the NASA Ames Research Center for performing visualization of computational aerodynamics, for example visualization of flow fields from computer simulations of fluid dynamics about vehicles such as the Space Shuttle. The hardware used for visualization is a high-performance graphics workstation connected to a super computer with a high speed channel. At present, the workstation is a Silicon Graphics IRIS 3130, the supercomputer is a CRAY2, and the high speed channel is a hyperchannel. The three techniques used for visualization are post-processing, tracking, and steering. Post-processing analysis is done after the simulation. Tracking analysis is done during a simulation but is not interactive, whereas steering analysis involves modifying the simulation interactively during the simulation. Using post-processing methods, a flow simulation is executed on a supercomputer and, after the simulation is complete, the results of the simulation are processed for viewing. The software in use and under development at NASA Ames Research Center for performing these types of tasks in computational aerodynamics is described. Workstation performance issues, benchmarking, and high-performance networks for this purpose are also discussed as well as descriptions of other hardware for digital video and film recording.
NASA Technical Reports Server (NTRS)
Kramer, Williams T. C.; Simon, Horst D.
1994-01-01
This tutorial proposes to be a practical guide for the uninitiated to the main topics and themes of high-performance computing (HPC), with particular emphasis to distributed computing. The intent is first to provide some guidance and directions in the rapidly increasing field of scientific computing using both massively parallel and traditional supercomputers. Because of their considerable potential computational power, loosely or tightly coupled clusters of workstations are increasingly considered as a third alternative to both the more conventional supercomputers based on a small number of powerful vector processors, as well as high massively parallel processors. Even though many research issues concerning the effective use of workstation clusters and their integration into a large scale production facility are still unresolved, such clusters are already used for production computing. In this tutorial we will utilize the unique experience made at the NAS facility at NASA Ames Research Center. Over the last five years at NAS massively parallel supercomputers such as the Connection Machines CM-2 and CM-5 from Thinking Machines Corporation and the iPSC/860 (Touchstone Gamma Machine) and Paragon Machines from Intel were used in a production supercomputer center alongside with traditional vector supercomputers such as the Cray Y-MP and C90.
Enabling Graph Mining in RDF Triplestores using SPARQL for Holistic In-situ Graph Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Sangkeun; Sukumar, Sreenivas R; Hong, Seokyong
The graph analysis is now considered as a promising technique to discover useful knowledge in data with a new perspective. We envi- sion that there are two dimensions of graph analysis: OnLine Graph Analytic Processing (OLGAP) and Graph Mining (GM) where each respectively focuses on subgraph pattern matching and automatic knowledge discovery in graph. Moreover, as these two dimensions aim to complementarily solve complex problems, holistic in-situ graph analysis which covers both OLGAP and GM in a single system is critical for minimizing the burdens of operating multiple graph systems and transferring intermediate result-sets between those systems. Nevertheless, most existingmore » graph analysis systems are only capable of one dimension of graph analysis. In this work, we take an approach to enabling GM capabilities (e.g., PageRank, connected-component analysis, node eccentricity, etc.) in RDF triplestores, which are originally developed to store RDF datasets and provide OLGAP capability. More specifically, to achieve our goal, we implemented six representative graph mining algorithms using SPARQL. The approach allows a wide range of available RDF data sets directly applicable for holistic graph analysis within a system. For validation of our approach, we evaluate performance of our implementations with nine real-world datasets and three different computing environments - a laptop computer, an Amazon EC2 instance, and a shared-memory Cray XMT2 URIKA-GD graph-processing appliance. The experimen- tal results show that our implementation can provide promising and scalable performance for real world graph analysis in all tested environments. The developed software is publicly available in an open-source project that we initiated.« less
Enabling Graph Mining in RDF Triplestores using SPARQL for Holistic In-situ Graph Analysis
Lee, Sangkeun; Sukumar, Sreenivas R; Hong, Seokyong; ...
2016-01-01
The graph analysis is now considered as a promising technique to discover useful knowledge in data with a new perspective. We envi- sion that there are two dimensions of graph analysis: OnLine Graph Analytic Processing (OLGAP) and Graph Mining (GM) where each respectively focuses on subgraph pattern matching and automatic knowledge discovery in graph. Moreover, as these two dimensions aim to complementarily solve complex problems, holistic in-situ graph analysis which covers both OLGAP and GM in a single system is critical for minimizing the burdens of operating multiple graph systems and transferring intermediate result-sets between those systems. Nevertheless, most existingmore » graph analysis systems are only capable of one dimension of graph analysis. In this work, we take an approach to enabling GM capabilities (e.g., PageRank, connected-component analysis, node eccentricity, etc.) in RDF triplestores, which are originally developed to store RDF datasets and provide OLGAP capability. More specifically, to achieve our goal, we implemented six representative graph mining algorithms using SPARQL. The approach allows a wide range of available RDF data sets directly applicable for holistic graph analysis within a system. For validation of our approach, we evaluate performance of our implementations with nine real-world datasets and three different computing environments - a laptop computer, an Amazon EC2 instance, and a shared-memory Cray XMT2 URIKA-GD graph-processing appliance. The experimen- tal results show that our implementation can provide promising and scalable performance for real world graph analysis in all tested environments. The developed software is publicly available in an open-source project that we initiated.« less
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers
NASA Technical Reports Server (NTRS)
Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak
1996-01-01
This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
Supercomputer analysis of purine and pyrimidine metabolism leading to DNA synthesis.
Heinmets, F
1989-06-01
A model-system is established to analyze purine and pyrimidine metabolism leading to DNA synthesis. The principal aim is to explore the flow and regulation of terminal deoxynucleoside triophosphates (dNTPs) in various input and parametric conditions. A series of flow equations are established, which are subsequently converted to differential equations. These are programmed (Fortran) and analyzed on a Cray chi-MP/48 supercomputer. The pool concentrations are presented as a function of time in conditions in which various pertinent parameters of the system are modified. The system is formulated by 100 differential equations.
Do Some X-ray Stars Have White Dwarf Companions?
NASA Technical Reports Server (NTRS)
McCollum, Bruce
1995-01-01
Some Be stars which are intermittent C-ray sources may have white dwarf companions rather than neutron stars. It is not possible to prove or rule out the existence of Be+WD systems using X-ray or optical data. However, the presence of a white dwarf could be established by the detection of its EUV continuum shortward of the Be star's continuum turnover at 1OOOA. Either the detection or the nondetection of Be+WD systems would have implications for models of Be star variability, models of Be binary system formation and evolution, and models of wind-fed accretion.
Deploying Server-side File System Monitoring at NERSC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Uselton, Andrew
2009-05-01
The Franklin Cray XT4 at the NERSC center was equipped with the server-side I/O monitoring infrastructure Cerebro/LMT, which is described here in detail. Insights gained from the data produced include a better understanding of instantaneous data rates during file system testing, file system behavior during regular production time, and long-term average behaviors. Information and insights gleaned from this monitoring support efforts to proactively manage the I/O infrastructure on Franklin. A simple model for I/O transactions is introduced and compared with the 250 million observations sent to the LMT database from August 2008 to February 2009.
Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, Terry R
2012-01-01
This paper describes a kernel scheduling algorithm that is based on coscheduling principles and that is intended for parallel applications running on 1000 cores or more. Experimental results for a Linux implementation on a Cray XT5 machine are presented. The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barton, G.W. Jr.
In UCID-19588, Communicating between the Apple and the Wang, we described how to take Apple DOS text files and send them to the Wang, and how to return Wang files to the Apple. It is also possible to use your Apple as an Octopus terminal, and to exchange files with Octopus 7600's. Presumably, you can also talk to the Crays, or any other part of the system. This connection has another virtue. It eliminates one of the terminals in your office.
NAVO MSRC Navigator. Fall 2008
2008-01-01
arrival of our two new HPC systems, DAVINCI (IBM P6) and EINSTEIN (Cray XT5), and our new mass storage server, NEWTON (Sun M5000). “The most...will run on both DAVINCI and EINSTEIN, providing researchers with the capability of running jobs of up to 4,256 and 12,736 cores in size...are expected to double as EINSTEIN and DAVINCI are brought online. We have also strengthened the backbone of our Disaster Recovery infrastructure, as
New Computer Simulations of Macular Neural Functioning
NASA Technical Reports Server (NTRS)
Ross, Muriel D.; Doshay, D.; Linton, S.; Parnas, B.; Montgomery, K.; Chimento, T.
1994-01-01
We use high performance graphics workstations and supercomputers to study the functional significance of the three-dimensional (3-D) organization of gravity sensors. These sensors have a prototypic architecture foreshadowing more complex systems. Scaled-down simulations run on a Silicon Graphics workstation and scaled-up, 3-D versions run on a Cray Y-MP supercomputer. A semi-automated method of reconstruction of neural tissue from serial sections studied in a transmission electron microscope has been developed to eliminate tedious conventional photography. The reconstructions use a mesh as a step in generating a neural surface for visualization. Two meshes are required to model calyx surfaces. The meshes are connected and the resulting prisms represent the cytoplasm and the bounding membranes. A finite volume analysis method is employed to simulate voltage changes along the calyx in response to synapse activation on the calyx or on calyceal processes. The finite volume method insures that charge is conserved at the calyx-process junction. These and other models indicate that efferent processes act as voltage followers, and that the morphology of some afferent processes affects their functioning. In a final application, morphological information is symbolically represented in three dimensions in a computer. The possible functioning of the connectivities is tested using mathematical interpretations of physiological parameters taken from the literature. Symbolic, 3-D simulations are in progress to probe the functional significance of the connectivities. This research is expected to advance computer-based studies of macular functioning and of synaptic plasticity.
Global Seismic Imaging Based on Adjoint Tomography
NASA Astrophysics Data System (ADS)
Bozdag, E.; Lefebvre, M.; Lei, W.; Peter, D. B.; Smith, J. A.; Zhu, H.; Komatitsch, D.; Tromp, J.
2013-12-01
Our aim is to perform adjoint tomography at the scale of globe to image the entire planet. We have started elastic inversions with a global data set of 253 CMT earthquakes with moment magnitudes in the range 5.8 ≤ Mw ≤ 7 and used GSN stations as well as some local networks such as USArray, European stations, etc. Using an iterative pre-conditioned conjugate gradient scheme, we initially set the aim to obtain a global crustal and mantle model with confined transverse isotropy in the upper mantle. Global adjoint tomography has so far remained a challenge mainly due to computational limitations. Recent improvements in our 3D solvers (e.g., a GPU version) and access to high-performance computational centers (e.g., ORNL's Cray XK7 "Titan" system) now enable us to perform iterations with higher-resolution (T > 9 s) and longer-duration (200 min) simulations to accommodate high-frequency body waves and major-arc surface waves, respectively, which help improve data coverage. The remaining challenge is the heavy I/O traffic caused by the numerous files generated during the forward/adjoint simulations and the pre- and post-processing stages of our workflow. We improve the global adjoint tomography workflow by adopting the ADIOS file format for our seismic data as well as models, kernels, etc., to improve efficiency on high-performance clusters. Our ultimate aim is to use data from all available networks and earthquakes within the magnitude range of our interest (5.5 ≤ Mw ≤ 7) which requires a solid framework to manage big data in our global adjoint tomography workflow. We discuss the current status and future of global adjoint tomography based on our initial results as well as practical issues such as handling big data in inversions and on high-performance computing systems.
Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Attaway, S.; Brown, K.; Gardner, D.
1997-12-31
Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
Parallel performance of TORT on the CRAY J90: Model and measurement
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barnett, A.; Azmy, Y.Y.
1997-10-01
A limitation on the parallel performance of TORT on the CRAY J90 is the amount of extra work introduced by the multitasking algorithm itself. The extra work beyond that of the serial version of the code, called overhead, arises from the synchronization of the parallel tasks and the accumulation of results by the master task. The goal of recent updates to TORT was to reduce the time consumed by these activities. To help understand which components of the multitasking algorithm contribute significantly to the overhead, a parallel performance model was constructed and compared to measurements of actual timings of themore » code.« less
Computer modeling of pulsed CO2 lasers for lidar applications
NASA Technical Reports Server (NTRS)
Spiers, Gary D.; Smithers, Martin E.; Murty, Rom
1991-01-01
The experimental results will enable a comparison of the numerical code output with experimental data. This will ensure verification of the validity of the code. The measurements were made on a modified commercial CO2 laser. Results are listed as following. (1) The pulse shape and energy dependence on gas pressure were measured. (2) The intrapulse frequency chirp due to plasma and laser induced medium perturbation effects were determined. A simple numerical model showed quantitative agreement with these measurements. The pulse to pulse frequency stability was also determined. (3) The dependence was measured of the laser transverse mode stability on cavity length. A simple analysis of this dependence in terms of changes to the equivalent fresnel number and the cavity magnification was performed. (4) An analysis was made of the discharge pulse shape which enabled the low efficiency of the laser to be explained in terms of poor coupling of the electrical energy into the vibrational levels. And (5) the existing laser resonator code was changed to allow it to run on the Cray XMP under the new operating system.
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Biswas, Rupak
2003-01-01
This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
Comparative Evaluation of Different Optimization Algorithms for Structural Design Applications
NASA Technical Reports Server (NTRS)
Patnaik, Surya N.; Coroneos, Rula M.; Guptill, James D.; Hopkins, Dale A.
1996-01-01
Non-linear programming algorithms play an important role in structural design optimization. Fortunately, several algorithms with computer codes are available. At NASA Lewis Research Centre, a project was initiated to assess the performance of eight different optimizers through the development of a computer code CometBoards. This paper summarizes the conclusions of that research. CometBoards was employed to solve sets of small, medium and large structural problems, using the eight different optimizers on a Cray-YMP8E/8128 computer. The reliability and efficiency of the optimizers were determined from the performance of these problems. For small problems, the performance of most of the optimizers could be considered adequate. For large problems, however, three optimizers (two sequential quadratic programming routines, DNCONG of IMSL and SQP of IDESIGN, along with Sequential Unconstrained Minimizations Technique SUMT) outperformed others. At optimum, most optimizers captured an identical number of active displacement and frequency constraints but the number of active stress constraints differed among the optimizers. This discrepancy can be attributed to singularity conditions in the optimization and the alleviation of this discrepancy can improve the efficiency of optimizers.
Simulation and analysis of a geopotential research mission
NASA Technical Reports Server (NTRS)
Schutz, B. E.
1987-01-01
Computer simulations were performed for a Geopotential Research Mission (GRM) to enable the study of the gravitational sensitivity of the range rate measurements between the two satellites and to provide a set of simulated measurements to assist in the evaluation of techniques developed for the determination of the gravity field. The simulations were conducted with two satellites in near circular, frozen orbits at 160 km altitudes separated by 300 km. High precision numerical integration of the polar orbits were used with a gravitational field complete to degree and order 360. The set of simulated data for a mission duration of about 32 days was generated on a Cray X-MP computer. The results presented cover the most recent simulation, S8703, and includes a summary of the numerical integration of the simulated trajectories, a summary of the requirements to compute nominal reference trajectories to meet the initial orbit determination requirements for the recovery of the geopotential, an analysis of the nature of the one way integrated Doppler measurements associated with the simulation, and a discussion of the data set to be made available.
Performance Trend of Different Algorithms for Structural Design Optimization
NASA Technical Reports Server (NTRS)
Patnaik, Surya N.; Coroneos, Rula M.; Guptill, James D.; Hopkins, Dale A.
1996-01-01
Nonlinear programming algorithms play an important role in structural design optimization. Fortunately, several algorithms with computer codes are available. At NASA Lewis Research Center, a project was initiated to assess performance of different optimizers through the development of a computer code CometBoards. This paper summarizes the conclusions of that research. CometBoards was employed to solve sets of small, medium and large structural problems, using different optimizers on a Cray-YMP8E/8128 computer. The reliability and efficiency of the optimizers were determined from the performance of these problems. For small problems, the performance of most of the optimizers could be considered adequate. For large problems however, three optimizers (two sequential quadratic programming routines, DNCONG of IMSL and SQP of IDESIGN, along with the sequential unconstrained minimizations technique SUMT) outperformed others. At optimum, most optimizers captured an identical number of active displacement and frequency constraints but the number of active stress constraints differed among the optimizers. This discrepancy can be attributed to singularity conditions in the optimization and the alleviation of this discrepancy can improve the efficiency of optimizers.
Software and Systems Test Track Architecture and Concept Definition
2007-05-01
Light 11.0 11.0 11.0 ASC Flex Free Software Foundation 2.5.31 2.5.31 2.5.31 ASC Fluent Fluent Inc. 6.2.26 6.2.26 6.2.26 6.2.26 ASC FMD ...11 ERDC Fluent Fluent 6.2.16 ERDC Fortran 77/90 compiler Compaq/Cray/SGI 7.4 7.4.3m 7.4.4m 5.6 ERDC FTA Platform 1.1 1.1 1.1 ERDC GAMESS
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code
NASA Astrophysics Data System (ADS)
Mendygral, P. J.; Radcliffe, N.; Kandalla, K.; Porter, D.; O'Neill, B. J.; Nolting, C.; Edmon, P.; Donnert, J. M. F.; Jones, T. W.
2017-02-01
We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it may be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.
Numerical results on the transcendence of constants involving pi, e, and Euler's constant
NASA Technical Reports Server (NTRS)
Bailey, David H.
1988-01-01
The existence of simple polynomial equations (integer relations) for the constants e/pi, e + pi, log pi, gamma (Euler's constant), e exp gamma, gamma/e, gamma/pi, and log gamma is investigated by means of numerical computations. The recursive form of the Ferguson-Fourcade algorithm (Ferguson and Fourcade, 1979; Ferguson, 1986 and 1987) is implemented on the Cray-2 supercomputer at NASA Ames, applying multiprecision techniques similar to those described by Bailey (1988) except that FFTs are used instead of dual-prime-modulus transforms for multiplication. It is shown that none of the constants has an integer relation of degree eight or less with coefficients of Euclidean norm 10 to the 9th or less.
High-performance multiprocessor architecture for a 3-D lattice gas model
NASA Technical Reports Server (NTRS)
Lee, F.; Flynn, M.; Morf, M.
1991-01-01
The lattice gas method has recently emerged as a promising discrete particle simulation method in areas such as fluid dynamics. We present a very high-performance scalable multiprocessor architecture, called ALGE, proposed for the simulation of a realistic 3-D lattice gas model, Henon's 24-bit FCHC isometric model. Each of these VLSI processors is as powerful as a CRAY-2 for this application. ALGE is scalable in the sense that it achieves linear speedup for both fixed and increasing problem sizes with more processors. The core computation of a lattice gas model consists of many repetitions of two alternating phases: particle collision and propagation. Functional decomposition by symmetry group and virtual move are the respective keys to efficient implementation of collision and propagation.
Turbomachinery Forced Response Prediction System (FREPS): User's Manual
NASA Technical Reports Server (NTRS)
Morel, M. R.; Murthy, D. V.
1994-01-01
The turbomachinery forced response prediction system (FREPS), version 1.2, is capable of predicting the aeroelastic behavior of axial-flow turbomachinery blades. This document is meant to serve as a guide in the use of the FREPS code with specific emphasis on its use at NASA Lewis Research Center (LeRC). A detailed explanation of the aeroelastic analysis and its development is beyond the scope of this document, and may be found in the references. FREPS has been developed by the NASA LeRC Structural Dynamics Branch. The manual is divided into three major parts: an introduction, the preparation of input, and the procedure to execute FREPS. Part 1 includes a brief background on the necessity of FREPS, a description of the FREPS system, the steps needed to be taken before FREPS is executed, an example input file with instructions, presentation of the geometric conventions used, and the input/output files employed and produced by FREPS. Part 2 contains a detailed description of the command names needed to create the primary input file that is required to execute the FREPS code. Also, Part 2 has an example data file to aid the user in creating their own input files. Part 3 explains the procedures required to execute the FREPS code on the Cray Y-MP, a computer system available at the NASA LeRC.
PLOT3D/AMES, GENERIC UNIX VERSION USING DISSPLA (WITH TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The UNIX/DISSPLA implementation of PLOT3D supports 2-D polygons as well as 2-D and 3-D lines, but does not support graphics features requiring 3-D polygons (shading and hidden line removal, for example). Views can be manipulated using keyboard commands. This version of PLOT3D is potentially able to produce files for a variety of output devices; however, site-specific capabilities will vary depending on the device drivers supplied with the user's DISSPLA library. The version 3.6b+ UNIX/DISSPLA implementations of PLOT3D (ARC-12788) and PLOT3D/TURB3D (ARC-12778) were developed for use on computers running UNIX SYSTEM 5 with BSD 4.3 extensions. The standard distribution media for each ofthese programs is a 9track, 6250 bpi magnetic tape in TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D (ARC-12783, ARC-12782); (3) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777, ARC-12781); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. System 5 is a trademark of Bell Labs, Incorporated. BSD4.3 is a trademark of the University of California at Berkeley. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, GENERIC UNIX VERSION USING DISSPLA (WITHOUT TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The UNIX/DISSPLA implementation of PLOT3D supports 2-D polygons as well as 2-D and 3-D lines, but does not support graphics features requiring 3-D polygons (shading and hidden line removal, for example). Views can be manipulated using keyboard commands. This version of PLOT3D is potentially able to produce files for a variety of output devices; however, site-specific capabilities will vary depending on the device drivers supplied with the user's DISSPLA library. The version 3.6b+ UNIX/DISSPLA implementations of PLOT3D (ARC-12788) and PLOT3D/TURB3D (ARC-12778) were developed for use on computers running UNIX SYSTEM 5 with BSD 4.3 extensions. The standard distribution media for each ofthese programs is a 9track, 6250 bpi magnetic tape in TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D (ARC-12783, ARC-12782); (3) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777, ARC-12781); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. System 5 is a trademark of Bell Labs, Incorporated. BSD4.3 is a trademark of the University of California at Berkeley. UNIX is a registered trademark of AT&T.
Input/output behavior of supercomputing applications
NASA Technical Reports Server (NTRS)
Miller, Ethan L.
1991-01-01
The collection and analysis of supercomputer I/O traces and their use in a collection of buffering and caching simulations are described. This serves two purposes. First, it gives a model of how individual applications running on supercomputers request file system I/O, allowing system designer to optimize I/O hardware and file system algorithms to that model. Second, the buffering simulations show what resources are needed to maximize the CPU utilization of a supercomputer given a very bursty I/O request rate. By using read-ahead and write-behind in a large solid stated disk, one or two applications were sufficient to fully utilize a Cray Y-MP CPU.
On the parallel solution of parabolic equations
NASA Technical Reports Server (NTRS)
Gallopoulos, E.; Saad, Youcef
1989-01-01
Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented.
NASA Technical Reports Server (NTRS)
Shannon, Robert V., Jr.
1989-01-01
The model generation and structural analysis performed for the High Pressure Oxidizer Turbopump (HPOTP) preburner pump volute housing located on the main pump end of the HPOTP in the space shuttle main engine are summarized. An ANSYS finite element model of the volute housing was built and executed. A static structural analysis was performed on the Engineering Analysis and Data System (EADS) Cray-XMP supercomputer
A Portable Parallel Implementation of the U.S. Navy Layered Ocean Model
1995-01-01
Wallcraft, PhD (I.C. 1981) Planning Systems Inc. & P. R. Moore, PhD (Camb. 1971) IC Dept. Math. DR Moore 1° Encontro de Metodos Numericos...Kendall Square, Hypercube, D R Moore 1 ° Encontro de Metodos Numericos para Equacöes de Derivadas Parciais A. J. Wallcraft IC Mathematics...chips: Chips Machine DEC Alpha CrayT3D/E SUN Sparc Fujitsu AP1000 Intel 860 Paragon D R Moore 1° Encontro de Metodos Numericos para Equacöes
Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, Terry R
2011-01-01
This paper describes a kernel scheduling algorithm that is based on co-scheduling principles and that is intended for parallel applications running on 1000 cores or more where inter-node scalability is key. Experimental results for a Linux implementation on a Cray XT5 machine are presented.1 The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
1990-08-01
corneal structure for both normal and swollen corneas. Other problems of future interest are the understanding of the structure of scarred and dystrophied ...METHOD AND RESULTS The system of equations is solved numerically on a Cray X-MP by a finite element method with 9-node Lagrange quadrilaterals ( Becker ...Appl. Math., 42, 430. Becker , E. B., G. F. Carey, and J. T. Oden, 1981. Finite Elements: An Introduction (Vol. 1), Prentice- Hall, Englewood Cliffs, New
A fast, time-accurate unsteady full potential scheme
NASA Technical Reports Server (NTRS)
Shankar, V.; Ide, H.; Gorski, J.; Osher, S.
1985-01-01
The unsteady form of the full potential equation is solved in conservation form by an implicit method based on approximate factorization. At each time level, internal Newton iterations are performed to achieve time accuracy and computational efficiency. A local time linearization procedure is introduced to provide a good initial guess for the Newton iteration. A novel flux-biasing technique is applied to generate proper forms of the artificial viscosity to treat hyperbolic regions with shocks and sonic lines present. The wake is properly modeled by accounting not only for jumps in phi, but also for jumps in higher derivatives of phi, obtained by imposing the density to be continuous across the wake. The far field is modeled using the Riemann invariants to simulate nonreflecting boundary conditions. The resulting unsteady method performs well which, even at low reduced frequency levels of 0.1 or less, requires fewer than 100 time steps per cycle at transonic Mach numbers. The code is fully vectorized for the CRAY-XMP and the VPS-32 computers.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.
1991-01-01
Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
NASA Technical Reports Server (NTRS)
Palmer, Grant; Venkatapathy, Ethiraj
1993-01-01
Three solution algorithms, explicit underrelaxation, point implicit, and lower upper symmetric Gauss-Seidel (LUSGS), are used to compute nonequilibrium flow around the Apollo 4 return capsule at 62 km altitude. By varying the Mach number, the efficiency and robustness of the solution algorithms were tested for different levels of chemical stiffness. The performance of the solution algorithms degraded as the Mach number and stiffness of the flow increased. At Mach 15, 23, and 30, the LUSGS method produces an eight order of magnitude drop in the L2 norm of the energy residual in 1/3 to 1/2 the Cray C-90 computer time as compared to the point implicit and explicit under-relaxation methods. The explicit under-relaxation algorithm experienced convergence difficulties at Mach 23 and above. At Mach 40 the performance of the LUSGS algorithm deteriorates to the point it is out-performed by the point implicit method. The effects of the viscous terms are investigated. Grid dependency questions are explored.
Query optimization for graph analytics on linked data using SPARQL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Seokyong; Lee, Sangkeun; Lim, Seung -Hwan
2015-07-01
Triplestores that support query languages such as SPARQL are emerging as the preferred and scalable solution to represent data and meta-data as massive heterogeneous graphs using Semantic Web standards. With increasing adoption, the desire to conduct graph-theoretic mining and exploratory analysis has also increased. Addressing that desire, this paper presents a solution that is the marriage of Graph Theory and the Semantic Web. We present software that can analyze Linked Data using graph operations such as counting triangles, finding eccentricity, testing connectedness, and computing PageRank directly on triple stores via the SPARQL interface. We describe the process of optimizing performancemore » of the SPARQL-based implementation of such popular graph algorithms by reducing the space-overhead, simplifying iterative complexity and removing redundant computations by understanding query plans. Our optimized approach shows significant performance gains on triplestores hosted on stand-alone workstations as well as hardware-optimized scalable supercomputers such as the Cray XMT.« less
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines
NASA Technical Reports Server (NTRS)
Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.
1994-01-01
The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).
HO2 rovibrational eigenvalue studies for nonzero angular momentum
NASA Astrophysics Data System (ADS)
Wu, Xudong T.; Hayes, Edward F.
1997-08-01
An efficient parallel algorithm is reported for determining all bound rovibrational energy levels for the HO2 molecule for nonzero angular momentum values, J=1, 2, and 3. Performance tests on the CRAY T3D indicate that the algorithm scales almost linearly when up to 128 processors are used. Sustained performance levels of up to 3.8 Gflops have been achieved using 128 processors for J=3. The algorithm uses a direct product discrete variable representation (DVR) basis and the implicitly restarted Lanczos method (IRLM) of Sorensen to compute the eigenvalues of the polyatomic Hamiltonian. Since the IRLM is an iterative method, it does not require storage of the full Hamiltonian matrix—it only requires the multiplication of the Hamiltonian matrix by a vector. When the IRLM is combined with a formulation such as DVR, which produces a very sparse matrix, both memory and computation times can be reduced dramatically. This algorithm has the potential to achieve even higher performance levels for larger values of the total angular momentum.
Preliminary 2-D shell analysis of the space shuttle solid rocket boosters
NASA Technical Reports Server (NTRS)
Knight, Norman F., Jr.; Gillian, Ronnie E.; Nemeth, Michael P.
1987-01-01
A two-dimensional shell model of an entire solid rocket booster (SRB) has been developed using the STAGSC-1 computer code and executed on the Ames CRAY computer. The purpose of these analyses is to calculate the overall deflection and stress distributions for the SRB when subjected to mechanical loads corresponding to critical times during the launch sequence. The mechanical loading conditions for the full SRB arise from the external tank (ET) attachment points, the solid rocket motor (SRM) pressure load, and the SRB hold down posts. The ET strut loads vary with time after the Space Shuttle main engine (SSME) ignition. The SRM internal pressure varies axially by approximately 100 psi. Static analyses of the full SRB are performed using a snapshot picture of the loads. The field and factory joints are modeled by using equivalent stiffness joints instead of detailed models of the joint. As such, local joint behavior cannot be obtained from this global model.
Viscous wing theory development. Volume 1: Analysis, method and results
NASA Technical Reports Server (NTRS)
Chow, R. R.; Melnik, R. E.; Marconi, F.; Steinhoff, J.
1986-01-01
Viscous transonic flows at large Reynolds numbers over 3-D wings were analyzed using a zonal viscid-inviscid interaction approach. A new numerical AFZ scheme was developed in conjunction with the finite volume formulation for the solution of the inviscid full-potential equation. A special far-field asymptotic boundary condition was developed and a second-order artificial viscosity included for an improved inviscid solution methodology. The integral method was used for the laminar/turbulent boundary layer and 3-D viscous wake calculation. The interaction calculation included the coupling conditions of the source flux due to the wing surface boundary layer, the flux jump due to the viscous wake, and the wake curvature effect. A method was also devised incorporating the 2-D trailing edge strong interaction solution for the normal pressure correction near the trailing edge region. A fully automated computer program was developed to perform the proposed method with one scalar version to be used on an IBM-3081 and two vectorized versions on Cray-1 and Cyber-205 computers.
The design and implementation of a parallel unstructured Euler solver using software primitives
NASA Technical Reports Server (NTRS)
Das, R.; Mavriplis, D. J.; Saltz, J.; Gupta, S.; Ponnusamy, R.
1992-01-01
This paper is concerned with the implementation of a three-dimensional unstructured grid Euler-solver on massively parallel distributed-memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. An unstructured multigrid algorithm with an edge-based data structure has been adopted, and a number of optimizations have been devised and implemented in order to accelerate the parallel communication rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code, while providing a basis for future automatic run-time compilation support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effect of these optimizations leads to roughly a factor of three performance improvement. The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.
Early Experiences Writing Performance Portable OpenMP 4 Codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joubert, Wayne; Hernandez, Oscar R
In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less
NASA Astrophysics Data System (ADS)
Anantharaj, V.; Mayer, B.; Wang, F.; Hack, J.; McKenna, D.; Hartman-Baker, R.
2012-04-01
The Oak Ridge Leadership Computing Facility (OLCF) facilitates the execution of computational experiments that require tens of millions of CPU hours (typically using thousands of processors simultaneously) while generating hundreds of terabytes of data. A set of ultra high resolution climate experiments in progress, using the Community Earth System Model (CESM), will produce over 35,000 files, ranging in sizes from 21 MB to 110 GB each. The execution of the experiments will require nearly 70 Million CPU hours on the Jaguar and Titan supercomputers at OLCF. The total volume of the output from these climate modeling experiments will be in excess of 300 TB. This model output must then be archived, analyzed, distributed to the project partners in a timely manner, and also made available more broadly. Meeting this challenge would require efficient movement of the data, staging the simulation output to a large and fast file system that provides high volume access to other computational systems used to analyze the data and synthesize results. This file system also needs to be accessible via high speed networks to an archival system that can provide long term reliable storage. Ideally this archival system is itself directly available to other systems that can be used to host services making the data and analysis available to the participants in the distributed research project and to the broader climate community. The various resources available at the OLCF now support this workflow. The available systems include the new Jaguar Cray XK6 2.63 petaflops (estimated) supercomputer, the 10 PB Spider center-wide parallel file system, the Lens/EVEREST analysis and visualization system, the HPSS archival storage system, the Earth System Grid (ESG), and the ORNL Climate Data Server (CDS). The ESG features federated services, search & discovery, extensive data handling capabilities, deep storage access, and Live Access Server (LAS) integration. The scientific workflow enabled on these systems, and developed as part of the Ultra-High Resolution Climate Modeling Project, allows users of OLCF resources to efficiently share simulated data, often multi-terabyte in volume, as well as the results from the modeling experiments and various synthesized products derived from these simulations. The final objective in the exercise is to ensure that the simulation results and the enhanced understanding will serve the needs of a diverse group of stakeholders across the world, including our research partners in U.S. Department of Energy laboratories & universities, domain scientists, students (K-12 as well as higher education), resource managers, decision makers, and the general public.
Distributed multitasking ITS with PVM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fan, W.C.; Halbleib, J.A. Sr.
1995-12-31
Advances in computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable to or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generatedmore » in a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of the MCNP code on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electronphoton transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load-balancing schemes for homogeneous and heterogeneous networks.« less
Distributed multitasking ITS with PVM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fan, W.C.; Halbleib, J.A. Sr.
1995-02-01
Advances of computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources, and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generated inmore » a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of MCNP on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electron/photon transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load balancing schemes for homogeneous and heterogeneous networks.« less
3D gain modeling of LMJ and NIF amplifiers
NASA Astrophysics Data System (ADS)
LeTouze, Geoffroy; Cabourdin, Olivier; Mengue, J. F.; Guenet, Mireille; Grebot, Eric; Seznec, Stephane E.; Jancaitis, Kenneth S.; Marshall, Christopher D.; Zapata, Luis E.; Erlandson, A. E.
1999-07-01
A 3D ray-trace model has been developed to predict the performance of flashlamp pumped laser amplifiers. The computer program, written in C++, includes a graphical display option using the Open Inventor library, as well as a parser and a loader allowing the user to easily model complex multi-segment amplifier systems. It runs both on a workstation cluster at LLNL, and on the T3E Cray at CEA. We will discuss how we have reduce the required computation time without changing precision by optimizing the parameters which set the discretization level of the calculation. As an example, the sample of calculation points is chosen to fit the pumping profile through the thickness of amplifier slabs. We will show the difference in pump rates with our latest model as opposed to those produced by our earlier 2.5D code AmpModel. We will also present the results of calculations which model surfaces and other 3D effects such as top and bottom refelcotr positions and reflectivity which could not be included in the 2.5D model. This new computer model also includes a full 3D calculation of the amplified spontaneous emission rate in the laser slab, as opposed to the 2.5D model which tracked only the variation in the gain across the transverse dimensions of the slab. We will present the impact of this evolution of the model on the predicted stimulated decay rate and the resulting gain distribution. Comparison with most recent AmpLab experimental result will be presented, in the different typical NIF and LMJ configurations.
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Crockett, Thomas W.
1999-01-01
This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
Scaling up ATLAS Event Service to production levels on opportunistic computing platforms
NASA Astrophysics Data System (ADS)
Benjamin, D.; Caballero, J.; Ernst, M.; Guan, W.; Hover, J.; Lesny, D.; Maeno, T.; Nilsson, P.; Tsulaia, V.; van Gemmeren, P.; Vaniachine, A.; Wang, F.; Wenaus, T.; ATLAS Collaboration
2016-10-01
Continued growth in public cloud and HPC resources is on track to exceed the dedicated resources available for ATLAS on the WLCG. Examples of such platforms are Amazon AWS EC2 Spot Instances, Edison Cray XC30 supercomputer, backfill at Tier 2 and Tier 3 sites, opportunistic resources at the Open Science Grid (OSG), and ATLAS High Level Trigger farm between the data taking periods. Because of specific aspects of opportunistic resources such as preemptive job scheduling and data I/O, their efficient usage requires workflow innovations provided by the ATLAS Event Service. Thanks to the finer granularity of the Event Service data processing workflow, the opportunistic resources are used more efficiently. We report on our progress in scaling opportunistic resource usage to double-digit levels in ATLAS production.
Relativistic Collisions of Highly-Charged Ions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ionescu, Dorin; Belkacem, Ali
1998-11-19
The physics of elementary atomic processes in relativistic collisions between highly-charged ions and atoms or other ions is briefly discussed, and some recent theoretical and experimental results in this field are summarized. They include excitation, capture, ionization, and electron-positron pair creation. The numerical solution of the two-center Dirac equation in momentum space is shown to be a powerful nonperturbative method for describing atomic processes in relativistic collisions involving heavy and highly-charged ions. By propagating negative-energy wave packets in time the evolution of the QED vacuum around heavy ions in relativistic motion is investigated. Recent results obtained from numerical calculations usingmore » massively parallel processing on the Cray-T3E supercomputer of the National Energy Research Scientific Computer Center (NERSC) at Berkeley National Laboratory are presented.« less
NASA Technical Reports Server (NTRS)
Holzmann, Gerard J.; Joshi, Rajeev; Groce, Alex
2008-01-01
Reportedly, supercomputer designer Seymour Cray once said that he would sooner use two strong oxen to plow a field than a thousand chickens. Although this is undoubtedly wise when it comes to plowing a field, it is not so clear for other types of tasks. Model checking problems are of the proverbial "search the needle in a haystack" type. Such problems can often be parallelized easily. Alas, none of the usual divide and conquer methods can be used to parallelize the working of a model checker. Given that it has become easier than ever to gain access to large numbers of computers to perform even routine tasks it is becoming more and more attractive to find alternate ways to use these resources to speed up model checking tasks. This paper describes one such method, called swarm verification.
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mendygral, P. J.; Radcliffe, N.; Kandalla, K.
2017-02-01
We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it maymore » be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.« less
Progress in unstructured-grid methods development for unsteady aerodynamic applications
NASA Technical Reports Server (NTRS)
Batina, John T.
1992-01-01
The development of unstructured-grid methods for the solution of the equations of fluid flow and what was learned over the course of the research are summarized. The focus of the discussion is on the solution of the time-dependent Euler equations including spatial discretizations, temporal discretizations, and boundary conditions. An example calculation with an implicit upwind method using a CFL number of infinity is presented for the Boeing 747 aircraft. The results were obtained in less than one hour CPU time on a Cray-2 computer, thus, demonstrating the speed and robustness of the capability. Additional calculations for the ONERA M6 wing demonstrate the accuracy of the method through the good agreement between calculated results and experimental data for a standard transonic flow case.
An interactive adaptive remeshing algorithm for the two-dimensional Euler equations
NASA Technical Reports Server (NTRS)
Slack, David C.; Walters, Robert W.; Lohner, R.
1990-01-01
An interactive adaptive remeshing algorithm utilizing a frontal grid generator and a variety of time integration schemes for the two-dimensional Euler equations on unstructured meshes is presented. Several device dependent interactive graphics interfaces have been developed along with a device independent DI-3000 interface which can be employed on any computer that has the supporting software including the Cray-2 supercomputers Voyager and Navier. The time integration methods available include: an explicit four stage Runge-Kutta and a fully implicit LU decomposition. A cell-centered finite volume upwind scheme utilizing Roe's approximate Riemann solver is developed. To obtain higher order accurate results a monotone linear reconstruction procedure proposed by Barth is utilized. Results for flow over a transonic circular arc and flow through a supersonic nozzle are examined.
NASA Astrophysics Data System (ADS)
Huhn, William Paul; Lange, Björn; Yu, Victor; Blum, Volker; Lee, Seyong; Yoon, Mina
Density-functional theory has been well established as the dominant quantum-mechanical computational method in the materials community. Large accurate simulations become very challenging on small to mid-scale computers and require high-performance compute platforms to succeed. GPU acceleration is one promising approach. In this talk, we present a first implementation of all-electron density-functional theory in the FHI-aims code for massively parallel GPU-based platforms. Special attention is paid to the update of the density and to the integration of the Hamiltonian and overlap matrices, realized in a domain decomposition scheme on non-uniform grids. The initial implementation scales well across nodes on ORNL's Titan Cray XK7 supercomputer (8 to 64 nodes, 16 MPI ranks/node) and shows an overall speed up in runtime due to utilization of the K20X Tesla GPUs on each Titan node of 1.4x, with the charge density update showing a speed up of 2x. Further acceleration opportunities will be discussed. Work supported by the LDRD Program of ORNL managed by UT-Battle, LLC, for the U.S. DOE and by the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
A compositional reservoir simulator on distributed memory parallel computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rame, M.; Delshad, M.
1995-12-31
This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Two-dimensional Euler and Navier-Stokes Time accurate simulations of fan rotor flows
NASA Technical Reports Server (NTRS)
Boretti, A. A.
1990-01-01
Two numerical methods are presented which describe the unsteady flow field in the blade-to-blade plane of an axial fan rotor. These methods solve the compressible, time-dependent, Euler and the compressible, turbulent, time-dependent, Navier-Stokes conservation equations for mass, momentum, and energy. The Navier-Stokes equations are written in Favre-averaged form and are closed with an approximate two-equation turbulence model with low Reynolds number and compressibility effects included. The unsteady aerodynamic component is obtained by superposing inflow or outflow unsteadiness to the steady conditions through time-dependent boundary conditions. The integration in space is performed by using a finite volume scheme, and the integration in time is performed by using k-stage Runge-Kutta schemes, k = 2,5. The numerical integration algorithm allows the reduction of the computational cost of an unsteady simulation involving high frequency disturbances in both CPU time and memory requirements. Less than 200 sec of CPU time are required to advance the Euler equations in a computational grid made up of about 2000 grid during 10,000 time steps on a CRAY Y-MP computer, with a required memory of less than 0.3 megawords.
Computer aided design of monolithic microwave and millimeter wave integrated circuits and subsystems
NASA Astrophysics Data System (ADS)
Ku, Walter H.
1987-08-01
This interim technical report presents results of research on the computer aided design of monolithic microwave and millimeter wave integrated circuits and subsystems. A specific objective is to extend the state-of-the-art of the Computer Aided Design (CAD) of the monolithic microwave and millimeter wave integrated circuits (MIMIC). In this reporting period, we have derived a new model for the high electron mobility transistor (HEMT) based on a nonlinear charge control formulation which takes into consideration the variation of the 2DEG distance offset from the heterointerface as a function of bias. Pseudomorphic InGaAs/GaAs HEMT devices have been successfully fabricated at UCSD. For a 1 micron gate length, a maximum transconductance of 320 mS/mm was obtained. In cooperation with TRW, devices with 0.15 micron and 0.25 micron gate lengths have been successfully fabricated and tested. New results on the design of ultra-wideband distributed amplifiers using 0.15 micron pseudomorphic InGaAs/GaAs HEMT's have also been obtained. In addition, two-dimensional models of the submicron MESFET's, HEMT's and HBT's are currently being developed for the CRAY X-MP/48 supercomputer. Preliminary results obtained are also presented in this report.
NASA Technical Reports Server (NTRS)
Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad
1994-01-01
The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
TIGER: Turbomachinery interactive grid generation
NASA Technical Reports Server (NTRS)
Soni, Bharat K.; Shih, Ming-Hsin; Janus, J. Mark
1992-01-01
A three dimensional, interactive grid generation code, TIGER, is being developed for analysis of flows around ducted or unducted propellers. TIGER is a customized grid generator that combines new technology with methods from general grid generation codes. The code generates multiple block, structured grids around multiple blade rows with a hub and shroud for either C grid or H grid topologies. The code is intended for use with a Euler/Navier-Stokes solver also being developed, but is general enough for use with other flow solvers. TIGER features a silicon graphics interactive graphics environment that displays a pop-up window, graphics window, and text window. The geometry is read as a discrete set of points with options for several industrial standard formats and NASA standard formats. Various splines are available for defining the surface geometries. Grid generation is done either interactively or through a batch mode operation using history files from a previously generated grid. The batch mode operation can be done either with a graphical display of the interactive session or with no graphics so that the code can be run on another computer system. Run time can be significantly reduced by running on a Cray-YMP.
Azad, Ariful; Buluç, Aydın
2016-05-16
We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more » these algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. We report high-performance implementations of three maximal matching algorithms using hybrid OpenMP-MPI and evaluate the performance of these algorithm using more than 35 real and randomly generated graphs. On real instances, our algorithms achieve up to 200 × speedup on 2048 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 cores.« less
Performance of an Optimized Eta Model Code on the Cray T3E and a Network of PCs
NASA Technical Reports Server (NTRS)
Kouatchou, Jules; Rancic, Miodrag; Geiger, Jim
2000-01-01
In the year 2001, NASA will launch the satellite TRIANA that will be the first Earth observing mission to provide a continuous, full disk view of the sunlit Earth. As a part of the HPCC Program at NASA GSFC, we have started a project whose objectives are to develop and implement a 3D cloud data assimilation system, by combining TRIANA measurements with model simulation, and to produce accurate statistics of global cloud coverage as an important element of the Earth's climate. For simulation of the atmosphere within this project we are using the NCEP/NOAA operational Eta model. In order to compare TRIANA and the Eta model data on approximately the same grid without significant downscaling, the Eta model will be integrated at a resolution of about 15 km. The integration domain (from -70 to +70 deg in latitude and 150 deg in longitude) will cover most of the sunlit Earth disc and will continuously rotate around the globe following TRIANA. The cloud data assimilation is supposed to run and produce 3D clouds on a near real-time basis. Such a numerical setup and integration design is very ambitious and computationally demanding. Thus, though the Eta model code has been very carefully developed and its computational efficiency has been systematically polished during the years of operational implementation at NCEP, the current MPI version may still have problems with memory and efficiency for the TRIANA simulations. Within this work, we optimize a parallel version of the Eta model code on a Cray T3E and a network of PCs (theHIVE) in order to improve its overall efficiency. Our optimization procedure consists of introducing dynamically allocated arrays to reduce the size of static memory, and optimizing on a single processor by splitting loops to limit the number of streams. All the presented results are derived using an integration domain centered at the equator, with a size of 60 x 60 deg, and with horizontal resolutions of 1/2 and 1/3 deg, respectively. In accompanying charts we report the elapsed time, the speedup and the Mflops as a function of the number of processors for the non-optimized version of the code on the T3E and theHIVE. The large amount of communication required for model integration explains its poor performance on theHIVE. Our initial implementation of the dynamic memory allocation has contributed to about 12% reduction of memory but has introduced a 3% overhead in computing time. This overhead was removed by performing loop splitting in some of the high demanding subroutines. When the Eta code is fully optimized in order to meet the memory requirement for TRIANA simulations, a non-negligeable overhead may appear that may seriously affect the efficiency of the code. To alleviate this problem, we are considering implementation of a new algorithm for the horizontal advection that is computationally less expensive, and also a new approach for marching in time.
2003-09-01
sensors – now generating more empirical data annually than existed in the field of astronomy before 1980 – and the ability of researchers to make use of it...9701 cray@hpcmo.hpc.mil David W. Hislop , Ph.D. Program Manager, Software and Knowledge Based Systems U.S. Army Research Office P.O. Box 12211 Research...Triangle Park, NC 27709 (919) 549-4255 FAX: (919) 549-4354 hislop @aro-emh1.army.mil Rodger Johnson Program Manager, Defense Research and Engineering
A Block-LU Update for Large-Scale Linear Programming
1990-01-01
linear programming problems. Results are given from runs on the Cray Y -MP. 1. Introduction We wish to use the simplex method [Dan63] to solve the...standard linear program, minimize cTx subject to Ax = b 1< x <U, where A is an m by n matrix and c, x, 1, u, and b are of appropriate dimension. The simplex...the identity matrix. The basis is used to solve for the search direction y and the dual variables 7r in the following linear systems: Bky = aq (1.2) and
PLOT3D/AMES, UNIX SUPERCOMPUTER AND SGI IRIS VERSION (WITHOUT TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. In addition to providing the advantages of performing complex calculations on a supercomputer, the Supercomputer/IRIS implementation of PLOT3D offers advanced 3-D, view manipulation, and animation capabilities. Shading and hidden line/surface removal can be used to enhance depth perception and other aspects of the graphical displays. A mouse can be used to translate, rotate, or zoom in on views. Files for several types of output can be produced. Two animation options are available. Simple animation sequences can be created on the IRIS, or,if an appropriately modified version of ARCGRAPH (ARC-12350) is accesible on the supercomputer, files can be created for use in GAS (Graphics Animation System, ARC-12379), an IRIS program which offers more complex rendering and animation capabilities and options for recording images to digital disk, video tape, or 16-mm film. The version 3.6b+ Supercomputer/IRIS implementations of PLOT3D (ARC-12779) and PLOT3D/TURB3D (ARC-12784) are suitable for use on CRAY 2/UNICOS, CONVEX, and ALLIANT computers with a remote Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstation. These programs are distributed on .25 inch magnetic tape cartridges in IRIS TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations (ARC-12783, ARC-12782); (2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC12777, ARC-12781); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 - which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo, DN10000, and GMR3D are trademarks of Hewlett-Packard, Incorporated. System V is a trademark of Bell Labs, Incorporated. BSD4.3 is a trademark of the University of California at Berkeley. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, UNIX SUPERCOMPUTER AND SGI IRIS VERSION (WITH TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. In addition to providing the advantages of performing complex calculations on a supercomputer, the Supercomputer/IRIS implementation of PLOT3D offers advanced 3-D, view manipulation, and animation capabilities. Shading and hidden line/surface removal can be used to enhance depth perception and other aspects of the graphical displays. A mouse can be used to translate, rotate, or zoom in on views. Files for several types of output can be produced. Two animation options are available. Simple animation sequences can be created on the IRIS, or,if an appropriately modified version of ARCGRAPH (ARC-12350) is accesible on the supercomputer, files can be created for use in GAS (Graphics Animation System, ARC-12379), an IRIS program which offers more complex rendering and animation capabilities and options for recording images to digital disk, video tape, or 16-mm film. The version 3.6b+ Supercomputer/IRIS implementations of PLOT3D (ARC-12779) and PLOT3D/TURB3D (ARC-12784) are suitable for use on CRAY 2/UNICOS, CONVEX, and ALLIANT computers with a remote Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstation. These programs are distributed on .25 inch magnetic tape cartridges in IRIS TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations (ARC-12783, ARC-12782); (2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC12777, ARC-12781); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 - which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo, DN10000, and GMR3D are trademarks of Hewlett-Packard, Incorporated. System V is a trademark of Bell Labs, Incorporated. BSD4.3 is a trademark of the University of California at Berkeley. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, APOLLO UNIX VERSION USING GMR3D (WITHOUT TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The Apollo implementation of PLOT3D uses some of the capabilities of Apollo's 3-dimensional graphics hardware, but does not take advantage of the shading and hidden line/surface removal capabilities of the Apollo DN10000. Although this implementation does not offer a capability for putting text on plots, it does support the use of a mouse to translate, rotate, or zoom in on views. The version 3.6b+ Apollo implementations of PLOT3D (ARC-12789) and PLOT3D/TURB3D (ARC-12785) were developed for use on Apollo computers running UNIX System V with BSD 4.3 extensions and the graphics library GMR3D Version 2.0. The standard distribution media for each of these programs is a 9-track, 6250 bpi magnetic tape in TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: 1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); 2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777, ARC-12781); 3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations (ARC-12783, ARC-12782). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, SGI IRIS VERSION (WITHOUT TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. In each of these areas, the IRIS implementation of PLOT3D offers advanced features which aid visualization efforts. Shading and hidden line/surface removal can be used to enhance depth perception and other aspects of the graphical displays. A mouse can be used to translate, rotate, or zoom in on views. Files for several types of output can be produced. Two animation options are even offered: creation of simple animation sequences without the need for other software; and, creation of files for use in GAS (Graphics Animation System, ARC-12379), an IRIS program which offers more complex rendering and animation capabilities and can record images to digital disk, video tape, or 16-mm film. The version 3.6b+ SGI implementations of PLOT3D (ARC-12783) and PLOT3D/TURB3D (ARC-12782) were developed for use on Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations. These programs are each distributed on one .25 inch magnetic tape cartridge in IRIS TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777,ARC-12781); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, APOLLO UNIX VERSION USING GMR3D (WITH TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The Apollo implementation of PLOT3D uses some of the capabilities of Apollo's 3-dimensional graphics hardware, but does not take advantage of the shading and hidden line/surface removal capabilities of the Apollo DN10000. Although this implementation does not offer a capability for putting text on plots, it does support the use of a mouse to translate, rotate, or zoom in on views. The version 3.6b+ Apollo implementations of PLOT3D (ARC-12789) and PLOT3D/TURB3D (ARC-12785) were developed for use on Apollo computers running UNIX System V with BSD 4.3 extensions and the graphics library GMR3D Version 2.0. The standard distribution media for each of these programs is a 9-track, 6250 bpi magnetic tape in TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: 1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); 2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777, ARC-12781); 3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations (ARC-12783, ARC-12782). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, SGI IRIS VERSION (WITH TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. In each of these areas, the IRIS implementation of PLOT3D offers advanced features which aid visualization efforts. Shading and hidden line/surface removal can be used to enhance depth perception and other aspects of the graphical displays. A mouse can be used to translate, rotate, or zoom in on views. Files for several types of output can be produced. Two animation options are even offered: creation of simple animation sequences without the need for other software; and, creation of files for use in GAS (Graphics Animation System, ARC-12379), an IRIS program which offers more complex rendering and animation capabilities and can record images to digital disk, video tape, or 16-mm film. The version 3.6b+ SGI implementations of PLOT3D (ARC-12783) and PLOT3D/TURB3D (ARC-12782) were developed for use on Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D workstations. These programs are each distributed on one .25 inch magnetic tape cartridge in IRIS TAR format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) VAX computers running VMS Version 5.0 and DISSPLA Version 11.0 (ARC-12777,ARC-12781); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
Laminated Thin Shell Structures Subjected to Free Vibration in a Hygrothermal Environment
NASA Technical Reports Server (NTRS)
Gotsis, Pascal K.; Guptill, James D.
1994-01-01
Parametric studies were performed to assess the effects of various parameters on the free-vibration behavior (natural frequencies) of (+/- theta)(sub 2) angle-ply, fiber composite, thin shell structures in a hygrothermal environment. Knowledge of the natural frequencies of structures is important in considering their response to various kinds of excitation, especially when structures and force systems are complex and when excitations are not periodic. The three dimensional, finite element structural analysis computer code CSTEM was used in the Cray YMP computer environment. The fiber composite shell was assumed to be cylindrical and made from T300 graphite fibers embedded in an intermediate-modulus, high-strength matrix. The following parameters were investigated: the length and the laminate thickness of the shell, the fiber orientation, the fiber volume fraction, the temperature profile through the thickness of the laminate, and laminates with different ply thicknesses. The results indicate that the fiber orientation and the length of the laminated shell had significant effects on the natural frequencies. The fiber volume fraction, the laminate thickness, and the temperature profile through the shell thickness had weak effects on the natural frequencies. Finally, the laminates with different ply thicknesses had an insignificant influence on the behavior of the vibrated laminated shell. Also, a single through-the-thickness, eight-node, three dimensional composite finite element analysis appears to be sufficient for investigating the free-vibration behavior of thin, composite, angle-ply shell structures.
The factorization of large composite numbers on the MPP
NASA Technical Reports Server (NTRS)
Mckurdy, Kathy J.; Wunderlich, Marvin C.
1987-01-01
The continued fraction method for factoring large integers (CFRAC) was an ideal algorithm to be implemented on a massively parallel computer such as the Massively Parallel Processor (MPP). After much effort, the first 60 digit number was factored on the MPP using about 6 1/2 hours of array time. Although this result added about 10 digits to the size number that could be factored using CFRAC on a serial machine, it was already badly beaten by the implementation of Davis and Holdridge on the CRAY-1 using the quadratic sieve, an algorithm which is clearly superior to CFRAC for large numbers. An algorithm is illustrated which is ideally suited to the single instruction multiple data (SIMD) massively parallel architecture and some of the modifications which were needed in order to make the parallel implementation effective and efficient are described.
NASA Technical Reports Server (NTRS)
Korzennik, Sylvain
1997-01-01
Under the direction of Dr. Rhodes, and the technical supervision of Dr. Korzennik, the data assimilation of high spatial resolution solar dopplergrams has been carried out throughout the program on the Intel Delta Touchstone supercomputer. With the help of a research assistant, partially supported by this grant, and under the supervision of Dr. Korzennik, code development was carried out at SAO, using various available resources. To ensure cross-platform portability, PVM was selected as the message passing library. A parallel implementation of power spectra computation for helioseismology data reduction, using PVM was successfully completed. It was successfully ported to SMP architectures (i.e. SUN), and to some MPP architectures (i.e. the CM5). Due to limitation of the implementation of PVM on the Cray T3D, the port to that architecture was not completed at the time.
Efficacy of Code Optimization on Cache-Based Processors
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube
NASA Technical Reports Server (NTRS)
Joslin, Ronald D.; Zubair, Mohammad
1993-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Scaling Properties of Algorithms in Nanotechnology
NASA Technical Reports Server (NTRS)
Saini, Subhash; Bailey, David H.; Chancellor, Marisa K. (Technical Monitor)
1996-01-01
At the present time, several technologies are pressing the limits of microminiature manufacturing. In semiconductor technology, for example, the Intel Pentium Pro (which is used in the Department of Energy's ASCI 'red' parallel supercomputer system) and the DEC Alpha 21164 (which is used in the CRAY T3E) both are fabricated using 0.35 micron process technology. Recently Texas Instruments (TI) announced the availability of 0.25 micron technology chips by the end of 1996 and plans to have 0.18 micron devices in production within two years. However, some significant challenges lie down the road. These include the skyrocketing cost of manufacturing plants, the 0.1 micron foreseeable limit of the photolithography process, quantum effects, data communication bandwidth limitations, heat dissipation, and others. Some related microminiature technologies include micro-electromechanical systems (MEMS), opto-electronic devices, quantum computing, biological computing, and others. All of these technologies require the fabrication of devices whose sizes are approaching the nanometer level. As such they are often collectively referred to with the name 'nanotechnology'. Clearly nanotechnology in this general sense is destined to be a very important technology of the 21st century. The ultimate dream in this arena is 'molecular nanotechnology', in other words the fabrication of devices and materials with most or all atoms and molecules in a pre-programmed position, possibly placed there by 'nano-robots'. This futuristic capability will probably not be achieved for at least two decades. However, it appears that somewhat less ambitious variations of molecular nanotechnology, such as devices and materials based on 'buckyballs' and 'nanotubes' may be realized significantly sooner, possibly within ten years or so. Even at the present time, semiconductor devices are approaching the regime where quantum chemical effects must be considered in design.
Enabling Graph Appliance for Genome Assembly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singh, Rina; Graves, Jeffrey A; Lee, Sangkeun
2015-01-01
In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to storemore » and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less
Multigrid Methods in Electronic Structure Calculations
NASA Astrophysics Data System (ADS)
Briggs, Emil
1996-03-01
Multigrid techniques have become the method of choice for a broad range of computational problems. Their use in electronic structure calculations introduces a new set of issues when compared to traditional plane wave approaches. We have developed a set of techniques that address these issues and permit multigrid algorithms to be applied to the electronic structure problem in an efficient manner. In our approach the Kohn-Sham equations are discretized on a real-space mesh using a compact representation of the Hamiltonian. The resulting equations are solved directly on the mesh using multigrid iterations. This produces rapid convergence rates even for ill-conditioned systems with large length and/or energy scales. The method has been applied to both periodic and non-periodic systems containing over 400 atoms and the results are in very good agreement with both theory and experiment. Example applications include a vacancy in diamond, an isolated C60 molecule, and a 64-atom cell of GaN with the Ga d-electrons in valence which required a 250 Ry cutoff. A particular strength of a real-space multigrid approach is its ready adaptability to massively parallel computer architectures. The compact representation of the Hamiltonian is especially well suited to such machines. Tests on the Cray-T3D have shown nearly linear scaling of the execution time up to the maximum number of processors (512). The MPP implementation has been used for studies of a large Amyloid Beta Peptide (C_146O_45N_42H_210) found in the brains of Alzheimers disease patients. Further applications of the multigrid method will also be described. (in collaboration D. J. Sullivan and J. Bernholc)
Comparison of Nonequilibrium Solution Algorithms Applied to Chemically Stiff Hypersonic Flows
NASA Technical Reports Server (NTRS)
Palmer, Grant; Venkatapathy, Ethiraj
1995-01-01
Three solution algorithms, explicit under-relaxation, point implicit, and lower-upper symmetric Gauss-Seidel, are used to compute nonequilibrium flow around the Apollo 4 return capsule at the 62-km altitude point in its descent trajectory. By varying the Mach number, the efficiency and robustness of the solution algorithms were tested for different levels of chemical stiffness.The performance of the solution algorithms degraded as the Mach number and stiffness of the flow increased. At Mach 15 and 30, the lower-upper symmetric Gauss-Seidel method produces an eight order of magnitude drop in the energy residual in one-third to one-half the Cray C-90 computer time as compared to the point implicit and explicit under-relaxation methods. The explicit under-relaxation algorithm experienced convergence difficulties at Mach 30 and above. At Mach 40 the performance of the lower-upper symmetric Gauss-Seidel algorithm deteriorates to the point that it is out performed by the point implicit method. The effects of the viscous terms are investigated. Grid dependency questions are explored.
User interface user's guide for HYPGEN
NASA Technical Reports Server (NTRS)
Chiu, Ing-Tsau
1992-01-01
The user interface (UI) of HYPGEN is developed using Panel Library to shorten the learning curve for new users and provide easier ways to run HYPGEN for casual users as well as for advanced users. Menus, buttons, sliders, and type-in fields are used extensively in UI to allow users to point and click with a mouse to choose various available options or to change values of parameters. On-line help is provided to give users information on using UI without consulting the manual. Default values are set for most parameters and boundary conditions are determined by UI to further reduce the effort needed to run HYPGEN; however, users are free to make any changes and save it in a file for later use. A hook to PLOT3D is built in to allow graphics manipulation. The viewpoint and min/max box for PLOT3D windows are computed by UI and saved in a PLOT3D journal file. For large grids which take a long time to generate on workstations, the grid generator (HYPGEN) can be run on faster computers such as Crays, while UI stays at the workstation.
Galactic scale gas flows in colliding galaxies: 3-dimensional, N-body/hydrodynamics experiments
NASA Technical Reports Server (NTRS)
Lamb, Susan A.; Gerber, Richard A.; Balsara, Dinshaw S.
1994-01-01
We present some results from three dimensional computer simulations of collisions between models of equal mass galaxies, one of which is a rotating, disk galaxy containing both gas and stars and the other is an elliptical containing stars only. We use fully self consistent models in which the halo mass is 2.5 times that of the disk. In the experiments we have varied the impact parameter between zero (head on) and 0.9R (where R is the radius of the disk), for impacts perpendicular to the disk plane. The calculations were performed on a Cray 2 computer using a combined N-body/smooth particle hydrodynamics (SPH) program. The results show the development of complicated flows and shock structures in the direction perpendicular to the plane of the disk and the propagation outwards of a density wave in both the stars and the gas. The collisional nature of the gas results in a sharper ring than obtained for the star particles, and the development of high volume densities and shocks.
RAID/C90 Technology Integration
NASA Technical Reports Server (NTRS)
Ciotti, Bob; Cooper, D. M. (Technical Monitor)
1994-01-01
In March 1993, NAS was the first to connect a Maximum Strategy RAID disk to the C90 using standard Cray provided software. This paper discusses the problems encountered, lessons learned, and performance achieved.
Parallel 3D Mortar Element Method for Adaptive Nonconforming Meshes
NASA Technical Reports Server (NTRS)
Feng, Huiyu; Mavriplis, Catherine; VanderWijngaart, Rob; Biswas, Rupak
2004-01-01
High order methods are frequently used in computational simulation for their high accuracy. An efficient way to avoid unnecessary computation in smooth regions of the solution is to use adaptive meshes which employ fine grids only in areas where they are needed. Nonconforming spectral elements allow the grid to be flexibly adjusted to satisfy the computational accuracy requirements. The method is suitable for computational simulations of unsteady problems with very disparate length scales or unsteady moving features, such as heat transfer, fluid dynamics or flame combustion. In this work, we select the Mark Element Method (MEM) to handle the non-conforming interfaces between elements. A new technique is introduced to efficiently implement MEM in 3-D nonconforming meshes. By introducing an "intermediate mortar", the proposed method decomposes the projection between 3-D elements and mortars into two steps. In each step, projection matrices derived in 2-D are used. The two-step method avoids explicitly forming/deriving large projection matrices for 3-D meshes, and also helps to simplify the implementation. This new technique can be used for both h- and p-type adaptation. This method is applied to an unsteady 3-D moving heat source problem. With our new MEM implementation, mesh adaptation is able to efficiently refine the grid near the heat source and coarsen the grid once the heat source passes. The savings in computational work resulting from the dynamic mesh adaptation is demonstrated by the reduction of the the number of elements used and CPU time spent. MEM and mesh adaptation, respectively, bring irregularity and dynamics to the computer memory access pattern. Hence, they provide a good way to gauge the performance of computer systems when running scientific applications whose memory access patterns are irregular and unpredictable. We select a 3-D moving heat source problem as the Unstructured Adaptive (UA) grid benchmark, a new component of the NAS Parallel Benchmarks (NPB). In this paper, we present some interesting performance results of ow OpenMP parallel implementation on different architectures such as the SGI Origin2000, SGI Altix, and Cray MTA-2.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Marquez, Andres; Choudhury, Sutanay
2012-09-01
Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
NASA Astrophysics Data System (ADS)
Kjærgaard, Thomas; Baudin, Pablo; Bykov, Dmytro; Eriksen, Janus Juul; Ettenhuber, Patrick; Kristensen, Kasper; Larkin, Jeff; Liakh, Dmitry; Pawłowski, Filip; Vose, Aaron; Wang, Yang Min; Jørgensen, Poul
2017-03-01
We present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide-Expand-Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide-Expand-Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalability of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the "resolution of the identity second-order Møller-Plesset perturbation theory" (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mckie, Jim
2012-01-09
This report documents the results of work done over a 6 year period under the FAST-OS programs. The first effort was called Right-Weight Kernels, (RWK) and was concerned with improving measurements of OS noise so it could be treated quantitatively; and evaluating the use of two operating systems, Linux and Plan 9, on HPC systems and determining how these operating systems needed to be extended or changed for HPC, while still retaining their general-purpose nature. The second program, HARE, explored the creation of alternative runtime models, building on RWK. All of the HARE work was done on Plan 9. Themore » HARE researchers were mindful of the very good Linux and LWK work being done at other labs and saw no need to recreate it. Even given this limited funding, the two efforts had outsized impact: _ Helped Cray decide to use Linux, instead of a custom kernel, and provided the tools needed to make Linux perform well _ Created a successor operating system to Plan 9, NIX, which has been taken in by Bell Labs for further development _ Created a standard system measurement tool, Fixed Time Quantum or FTQ, which is widely used for measuring operating systems impact on applications _ Spurred the use of the 9p protocol in several organizations, including IBM _ Built software in use at many companies, including IBM, Cray, and Google _ Spurred the creation of alternative runtimes for use on HPC systems _ Demonstrated that, with proper modifications, a general purpose operating systems can provide communications up to 3 times as effective as user-level libraries Open source was a key part of this work. The code developed for this project is in wide use and available at many places. The core Blue Gene code is available at https://bitbucket.org/ericvh/hare. We describe details of these impacts in the following sections. The rest of this report is organized as follows: First, we describe commercial impact; next, we describe the FTQ benchmark and its impact in more detail; operating systems and runtime research follows; we discuss infrastructure software; and close with a description of the new NIX operating system, future work, and conclusions.« less
Modeling and new equipment definition for the vibration isolation box equipment system
NASA Technical Reports Server (NTRS)
Sani, Robert L.
1993-01-01
Our MSAD-funded research project is to provide numerical modeling support for the VIBES (Vibration Isolation Box Experiment System) which is an IML2 flight experiment being built by the Japanese research team of Dr. H. Azuma of the Japanese National Aerospace Laboratory. During this reporting period, the following have been accomplished: A semi-consistent mass finite element projection algorithm for 2D and 3D Boussinesq flows has been implemented on Sun, HP And Cray Platforms. The algorithm has better phase speed accuracy than similar finite difference or lumped mass finite element algorithms, an attribute which is essential for addressing realistic g-jitter effects as well as convectively-dominated transient systems. The projection algorithm has been benchmarked against solutions generated via the commercial code FIDAP. The algorithm appears to be accurate as well as computationally efficient. Optimization and potential parallelization studies are underway. Our implementation to date has focused on execution of the basic algorithm with at most a concern for vectorization. The initial time-varying gravity Boussinesq flow simulation is being set up. The mesh is being designed and the input file is being generated. Some preliminary 'small mesh' cases will be attempted on our HP9000/735 while our request to MSAD for supercomputing resources is being addressed. The Japanese research team for VIBES was visited, the current set up and status of the physical experiment was obtained and ongoing E-Mail communication link was established.
SHABERTH - ANALYSIS OF A SHAFT BEARING SYSTEM (CRAY VERSION)
NASA Technical Reports Server (NTRS)
Coe, H. H.
1994-01-01
The SHABERTH computer program was developed to predict operating characteristics of bearings in a multibearing load support system. Lubricated and non-lubricated bearings can be modeled. SHABERTH calculates the loads, torques, temperatures, and fatigue life for ball and/or roller bearings on a single shaft. The program also allows for an analysis of the system reaction to the termination of lubricant supply to the bearings and other lubricated mechanical elements. SHABERTH has proven to be a valuable tool in the design and analysis of shaft bearing systems. The SHABERTH program is structured with four nested calculation schemes. The thermal scheme performs steady state and transient temperature calculations which predict system temperatures for a given operating state. The bearing dimensional equilibrium scheme uses the bearing temperatures, predicted by the temperature mapping subprograms, and the rolling element raceway load distribution, predicted by the bearing subprogram, to calculate bearing diametral clearance for a given operating state. The shaft-bearing system load equilibrium scheme calculates bearing inner ring positions relative to the respective outer rings such that the external loading applied to the shaft is brought into equilibrium by the rolling element loads which develop at each bearing inner ring for a given operating state. The bearing rolling element and cage load equilibrium scheme calculates the rolling element and cage equilibrium positions and rotational speeds based on the relative inner-outer ring positions, inertia effects, and friction conditions. The ball bearing subprograms in the current SHABERTH program have several model enhancements over similar programs. These enhancements include an elastohydrodynamic (EHD) film thickness model that accounts for thermal heating in the contact area and lubricant film starvation; a new model for traction combined with an asperity load sharing model; a model for the hydrodynamic rolling and shear forces in the inlet zone of lubricated contacts, which accounts for the degree of lubricant film starvation; modeling normal and friction forces between a ball and a cage pocket, which account for the transition between the hydrodynamic and elastohydrodynamic regimes of lubrication; and a model of the effect on fatigue life of the ratio of the EHD plateau film thickness to the composite surface roughness. SHABERTH is intended to be as general as possible. The models in SHABERTH allow for the complete mathematical simulation of real physical systems. Systems are limited to a maximum of five bearings supporting the shaft, a maximum of thirty rolling elements per bearing, and a maximum of one hundred temperature nodes. The SHABERTH program structure is modular and has been designed to permit refinement and replacement of various component models as the need and opportunities develop. A preprocessor is included in the IBM PC version of SHABERTH to provide a user friendly means of developing SHABERTH models and executing the resulting code. The preprocessor allows the user to create and modify data files with minimal effort and a reduced chance for errors. Data is utilized as it is entered; the preprocessor then decides what additional data is required to complete the model. Only this required information is requested. The preprocessor can accommodate data input for any SHABERTH compatible shaft bearing system model. The system may include ball bearings, roller bearings, and/or tapered roller bearings. SHABERTH is written in FORTRAN 77, and two machine versions are available from COSMIC. The CRAY version (LEW-14860) has a RAM requirement of 176K of 64 bit words. The IBM PC version (MFS-28818) is written for IBM PC series and compatible computers running MS-DOS, and includes a sample MS-DOS executable. For execution, the PC version requires at least 1Mb of RAM and an 80386 or 486 processor machine with an 80x87 math co-processor. The standard distribution medium for the IBM PC version is a set of two 5.25 inch 360K MS-DOS format diskettes. The contents of the diske
Performance of VPIC on Trinity
NASA Astrophysics Data System (ADS)
Nystrom, W. D.; Bergen, B.; Bird, R. F.; Bowers, K. J.; Daughton, W. S.; Guo, F.; Li, H.; Nam, H. A.; Pang, X.; Rust, W. N., III; Wohlbier, J.; Yin, L.; Albright, B. J.
2016-10-01
Trinity is a new major DOE computing resource which is going through final acceptance testing at Los Alamos National Laboratory. Trinity has several new and unique architectural features including two compute partitions, one with dual socket Intel Haswell Xeon compute nodes and one with Intel Knights Landing (KNL) Xeon Phi compute nodes. Additional unique features include use of on package high bandwidth memory (HBM) for the KNL nodes, the ability to configure the KNL nodes with respect to HBM model and on die network topology in a variety of operational modes at run time, and use of solid state storage via burst buffer technology to reduce time required to perform I/O. An effort is in progress to port and optimize VPIC to Trinity and evaluate its performance. Because VPIC was recently released as Open Source, it is being used as part of acceptance testing for Trinity and is participating in the Trinity Open Science Program which has resulted in excellent collaboration activities with both Cray and Intel. Results of this work will be presented on performance of VPIC on both Haswell and KNL partitions for both single node runs and runs at scale. Work performed under the auspices of the U.S. Dept. of Energy by the Los Alamos National Security, LLC Los Alamos National Laboratory under contract DE-AC52-06NA25396 and supported by the LANL LDRD program.
15. BUILDING 239. SECTIONS AND DETAILS OF DRYING ROOMS AND ...
15. BUILDING 239. SECTIONS AND DETAILS OF DRYING ROOMS AND MIXING ROOMS. March 6, 1941. - Frankford Arsenal, Building Nos. 239-239A, Southeast corner of Clay Street & Cray Road, Philadelphia, Philadelphia County, PA
Late evolution of very low mass X-ray binaries sustained by radiation from their primaries
NASA Technical Reports Server (NTRS)
Ruderman, M.; Shaham, J.; Tavani, M.; Eichler, D.
1989-01-01
The accretion-powered radiation from the X-ray pulsar system Her X-1 (McCray et al. 1982) is studied. The changes in the soft X-ray and gamma-ray flux and in the accompanying electron-positron wind are discussed. These are believed to be associated with the inward movement of the inner edge of the accretion disk corresponding to the boundary with the neutron star's corotating magnetosphere (Alfven radius). LMXB evolution which is self-sustained by secondary winds intercepting the radiation emitted near an LMXB neutron star is investigated as well.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database
Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...
2013-01-01
Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Multiprocessing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1990-01-01
Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
The coupling of fluids, dynamics, and controls on advanced architecture computers
NASA Technical Reports Server (NTRS)
Atwood, Christopher
1995-01-01
This grant provided for the demonstration of coupled controls, body dynamics, and fluids computations in a workstation cluster environment; and an investigation of the impact of peer-peer communication on flow solver performance and robustness. The findings of these investigations were documented in the conference articles.The attached publication, 'Towards Distributed Fluids/Controls Simulations', documents the solution and scaling of the coupled Navier-Stokes, Euler rigid-body dynamics, and state feedback control equations for a two-dimensional canard-wing. The poor scaling shown was due to serialized grid connectivity computation and Ethernet bandwidth limits. The scaling of a peer-to-peer communication flow code on an IBM SP-2 was also shown. The scaling of the code on the switched fabric-linked nodes was good, with a 2.4 percent loss due to communication of intergrid boundary point information. The code performance on 30 worker nodes was 1.7 (mu)s/point/iteration, or a factor of three over a Cray C-90 head. The attached paper, 'Nonlinear Fluid Computations in a Distributed Environment', documents the effect of several computational rate enhancing methods on convergence. For the cases shown, the highest throughput was achieved using boundary updates at each step, with the manager process performing communication tasks only. Constrained domain decomposition of the implicit fluid equations did not degrade the convergence rate or final solution. The scaling of a coupled body/fluid dynamics problem on an Ethernet-linked cluster was also shown.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
Finite element analysis of human joints
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bossart, P.L.; Hollerbach, K.
1996-09-01
Our work focuses on the development of finite element models (FEMs) that describe the biomechanics of human joints. Finite element modeling is becoming a standard tool in industrial applications. In highly complex problems such as those found in biomechanics research, however, the full potential of FEMs is just beginning to be explored, due to the absence of precise, high resolution medical data and the difficulties encountered in converting these enormous datasets into a form that is usable in FEMs. With increasing computing speed and memory available, it is now feasible to address these challenges. We address the first by acquiringmore » data with a high resolution C-ray CT scanner and the latter by developing semi-automated method for generating the volumetric meshes used in the FEM. Issues related to tomographic reconstruction, volume segmentation, the use of extracted surfaces to generate volumetric hexahedral meshes, and applications of the FEM are described.« less
Networking for large-scale science: infrastructure, provisioning, transport and application mapping
NASA Astrophysics Data System (ADS)
Rao, Nageswara S.; Carter, Steven M.; Wu, Qishi; Wing, William R.; Zhu, Mengxia; Mezzacappa, Anthony; Veeraraghavan, Malathi; Blondin, John M.
2005-01-01
Large-scale science computations and experiments require unprecedented network capabilities in the form of large bandwidth and dynamically stable connections to support data transfers, interactive visualizations, and monitoring and steering operations. A number of component technologies dealing with the infrastructure, provisioning, transport and application mappings must be developed and/or optimized to achieve these capabilities. We present a brief account of the following technologies that contribute toward achieving these network capabilities: (a) DOE UltraScienceNet and NSF CHEETAH network testbeds that provide on-demand and scheduled dedicated network connections; (b) experimental results on transport protocols that achieve close to 100% utilization on dedicated 1Gbps wide-area channels; (c) a scheme for optimally mapping a visualization pipeline onto a network to minimize the end-to-end delays; and (d) interconnect configuration and protocols that provides multiple Gbps flows from Cray X1 to external hosts.
Analysis, tuning and comparison of two general sparse solvers for distributed memory computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amestoy, P.R.; Duff, I.S.; L'Excellent, J.-Y.
2000-06-30
We describe the work performed in the context of a Franco-Berkeley funded project between NERSC-LBNL located in Berkeley (USA) and CERFACS-ENSEEIHT located in Toulouse (France). We discuss both the tuning and performance analysis of two distributed memory sparse solvers (superlu from Berkeley and mumps from Toulouse) on the 512 processor Cray T3E from NERSC (Lawrence Berkeley National Laboratory). This project gave us the opportunity to improve the algorithms and add new features to the codes. We then quite extensively analyze and compare the two approaches on a set of large problems from real applications. We further explain the main differencesmore » in the behavior of the approaches on artificial regular grid problems. As a conclusion to this activity report, we mention a set of parallel sparse solvers on which this type of study should be extended.« less
2012-02-10
Then and Now: These images illustrate the dramatic improvement in NASA computing power over the last 23 years, and its effect on the number of grid points used for flow simulations. At left, an image from the first full-body Navier-Stokes simulation (1988) of an F-16 fighter jet showing pressure on the aircraft body, and fore-body streamlines at Mach 0.90. This steady-state solution took 25 hours using a single Cray X-MP processor to solve the 500,000 grid-point problem. Investigator: Neal Chaderjian, NASA Ames Research Center At right, a 2011 snapshot from a Navier-Stokes simulation of a V-22 Osprey rotorcraft in hover. The blade vortices interact with the smaller turbulent structures. This very detailed simulation used 660 million grid points, and ran on 1536 processors on the Pleiades supercomputer for 180 hours. Investigator: Neal Chaderjian, NASA Ames Research Center; Image: Tim Sandstrom, NASA Ames Research Center
NASA Technical Reports Server (NTRS)
Wang, Xiao-Yen; Chow, Chuen-Yen; Chang, Sin-Chung
1998-01-01
Without resorting to special treatment for each individual test case, the 1D and 2D CE/SE shock-capturing schemes described previously (in Part I) are used to simulate flows involving phenomena such as shock waves, contact discontinuities, expansion waves and their interactions. Five 1D and six 2D problems are considered to examine the capability and robustness of these schemes. Despite their simple logical structures and low computational cost (for the 2D CE/SE shock-capturing scheme, the CPU time is about 2 micro-secs per mesh point per marching step on a Cray C90 machine), the numerical results, when compared with experimental data, exact solutions or numerical solutions by other methods, indicate that these schemes can accurately resolve shock and contact discontinuities consistently.
Parallel performance optimizations on unstructured mesh-based simulations
Sarje, Abhinav; Song, Sukhyun; Jacobsen, Douglas; ...
2015-06-01
This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intranode data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches.more » We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reed, D.A.; Grunwald, D.C.
The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less
NASA Technical Reports Server (NTRS)
Steinke, Ronald J.
1989-01-01
The Rai ROTOR1 code for two-dimensional, unsteady viscous flow analysis was applied to a supersonic throughflow fan stage design. The axial Mach number for this fan design increases from 2.0 at the inlet to 2.9 at the outlet. The Rai code uses overlapped O- and H-grids that are appropriately packed. The Rai code was run on a Cray XMP computer; then data postprocessing and graphics were performed to obtain detailed insight into the stage flow. The large rotor wakes uniformly traversed the rotor-stator interface and dispersed as they passed through the stator passage. Only weak blade shock losses were computerd, which supports the design goals. High viscous effects caused large blade wakes and a low fan efficiency. Rai code flow predictions were essentially steady for the rotor, and they compared well with Chima rotor viscous code predictions based on a C-grid of similar density.
Domain decomposition methods in aerodynamics
NASA Technical Reports Server (NTRS)
Venkatakrishnan, V.; Saltz, Joel
1990-01-01
Compressible Euler equations are solved for two-dimensional problems by a preconditioned conjugate gradient-like technique. An approximate Riemann solver is used to compute the numerical fluxes to second order accuracy in space. Two ways to achieve parallelism are tested, one which makes use of parallelism inherent in triangular solves and the other which employs domain decomposition techniques. The vectorization/parallelism in triangular solves is realized by the use of a recording technique called wavefront ordering. This process involves the interpretation of the triangular matrix as a directed graph and the analysis of the data dependencies. It is noted that the factorization can also be done in parallel with the wave front ordering. The performances of two ways of partitioning the domain, strips and slabs, are compared. Results on Cray YMP are reported for an inviscid transonic test case. The performances of linear algebra kernels are also reported.
NASA Technical Reports Server (NTRS)
Edwards, Jack R.; Mcrae, D. S.
1993-01-01
An efficient implicit method for the computation of steady, three-dimensional, compressible Navier-Stokes flowfields is presented. A nonlinear iteration strategy based on planar Gauss-Seidel sweeps is used to drive the solution toward a steady state, with approximate factorization errors within a crossflow plane reduced by the application of a quasi-Newton technique. A hybrid discretization approach is employed, with flux-vector splitting utilized in the streamwise direction and central differences with artificial dissipation used for the transverse fluxes. Convergence histories and comparisons with experimental data are presented for several 3-D shock-boundary layer interactions. Both laminar and turbulent cases are considered, with turbulent closure provided by a modification of the Baldwin-Barth one-equation model. For the problems considered (175,000-325,000 mesh points), the algorithm provides steady-state convergence in 900-2000 CPU seconds on a single processor of a Cray Y-MP.
Eigenvalue routines in NASTRAN: A comparison with the Block Lanczos method
NASA Technical Reports Server (NTRS)
Tischler, V. A.; Venkayya, Vipperla B.
1993-01-01
The NASA STRuctural ANalysis (NASTRAN) program is one of the most extensively used engineering applications software in the world. It contains a wealth of matrix operations and numerical solution techniques, and they were used to construct efficient eigenvalue routines. The purpose of this paper is to examine the current eigenvalue routines in NASTRAN and to make efficiency comparisons with a more recent implementation of the Block Lanczos algorithm by Boeing Computer Services (BCS). This eigenvalue routine is now available in the BCS mathematics library as well as in several commercial versions of NASTRAN. In addition, CRAY maintains a modified version of this routine on their network. Several example problems, with a varying number of degrees of freedom, were selected primarily for efficiency bench-marking. Accuracy is not an issue, because they all gave comparable results. The Block Lanczos algorithm was found to be extremely efficient, in particular, for very large size problems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gittens, Alex; Devarakonda, Aditya; Racah, Evan
We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark’s data parallel model. We perform scalingmore » experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.« less
Analytical simulation of weld effects in creep range
NASA Technical Reports Server (NTRS)
Dhalla, A. K.
1985-01-01
The inelastic analysis procedure used to investigate the effect of welding on the creep rupture strength of a typical Liquid Metal Fast Breeder Reactor (LMFBR) nozzle is discussed. The current study is part of an overall experimental and analytical investigation to verify the inelastic analysis procedure now being used to design LMFBR structural components operating at elevated temperatures. Two important weld effects included in the numerical analysis are: (1) the residual stress introduced in the fabrication process; and (2) the time-independent and the time-dependent material property variations. Finite element inelastic analysis was performed on a CRAY-1S computer using the ABAQUS program with the constitutive equations developed for the design of LMFBR structural components. The predicted peak weld residual stresses relax by as much as 40% during elevated temperature operation, and their effect on creep-rupture cracking of the nozzle is considered of secondary importance.
Seismic signal processing on heterogeneous supercomputers
NASA Astrophysics Data System (ADS)
Gokhberg, Alexey; Ermert, Laura; Fichtner, Andreas
2015-04-01
The processing of seismic signals - including the correlation of massive ambient noise data sets - represents an important part of a wide range of seismological applications. It is characterized by large data volumes as well as high computational input/output intensity. Development of efficient approaches towards seismic signal processing on emerging high performance computing systems is therefore essential. Heterogeneous supercomputing systems introduced in the recent years provide numerous computing nodes interconnected via high throughput networks, every node containing a mix of processing elements of different architectures, like several sequential processor cores and one or a few graphical processing units (GPU) serving as accelerators. A typical representative of such computing systems is "Piz Daint", a supercomputer of the Cray XC 30 family operated by the Swiss National Supercomputing Center (CSCS), which we used in this research. Heterogeneous supercomputers provide an opportunity for manifold application performance increase and are more energy-efficient, however they have much higher hardware complexity and are therefore much more difficult to program. The programming effort may be substantially reduced by the introduction of modular libraries of software components that can be reused for a wide class of seismology applications. The ultimate goal of this research is design of a prototype for such library suitable for implementing various seismic signal processing applications on heterogeneous systems. As a representative use case we have chosen an ambient noise correlation application. Ambient noise interferometry has developed into one of the most powerful tools to image and monitor the Earth's interior. Future applications will require the extraction of increasingly small details from noise recordings. To meet this demand, more advanced correlation techniques combined with very large data volumes are needed. This poses new computational problems that require dedicated HPC solutions. The chosen application is using a wide range of common signal processing methods, which include various IIR filter designs, amplitude and phase correlation, computing the analytic signal, and discrete Fourier transforms. Furthermore, various processing methods specific for seismology, like rotation of seismic traces, are used. Efficient implementation of all these methods on the GPU-accelerated systems represents several challenges. In particular, it requires a careful distribution of work between the sequential processors and accelerators. Furthermore, since the application is designed to process very large volumes of data, special attention had to be paid to the efficient use of the available memory and networking hardware resources in order to reduce intensity of data input and output. In our contribution we will explain the software architecture as well as principal engineering decisions used to address these challenges. We will also describe the programming model based on C++ and CUDA that we used to develop the software. Finally, we will demonstrate performance improvements achieved by using the heterogeneous computing architecture. This work was supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d26.
Parallelization of Rocket Engine System Software (Press)
NASA Technical Reports Server (NTRS)
Cezzar, Ruknet
1996-01-01
The main goal is to assess parallelization requirements for the Rocket Engine Numeric Simulator (RENS) project which, aside from gathering information on liquid-propelled rocket engines and setting forth requirements, involve a large FORTRAN based package at NASA Lewis Research Center and TDK software developed by SUBR/UWF. The ultimate aim is to develop, test, integrate, and suitably deploy a family of software packages on various aspects and facets of rocket engines using liquid-propellants. At present, all project efforts by the funding agency, NASA Lewis Research Center, and the HBCU participants are disseminated over the internet using world wide web home pages. Considering obviously expensive methods of actual field trails, the benefits of software simulators are potentially enormous. When realized, these benefits will be analogous to those provided by numerous CAD/CAM packages and flight-training simulators. According to the overall task assignments, Hampton University's role is to collect all available software, place them in a common format, assess and evaluate, define interfaces, and provide integration. Most importantly, the HU's mission is to see to it that the real-time performance is assured. This involves source code translations, porting, and distribution. The porting will be done in two phases: First, place all software on Cray XMP platform using FORTRAN. After testing and evaluation on the Cray X-MP, the code will be translated to C + + and ported to the parallel nCUBE platform. At present, we are evaluating another option of distributed processing over local area networks using Sun NFS, Ethernet, TCP/IP. Considering the heterogeneous nature of the present software (e.g., first started as an expert system using LISP machines) which now involve FORTRAN code, the effort is expected to be quite challenging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lichtner, Peter C.; Hammond, Glenn E.
Evolution of a hexavalent uranium [U(VI)] plume at the Hanford 300 Area bordering the Columbia River is investigated to evaluate the roles of labile and nonlabile forms of U(VI) on the longevity of the plume. A high fidelity, three-dimensional, field-scale, reactive flow and transport model is used to represent the system. Richards equation coupled to multicomponent reactive transport equations are solved for times up to 100 years taking into account rapid fluctuations in the Columbia River stage resulting in pulse releases of U(VI) into the river. The peta-scale computer code PFLOTRAN developed under a DOE SciDAC-2 project is employed inmore » the simulations and executed on ORNL's Cray XT5 supercomputer Jaguar. Labile U(VI) is represented in the model through surface complexation reactions and its nonlabile form through dissolution of metatorbernite used as a surrogate mineral. Initial conditions are constructed corresponding to the U(VI) plume already in place to avoid uncertainties associated with the lack of historical data for the waste stream. The cumulative U(VI) flux into the river is compared for cases of equilibrium and multirate sorption models and for no sorption. The sensitivity of the U(VI) flux into the river on the initial plume configuration is investigated. The presence of nonlabile U(VI) was found to be essential in explaining the longevity of the U(VI) plume and the prolonged high U(VI) concentrations at the site exceeding the EPA MCL for uranium.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shamis, Pavel; Graham, Richard L; Gorentla Venkata, Manjunath
The scalability and performance of collective communication operations limit the scalability and performance of many scientific applications. This paper presents two new blocking and nonblocking Broadcast algorithms for communicators with arbitrary communication topology, and studies their performance. These algorithms benefit from increased concurrency and a reduced memory footprint, making them suitable for use on large-scale systems. Measuring small, medium, and large data Broadcasts on a Cray-XT5, using 24,576 MPI processes, the Cheetah algorithms outperform the native MPI on that system by 51%, 69%, and 9%, respectively, at the same process count. These results demonstrate an algorithmic approach to the implementationmore » of the important class of collective communications, which is high performing, scalable, and also uses resources in a scalable manner.« less
PLOT3D/AMES, DEC VAX VMS VERSION USING DISSPLA (WITHOUT TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P. G.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The VAX/VMS/DISSPLA implementation of PLOT3D supports 2-D polygons as well as 2-D and 3-D lines, but does not support graphics features requiring 3-D polygons (shading and hidden line removal, for example). Views can be manipulated using keyboard commands. This version of PLOT3D is potentially able to produce files for a variety of output devices; however, site-specific capabilities will vary depending on the device drivers supplied with the user's DISSPLA library. If ARCGRAPH (ARC-12350) is installed on the user's VAX, the VMS/DISSPLA version of PLOT3D can also be used to create files for use in GAS (Graphics Animation System, ARC-12379), an IRIS program capable of animating and recording images on film. The version 3.6b+ VMS/DISSPLA implementations of PLOT3D (ARC-12777) and PLOT3D/TURB3D (ARC-12781) were developed for use on VAX computers running VMS Version 5.0 and DISSPLA Version 11.0. The standard distribution media for each of these programs is a 9-track, 6250 bpi magnetic tape in DEC VAX BACKUP format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D (ARC-12783, ARC12782); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
PLOT3D/AMES, DEC VAX VMS VERSION USING DISSPLA (WITH TURB3D)
NASA Technical Reports Server (NTRS)
Buning, P.
1994-01-01
PLOT3D is an interactive graphics program designed to help scientists visualize computational fluid dynamics (CFD) grids and solutions. Today, supercomputers and CFD algorithms can provide scientists with simulations of such highly complex phenomena that obtaining an understanding of the simulations has become a major problem. Tools which help the scientist visualize the simulations can be of tremendous aid. PLOT3D/AMES offers more functions and features, and has been adapted for more types of computers than any other CFD graphics program. Version 3.6b+ is supported for five computers and graphic libraries. Using PLOT3D, CFD physicists can view their computational models from any angle, observing the physics of problems and the quality of solutions. As an aid in designing aircraft, for example, PLOT3D's interactive computer graphics can show vortices, temperature, reverse flow, pressure, and dozens of other characteristics of air flow during flight. As critical areas become obvious, they can easily be studied more closely using a finer grid. PLOT3D is part of a computational fluid dynamics software cycle. First, a program such as 3DGRAPE (ARC-12620) helps the scientist generate computational grids to model an object and its surrounding space. Once the grids have been designed and parameters such as the angle of attack, Mach number, and Reynolds number have been specified, a "flow-solver" program such as INS3D (ARC-11794 or COS-10019) solves the system of equations governing fluid flow, usually on a supercomputer. Grids sometimes have as many as two million points, and the "flow-solver" produces a solution file which contains density, x- y- and z-momentum, and stagnation energy for each grid point. With such a solution file and a grid file containing up to 50 grids as input, PLOT3D can calculate and graphically display any one of 74 functions, including shock waves, surface pressure, velocity vectors, and particle traces. PLOT3D's 74 functions are organized into five groups: 1) Grid Functions for grids, grid-checking, etc.; 2) Scalar Functions for contour or carpet plots of density, pressure, temperature, Mach number, vorticity magnitude, helicity, etc.; 3) Vector Functions for vector plots of velocity, vorticity, momentum, and density gradient, etc.; 4) Particle Trace Functions for rake-like plots of particle flow or vortex lines; and 5) Shock locations based on pressure gradient. TURB3D is a modification of PLOT3D which is used for viewing CFD simulations of incompressible turbulent flow. Input flow data consists of pressure, velocity and vorticity. Typical quantities to plot include local fluctuations in flow quantities and turbulent production terms, plotted in physical or wall units. PLOT3D/TURB3D includes both TURB3D and PLOT3D because the operation of TURB3D is identical to PLOT3D, and there is no additional sample data or printed documentation for TURB3D. Graphical capabilities of PLOT3D version 3.6b+ vary among the implementations available through COSMIC. Customers are encouraged to purchase and carefully review the PLOT3D manual before ordering the program for a specific computer and graphics library. There is only one manual for use with all implementations of PLOT3D, and although this manual generally assumes that the Silicon Graphics Iris implementation is being used, informative comments concerning other implementations appear throughout the text. With all implementations, the visual representation of the object and flow field created by PLOT3D consists of points, lines, and polygons. Points can be represented with dots or symbols, color can be used to denote data values, and perspective is used to show depth. Differences among implementations impact the program's ability to use graphical features that are based on 3D polygons, the user's ability to manipulate the graphical displays, and the user's ability to obtain alternate forms of output. The VAX/VMS/DISSPLA implementation of PLOT3D supports 2-D polygons as well as 2-D and 3-D lines, but does not support graphics features requiring 3-D polygons (shading and hidden line removal, for example). Views can be manipulated using keyboard commands. This version of PLOT3D is potentially able to produce files for a variety of output devices; however, site-specific capabilities will vary depending on the device drivers supplied with the user's DISSPLA library. If ARCGRAPH (ARC-12350) is installed on the user's VAX, the VMS/DISSPLA version of PLOT3D can also be used to create files for use in GAS (Graphics Animation System, ARC-12379), an IRIS program capable of animating and recording images on film. The version 3.6b+ VMS/DISSPLA implementations of PLOT3D (ARC-12777) and PLOT3D/TURB3D (ARC-12781) were developed for use on VAX computers running VMS Version 5.0 and DISSPLA Version 11.0. The standard distribution media for each of these programs is a 9-track, 6250 bpi magnetic tape in DEC VAX BACKUP format. Customers purchasing one implementation version of PLOT3D or PLOT3D/TURB3D will be given a $200 discount on each additional implementation version ordered at the same time. Version 3.6b+ of PLOT3D and PLOT3D/TURB3D are also supported for the following computers and graphics libraries: (1) generic UNIX Supercomputer and IRIS, suitable for CRAY 2/UNICOS, CONVEX, and Alliant with remote IRIS 2xxx/3xxx or IRIS 4D (ARC-12779, ARC-12784); (2) Silicon Graphics IRIS 2xxx/3xxx or IRIS 4D (ARC-12783, ARC12782); (3) generic UNIX and DISSPLA Version 11.0 (ARC-12788, ARC-12778); and (4) Apollo computers running UNIX and GMR3D Version 2.0 (ARC-12789, ARC-12785 which have no capabilities to put text on plots). Silicon Graphics Iris, IRIS 4D, and IRIS 2xxx/3xxx are trademarks of Silicon Graphics Incorporated. VAX and VMS are trademarks of Digital Electronics Corporation. DISSPLA is a trademark of Computer Associates. CRAY 2 and UNICOS are trademarks of CRAY Research, Incorporated. CONVEX is a trademark of Convex Computer Corporation. Alliant is a trademark of Alliant. Apollo and GMR3D are trademarks of Hewlett-Packard, Incorporated. UNIX is a registered trademark of AT&T.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes
NASA Technical Reports Server (NTRS)
Wissink, Andrew; Allen, Edwin (Technical Monitor)
1998-01-01
Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Hixon, Duane; Sankar, L. N.
1993-01-01
During the past two decades, there has been significant progress in the field of numerical simulation of unsteady compressible viscous flows. At present, a variety of solution techniques exist such as the transonic small disturbance analyses (TSD), transonic full potential equation-based methods, unsteady Euler solvers, and unsteady Navier-Stokes solvers. These advances have been made possible by developments in three areas: (1) improved numerical algorithms; (2) automation of body-fitted grid generation schemes; and (3) advanced computer architectures with vector processing and massively parallel processing features. In this work, the GMRES scheme has been considered as a candidate for acceleration of a Newton iteration time marching scheme for unsteady 2-D and 3-D compressible viscous flow calculation; from preliminary calculations, this will provide up to a 65 percent reduction in the computer time requirements over the existing class of explicit and implicit time marching schemes. The proposed method has ben tested on structured grids, but is flexible enough for extension to unstructured grids. The described scheme has been tested only on the current generation of vector processor architecture of the Cray Y/MP class, but should be suitable for adaptation to massively parallel machines.
P-HARP: A parallel dynamic spectral partitioner
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sohn, A.; Biswas, R.; Simon, H.D.
1997-05-01
Partitioning unstructured graphs is central to the parallel solution of problems in computational science and engineering. The authors have introduced earlier the sequential version of an inertial spectral partitioner called HARP which maintains the quality of recursive spectral bisection (RSB) while forming the partitions an order of magnitude faster than RSB. The serial HARP is known to be the fastest spectral partitioner to date, three to four times faster than similar partitioners on a variety of meshes. This paper presents a parallel version of HARP, called P-HARP. Two types of parallelism have been exploited: loop level parallelism and recursive parallelism.more » P-HARP has been implemented in MPI on the SGI/Cray T3E and the IBM SP2. Experimental results demonstrate that P-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.25 seconds on a 64-processor T3E. Experimental results further show that P-HARP can give nearly a 20-fold speedup on 64 processors. These results indicate that graph partitioning is no longer a major bottleneck that hinders the advancement of computational science and engineering for dynamically-changing real-world applications.« less
Shadid, J. N.; Pawlowski, R. P.; Cyr, E. C.; ...
2016-02-10
Here, we discuss that the computational solution of the governing balance equations for mass, momentum, heat transfer and magnetic induction for resistive magnetohydrodynamics (MHD) systems can be extremely challenging. These difficulties arise from both the strong nonlinear, nonsymmetric coupling of fluid and electromagnetic phenomena, as well as the significant range of time- and length-scales that the interactions of these physical mechanisms produce. This paper explores the development of a scalable, fully-implicit stabilized unstructured finite element (FE) capability for 3D incompressible resistive MHD. The discussion considers the development of a stabilized FE formulation in context of the variational multiscale (VMS) method,more » and describes the scalable implicit time integration and direct-to-steady-state solution capability. The nonlinear solver strategy employs Newton–Krylov methods, which are preconditioned using fully-coupled algebraic multilevel preconditioners. These preconditioners are shown to enable a robust, scalable and efficient solution approach for the large-scale sparse linear systems generated by the Newton linearization. Verification results demonstrate the expected order-of-accuracy for the stabilized FE discretization. The approach is tested on a variety of prototype problems, that include MHD duct flows, an unstable hydromagnetic Kelvin–Helmholtz shear layer, and a 3D island coalescence problem used to model magnetic reconnection. Initial results that explore the scaling of the solution methods are also presented on up to 128K processors for problems with up to 1.8B unknowns on a CrayXK7.« less
Kjaergaard, Thomas; Baudin, Pablo; Bykov, Dmytro; ...
2016-11-16
Here, we present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide–Expand–Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide–Expand–Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalabilitymore » of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the “resolution of the identity second-order Moller–Plesset perturbation theory” (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.« less
NASA Astrophysics Data System (ADS)
Buaria, D.; Yeung, P. K.
2017-12-01
A new parallel algorithm utilizing a partitioned global address space (PGAS) programming model to achieve high scalability is reported for particle tracking in direct numerical simulations of turbulent fluid flow. The work is motivated by the desire to obtain Lagrangian information necessary for the study of turbulent dispersion at the largest problem sizes feasible on current and next-generation multi-petaflop supercomputers. A large population of fluid particles is distributed among parallel processes dynamically, based on instantaneous particle positions such that all of the interpolation information needed for each particle is available either locally on its host process or neighboring processes holding adjacent sub-domains of the velocity field. With cubic splines as the preferred interpolation method, the new algorithm is designed to minimize the need for communication, by transferring between adjacent processes only those spline coefficients determined to be necessary for specific particles. This transfer is implemented very efficiently as a one-sided communication, using Co-Array Fortran (CAF) features which facilitate small data movements between different local partitions of a large global array. The cost of monitoring transfer of particle properties between adjacent processes for particles migrating across sub-domain boundaries is found to be small. Detailed benchmarks are obtained on the Cray petascale supercomputer Blue Waters at the University of Illinois, Urbana-Champaign. For operations on the particles in a 81923 simulation (0.55 trillion grid points) on 262,144 Cray XE6 cores, the new algorithm is found to be orders of magnitude faster relative to a prior algorithm in which each particle is tracked by the same parallel process at all times. This large speedup reduces the additional cost of tracking of order 300 million particles to just over 50% of the cost of computing the Eulerian velocity field at this scale. Improving support of PGAS models on major compilers suggests that this algorithm will be of wider applicability on most upcoming supercomputers.
Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H
2014-05-28
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260,000, scalability up to 295,000 CPU cores has been shown on BlueGene/P.