cray t3e parallel: Topics by Science.gov

Sample records for cray t3e parallel

High Performance Programming Using Explicit Shared Memory Model on Cray T3D1

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Saini, Subhash; Grassi, Charles

1994-01-01

The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
High Performance Programming Using Explicit Shared Memory Model on the Cray T3D

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

1994-01-01

The Cray T3D is the first-phase system in Cray Research Inc.'s (CRI) three-phase massively parallel processing program. In this report we describe the architecture of the T3D, as well as the CRAFT (Cray Research Adaptive Fortran) programming model, and contrast it with PVM, which is also supported on the T3D We present some performance data based on the NAS Parallel Benchmarks to illustrate both architectural and software features of the T3D.
Parallelization of ARC3D with Computer-Aided Tools

NASA Technical Reports Server (NTRS)

Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Parallel algorithms for modeling flow in permeable media. Annual report, February 15, 1995 - February 14, 1996

DOE Office of Scientific and Technical Information (OSTI.GOV)

G.A. Pope; K. Sephernoori; D.C. McKinney

1996-03-15

This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E

NASA Technical Reports Server (NTRS)

Ma, Kwan-Liu; Crockett, Thomas W.

1999-01-01

This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1999-01-01

The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
S-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Simon, H.

1998-01-01

Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less
An Automated Parallel Image Registration Technique Based on the Correlation of Wavelet Features

NASA Technical Reports Server (NTRS)

LeMoigne, Jacqueline; Campbell, William J.; Cromp, Robert F.; Zukor, Dorothy (Technical Monitor)

2001-01-01

With the increasing importance of multiple platform/multiple remote sensing missions, fast and automatic integration of digital data from disparate sources has become critical to the success of these endeavors. Our work utilizes maxima of wavelet coefficients to form the basic features of a correlation-based automatic registration algorithm. Our wavelet-based registration algorithm is tested successfully with data from the National Oceanic and Atmospheric Administration (NOAA) Advanced Very High Resolution Radiometer (AVHRR) and the Landsat/Thematic Mapper(TM), which differ by translation and/or rotation. By the choice of high-frequency wavelet features, this method is similar to an edge-based correlation method, but by exploiting the multi-resolution nature of a wavelet decomposition, our method achieves higher computational speeds for comparable accuracies. This algorithm has been implemented on a Single Instruction Multiple Data (SIMD) massively parallel computer, the MasPar MP-2, as well as on the CrayT3D, the Cray T3E and a Beowulf cluster of Pentium workstations.
NAS Parallel Benchmark Results 11-96. 1.0

NASA Technical Reports Server (NTRS)

Bailey, David H.; Bailey, David; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

The NAS Parallel Benchmarks have been developed at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a "pencil and paper" fashion. In other words, the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. These results represent the best results that have been reported to us by the vendors for the specific 3 systems listed. In this report, we present new NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz), NEC SX-4/32, SGI/CRAY T3E, SGI Origin200, and SGI Origin2000. We also report High Performance Fortran (HPF) based NPB results for IBM SP2 Wide Nodes, HP/Convex Exemplar SPP2000, and SGI/CRAY T3D. These results have been submitted by Applied Parallel Research (APR) and Portland Group Inc. (PGI). We also present sustained performance per dollar for Class B LU, SP and BT benchmarks.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
The SGI/Cray T3E: Experiences and Insights

NASA Technical Reports Server (NTRS)

Bernard, Lisa Hamet

1998-01-01

The NASA Goddard Space Flight Center is home to the fifth most powerful supercomputer in the world, a 1024 processor SGI/Cray T3E-600. The original 512 processor system was placed at Goddard in March, 1997 as part of a cooperative agreement between the High Performance Computing and Communications Program's Earth and Space Sciences Project (ESS) and SGI/Cray Research. The goal of this system is to facilitate achievement of the Project milestones of 10, 50 and 100 GFLOPS sustained performance on selected Earth and space science application codes. The additional 512 processors were purchased in March, 1998 by the NASA Earth Science Enterprise for the NASA Seasonal to Interannual Prediction Project (NSIPP). These two "halves" still operate as a single system, and must satisfy the unique requirements of both aforementioned groups, as well as guest researchers from the Earth, space, microgravity, manned space flight and aeronautics communities. Few large scalable parallel systems are configured for capability computing, so models are hard to find. This unique environment has created a challenging system administration task, and has yielded some insights into the supercomputing needs of the various NASA Enterprises, as well as insights into the strengths and weaknesses of the T3E architecture and software. The T3E is a distributed memory system in which the processing elements (PE's) are connected by a low latency, high bandwidth bidirectional 3-D torus. Due to the focus on high speed communication between PE's, the T3E requires PE's to be allocated contiguously per job. Further, jobs will only execute on the user specified number of PE's and PE timesharing is possible but impractical. With a highly varied job mix in both size and runtime of jobs, the resulting scenario is PE fragmentation and an inability to achieve near 100% utilization. SGI/Cray has provided several scheduling and configuration tools to minimize the impact of fragmentation. These tools include PScheD (the political scheduler), GRM (the global resource manager) and NQE (the Network Queuing Environment). Features and impact of these tools will be discussed, as will resulting performance and utilization data. As a distributed memory system, the T3E is designed to be programmed through explicit message passing. Consequently, certain assumptions related to code design are made by the operating system (UNICOS/mk) and its scheduling tools. With the exception of HPF, which does run on the T3E, however poorly, alternative programming styles have the potential to impact the T3E in unexpected and undesirable ways. Several examples will be presented (preceeded with the disclaimer, "Don't try this at home! Violators will be prosecuted!")
Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jette, M.; Caliga, D.

Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less
An Atmospheric General Circulation Model with Chemistry for the CRAY T3E: Design, Performance Optimization and Coupling to an Ocean Model

NASA Technical Reports Server (NTRS)

Farrara, John D.; Drummond, Leroy A.; Mechoso, Carlos R.; Spahr, Joseph A.

1998-01-01

The design, implementation and performance optimization on the CRAY T3E of an atmospheric general circulation model (AGCM) which includes the transport of, and chemical reactions among, an arbitrary number of constituents is reviewed. The parallel implementation is based on a two-dimensional (longitude and latitude) data domain decomposition. Initial optimization efforts centered on minimizing the impact of substantial static and weakly-dynamic load imbalances among processors through load redistribution schemes. Recent optimization efforts have centered on single-node optimization. Strategies employed include loop unrolling, both manually and through the compiler, the use of an optimized assembler-code library for special function calls, and restructuring of parts of the code to improve data locality. Data exchanges and synchronizations involved in coupling different data-distributed models can account for a significant fraction of the running time. Therefore, the required scattering and gathering of data must be optimized. In systems such as the T3E, there is much more aggregate bandwidth in the total system than in any particular processor. This suggests a distributed design. The design and implementation of a such distributed 'Data Broker' as a means to efficiently couple the components of our climate system model is described.
A parallel finite-difference method for computational aerodynamics

NASA Technical Reports Server (NTRS)

Swisshelm, Julie M.

1989-01-01

A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed.
P-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Biswas, R.; Simon, H.D.

1997-05-01

Partitioning unstructured graphs is central to the parallel solution of problems in computational science and engineering. The authors have introduced earlier the sequential version of an inertial spectral partitioner called HARP which maintains the quality of recursive spectral bisection (RSB) while forming the partitions an order of magnitude faster than RSB. The serial HARP is known to be the fastest spectral partitioner to date, three to four times faster than similar partitioners on a variety of meshes. This paper presents a parallel version of HARP, called P-HARP. Two types of parallelism have been exploited: loop level parallelism and recursive parallelism.more » P-HARP has been implemented in MPI on the SGI/Cray T3E and the IBM SP2. Experimental results demonstrate that P-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.25 seconds on a 64-processor T3E. Experimental results further show that P-HARP can give nearly a 20-fold speedup on 64 processors. These results indicate that graph partitioning is no longer a major bottleneck that hinders the advancement of computational science and engineering for dynamically-changing real-world applications.« less
Evaluation of a Fully 3-D Bpf Method for Small Animal PET Images on Mimd Architectures

NASA Astrophysics Data System (ADS)

Bevilacqua, A.

Positron Emission Tomography (PET) images can be reconstructed using Fourier transform methods. This paper describes the performance of a fully 3-D Backprojection-Then-Filter (BPF) algorithm on the Cray T3E machine and on a cluster of workstations. PET reconstruction of small animals is a class of problems characterized by poor counting statistics. The low-count nature of these studies necessitates 3-D reconstruction in order to improve the sensitivity of the PET system: by including axially oblique Lines Of Response (LORs), the sensitivity of the system can be significantly improved by the 3-D acquisition and reconstruction. The BPF method is widely used in clinical studies because of its speed and easy implementation. Moreover, the BPF method is suitable for on-time 3-D reconstruction as it does not need any sinogram or rearranged data. In order to investigate the possibility of on-line processing, we reconstruct a phantom using the data stored in the list-mode format by the data acquisition system. We show how the intrinsically parallel nature of the BPF method makes it suitable for on-line reconstruction on a MIMD system such as the Cray T3E. Lastly, we analyze the performance of this algorithm on a cluster of workstations.
A Portable Parallel Implementation of the U.S. Navy Layered Ocean Model

DTIC Science & Technology

1995-01-01

Wallcraft, PhD (I.C. 1981) Planning Systems Inc. & P. R. Moore, PhD (Camb. 1971) IC Dept. Math. DR Moore 1° Encontro de Metodos Numericos...Kendall Square, Hypercube, D R Moore 1 ° Encontro de Metodos Numericos para Equacöes de Derivadas Parciais A. J. Wallcraft IC Mathematics...chips: Chips Machine DEC Alpha CrayT3D/E SUN Sparc Fujitsu AP1000 Intel 860 Paragon D R Moore 1° Encontro de Metodos Numericos para Equacöes
A Spectral Element Ocean Model on the Cray T3D: the interannual variability of the Mediterranean Sea general circulation

NASA Astrophysics Data System (ADS)

Molcard, A. J.; Pinardi, N.; Ansaloni, R.

A new numerical model, SEOM (Spectral Element Ocean Model, (Iskandarani et al, 1994)), has been implemented in the Mediterranean Sea. Spectral element methods combine the geometric flexibility of finite element techniques with the rapid convergence rate of spectral schemes. The current version solves the shallow water equations with a fifth (or sixth) order accuracy spectral scheme and about 50.000 nodes. The domain decomposition philosophy makes it possible to exploit the power of parallel machines. The original MIMD master/slave version of SEOM, written in F90 and PVM, has been ported to the Cray T3D. When critical for performance, Cray specific high-performance one-sided communication routines (SHMEM) have been adopted to fully exploit the Cray T3D interprocessor network. Tests performed with highly unstructured and irregular grid, on up to 128 processors, show an almost linear scalability even with unoptimized domain decomposition techniques. Results from various case studies on the Mediterranean Sea are shown, involving realistic coastline geometry, and monthly mean 1000mb winds from the ECMWF's atmospheric model operational analysis from the period January 1987 to December 1994. The simulation results show that variability in the wind forcing considerably affect the circulation dynamics of the Mediterranean Sea.

Parallelization of the FLAPW method

NASA Astrophysics Data System (ADS)

Canning, A.; Mannstadt, W.; Freeman, A. J.

2000-08-01

The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.
Tensorial Basis Spline Collocation Method for Poisson's Equation

NASA Astrophysics Data System (ADS)

Plagne, Laurent; Berthou, Jean-Yves

2000-01-01

This paper aims to describe the tensorial basis spline collocation method applied to Poisson's equation. In the case of a localized 3D charge distribution in vacuum, this direct method based on a tensorial decomposition of the differential operator is shown to be competitive with both iterative BSCM and FFT-based methods. We emphasize the O(h4) and O(h6) convergence of TBSCM for cubic and quintic splines, respectively. We describe the implementation of this method on a distributed memory parallel machine. Performance measurements on a Cray T3E are reported. Our code exhibits high performance and good scalability: As an example, a 27 Gflops performance is obtained when solving Poisson's equation on a 2563 non-uniform 3D Cartesian mesh by using 128 T3E-750 processors. This represents 215 Mflops per processors.
Applications Performance Under MPL and MPI on NAS IBM SP2

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

1994-01-01

On July 5, 1994, an IBM Scalable POWER parallel System (IBM SP2) with 64 nodes, was installed at the Numerical Aerodynamic Simulation (NAS) Facility Each node of NAS IBM SP2 is a "wide node" consisting of a RISC 6000/590 workstation module with a clock of 66.5 MHz which can perform four floating point operations per clock with a peak performance of 266 Mflop/s. By the end of 1994, 64 nodes of IBM SP2 will be upgraded to 160 nodes with a peak performance of 42.5 Gflop/s. An overview of the IBM SP2 hardware is presented. The basic understanding of architectural details of RS 6000/590 will help application scientists the porting, optimizing, and tuning of codes from other machines such as the CRAY C90 and the Paragon to the NAS SP2. Optimization techniques such as quad-word loading, effective utilization of two floating point units, and data cache optimization of RS 6000/590 is illustrated, with examples giving performance gains at each optimization step. The conversion of codes using Intel's message passing library NX to codes using native Message Passing Library (MPL) and the Message Passing Interface (NMI) library available on the IBM SP2 is illustrated. In particular, we will present the performance of Fast Fourier Transform (FFT) kernel from NAS Parallel Benchmarks (NPB) under MPL and MPI. We have also optimized some of Fortran BLAS 2 and BLAS 3 routines, e.g., the optimized Fortran DAXPY runs at 175 Mflop/s and optimized Fortran DGEMM runs at 230 Mflop/s per node. The performance of the NPB (Class B) on the IBM SP2 is compared with the CRAY C90, Intel Paragon, TMC CM-5E, and the CRAY T3D.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Automated Generation of Message-Passing Programs: An Evaluation Using CAPTools

NASA Technical Reports Server (NTRS)

Hribar, Michelle R.; Jin, Haoqiang; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

Scientists at NASA Ames Research Center have been developing computational aeroscience applications on highly parallel architectures over the past ten years. During that same time period, a steady transition of hardware and system software also occurred, forcing us to expend great efforts into migrating and re-coding our applications. As applications and machine architectures become increasingly complex, the cost and time required for this process will become prohibitive. In this paper, we present the first set of results in our evaluation of interactive parallelization tools. In particular, we evaluate CAPTool's ability to parallelize computational aeroscience applications. CAPTools was tested on serial versions of the NAS Parallel Benchmarks and ARC3D, a computational fluid dynamics application, on two platforms: the SGI Origin 2000 and the Cray T3E. This evaluation includes performance, amount of user interaction required, limitations and portability. Based on these results, a discussion on the feasibility of computer aided parallelization of aerospace applications is presented along with suggestions for future work.
Studying an Eulerian Computer Model on Different High-performance Computer Platforms and Some Applications

NASA Astrophysics Data System (ADS)

Georgiev, K.; Zlatev, Z.

2010-11-01

The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
Implementation of Helioseismic Data Reduction and Diagnostic Techniques on Massively Parallel Architectures

NASA Technical Reports Server (NTRS)

Korzennik, Sylvain

1997-01-01

Under the direction of Dr. Rhodes, and the technical supervision of Dr. Korzennik, the data assimilation of high spatial resolution solar dopplergrams has been carried out throughout the program on the Intel Delta Touchstone supercomputer. With the help of a research assistant, partially supported by this grant, and under the supervision of Dr. Korzennik, code development was carried out at SAO, using various available resources. To ensure cross-platform portability, PVM was selected as the message passing library. A parallel implementation of power spectra computation for helioseismology data reduction, using PVM was successfully completed. It was successfully ported to SMP architectures (i.e. SUN), and to some MPP architectures (i.e. the CM5). Due to limitation of the implementation of PVM on the Cray T3D, the port to that architecture was not completed at the time.
Heart Fibrillation and Parallel Supercomputers

NASA Technical Reports Server (NTRS)

Kogan, B. Y.; Karplus, W. J.; Chudin, E. E.

1997-01-01

The Luo and Rudy 3 cardiac cell mathematical model is implemented on the parallel supercomputer CRAY - T3D. The splitting algorithm combined with variable time step and an explicit method of integration provide reasonable solution times and almost perfect scaling for rectilinear wave propagation. The computer simulation makes it possible to observe new phenomena: the break-up of spiral waves caused by intracellular calcium and dynamics and the non-uniformity of the calcium distribution in space during the onset of the spiral wave.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Parallelization of the FLAPW method and comparison with the PPW method

NASA Astrophysics Data System (ADS)

Canning, Andrew; Mannstadt, Wolfgang; Freeman, Arthur

2000-03-01

The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. In the past the FLAPW method has been limited to systems of about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell running on up to 512 processors on a Cray T3E parallel supercomputer. Some results will also be presented on a comparison of the plane-wave pseudopotential method and the FLAPW method on large systems.
RISC Processors and High Performance Computing

NASA Technical Reports Server (NTRS)

Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

1995-01-01

In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.
Magnetosphere simulations with a high-performance 3D AMR MHD Code

NASA Astrophysics Data System (ADS)

Gombosi, Tamas; Dezeeuw, Darren; Groth, Clinton; Powell, Kenneth; Song, Paul

1998-11-01

BATS-R-US is a high-performance 3D AMR MHD code for space physics applications running on massively parallel supercomputers. In BATS-R-US the electromagnetic and fluid equations are solved with a high-resolution upwind numerical scheme in a tightly coupled manner. The code is very robust and it is capable of spanning a wide range of plasma parameters (such as β, acoustic and Alfvénic Mach numbers). Our code is highly scalable: it achieved a sustained performance of 233 GFLOPS on a Cray T3E-1200 supercomputer with 1024 PEs. This talk reports results from the BATS-R-US code for the GGCM (Geospace General Circularculation Model) Phase 1 Standard Model Suite. This model suite contains 10 different steady-state configurations: 5 IMF clock angles (north, south, and three equally spaced angles in- between) with 2 IMF field strengths for each angle (5 nT and 10 nT). The other parameters are: solar wind speed =400 km/sec; solar wind number density = 5 protons/cc; Hall conductance = 0; Pedersen conductance = 5 S; parallel conductivity = ∞.
Tough2{_}MP: A parallel version of TOUGH2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu; Ding, Chris

2003-04-09

TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
The SGI/CRAY T3E: Experiences and Insights

NASA Technical Reports Server (NTRS)

Bernard, Lisa Hamet

1999-01-01

The focus of the HPCC Earth and Space Sciences (ESS) Project is capability computing - pushing highly scalable computing testbeds to their performance limits. The drivers of this focus are the Grand Challenge problems in Earth and space science: those that could not be addressed in a capacity computing environment where large jobs must continually compete for resources. These Grand Challenge codes require a high degree of communication, large memory, and very large I/O (throughout the duration of the processing, not just in loading initial conditions and saving final results). This set of parameters led to the selection of an SGI/Cray T3E as the current ESS Computing Testbed. The T3E at the Goddard Space Flight Center is a unique computational resource within NASA. As such, it must be managed to effectively support the diverse research efforts across the NASA research community yet still enable the ESS Grand Challenge Investigator teams to achieve their performance milestones, for which the system was intended. To date, all Grand Challenge Investigator teams have achieved the 10 GFLOPS milestone, eight of nine have achieved the 50 GFLOPS milestone, and three have achieved the 100 GFLOPS milestone. In addition, many technical papers have been published highlighting results achieved on the NASA T3E, including some at this Workshop. The successes enabled by the NASA T3E computing environment are best illustrated by the 512 PE upgrade funded by the NASA Earth Science Enterprise earlier this year. Never before has an HPCC computing testbed been so well received by the general NASA science community that it was deemed critical to the success of a core NASA science effort. NASA looks forward to many more success stories before the conclusion of the NASA-SGI/Cray cooperative agreement in June 1999.
Geopotential Error Analysis from Satellite Gradiometer and Global Positioning System Observables on Parallel Architecture

NASA Technical Reports Server (NTRS)

Schutz, Bob E.; Baker, Gregory A.

1997-01-01

The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
Geopotential error analysis from satellite gradiometer and global positioning system observables on parallel architectures

NASA Astrophysics Data System (ADS)

Baker, Gregory Allen

The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
Climate Ocean Modeling on a Beowulf Class System

NASA Technical Reports Server (NTRS)

Cheng, B. N.; Chao, Y.; Wang, P.; Bondarenko, M.

2000-01-01

With the growing power and shrinking cost of personal computers. the availability of fast ethernet interconnections, and public domain software packages, it is now possible to combine them to build desktop parallel computers (named Beowulf or PC clusters) at a fraction of what it would cost to buy systems of comparable power front supercomputer companies. This led as to build and assemble our own sys tem. specifically for climate ocean modeling. In this article, we present our experience with such a system, discuss its network performance, and provide some performance comparison data with both HP SPP2000 and Cray T3E for an ocean Model used in present-day oceanographic research.
FFTs in external or hierarchical memory

NASA Technical Reports Server (NTRS)

Bailey, David H.

1989-01-01

A description is given of advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) use strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly 2 Gflops.
Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Chan, Tony F.; Fatoohi, Rod A.

1990-01-01

The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.

Application of high-performance computing to numerical simulation of human movement

NASA Technical Reports Server (NTRS)

Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.

1995-01-01

We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes

NASA Technical Reports Server (NTRS)

Wissink, Andrew; Allen, Edwin (Technical Monitor)

1998-01-01

Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
Testing of PVODE, a parallel ODE solver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wittman, M.R.

1996-08-09

The purpose of this paper is to discuss the issues involved with, and the results from, testing of two example programs that use PVODE: pvkx and pvnx. These two programs are intended to provide a template for users for follow when writing their own code. However, we also used them (primarily pvkx) to do performance testing and visualization. This work was done on a Cray T3D, a Sparc 10, and a Sparc 5.
OPAL: An Open-Source MPI-IO Library over Cray XT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yu, Weikuan; Vetter, Jeffrey S; Canon, Richard Shane

Parallel IO over Cray XT is supported by a vendor-supplied MPI-IO package. This package contains a proprietary ADIO implementation built on top of the sysio library. While it is reasonable to maintain a stable code base for application scientists' convenience, it is also very important to the system developers and researchers to analyze and assess the effectiveness of parallel IO software, and accordingly, tune and optimize the MPI-IO implementation. A proprietary parallel IO code base relinquishes such flexibilities. On the other hand, a generic UFS-based MPI-IO implementation is typically used on many Linux-based platforms. We have developed an open-source MPI-IOmore » package over Lustre, referred to as OPAL (OPportunistic and Adaptive MPI-IO Library over Lustre). OPAL provides a single source-code base for MPI-IO over Lustre on Cray XT and Linux platforms. Compared to Cray implementation, OPAL provides a number of good features, including arbitrary specification of striping patterns and Lustre-stripe aligned file domain partitioning. This paper presents the performance comparisons between OPAL and Cray's proprietary implementation. Our evaluation demonstrates that OPAL achieves the performance comparable to the Cray implementation. We also exemplify the benefits of an open source package in revealing the underpinning of the parallel IO performance.« less
Force user's manual: A portable, parallel FORTRAN

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

1990-01-01

The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
CRAY mini manual. Revision D

NASA Technical Reports Server (NTRS)

Tennille, Geoffrey M.; Howser, Lona M.

1993-01-01

This document briefly describes the use of the CRAY supercomputers that are an integral part of the Supercomputing Network Subsystem of the Central Scientific Computing Complex at LaRC. Features of the CRAY supercomputers are covered, including: FORTRAN, C, PASCAL, architectures of the CRAY-2 and CRAY Y-MP, the CRAY UNICOS environment, batch job submittal, debugging, performance analysis, parallel processing, utilities unique to CRAY, and documentation. The document is intended for all CRAY users as a ready reference to frequently asked questions and to more detailed information contained in the vendor manuals. It is appropriate for both the novice and the experienced user.
Multitasking the three-dimensional transport code TORT on CRAY platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.; Burre, C.A.

1996-04-01

The multitasking options in the three-dimensional neutral particle transport code TORT originally implemented for Cray`s CTSS operating system are revived and extended to run on Cray Y/MP and C90 computers using the UNICOS operating system. These include two coarse-grained domain decompositions; across octants, and across directions within an octant, termed Octant Parallel (OP), and Direction Parallel (DP), respectively. Parallel performance of the DP is significantly enhanced by increasing the task grain size and reducing load imbalance via dynamic scheduling of the discrete angles among the participating tasks. Substantial Wall Clock speedup factors, approaching 4.5 using 8 tasks, have been measuredmore » in a time-sharing environment, and generally depend on the test problem specifications, number of tasks, and machine loading during execution.« less
Vectorized and multitasked solution of the few-group neutron diffusion equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zee, S.K.; Turinsky, P.J.; Shayer, Z.

1989-03-01

A numerical algorithm with parallelism was used to solve the two-group, multidimensional neutron diffusion equations on computers characterized by shared memory, vector pipeline, and multi-CPU architecture features. Specifically, solutions were obtained on the Cray X/MP-48, the IBM-3090 with vector facilities, and the FPS-164. The material-centered mesh finite difference method approximation and outer-inner iteration method were employed. Parallelism was introduced in the inner iterations using the cyclic line successive overrelaxation iterative method and solving in parallel across lines. The outer iterations were completed using the Chebyshev semi-iterative method that allows parallelism to be introduced in both space and energy groups. Formore » the three-dimensional model, power, soluble boron, and transient fission product feedbacks were included. Concentrating on the pressurized water reactor (PWR), the thermal-hydraulic calculation of moderator density assumed single-phase flow and a closed flow channel, allowing parallelism to be introduced in the solution across the radial plane. Using a pinwise detail, quarter-core model of a typical PWR in cycle 1, for the two-dimensional model without feedback the measured million floating point operations per second (MFLOPS)/vector speedups were 83/11.7. 18/2.2, and 2.4/5.6 on the Cray, IBM, and FPS without multitasking, respectively. Lower performance was observed with a coarser mesh, i.e., shorter vector length, due to vector pipeline start-up. For an 18 x 18 x 30 (x-y-z) three-dimensional model with feedback of the same core, MFLOPS/vector speedups of --61/6.7 and an execution time of 0.8 CPU seconds on the Cray without multitasking were measured. Finally, using two CPUs and the vector pipelines of the Cray, a multitasking efficiency of 81% was noted for the three-dimensional model.« less
Parallel computation in a three-dimensional elastic-plastic finite-element analysis

NASA Technical Reports Server (NTRS)

Shivakumar, K. N.; Bigelow, C. A.; Newman, J. C., Jr.

1992-01-01

A CRAY parallel processing technique called autotasking was implemented in a three-dimensional elasto-plastic finite-element code. The technique was evaluated on two CRAY supercomputers, a CRAY 2 and a CRAY Y-MP. Autotasking was implemented in all major portions of the code, except the matrix equations solver. Compiler directives alone were not able to properly multitask the code; user-inserted directives were required to achieve better performance. It was noted that the connect time, rather than wall-clock time, was more appropriate to determine speedup in multiuser environments. For a typical example problem, a speedup of 2.1 (1.8 when the solution time was included) was achieved in a dedicated environment and 1.7 (1.6 with solution time) in a multiuser environment on a four-processor CRAY 2 supercomputer. The speedup on a three-processor CRAY Y-MP was about 2.4 (2.0 with solution time) in a multiuser environment.
HARP: A Dynamic Inertial Spectral Partitioner

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Sohn, Andrew; Biswas, Rupak

1997-01-01

Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effecfive in generating high-quality partitions of realistically-sized meshes. The major problem which hindered their wide-spread use was their long execution times. This paper presents a new inertial spectral partitioner, called HARP. The main objective of the proposed approach is to quickly partition the meshes at runtime in a manner that works efficiently for real applications in the context of distributed-memory machines. The underlying principle of HARP is to find the eigenvectors of the unpartitioned vertices and then project them onto the eigerivectors of the original mesh. Results for various meshes ranging in size from 1000 to 100,000 vertices indicate that HARP can indeed partition meshes rapidly at runtime. Experimental results show that our largest mesh can be partitioned sequentially in only a few seconds on an SP2 which is several times faster than other spectral partitioners while maintaining the solution quality of the proven RSB method. A parallel WI version of HARP has also been implemented on IBM SP2 and Cray T3E. Parallel HARP, running on 64 processors SP2 and T3E, can partition a mesh containing more than 100,000 vertices into 64 subgrids in about half a second. These results indicate that graph partitioning can now be truly embedded in dynamically-changing real-world applications.
Transferring ecosystem simulation codes to supercomputers

NASA Technical Reports Server (NTRS)

Skiles, J. W.; Schulbach, C. H.

1995-01-01

Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.
Relativistic Collisions of Highly-Charged Ions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ionescu, Dorin; Belkacem, Ali

1998-11-19

The physics of elementary atomic processes in relativistic collisions between highly-charged ions and atoms or other ions is briefly discussed, and some recent theoretical and experimental results in this field are summarized. They include excitation, capture, ionization, and electron-positron pair creation. The numerical solution of the two-center Dirac equation in momentum space is shown to be a powerful nonperturbative method for describing atomic processes in relativistic collisions involving heavy and highly-charged ions. By propagating negative-energy wave packets in time the evolution of the QED vacuum around heavy ions in relativistic motion is investigated. Recent results obtained from numerical calculations usingmore » massively parallel processing on the Cray-T3E supercomputer of the National Energy Research Scientific Computer Center (NERSC) at Berkeley National Laboratory are presented.« less
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers

NASA Technical Reports Server (NTRS)

Skiles, J. W.; Schulbach, C. H.

1994-01-01

Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
Dynamic overset grid communication on distributed memory parallel processors

NASA Technical Reports Server (NTRS)

Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

1993-01-01

A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Werner, N.E.; Van Matre, S.W.

1985-05-01

This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
Experiences with Cray multi-tasking

NASA Technical Reports Server (NTRS)

Miya, E. N.

1985-01-01

The issues involved in modifying an existing code for multitasking is explored. They include Cray extensions to FORTRAN, an examination of the application code under study, designing workable modifications, specific code modifications to the VAX and Cray versions, performance, and efficiency results. The finished product is a faster, fully synchronous, parallel version of the original program. A production program is partitioned by hand to run on two CPUs. Loop splitting multitasks three key subroutines. Simply dividing subroutine data and control structure down the middle of a subroutine is not safe. Simple division produces results that are inconsistent with uniprocessor runs. The safest way to partition the code is to transfer one block of loops at a time and check the results of each on a test case. Other issues include debugging and performance. Task startup and maintenance (e.g., synchronization) are potentially expensive.
Improvements to the Unstructured Mesh Generator MESH3D

NASA Technical Reports Server (NTRS)

Thomas, Scott D.; Baker, Timothy J.; Cliff, Susan E.

1999-01-01

The AIRPLANE process starts with an aircraft geometry stored in a CAD system. The surface is modeled with a mesh of triangles and then the flow solver produces pressures at surface points which may be integrated to find forces and moments. The biggest advantage is that the grid generation bottleneck of the CFD process is eliminated when an unstructured tetrahedral mesh is used. MESH3D is the key to turning around the first analysis of a CAD geometry in days instead of weeks. The flow solver part of AIRPLANE has proven to be robust and accurate over a decade of use at NASA. It has been extensively validated with experimental data and compares well with other Euler flow solvers. AIRPLANE has been applied to all the HSR geometries treated at Ames over the course of the HSR program in order to verify the accuracy of other flow solvers. The unstructured approach makes handling complete and complex geometries very simple because only the surface of the aircraft needs to be discretized, i.e. covered with triangles. The volume mesh is created automatically by MESH3D. AIRPLANE runs well on multiple platforms. Vectorization on the Cray Y-MP is reasonable for a code that uses indirect addressing. Massively parallel computers such as the IBM SP2, SGI Origin 2000, and the Cray T3E have been used with an MPI version of the flow solver and the code scales very well on these systems. AIRPLANE can run on a desktop computer as well. AIRPLANE has a future. The unstructured technologies developed as part of the HSR program are now targeting high Reynolds number viscous flow simulation. The pacing item in this effort is Navier-Stokes mesh generation.
Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash

1997-01-01

Compilers supporting High Performance Form (HPF) features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR), Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI) combinations will be compared, based on latest NAS Parallel Benchmark results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition, we would also present NPB, (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu CAPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz), NEC SX-4/32, SGI/CRAY T3E, and SGI Origin2000. We would also present sustained performance per dollar for Class B LU, SP and BT benchmarks.
Parallel processing a real code: A case history

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.A.; Trease, H.E.

1988-01-01

A three-dimensional, time-dependent Free-Lagrange hydrodynamics code has been multitasked and autotasked on a Cray X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the Cray multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The 3-D algorithm has presented a number of problems that simpler algorithms, such as 1-D hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a Cray 1, to a multitasking code are discussed, Autotasking of a rewritten version ofmore » the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given. 8 refs., 13 figs.« less
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1999-01-01

Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.

Efficient multitasking of Choleski matrix factorization on CRAY supercomputers

NASA Technical Reports Server (NTRS)

Overman, Andrea L.; Poole, Eugene L.

1991-01-01

A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
Full speed ahead for software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wolfe, A.

1986-03-10

Supercomputing software is moving into high gear, spurred by the rapid spread of supercomputers into new applications. The critical challenge is how to develop tools that will make it easier for programmers to write applications that take advantage of vectorizing in the classical supercomputer and the parallelism that is emerging in supercomputers and minisupercomputers. Writing parallel software is a challenge that every programmer must face because parallel architectures are springing up across the range of computing. Cray is developing a host of tools for programmers. Tools to support multitasking (in supercomputer parlance, multitasking means dividing up a single program tomore » run on multiple processors) are high on Cray's agenda. On tap for multitasking is Premult, dubbed a microtasking tool. As a preprocessor for Cray's CFT77 FORTRAN compiler, Premult will provide fine-grain multitasking.« less
Highlights of X-Stack ExM Deliverable Swift/T

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wozniak, Justin M.

Swift/T is a key success from the ExM: System support for extreme-scale, many-task applications1 X-Stack project, which proposed to use concurrent dataflow as an innovative programming model to exploit extreme parallelism in exascale computers. The Swift/T component of the project reimplemented the Swift language from scratch to allow applications that compose scientific modules together to be build and run on available petascale computers (Blue Gene, Cray). Swift/T does this via a new compiler and runtime that generates and executes the application as an MPI program. We assume that mission-critical emerging exascale applications will be composed as scalable applications using existingmore » software components, connected by data dependencies. Developers wrap native code fragments using a higherlevel language, then build composite applications to form a computational experiment. This exemplifies hierarchical concurrency: lower-level messaging libraries are used for fine-grained parallelism; highlevel control is used for inter-task coordination. These patterns are best expressed with dataflow, but static DAGs (i.e., other workflow languages) limit the applications that can be built; they do not provide the expressiveness of Swift, such as conditional execution, iteration, and recursive functions.« less
Analysis, tuning and comparison of two general sparse solvers for distributed memory computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amestoy, P.R.; Duff, I.S.; L'Excellent, J.-Y.

2000-06-30

We describe the work performed in the context of a Franco-Berkeley funded project between NERSC-LBNL located in Berkeley (USA) and CERFACS-ENSEEIHT located in Toulouse (France). We discuss both the tuning and performance analysis of two distributed memory sparse solvers (superlu from Berkeley and mumps from Toulouse) on the 512 processor Cray T3E from NERSC (Lawrence Berkeley National Laboratory). This project gave us the opportunity to improve the algorithms and add new features to the codes. We then quite extensively analyze and compare the two approaches on a set of large problems from real applications. We further explain the main differencesmore » in the behavior of the approaches on artificial regular grid problems. As a conclusion to this activity report, we mention a set of parallel sparse solvers on which this type of study should be extended.« less
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines

NASA Technical Reports Server (NTRS)

1999-01-01

Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.

The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
Parallel performance of TORT on the CRAY J90: Model and measurement

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1997-10-01

A limitation on the parallel performance of TORT on the CRAY J90 is the amount of extra work introduced by the multitasking algorithm itself. The extra work beyond that of the serial version of the code, called overhead, arises from the synchronization of the parallel tasks and the accumulation of results by the master task. The goal of recent updates to TORT was to reduce the time consumed by these activities. To help understand which components of the multitasking algorithm contribute significantly to the overhead, a parallel performance model was constructed and compared to measurements of actual timings of themore » code.« less
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Earth Sciences Division; Zhang, Keni; Zhang, Keni

TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator ismore » to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used. To familiarize users with the parallel code, illustrative sample problems are presented.« less
The ASC Sequoia Programming Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seager, M

2008-08-06

In the late 1980's and early 1990's, Lawrence Livermore National Laboratory was deeply engrossed in determining the next generation programming model for the Integrated Design Codes (IDC) beyond vectorization for the Cray 1s series of computers. The vector model, developed in mid 1970's first for the CDC 7600 and later extended from stack based vector operation to memory to memory operations for the Cray 1s, lasted approximately 20 years (See Slide 5). The Cray vector era was deemed an extremely long lived era as it allowed vector codes to be developed over time (the Cray 1s were faster in scalarmore » mode than the CDC 7600) with vector unit utilization increasing incrementally over time. The other attributes of the Cray vector era at LLNL were that we developed, supported and maintained the Operating System (LTSS and later NLTSS), communications protocols (LINCS), Compilers (Civic Fortran77 and Model), operating system tools (e.g., batch system, job control scripting, loaders, debuggers, editors, graphics utilities, you name it) and math and highly machine optimized libraries (e.g., SLATEC, and STACKLIB). Although LTSS was adopted by Cray for early system generations, they later developed COS and UNICOS operating systems and environment on their own. In the late 1970s and early 1980s two trends appeared that made the Cray vector programming model (described above including both the hardware and system software aspects) seem potentially dated and slated for major revision. These trends were the appearance of low cost CMOS microprocessors and their attendant, departmental and mini-computers and later workstations and personal computers. With the wide spread adoption of Unix in the early 1980s, it appeared that LLNL (and the other DOE Labs) would be left out of the mainstream of computing without a rapid transition to these 'Killer Micros' and modern OS and tools environments. The other interesting advance in the period is that systems were being developed with multiple 'cores' in them and called Symmetric Multi-Processor or Shared Memory Processor (SMP) systems. The parallel revolution had begun. The Laboratory started a small 'parallel processing project' in 1983 to study the new technology and its application to scientific computing with four people: Tim Axelrod, Pete Eltgroth, Paul Dubois and Mark Seager. Two years later, Eugene Brooks joined the team. This team focused on Unix and 'killer micro' SMPs. Indeed, Eugene Brooks was credited with coming up with the 'Killer Micro' term. After several generations of SMP platforms (e.g., Sequent Balance 8000 with 8 33MHz MC32032s, Allian FX8 with 8 MC68020 and FPGA based Vector Units and finally the BB&N Butterfly with 128 cores), it became apparent to us that the killer micro revolution would indeed take over Crays and that we definitely needed a new programming and systems model. The model developed by Mark Seager and Dale Nielsen focused on both the system aspects (Slide 3) and the code development aspects (Slide 4). Although now succinctly captured in two attached slides, at the time there was tremendous ferment in the research community as to what parallel programming model would emerge, dominate and survive. In addition, we wanted a model that would provide portability between platforms of a single generation but also longevity over multiple--and hopefully--many generations. Only after we developed the 'Livermore Model' and worked it out in considerable detail did it become obvious that what we came up with was the right approach. In a nutshell, the applications programming model of the Livermore Model posited that SMP parallelism would ultimately not scale indefinitely and one would have to bite the bullet and implement MPI parallelism within the Integrated Design Code (IDC). We also had a major emphasis on doing everything in a completely standards based, portable methodology with POSIX/Unix as the target environment. We decided against specialized libraries like STACKLIB for performance, but kept as many general purpose, portable math libraries as were needed by the codes. Third, we assumed that the SMPs in clusters would evolve in time to become more powerful, feature rich and, in particular, offer more cores. Thus, we focused on OpenMP, and POSIX PThreads for programming SMP parallelism. These code porting efforts were lead by Dale Nielsen, A-Division code group leader, and Randy Christensen, B-Division code group leader. Most of the porting effort revolved removing 'Crayisms' in the codes: artifacts of LTSS/NLTSS, Civic compiler extensions beyond Fortran77, IO libraries and dealing with new code control languages (we switched to Perl and later to Python). Adding MPI to the codes was initially problematic and error prone because the programmers used MPI directly and sprinkled the calls throughout the code.« less
Multitasking a three-dimensional Navier-Stokes algorithm on the Cray-2

NASA Technical Reports Server (NTRS)

Swisshelm, Julie M.

1989-01-01

A three-dimensional computational aerodynamics algorithm has been multitasked for efficient parallel execution on the Cray-2. It provides a means for examining the multitasking performance of a complete CFD application code. An embedded zonal multigrid scheme is used to solve the Reynolds-averaged Navier-Stokes equations for an internal flow model problem. The explicit nature of each component of the method allows a spatial partitioning of the computational domain to achieve a well-balanced task load for MIMD computers with vector-processing capability. Experiments have been conducted with both two- and three-dimensional multitasked cases. The best speedup attained by an individual task group was 3.54 on four processors of the Cray-2, while the entire solver yielded a speedup of 2.67 on four processors for the three-dimensional case. The multiprocessing efficiency of various types of computational tasks is examined, performance on two Cray-2s with different memory access speeds is compared, and extrapolation to larger problems is discussed.
Chemical calculations on Cray computers

NASA Technical Reports Server (NTRS)

Taylor, Peter R.; Bauschlicher, Charles W., Jr.; Schwenke, David W.

1989-01-01

The influence of recent developments in supercomputing on computational chemistry is discussed with particular reference to Cray computers and their pipelined vector/limited parallel architectures. After reviewing Cray hardware and software the performance of different elementary program structures are examined, and effective methods for improving program performance are outlined. The computational strategies appropriate for obtaining optimum performance in applications to quantum chemistry and dynamics are discussed. Finally, some discussion is given of new developments and future hardware and software improvements.
Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

NASA Technical Reports Server (NTRS)

Gentzsch, W.

1982-01-01

Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
Performance of a plasma fluid code on the Intel parallel computers

NASA Technical Reports Server (NTRS)

Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

1992-01-01

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.
Reconstruction of the 1997/1998 El Nino from TOPEX/POSEIDON and TOGA/TAO Data Using a Massively Parallel Pacific-Ocean Model and Ensemble Kalman Filter

NASA Technical Reports Server (NTRS)

Keppenne, C. L.; Rienecker, M.; Borovikov, A. Y.

1999-01-01

Two massively parallel data assimilation systems in which the model forecast-error covariances are estimated from the distribution of an ensemble of model integrations are applied to the assimilation of 97-98 TOPEX/POSEIDON altimetry and TOGA/TAO temperature data into a Pacific basin version the NASA Seasonal to Interannual Prediction Project (NSIPP)ls quasi-isopycnal ocean general circulation model. in the first system, ensemble of model runs forced by an ensemble of atmospheric model simulations is used to calculate asymptotic error statistics. The data assimilation then occurs in the reduced phase space spanned by the corresponding leading empirical orthogonal functions. The second system is an ensemble Kalman filter in which new error statistics are computed during each assimilation cycle from the time-dependent ensemble distribution. The data assimilation experiments are conducted on NSIPP's 512-processor CRAY T3E. The two data assimilation systems are validated by withholding part of the data and quantifying the extent to which the withheld information can be inferred from the assimilation of the remaining data. The pros and cons of each system are discussed.
Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

NASA Technical Reports Server (NTRS)

Logan, Terry G.

1994-01-01

The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
Understanding the Cray X1 System

NASA Technical Reports Server (NTRS)

Cheung, Samson

2004-01-01

This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms that are available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.
A compositional reservoir simulator on distributed memory parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rame, M.; Delshad, M.

1995-12-31

This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Demonstration of a full volume 3D pre-stack depth migration in the Garden Banks area using massively parallel processor (MPP) technology

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano, M.; Chang, H.; VanDyke, J.

1996-12-31

This paper describes the implementation and results of portable, production-scale 3D Pre-stack Kirchhoff depth migration software. Full volume pre-stack imaging was applied to a six million trace (46.9 Gigabyte) data set from a subsalt play in the Garden Banks area in the Gulf of Mexico. The velocity model building and updating, were derived using image depth gathers and an image-driven strategy. After three velocity iterations, depth migrated sections revealed drilling targets that were not visible in the conventional 3D post-stack time migrated data set. As expected from the implementation of the migration algorithm, it was found that amplitudes are wellmore » preserved and anomalies associated with known reservoirs, conform to petrophysical predictions. Image gathers for velocity analysis and the final depth migrated volume, were generated on an 1824 node Intel Paragon at Sandia National Laboratories. The code has been successfully ported to a CRAY (T3D) and Unix workstation Parallel Virtual Machine environments (PVM).« less
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.

Parallel processing a three-dimensional free-lagrange code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.A.; Trease, H.E.

1989-01-01

A three-dimensional, time-dependent free-Lagrange hydrodynamics code has been multitasked and autotasked on a CRAY X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the CRAY multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The three-dimensional algorithm has presented a number of problems that simpler algorithms, such as those for one-dimensional hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a CRAY-1, to a multitasking code are discussed. Autotasking of a rewritten versionmore » of the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given.« less
The International Conference on Vector and Parallel Computing (2nd)

DTIC Science & Technology

1989-01-17

Computation of the SVD of Bidiagonal Matrices" ...................................... 11 " Lattice QCD -As a Large Scale Scientific Computation...vectorizcd for the IBM 3090 Vector Facility. In addition, elapsed times " Lattice QCD -As a Large Scale Scientific have been reduced by using 3090...benchmarked Lattice QCD on a large number ofcompu- come from the wavefront solver routine. This was exten- ters: CrayX-MP and Cray 2 (vector
Performance of the Cray T3D and Emerging Architectures on Canopy QCD Applications

NASA Astrophysics Data System (ADS)

Fischler, Mark; Uchima, Mike

1996-03-01

The Cray T3D, an MIMD system with NUMA shared memory capabilities and in principle very low communications latency, can support the Canopy framework for grid-oriented applications. CANOPY has been ported to the T3D, with the intent of making it available to a spectrum of users. The performance of the T3D running Canopy has been benchmarked on five QCD applications extensively run on ACPMAPS at Fermilab, requiring a variety of data access patterns. The net performance and scaling behavior reveals an efficiency relative to peak Gflops almost identical to that achieved on ACPMAPS. Detailed studies of the major factors impacting performance are presented. Generalizations applying this analysis to the newly emerging crop of commercial systems reveal where their limitations will lie. On these applications, efficiencies of above 25% are not to be expected; eliminating overheads due to Canopy will improve matters, but by less than a factor of two.
Performance of the Cray T3D and emerging architectures on canopy QCD applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fischler, M.; Uchima, M.

1995-11-01

The Cray T3D, an MIMD system with NUMA shared memory capabilities and in principle very low communications latency, can support the Canopy framework for grid-oriented applications. CANOPY has been ported to the T3D, with the intent of making it available to a spectrum of users. The performance of the T3D running Canopy has been benchmarked on five QCD applications extensively run on ACPMAPS at Fermilab, requiring a variety of data access patterns. The net performance and scaling behavior reveals an efficiency relative to peak Gflops almost identical to that achieved on ACPMAPS. Detailed studies of the major factors impacting performancemore » are presented. Generalizations applying this analysis to the newly emerging crop of commercial systems reveal where their limitations will lie. On these applications, efficiencies of above 25% are not to be expected; eliminating overheads due to Canopy will improve matters, but by less than a factor of two.« less
Introducing Argonne’s Theta Supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

None

Theta, the Argonne Leadership Computing Facility’s (ALCF) new Intel-Cray supercomputer, is officially open to the research community. Theta’s massively parallel, many-core architecture puts the ALCF on the path to Aurora, the facility’s future Intel-Cray system. Capable of nearly 10 quadrillion calculations per second, Theta enables researchers to break new ground in scientific investigations that range from modeling the inner workings of the brain to developing new materials for renewable energy applications.
Parallel Calculation of Sensitivity Derivatives for Aircraft Design using Automatic Differentiation

NASA Technical Reports Server (NTRS)

Bischof, c. H.; Green, L. L.; Haigler, K. J.; Knauff, T. L., Jr.

1994-01-01

Sensitivity derivative (SD) calculation via automatic differentiation (AD) typical of that required for the aerodynamic design of a transport-type aircraft is considered. Two ways of computing SD via code generated by the ADIFOR automatic differentiation tool are compared for efficiency and applicability to problems involving large numbers of design variables. A vector implementation on a Cray Y-MP computer is compared with a coarse-grained parallel implementation on an IBM SP1 computer, employing a Fortran M wrapper. The SD are computed for a swept transport wing in turbulent, transonic flow; the number of geometric design variables varies from 1 to 60 with coupling between a wing grid generation program and a state-of-the-art, 3-D computational fluid dynamics program, both augmented for derivative computation via AD. For a small number of design variables, the Cray Y-MP implementation is much faster. As the number of design variables grows, however, the IBM SP1 becomes an attractive alternative in terms of compute speed, job turnaround time, and total memory available for solutions with large numbers of design variables. The coarse-grained parallel implementation also can be moved easily to a network of workstations.
A Performance Evaluation of the Cray X1 for Scientific Applications

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak; Borrill, Julian; Canning, Andrew; Carter, Jonathan; Djomehri, M. Jahed; Shan, Hongzhang; Skinner, David

2004-01-01

The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and cost effectiveness. However, the recent development of massively parallel vector systems is having a significant effect on the supercomputing landscape. In this paper, we compare the performance of the recently released Cray X1 vector system with that of the cacheless NEC SX-6 vector machine, and the superscalar cache-based IBM Power3 and Power4 architectures for scientific applications. Overall results demonstrate that the X1 is quite promising, but performance improvements are expected as the hardware, systems software, and numerical libraries mature. Code reengineering to effectively utilize the complex architecture may also lead to significant efficiency enhancements.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
The Portals 4.0 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2012-11-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
Modeling high-temperature superconductors and metallic alloys on the Intel IPSC/860

NASA Astrophysics Data System (ADS)

Geist, G. A.; Peyton, B. W.; Shelton, W. A.; Stocks, G. M.

Oak Ridge National Laboratory has embarked on several computational Grand Challenges, which require the close cooperation of physicists, mathematicians, and computer scientists. One of these projects is the determination of the material properties of alloys from first principles and, in particular, the electronic structure of high-temperature superconductors. While the present focus of the project is on superconductivity, the approach is general enough to permit study of other properties of metallic alloys such as strength and magnetic properties. This paper describes the progress to date on this project. We include a description of a self-consistent KKR-CPA method, parallelization of the model, and the incorporation of a dynamic load balancing scheme into the algorithm. We also describe the development and performance of a consolidated KKR-CPA code capable of running on CRAYs, workstations, and several parallel computers without source code modification. Performance of this code on the Intel iPSC/860 is also compared to a CRAY 2, CRAY YMP, and several workstations. Finally, some density of state calculations of two perovskite superconductors are given.
A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gittens, Alex; Kottalam, Jey; Yang, Jiyan

We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with themore » fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.« less
The portals 4.0.1 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2013-04-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
Massively parallel quantum computer simulator

NASA Astrophysics Data System (ADS)

De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.

2007-01-01

We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
A Performance Evaluation of the Cray X1 for Scientific Applications

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak; Borrill, Julian; Canning, Andrew; Carter, Jonathan; Djomehri, M. Jahed; Shan, Hongzhang; Skinner, David

2003-01-01

The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and capacity computers because of their generality, scalability, and cost effectiveness. However, the recent development of massively parallel vector systems is having a significant effect on the supercomputing landscape. In this paper, we compare the performance of the recently-released Cray X1 vector system with that of the cacheless NEC SX-6 vector machine, and the superscalar cache-based IBM Power3 and Power4 architectures for scientific applications. Overall results demonstrate that the X1 is quite promising, but performance improvements are expected as the hardware, systems software, and numerical libraries mature. Code reengineering to effectively utilize the complex architecture may also lead to significant efficiency enhancements.
Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application

NASA Technical Reports Server (NTRS)

Tennille, Geoffrey M.; Overman, Andrea L.; Lambiotte, Jules J.; Streett, Craig L.

1991-01-01

The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Doerfler, Douglas; Austin, Brian; Cook, Brandon

There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad

1995-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
HO2 rovibrational eigenvalue studies for nonzero angular momentum

NASA Astrophysics Data System (ADS)

Wu, Xudong T.; Hayes, Edward F.

1997-08-01

An efficient parallel algorithm is reported for determining all bound rovibrational energy levels for the HO2 molecule for nonzero angular momentum values, J=1, 2, and 3. Performance tests on the CRAY T3D indicate that the algorithm scales almost linearly when up to 128 processors are used. Sustained performance levels of up to 3.8 Gflops have been achieved using 128 processors for J=3. The algorithm uses a direct product discrete variable representation (DVR) basis and the implicitly restarted Lanczos method (IRLM) of Sorensen to compute the eigenvalues of the polyatomic Hamiltonian. Since the IRLM is an iterative method, it does not require storage of the full Hamiltonian matrix—it only requires the multiplication of the Hamiltonian matrix by a vector. When the IRLM is combined with a formulation such as DVR, which produces a very sparse matrix, both memory and computation times can be reduced dramatically. This algorithm has the potential to achieve even higher performance levels for larger values of the total angular momentum.
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.

Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

DOE PAGES

Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...

2015-01-01

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Performance of an Optimized Eta Model Code on the Cray T3E and a Network of PCs

NASA Technical Reports Server (NTRS)

Kouatchou, Jules; Rancic, Miodrag; Geiger, Jim

2000-01-01

In the year 2001, NASA will launch the satellite TRIANA that will be the first Earth observing mission to provide a continuous, full disk view of the sunlit Earth. As a part of the HPCC Program at NASA GSFC, we have started a project whose objectives are to develop and implement a 3D cloud data assimilation system, by combining TRIANA measurements with model simulation, and to produce accurate statistics of global cloud coverage as an important element of the Earth's climate. For simulation of the atmosphere within this project we are using the NCEP/NOAA operational Eta model. In order to compare TRIANA and the Eta model data on approximately the same grid without significant downscaling, the Eta model will be integrated at a resolution of about 15 km. The integration domain (from -70 to +70 deg in latitude and 150 deg in longitude) will cover most of the sunlit Earth disc and will continuously rotate around the globe following TRIANA. The cloud data assimilation is supposed to run and produce 3D clouds on a near real-time basis. Such a numerical setup and integration design is very ambitious and computationally demanding. Thus, though the Eta model code has been very carefully developed and its computational efficiency has been systematically polished during the years of operational implementation at NCEP, the current MPI version may still have problems with memory and efficiency for the TRIANA simulations. Within this work, we optimize a parallel version of the Eta model code on a Cray T3E and a network of PCs (theHIVE) in order to improve its overall efficiency. Our optimization procedure consists of introducing dynamically allocated arrays to reduce the size of static memory, and optimizing on a single processor by splitting loops to limit the number of streams. All the presented results are derived using an integration domain centered at the equator, with a size of 60 x 60 deg, and with horizontal resolutions of 1/2 and 1/3 deg, respectively. In accompanying charts we report the elapsed time, the speedup and the Mflops as a function of the number of processors for the non-optimized version of the code on the T3E and theHIVE. The large amount of communication required for model integration explains its poor performance on theHIVE. Our initial implementation of the dynamic memory allocation has contributed to about 12% reduction of memory but has introduced a 3% overhead in computing time. This overhead was removed by performing loop splitting in some of the high demanding subroutines. When the Eta code is fully optimized in order to meet the memory requirement for TRIANA simulations, a non-negligeable overhead may appear that may seriously affect the efficiency of the code. To alleviate this problem, we are considering implementation of a new algorithm for the horizontal advection that is computationally less expensive, and also a new approach for marching in time.
Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.

1992-01-01

An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Finite Element Flow Code Optimization on the Cray T3D,

DTIC Science & Technology

1997-04-01

present time, the system is configured with 512 processing elements and 32.8 Cigabytes of memory. Through a gift of time from MSCI and other arrangements, the AHPCRC has limited access to this system.
NAS Parallel Benchmark. Results 11-96: Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks. 1.0

NASA Technical Reports Server (NTRS)

Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
1993 Gordon Bell Prize Winners

NASA Technical Reports Server (NTRS)

Karp, Alan H.; Simon, Horst; Heller, Don; Cooper, D. M. (Technical Monitor)

1994-01-01

The Gordon Bell Prize recognizes significant achievements in the application of supercomputers to scientific and engineering problems. In 1993, finalists were named for work in three categories: (1) Performance, which recognizes those who solved a real problem in the quickest elapsed time. (2) Price/performance, which encourages the development of cost-effective supercomputing. (3) Compiler-generated speedup, which measures how well compiler writers are facilitating the programming of parallel processors. The winners were announced November 17 at the Supercomputing 93 conference in Portland, Oregon. Gordon Bell, an independent consultant in Los Altos, California, is sponsoring $2,000 in prizes each year for 10 years to promote practical parallel processing research. This is the sixth year of the prize, which Computer administers. Something unprecedented in Gordon Bell Prize competition occurred this year: A computer manufacturer was singled out for recognition. Nine entries reporting results obtained on the Cray C90 were received, seven of the submissions orchestrated by Cray Research. Although none of these entries showed sufficiently high performance to win outright, the judges were impressed by the breadth of applications that ran well on this machine, all nine running at more than a third of the peak performance of the machine.
Barrier-breaking performance for industrial problems on the CRAY C916

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graffunder, S.K.

1993-12-31

Nine applications, including third-party codes, were submitted to the Gordon Bell Prize committee showing the CRAY C916 supercomputer providing record-breaking time to solution for industrial problems in several disciplines. Performance was obtained by balancing raw hardware speed; effective use of large, real, shared memory; compiler vectorization and autotasking; hand optimization; asynchronous I/O techniques; and new algorithms. The highest GFLOPS performance for the submissions was 11.1 GFLOPS out of a peak advertised performance of 16 GFLOPS for the CRAY C916 system. One program achieved a 15.45 speedup from the compiler with just two hand-inserted directives to scope variables properly for themore » mathematical library. New I/O techniques hide tens of gigabytes of I/O behind parallel computations. Finally, new iterative solver algorithms have demonstrated times to solution on 1 CPU as high as 70 times faster than the best direct solvers.« less
Distributed Finite Element Analysis Using a Transputer Network

NASA Technical Reports Server (NTRS)

Watson, James; Favenesi, James; Danial, Albert; Tombrello, Joseph; Yang, Dabby; Reynolds, Brian; Turrentine, Ronald; Shephard, Mark; Baehmann, Peggy

1989-01-01

The principal objective of this research effort was to demonstrate the extraordinarily cost effective acceleration of finite element structural analysis problems using a transputer-based parallel processing network. This objective was accomplished in the form of a commercially viable parallel processing workstation. The workstation is a desktop size, low-maintenance computing unit capable of supercomputer performance yet costs two orders of magnitude less. To achieve the principal research objective, a transputer based structural analysis workstation termed XPFEM was implemented with linear static structural analysis capabilities resembling commercially available NASTRAN. Finite element model files, generated using the on-line preprocessing module or external preprocessing packages, are downloaded to a network of 32 transputers for accelerated solution. The system currently executes at about one third Cray X-MP24 speed but additional acceleration appears likely. For the NASA selected demonstration problem of a Space Shuttle main engine turbine blade model with about 1500 nodes and 4500 independent degrees of freedom, the Cray X-MP24 required 23.9 seconds to obtain a solution while the transputer network, operated from an IBM PC-AT compatible host computer, required 71.7 seconds. Consequently, the $80,000 transputer network demonstrated a cost-performance ratio about 60 times better than the $15,000,000 Cray X-MP24 system.
Developing software to use parallel processing effectively. Final report, June-December 1987

DOE Office of Scientific and Technical Information (OSTI.GOV)

Center, J.

1988-10-01

This report describes the difficulties involved in writing efficient parallel programs and describes the hardware and software support currently available for generating software that utilizes processing effectively. Historically, the processing rate of single-processor computers has increased by one order of magnitude every five years. However, this pace is slowing since electronic circuitry is coming up against physical barriers. Unfortunately, the complexity of engineering and research problems continues to require ever more processing power (far in excess of the maximum estimated 3 Gflops achievable by single-processor computers). For this reason, parallel-processing architectures are receiving considerable interest, since they offer high performancemore » more cheaply than a single-processor supercomputer, such as the Cray.« less
A practical approach to portability and performance problems on massively parallel supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Beazley, D.M.; Lomdahl, P.S.

1994-12-08

We present an overview of the tactics we have used to achieve a high-level of performance while improving portability for a large-scale molecular dynamics code SPaSM. SPaSM was originally implemented in ANSI C with message passing for the Connection Machine 5 (CM-5). In 1993, SPaSM was selected as one of the winners in the IEEE Gordon Bell Prize competition for sustaining 50 Gflops on the 1024 node CM-5 at Los Alamos National Laboratory. Achieving this performance on the CM-5 required rewriting critical sections of code in CDPEAC assembler language. In addition, the code made extensive use of CM-5 parallel I/Omore » and the CMMD message passing library. Given this highly specialized implementation, we describe how we have ported the code to the Cray T3D and high performance workstations. In addition we will describe how it has been possible to do this using a single version of source code that runs on all three platforms without sacrificing any performance. Sound too good to be true? We hope to demonstrate that one can realize both code performance and portability without relying on the latest and greatest prepackaged tool or parallelizing compiler.« less
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
Efficacy of Code Optimization on Cache-Based Processors

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Parallel-vector out-of-core equation solver for computational mechanics

NASA Technical Reports Server (NTRS)

Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.

1993-01-01

A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
Basic JCL for the CRAY-1 operating system (COS) with emphasis on making the transition from CDC 7600/SCOPE

NASA Technical Reports Server (NTRS)

Howe, G.; Saunders, D.

1983-01-01

Users of the CDC 7600 at Ames are assisted in making the transition to the CRAY-1. Similarities and differences in the basic JCL are summarized, and a dozen or so examples of typical batch jobs for the two systems are shown in parallel. Some changes to look for in FORTRAN programs and in the use of UPDATE are also indicated. No attempt is made to cover magnetic tape handling. The material here should not be considered a substitute for reading the more conventional manuals or the User's Guide for the Advanced Computational Facility, available from the Computer Information Center.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

1992-01-01

Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Supercomputer algorithms for efficient linear octree encoding of three-dimensional brain images.

PubMed

Berger, S B; Reis, D J

1995-02-01

We designed and implemented algorithms for three-dimensional (3-D) reconstruction of brain images from serial sections using two important supercomputer architectures, vector and parallel. These architectures were represented by the Cray YMP and Connection Machine CM-2, respectively. The programs operated on linear octree representations of the brain data sets, and achieved 500-800 times acceleration when compared with a conventional laboratory workstation. As the need for higher resolution data sets increases, supercomputer algorithms may offer a means of performing 3-D reconstruction well above current experimental limits.
Multitasking the INS3D-LU code on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Yoon, Seokkwan

1991-01-01

This paper presents the results of multitasking the INS3D-LU code on eight processors. The code is a full Navier-Stokes solver for incompressible fluid in three dimensional generalized coordinates using a lower-upper symmetric-Gauss-Seidel implicit scheme. This code has been fully vectorized on oblique planes of sweep and parallelized using autotasking with some directives and minor modifications. The timing results for five grid sizes are presented and analyzed. The code has achieved a processing rate of over one Gflops.
Tuning HDF5 subfiling performance on parallel file systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Byna, Suren; Chaarawi, Mohamad; Koziol, Quincey

Subfiling is a technique used on parallel file systems to reduce locking and contention issues when multiple compute nodes interact with the same storage target node. Subfiling provides a compromise between the single shared file approach that instigates the lock contention problems on parallel file systems and having one file per process, which results in generating a massive and unmanageable number of files. In this paper, we evaluate and tune the performance of recently implemented subfiling feature in HDF5. In specific, we explain the implementation strategy of subfiling feature in HDF5, provide examples of using the feature, and evaluate andmore » tune parallel I/O performance of this feature with parallel file systems of the Cray XC40 system at NERSC (Cori) that include a burst buffer storage and a Lustre disk-based storage. We also evaluate I/O performance on the Cray XC30 system, Edison, at NERSC. Our results show performance benefits of 1.2X to 6X performance advantage with subfiling compared to writing a single shared HDF5 file. We present our exploration of configurations, such as the number of subfiles and the number of Lustre storage targets to storing files, as optimization parameters to obtain superior I/O performance. Based on this exploration, we discuss recommendations for achieving good I/O performance as well as limitations with using the subfiling feature.« less
A multithreaded and GPU-optimized compact finite difference algorithm for turbulent mixing at high Schmidt number using petascale computing

NASA Astrophysics Data System (ADS)

Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.

2017-11-01

Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.

Parallelization of Rocket Engine System Software (Press)

NASA Technical Reports Server (NTRS)

Cezzar, Ruknet

1996-01-01

The main goal is to assess parallelization requirements for the Rocket Engine Numeric Simulator (RENS) project which, aside from gathering information on liquid-propelled rocket engines and setting forth requirements, involve a large FORTRAN based package at NASA Lewis Research Center and TDK software developed by SUBR/UWF. The ultimate aim is to develop, test, integrate, and suitably deploy a family of software packages on various aspects and facets of rocket engines using liquid-propellants. At present, all project efforts by the funding agency, NASA Lewis Research Center, and the HBCU participants are disseminated over the internet using world wide web home pages. Considering obviously expensive methods of actual field trails, the benefits of software simulators are potentially enormous. When realized, these benefits will be analogous to those provided by numerous CAD/CAM packages and flight-training simulators. According to the overall task assignments, Hampton University's role is to collect all available software, place them in a common format, assess and evaluate, define interfaces, and provide integration. Most importantly, the HU's mission is to see to it that the real-time performance is assured. This involves source code translations, porting, and distribution. The porting will be done in two phases: First, place all software on Cray XMP platform using FORTRAN. After testing and evaluation on the Cray X-MP, the code will be translated to C + + and ported to the parallel nCUBE platform. At present, we are evaluating another option of distributed processing over local area networks using Sun NFS, Ethernet, TCP/IP. Considering the heterogeneous nature of the present software (e.g., first started as an expert system using LISP machines) which now involve FORTRAN code, the effort is expected to be quite challenging.
A comparison of five benchmarks

NASA Technical Reports Server (NTRS)

Huss, Janice E.; Pennline, James A.

1987-01-01

Five benchmark programs were obtained and run on the NASA Lewis CRAY X-MP/24. A comparison was made between the programs codes and between the methods for calculating performance figures. Several multitasking jobs were run to gain experience in how parallel performance is measured.
MHD Instability and Turbulence in the Tachocline

NASA Technical Reports Server (NTRS)

Werne, Joseph

2001-01-01

In this quarter we have begun simulations on the Cray T3E at PSC and we are debugging our code on the TSC. The PSC simulations are examining stratified shear turbulence with a flow-aligned magnetic field and passive tracer particles. We have conducted analysis of neutral simulations to establish a firm basis of comparison. Second-order structure functions have been computed, fit, and compared to theoretical expressions relating the dissipation fields and the structure-function-fit parameters. Agreement with high-Reynolds number observations is excellent, giving us confidence that the lower-Re simulations are relevant to higher-Re flows. We have also evaluated the neutral layer anisotropy.
Studies on complex π-π and T-stacking features of imidazole and phenyl/p-halophenyl units in series of 5-amino-1-(phenyl/p-halophenyl)imidazole-4-carboxamides and their carbonitrile derivatives: Role of halogens in tuning of conformation

NASA Astrophysics Data System (ADS)

Das, Aniruddha

2017-11-01

5-amino-1-(phenyl/p-halophenyl)imidazole-4-carboxamides (N-phenyl AICA) (2a-e) and 5-amino-1-(phenyl/p-halophenyl)imidazole-4-carbonitriles (N-phenyl AICN) (3a-e) had been synthesized. X-ray crystallographic studies of 2a-e and 3a-e had been performed to identify any distinct change in stacking patterns in their crystal lattice. Single crystal X-ray diffraction studies of 2a-e revealed π-π stack formations with both imidazole and phenyl/p-halophenyl units in anti and syn parallel-displaced (PD)-type dispositions. No π-π stacking of imidazole occurred when the halogen substituent is bromo or iodo; π-π stacking in these cases occurred involving phenyl rings only. The presence of an additional T-stacking had been observed in crystal lattices of 3a-e. Vertical π-π stacking distances in anti-parallel PD-type arrangements as well as T-stacking distances had shown stacking distances short enough to impart stabilization whereas syn-parallel stacking arrangements had got much larger π-π stacking distances to belie any syn-parallel stacking stabilization. DFT studies had been pursued for quantifying the π-π stacking and T-stacking stabilization. The plotted curves for anti-parallel and T-stacked moieties had similarities to the 'Morse potential energy curve for diatomic molecule'. The minima of the curves corresponded to the most stable stacking distances and related energy values indicated stacking stabilization. Similar DFT studies on syn-parallel systems of 2b corresponded to no π-π stacking stabilization at all. Halogen-halogen interactions had also been observed to stabilize the compounds 2d, 2e and 3d. Nano-structural behaviour of the series of compounds 2a-e and 3a-e were thoroughly investigated.
Cray Research, Inc. Cray 1-s, Cray FORTRAN translator CFT) version 1. 11 Bugfix 1. Validation summary report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Not Available

1983-09-09

This Validation Summary Report (VSR) for the Cray Research, Inc., CRAY FORTRAN Translator (CFT) Version 1.11 Bugfix 1 running under the CRAY Operating System (COS) Version 1.12 provides a consolidated summary of the results obtained from the validation of the subject compiler against the 1978 FORTRAN Standard (X3.9-1978/FIPS PUB 69). The compiler was validated against the Full Level FORTRAN level of FIPS PUB 69. The VSR is made up of several sections showing all the discrepancies found -if any. These include an overview of the validation which lists all categories of discrepancies together with the tests which failed.
On the parallel solution of parabolic equations

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Youcef

1989-01-01

Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented.
Implementations of BLAST for parallel computers.

PubMed

Jülich, A

1995-02-01

The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Painter, J.; McCormick, P.; Krogh, M.

This paper presents the ACL (Advanced Computing Lab) Message Passing Library. It is a high throughput, low latency communications library, based on Thinking Machines Corp.`s CMMD, upon which message passing applications can be built. The library has been implemented on the Cray T3D, Thinking Machines CM-5, SGI workstations, and on top of PVM.
Development of a Navier-Stokes algorithm for parallel-processing supercomputers. Ph.D. Thesis - Colorado State Univ., Dec. 1988

NASA Technical Reports Server (NTRS)

Swisshelm, Julie M.

1989-01-01

An explicit flow solver, applicable to the hierarchy of model equations ranging from Euler to full Navier-Stokes, is combined with several techniques designed to reduce computational expense. The computational domain consists of local grid refinements embedded in a global coarse mesh, where the locations of these refinements are defined by the physics of the flow. Flow characteristics are also used to determine which set of model equations is appropriate for solution in each region, thereby reducing not only the number of grid points at which the solution must be obtained, but also the computational effort required to get that solution. Acceleration to steady-state is achieved by applying multigrid on each of the subgrids, regardless of the particular model equations being solved. Since each of these components is explicit, advantage can readily be taken of the vector- and parallel-processing capabilities of machines such as the Cray X-MP and Cray-2.
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

DOE PAGES

Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...

1995-01-01

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2

PubMed Central

Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.

2014-01-01

SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
A Programming Model Performance Study Using the NAS Parallel Benchmarks

DOE PAGES

Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...

2010-01-01

Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less
Parallel integer sorting with medium and fine-scale parallelism

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1993-01-01

Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jones, Terry R

2012-01-01

This paper describes a kernel scheduling algorithm that is based on coscheduling principles and that is intended for parallel applications running on 1000 cores or more. Experimental results for a Linux implementation on a Cray XT5 machine are presented. The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
Automatic differentiation for design sensitivity analysis of structural systems using multiple processors

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.; Storaasli, Olaf O.; Qin, Jiangning; Qamar, Ramzi

1994-01-01

An automatic differentiation tool (ADIFOR) is incorporated into a finite element based structural analysis program for shape and non-shape design sensitivity analysis of structural systems. The entire analysis and sensitivity procedures are parallelized and vectorized for high performance computation. Small scale examples to verify the accuracy of the proposed program and a medium scale example to demonstrate the parallel vector performance on multiple CRAY C90 processors are included.
Implementing TCP/IP and a socket interface as a server in a message-passing operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hipp, E.; Wiltzius, D.

1990-03-01

The UNICOS 4.3BSD network code and socket transport interface are the basis of an explicit network server for NLTSS, a message passing operating system on the Cray YMP. A BSD socket user library provides access to the network server using an RPC mechanism. The advantages of this server methodology are its modularity and extensibility to migrate to future protocol suites (e.g. OSI) and transport interfaces. In addition, the network server is implemented in an explicit multi-tasking environment to take advantage of the Cray YMP multi-processor platform. 19 refs., 5 figs.
Performance of a parallel algebraic multilevel preconditioner for stabilized finite element semiconductor device modeling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lin, Paul T.; Shadid, John N.; Sala, Marzio

In this study results are presented for the large-scale parallel performance of an algebraic multilevel preconditioner for solution of the drift-diffusion model for semiconductor devices. The preconditioner is the key numerical procedure determining the robustness, efficiency and scalability of the fully-coupled Newton-Krylov based, nonlinear solution method that is employed for this system of equations. The coupled system is comprised of a source term dominated Poisson equation for the electric potential, and two convection-diffusion-reaction type equations for the electron and hole concentration. The governing PDEs are discretized in space by a stabilized finite element method. Solution of the discrete system ismore » obtained through a fully-implicit time integrator, a fully-coupled Newton-based nonlinear solver, and a restarted GMRES Krylov linear system solver. The algebraic multilevel preconditioner is based on an aggressive coarsening graph partitioning of the nonzero block structure of the Jacobian matrix. Representative performance results are presented for various choices of multigrid V-cycles and W-cycles and parameter variations for smoothers based on incomplete factorizations. Parallel scalability results are presented for solution of up to 10{sup 8} unknowns on 4096 processors of a Cray XT3/4 and an IBM POWER eServer system.« less
Computer Center Reference Manual

DTIC Science & Technology

1988-06-20

Compiler options, separated by colons (default: A-:B?-:BREG-8:BT-:C-:D+:H2:A H+24:L+:O+:P-:+:RV-:S4:S+4:A ST-:T+: TREG -8:U-:V+:A-:Z+) CPU- - Cray to execute...4 a 0~ 0 -. C 6 0 _ .0’ 63 r C 6 0 C .. 60 0 6 .- 6 ILC A L U 00 00 IL C 6 1 6j 0 1.- U 0 E L 14. E Lc C0 a 60. LL 1 0’. . 𔄀 - 66 6 0 DL C 6 a 4
LASL benchmark performance 1978. [CDC STAR-100, 6600, 7600, Cyber 73, and CRAY-1

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKnight, A.L.

1979-08-01

This report presents the results of running several benchmark programs on a CDC STAR-100, a Cray Research CRAY-1, a CDC 6600, a CDC 7600, and a CDC Cyber 73. The benchmark effort included CRAY-1's at several installations running different operating systems and compilers. This benchmark is part of an ongoing program at Los Alamos Scientific Laboratory to collect performance data and monitor the development trend of supercomputers. 3 tables.
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-02-15

We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less

A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-05-29

We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Scalable nuclear density functional theory with Sky3D

NASA Astrophysics Data System (ADS)

Afibuzzaman, Md; Schuetrumpf, Bastian; Aktulga, Hasan Metin

2018-02-01

In nuclear astrophysics, quantum simulations of large inhomogeneous dense systems as they appear in the crusts of neutron stars present big challenges. The number of particles in a simulation with periodic boundary conditions is strongly limited due to the immense computational cost of the quantum methods. In this paper, we describe techniques for an efficient and scalable parallel implementation of Sky3D, a nuclear density functional theory solver that operates on an equidistant grid. Presented techniques allow Sky3D to achieve good scaling and high performance on a large number of cores, as demonstrated through detailed performance analysis on a Cray XC40 supercomputer.
New computing systems and their impact on structural analysis and design

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1989-01-01

A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
Integrating Grid Services into the Cray XT4 Environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

NERSC; Cholia, Shreyas; Lin, Hwa-Chun Wendy

2009-05-01

The 38640 core Cray XT4"Franklin" system at the National Energy Research Scientific Computing Center (NERSC) is a massively parallel resource available to Department of Energy researchers that also provides on-demand grid computing to the Open Science Grid. The integration of grid services on Franklin presented various challenges, including fundamental differences between the interactive and compute nodes, a stripped down compute-node operating system without dynamic library support, a shared-root environment and idiosyncratic application launching. Inour work, we describe how we resolved these challenges on a running, general-purpose production system to provide on-demand compute, storage, accounting and monitoring services through generic gridmore » interfaces that mask the underlying system-specific details for the end user.« less
Climate Data Assimilation on a Massively Parallel Supercomputer

NASA Technical Reports Server (NTRS)

Ding, Hong Q.; Ferraro, Robert D.

1996-01-01

We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512-nodes of an Intel Paragon. The preconditioned Conjugate Gradient solver achieves a sustained 18 Gflops performance. Consequently, we achieve an unprecedented 100-fold reduction in time to solution on the Intel Paragon over a single head of a Cray C90. This not only exceeds the daily performance requirement of the Data Assimilation Office at NASA's Goddard Space Flight Center, but also makes it possible to explore much larger and challenging data assimilation problems which are unthinkable on a traditional computer platform such as the Cray C90.
By Hand or Not By-Hand: A Case Study of Alternative Approaches to Parallelize CFD Applications

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Bailey, David (Technical Monitor)

1997-01-01

While parallel processing promises to speed up applications by several orders of magnitude, the performance achieved still depends upon several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. The existence of the Gorden Bell Prize given out at Supercomputing every year suggests that while good performance can be attained for real applications on general purpose multiprocessors, the large investment in man-power and time still has to be repeated for each application-machine combination. As applications and machine architectures become more complex, the cost and time-delays for obtaining performance by hand will become prohibitive. Computer users today can turn to three possible avenues for help: parallel libraries, parallel languages and compilers, interactive parallelization tools. The success of these methodologies, in turn, depends on proper application of data dependency analysis, program structure recognition and transformation, performance prediction as well as exploitation of user supplied knowledge. NASA has been developing multidisciplinary applications on highly parallel architectures under the High Performance Computing and Communications Program. Over the past six years, the transition of underlying hardware and system software have forced the scientists to spend a large effort to migrate and recede their applications. Various attempts to exploit software tools to automate the parallelization process have not produced favorable results. In this paper, we report our most recent experience with CAPTOOL, a package developed at Greenwich University. We have chosen CAPTOOL for three reasons: 1. CAPTOOL accepts a FORTRAN 77 program as input. This suggests its potential applicability to a large collection of legacy codes currently in use. 2. CAPTOOL employs domain decomposition to obtain parallelism. Although the fact that not all kinds of parallelism are handled may seem unappealing, many NASA applications in computational aerosciences as well as earth and space sciences are amenable to domain decomposition. 3. CAPTOOL generates code for a large variety of environments employed across NASA centers: MPI/PVM on network of workstations to the IBS/SP2 and CRAY/T3D.
Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jones, Terry R

2011-01-01

This paper describes a kernel scheduling algorithm that is based on co-scheduling principles and that is intended for parallel applications running on 1000 cores or more where inter-node scalability is key. Experimental results for a Linux implementation on a Cray XT5 machine are presented.1 The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

NASA Technical Reports Server (NTRS)

Wigton, Larry

1996-01-01

Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization

NASA Technical Reports Server (NTRS)

Jones, James Patton; Nitzberg, Bill

1999-01-01

The NAS facility has operated parallel supercomputers for the past 11 years, including the Intel iPSC/860, Intel Paragon, Thinking Machines CM-5, IBM SP-2, and Cray Origin 2000. Across this wide variety of machine architectures, across a span of 10 years, across a large number of different users, and through thousands of minor configuration and policy changes, the utilization of these machines shows three general trends: (1) scheduling using a naive FIFO first-fit policy results in 40-60% utilization, (2) switching to the more sophisticated dynamic backfilling scheduling algorithm improves utilization by about 15 percentage points (yielding about 70% utilization), and (3) reducing the maximum allowable job size further increases utilization. Most surprising is the consistency of these trends. Over the lifetime of the NAS parallel systems, we made hundreds, perhaps thousands, of small changes to hardware, software, and policy, yet, utilization was affected little. In particular these results show that the goal of achieving near 100% utilization while supporting a real parallel supercomputing workload is unrealistic.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database

DOE PAGES

Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...

2013-01-01

Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Proceedings of the Scientific Conference on Obscuration and Aerosol Research Held in Aberdeen Maryland on 27-30 June 1989

DTIC Science & Technology

1990-08-01

corneal structure for both normal and swollen corneas. Other problems of future interest are the understanding of the structure of scarred and dystrophied ...METHOD AND RESULTS The system of equations is solved numerically on a Cray X-MP by a finite element method with 9-node Lagrange quadrilaterals ( Becker ...Appl. Math., 42, 430. Becker , E. B., G. F. Carey, and J. T. Oden, 1981. Finite Elements: An Introduction (Vol. 1), Prentice- Hall, Englewood Cliffs, New
Hot Chips and Hot Interconnects for High End Computing Systems

NASA Technical Reports Server (NTRS)

Saini, Subhash

2005-01-01

I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
Parallel computing using a Lagrangian formulation

NASA Technical Reports Server (NTRS)

Liou, May-Fun; Loh, Ching Yuen

1991-01-01

A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel computing using a Lagrangian formulation

NASA Technical Reports Server (NTRS)

Liou, May-Fun; Loh, Ching-Yuen

1992-01-01

This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel Climate Data Assimilation PSAS Package

NASA Technical Reports Server (NTRS)

Ding, Hong Q.; Chan, Clara; Gennery, Donald B.; Ferraro, Robert D.

1996-01-01

We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512node Intel Paragon. The equation solver achieves a sustained 18 Gflops performance. As the results, we achieved an unprecedented 100-fold solution time reduction on the Intel Paragon parallel platform over the Cray C90. This not only meets and exceeds the DAO time requirements, but also significantly enlarges the window of exploration in climate data assimilations.
Efficient Iterative Methods Applied to the Solution of Transonic Flows

NASA Astrophysics Data System (ADS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Chronopoulos, Anthony T.

1996-02-01

We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton's method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GMRES method. The preconditioner is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton-GMRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems.
Solving large sparse eigenvalue problems on supercomputers

NASA Technical Reports Server (NTRS)

Philippe, Bernard; Saad, Youcef

1988-01-01

An important problem in scientific computing consists in finding a few eigenvalues and corresponding eigenvectors of a very large and sparse matrix. The most popular methods to solve these problems are based on projection techniques on appropriate subspaces. The main attraction of these methods is that they only require the use of the matrix in the form of matrix by vector multiplications. The implementations on supercomputers of two such methods for symmetric matrices, namely Lanczos' method and Davidson's method are compared. Since one of the most important operations in these two methods is the multiplication of vectors by the sparse matrix, methods of performing this operation efficiently are discussed. The advantages and the disadvantages of each method are compared and implementation aspects are discussed. Numerical experiments on a one processor CRAY 2 and CRAY X-MP are reported. Possible parallel implementations are also discussed.
High performance computing applications in neurobiological research

NASA Technical Reports Server (NTRS)

Ross, Muriel D.; Cheng, Rei; Doshay, David G.; Linton, Samuel W.; Montgomery, Kevin; Parnas, Bruce R.

1994-01-01

The human nervous system is a massively parallel processor of information. The vast numbers of neurons, synapses and circuits is daunting to those seeking to understand the neural basis of consciousness and intellect. Pervading obstacles are lack of knowledge of the detailed, three-dimensional (3-D) organization of even a simple neural system and the paucity of large scale, biologically relevant computer simulations. We use high performance graphics workstations and supercomputers to study the 3-D organization of gravity sensors as a prototype architecture foreshadowing more complex systems. Scaled-down simulations run on a Silicon Graphics workstation and scale-up, three-dimensional versions run on the Cray Y-MP and CM5 supercomputers.
Parallel k-means++ for Multiple Shared-Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mackey, Patrick S.; Lewis, Robert R.

2016-09-22

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less

Performance of a Bounce-Averaged Global Model of Super-Thermal Electron Transport in the Earth's Magnetic Field

NASA Technical Reports Server (NTRS)

McGuire, Tim

1998-01-01

In this paper, we report the results of our recent research on the application of a multiprocessor Cray T916 supercomputer in modeling super-thermal electron transport in the earth's magnetic field. In general, this mathematical model requires numerical solution of a system of partial differential equations. The code we use for this model is moderately vectorized. By using Amdahl's Law for vector processors, it can be verified that the code is about 60% vectorized on a Cray computer. Speedup factors on the order of 2.5 were obtained compared to the unvectorized code. In the following sections, we discuss the methodology of improving the code. In addition to our goal of optimizing the code for solution on the Cray computer, we had the goal of scalability in mind. Scalability combines the concepts of portabilty with near-linear speedup. Specifically, a scalable program is one whose performance is portable across many different architectures with differing numbers of processors for many different problem sizes. Though we have access to a Cray at this time, the goal was to also have code which would run well on a variety of architectures.
Reversible Parallel Discrete-Event Execution of Large-scale Epidemic Outbreak Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S; Seal, Sudip K

2010-01-01

The spatial scale, runtime speed and behavioral detail of epidemic outbreak simulations together require the use of large-scale parallel processing. In this paper, an optimistic parallel discrete event execution of a reaction-diffusion simulation model of epidemic outbreaks is presented, with an implementation over themore » $$\\mu$$sik simulator. Rollback support is achieved with the development of a novel reversible model that combines reverse computation with a small amount of incremental state saving. Parallel speedup and other runtime performance metrics of the simulation are tested on a small (8,192-core) Blue Gene / P system, while scalability is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes (up to several hundred million individuals in the largest case) are exercised.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
Utilization of parallel processing in solving the inviscid form of the average-passage equation system for multistage turbomachinery

NASA Technical Reports Server (NTRS)

Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.

1987-01-01

A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to are specific to the Cray X-MP line of computers and its associated SSD (Solid-State Disk). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.

We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we proposemore » a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.« less
Efficient iterative methods applied to the solution of transonic flows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wissink, A.M.; Lyrintzis, A.S.; Chronopoulos, A.T.

1996-02-01

We investigate the use of an inexact Newton`s method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton`s method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GIVIRES method. The preconditionermore » is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton- GIVIRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems. 38 refs., 14 figs., 7 tabs.« less
FRPA: A Framework for Recursive Parallel Algorithms

DTIC Science & Technology

2015-05-01

a t o i ( argv [ 1 ] ) ; s td : : s t r i n g i n t e r l e a v i n g = ( argc > 2) ? argv [ 2 ] : " " ; double ∗ A = randomArray ( l e n g t h...actually determines how deep the recursion is. For example, a configuration with schedule ‘BBDB’ and depth 3 represents the in- terleaving ‘ BBD ’. This means...depth 3 represents the same interleaving as the configuration with schedule ‘BBDD’ and depth 3, namely ‘ BBD ’. In our experiments, this redundancy did
Parallel-vector computation for structural analysis and nonlinear unconstrained optimization problems

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.

1990-01-01

Practical engineering application can often be formulated in the form of a constrained optimization problem. There are several solution algorithms for solving a constrained optimization problem. One approach is to convert a constrained problem into a series of unconstrained problems. Furthermore, unconstrained solution algorithms can be used as part of the constrained solution algorithms. Structural optimization is an iterative process where one starts with an initial design, a finite element structure analysis is then performed to calculate the response of the system (such as displacements, stresses, eigenvalues, etc.). Based upon the sensitivity information on the objective and constraint functions, an optimizer such as ADS or IDESIGN, can be used to find the new, improved design. For the structural analysis phase, the equation solver for the system of simultaneous, linear equations plays a key role since it is needed for either static, or eigenvalue, or dynamic analysis. For practical, large-scale structural analysis-synthesis applications, computational time can be excessively large. Thus, it is necessary to have a new structural analysis-synthesis code which employs new solution algorithms to exploit both parallel and vector capabilities offered by modern, high performance computers such as the Convex, Cray-2 and Cray-YMP computers. The objective of this research project is, therefore, to incorporate the latest development in the parallel-vector equation solver, PVSOLVE into the widely popular finite-element production code, such as the SAP-4. Furthermore, several nonlinear unconstrained optimization subroutines have also been developed and tested under a parallel computer environment. The unconstrained optimization subroutines are not only useful in their own right, but they can also be incorporated into a more popular constrained optimization code, such as ADS.
Parallel-vector computation for linear structural analysis and non-linear unconstrained optimization problems

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.

1991-01-01

Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
High Performance Computing and Visualization Infrastructure for Simultaneous Parallel Computing and Parallel Visualization Research

DTIC Science & Technology

2016-11-09

Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel receiving PHDs Names of other research staff...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W
High-Performance Design Patterns for Modern Fortran

DOE PAGES

Haveraaen, Magne; Morris, Karla; Rouson, Damian; ...

2015-01-01

This paper presents ideas for using coordinate-free numerics in modern Fortran to achieve code flexibility in the partial differential equation (PDE) domain. We also show how Fortran, over the last few decades, has changed to become a language well-suited for state-of-the-art software development. Fortran’s new coarray distributed data structure, the language’s class mechanism, and its side-effect-free, pure procedure capability provide the scaffolding on which we implement HPC software. These features empower compilers to organize parallel computations with efficient communication. We present some programming patterns that support asynchronous evaluation of expressions comprised of parallel operations on distributed data. We implemented thesemore » patterns using coarrays and the message passing interface (MPI). We compared the codes’ complexity and performance. The MPI code is much more complex and depends on external libraries. The MPI code on Cray hardware using the Cray compiler is 1.5–2 times faster than the coarray code on the same hardware. The Intel compiler implements coarrays atop Intel’s MPI library with the result apparently being 2–2.5 times slower than manually coded MPI despite exhibiting nearly linear scaling efficiency. As compilers mature and further improvements to coarrays comes in Fortran 2015, we expect this performance gap to narrow.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel W.

Coupled-cluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain the productivity of computational quantum chemists.« less
The factorization of large composite numbers on the MPP

NASA Technical Reports Server (NTRS)

Mckurdy, Kathy J.; Wunderlich, Marvin C.

1987-01-01

The continued fraction method for factoring large integers (CFRAC) was an ideal algorithm to be implemented on a massively parallel computer such as the Massively Parallel Processor (MPP). After much effort, the first 60 digit number was factored on the MPP using about 6 1/2 hours of array time. Although this result added about 10 digits to the size number that could be factored using CFRAC on a serial machine, it was already badly beaten by the implementation of Davis and Holdridge on the CRAY-1 using the quadratic sieve, an algorithm which is clearly superior to CFRAC for large numbers. An algorithm is illustrated which is ideally suited to the single instruction multiple data (SIMD) massively parallel architecture and some of the modifications which were needed in order to make the parallel implementation effective and efficient are described.
Multitasking and microtasking experience on the NA S Cray-2 and ACF Cray X-MP

NASA Technical Reports Server (NTRS)

Raiszadeh, Farhad

1987-01-01

The fast Fourier transform (FFT) kernel of the NAS benchmark program has been utilized to experiment with the multitasking library on the Cray-2 and Cray X-MP/48, and microtasking directives on the Cray X-MP. Some performance figures are shown, and the state of multitasking software is described.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Parallel processors and nonlinear structural dynamics algorithms and software

NASA Technical Reports Server (NTRS)

Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

1989-01-01

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
The utilization of parallel processing in solving the inviscid form of the average-passage equation system for multistage turbomachinery

NASA Technical Reports Server (NTRS)

Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.

1987-01-01

A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to in this paper are specific to the Cray X-MP line of computers and its associated SSD (Solid-state Storage Device). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
Anisotropic electron temperature measurements without knowing the spectral transmissivity for a JT-60SA Thomson scattering diagnostic.

PubMed

Tojo, H; Hatae, T; Yatsuka, E; Itami, K

2012-10-01

This paper focuses on a method for measuring the electron temperature (T(e)) without knowing the transmissivity using Thomson scattering diagnostic with a double-pass scattering system. Application of this method for measuring the anisotropic T(e), i.e., the T(e) in the directions parallel (T(eparallel)) and perpendicular (T(eperpendicular)) to the magnetic field, is proposed. Simulations based on the designed parameters for a JT-60SA indicate the feasibility of the measurements except in certain T(e) ranges, e.g., T(eparallel) ~ 3.5T(eperpendicular) at 120° of the scattering angle.
Highly parallel implementation of non-adiabatic Ehrenfest molecular dynamics

NASA Astrophysics Data System (ADS)

Kanai, Yosuke; Schleife, Andre; Draeger, Erik; Anisimov, Victor; Correa, Alfredo

2014-03-01

While the adiabatic Born-Oppenheimer approximation tremendously lowers computational effort, many questions in modern physics, chemistry, and materials science require an explicit description of coupled non-adiabatic electron-ion dynamics. Electronic stopping, i.e. the energy transfer of a fast projectile atom to the electronic system of the target material, is a notorious example. We recently implemented real-time time-dependent density functional theory based on the plane-wave pseudopotential formalism in the Qbox/qb@ll codes. We demonstrate that explicit integration using a fourth-order Runge-Kutta scheme is very suitable for modern highly parallelized supercomputers. Applying the new implementation to systems with hundreds of atoms and thousands of electrons, we achieved excellent performance and scalability on a large number of nodes both on the BlueGene based ``Sequoia'' system at LLNL as well as the Cray architecture of ``Blue Waters'' at NCSA. As an example, we discuss our work on computing the electronic stopping power of aluminum and gold for hydrogen projectiles, showing an excellent agreement with experiment. These first-principles calculations allow us to gain important insight into the the fundamental physics of electronic stopping.
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.

1994-01-01

The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).

Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems; towards millisecond-scale simulations of protein structure, dynamics, and thermodynamics

PubMed Central

Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.

2010-01-01

We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
Geophysics of Small Planetary Bodies

NASA Technical Reports Server (NTRS)

Asphaug, Erik I.

1998-01-01

As a SETI Institute PI from 1996-1998, Erik Asphaug studied impact and tidal physics and other geophysical processes associated with small (low-gravity) planetary bodies. This work included: a numerical impact simulation linking basaltic achondrite meteorites to asteroid 4 Vesta (Asphaug 1997), which laid the groundwork for an ongoing study of Martian meteorite ejection; cratering and catastrophic evolution of small bodies (with implications for their internal structure; Asphaug et al. 1996); genesis of grooved and degraded terrains in response to impact; maturation of regolith (Asphaug et al. 1997a); and the variation of crater outcome with impact angle, speed, and target structure. Research of impacts into porous, layered and prefractured targets (Asphaug et al. 1997b, 1998a) showed how shape, rheology and structure dramatically affects sizes and velocities of ejecta, and the survivability and impact-modification of comets and asteroids (Asphaug et al. 1998a). As an affiliate of the Galileo SSI Team, the PI studied problems related to cratering, tectonics, and regolith evolution, including an estimate of the impactor flux around Jupiter and the effect of impact on local and regional tectonics (Asphaug et al. 1998b). Other research included tidal breakup modeling (Asphaug and Benz 1996; Schenk et al. 1996), which is leading to a general understanding of the role of tides in planetesimal evolution. As a Guest Computational Investigator for NASA's BPCC/ESS supercomputer testbed, helped graft SPH3D onto an existing tree code tuned for the massively parallel Cray T3E (Olson and Asphaug, in preparation), obtaining a factor xIO00 speedup in code execution time (on 512 cpus). Runs which once took months are now completed in hours.
ARC2D - EFFICIENT SOLUTION METHODS FOR THE NAVIER-STOKES EQUATIONS (CRAY VERSION)

NASA Technical Reports Server (NTRS)

Pulliam, T. H.

1994-01-01

ARC2D is a computational fluid dynamics program developed at the NASA Ames Research Center specifically for airfoil computations. The program uses implicit finite-difference techniques to solve two-dimensional Euler equations and thin layer Navier-Stokes equations. It is based on the Beam and Warming implicit approximate factorization algorithm in generalized coordinates. The methods are either time accurate or accelerated non-time accurate steady state schemes. The evolution of the solution through time is physically realistic; good solution accuracy is dependent on mesh spacing and boundary conditions. The mathematical development of ARC2D begins with the strong conservation law form of the two-dimensional Navier-Stokes equations in Cartesian coordinates, which admits shock capturing. The Navier-Stokes equations can be transformed from Cartesian coordinates to generalized curvilinear coordinates in a manner that permits one computational code to serve a wide variety of physical geometries and grid systems. ARC2D includes an algebraic mixing length model to approximate the effect of turbulence. In cases of high Reynolds number viscous flows, thin layer approximation can be applied. ARC2D allows for a variety of solutions to stability boundaries, such as those encountered in flows with shocks. The user has considerable flexibility in assigning geometry and developing grid patterns, as well as in assigning boundary conditions. However, the ARC2D model is most appropriate for attached and mildly separated boundary layers; no attempt is made to model wake regions and widely separated flows. The techniques have been successfully used for a variety of inviscid and viscous flowfield calculations. The Cray version of ARC2D is written in FORTRAN 77 for use on Cray series computers and requires approximately 5Mb memory. The program is fully vectorized. The tape includes variations for the COS and UNICOS operating systems. Also included is a sample routine for CONVEX computers to emulate Cray system time calls, which should be easy to modify for other machines as well. The standard distribution media for this version is a 9-track 1600 BPI ASCII Card Image format magnetic tape. The Cray version was developed in 1987. The IBM ES/3090 version is an IBM port of the Cray version. It is written in IBM VS FORTRAN and has the capability of executing in both vector and parallel modes on the MVS/XA operating system and in vector mode on the VM/XA operating system. Various options of the IBM VS FORTRAN compiler provide new features for the ES/3090 version, including 64-bit arithmetic and up to 2 GB of virtual addressability. The IBM ES/3090 version is available only as a 9-track, 1600 BPI IBM IEBCOPY format magnetic tape. The IBM ES/3090 version was developed in 1989. The DEC RISC ULTRIX version is a DEC port of the Cray version. It is written in FORTRAN 77 for RISC-based Digital Equipment platforms. The memory requirement is approximately 7Mb of main memory. It is available in UNIX tar format on TK50 tape cartridge. The port to DEC RISC ULTRIX was done in 1990. COS and UNICOS are trademarks and Cray is a registered trademark of Cray Research, Inc. IBM, ES/3090, VS FORTRAN, MVS/XA, and VM/XA are registered trademarks of International Business Machines. DEC and ULTRIX are registered trademarks of Digital Equipment Corporation.
A Decade of Hubble Space Telescope Science

NASA Astrophysics Data System (ADS)

Livio, Mario; Noll, Keith; Stiavelli, Massimo

2003-06-01

1. HST studies of Mars J. F. Bell; 2. HST images of Jupiter's UV aurora J. T. Clarke; 3. Star formation J. Bally; 4. SN1987A: the birth of a supernova remnant R. McCray; 5. Globular clusters: the view from HST W. E. Harris; 6. Ultraviolet absorption line studies of the Galactic interstellar medium with the Goddard High Resolution Spectrograph B. D. Savage; 7. HST's view of the center of the Milky Way galaxy M. J. Rieke; 8. Stellar populations in dwarf galaxies: a review of the contribution of HST to our understanding of the nearby universe E. Tolstoy; 9. The formation of star clusters B. C. Whitmore; 10. Starburst galaxies observed with the Hubble Space Telescope C. Leitherer; 11. Supermassive black holes F. D. Macchetto; 12. The HST Key Project to measure the Hubble Constant W. L. Freedman, R. C. Kennicutt, J. R. Mould and B. F. Madore; 13. Ho from Type Ia Supernovae G. A. Tammann, A. Sandage and A. Saha; 14. Strong gravitational lensing: cosmology from angels and redshifts A. Tyson.
Aerodynamic optimization studies on advanced architecture computers

NASA Technical Reports Server (NTRS)

Chawla, Kalpana

1995-01-01

The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reed, D.A.; Grunwald, D.C.

The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less
Domain decomposition methods in aerodynamics

NASA Technical Reports Server (NTRS)

Venkatakrishnan, V.; Saltz, Joel

1990-01-01

Compressible Euler equations are solved for two-dimensional problems by a preconditioned conjugate gradient-like technique. An approximate Riemann solver is used to compute the numerical fluxes to second order accuracy in space. Two ways to achieve parallelism are tested, one which makes use of parallelism inherent in triangular solves and the other which employs domain decomposition techniques. The vectorization/parallelism in triangular solves is realized by the use of a recording technique called wavefront ordering. This process involves the interpretation of the triangular matrix as a directed graph and the analysis of the data dependencies. It is noted that the factorization can also be done in parallel with the wave front ordering. The performances of two ways of partitioning the domain, strips and slabs, are compared. Results on Cray YMP are reported for an inviscid transonic test case. The performances of linear algebra kernels are also reported.
Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Saini, Subbash; Ciotti, Robert

2006-01-01

We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
Parallel k-means++

DOE Office of Scientific and Technical Information (OSTI.GOV)

A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less
Parallelized reliability estimation of reconfigurable computer networks

NASA Technical Reports Server (NTRS)

Nicol, David M.; Das, Subhendu; Palumbo, Dan

1990-01-01

A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.
Comparison of the MPP with other supercomputers for LANDSAT data processing

NASA Technical Reports Server (NTRS)

Ozga, Martin

1987-01-01

The massively parallel processor is compared to the CRAY X-MP and the CYBER-205 for LANDSAT data processing. The maximum likelihood classification algorithm is the basis for comparison since this algorithm is simple to implement and vectorizes very well. The algorithm was implemented on all three machines and tested by classifying the same full scene of LANDSAT multispectral scan data. Timings are compared as well as features of the machines and available software.
Experiences and results multitasking a hydrodynamics code on global and local memory machines

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.

1987-01-01

A one-dimensional, time-dependent Lagrangian hydrodynamics code using a Godunov solution method has been multitasked for the Cray X-MP/48, the Intel iPSC hypercube, the Alliant FX series and the IBM RP3 computers. Actual multitasking results have been obtained for the Cray, Intel and Alliant computers and simulated results were obtained for the Cray and RP3 machines. The differences in the methods required to multitask on each of the machines is discussed. Results are presented for a sample problem involving a shock wave moving down a channel. Comparisons are made between theoretical speedups, predicted by Amdahl's law, and the actual speedups obtained.more » The problems of debugging on the different machines are also described.« less
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE PAGES

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel; ...

2017-03-08

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Multitasking for flows about multiple body configurations using the chimera grid scheme

NASA Technical Reports Server (NTRS)

Dougherty, F. C.; Morgan, R. L.

1987-01-01

The multitasking of a finite-difference scheme using multiple overset meshes is described. In this chimera, or multiple overset mesh approach, a multiple body configuration is mapped using a major grid about the main component of the configuration, with minor overset meshes used to map each additional component. This type of code is well suited to multitasking. Both steady and unsteady two dimensional computations are run on parallel processors on a CRAY-X/MP 48, usually with one mesh per processor. Flow field results are compared with single processor results to demonstrate the feasibility of running multiple mesh codes on parallel processors and to show the increase in efficiency.
Anisotropic electron temperature measurements without knowing the spectral transmissivity for a JT-60SA Thomson scattering diagnostic

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tojo, H.; Hatae, T.; Yatsuka, E.

2012-10-15

This paper focuses on a method for measuring the electron temperature (T{sub e}) without knowing the transmissivity using Thomson scattering diagnostic with a double-pass scattering system. Application of this method for measuring the anisotropic T{sub e}, i.e., the T{sub e} in the directions parallel (T{sub e Double-Vertical-Line Double-Vertical-Line }) and perpendicular (T{sub e Up-Tack }) to the magnetic field, is proposed. Simulations based on the designed parameters for a JT-60SA indicate the feasibility of the measurements except in certain T{sub e} ranges, e.g., T{sub e Double-Vertical-Line Double-Vertical-Line }{approx} 3.5T{sub e Up-Tack} at 120 Degree-Sign of the scattering angle.
INS3D - NUMERICAL SOLUTION OF THE INCOMPRESSIBLE NAVIER-STOKES EQUATIONS IN THREE-DIMENSIONAL GENERALIZED CURVILINEAR COORDINATES (CRAY VERSION)

NASA Technical Reports Server (NTRS)

Rogers, S. E.

1994-01-01

INS3D computes steady-state solutions to the incompressible Navier-Stokes equations. The INS3D approach utilizes pseudo-compressibility combined with an approximate factorization scheme. This computational fluid dynamics (CFD) code has been verified on problems such as flow through a channel, flow over a backwardfacing step and flow over a circular cylinder. Three dimensional cases include flow over an ogive cylinder, flow through a rectangular duct, wind tunnel inlet flow, cylinder-wall juncture flow and flow through multiple posts mounted between two plates. INS3D uses a pseudo-compressibility approach in which a time derivative of pressure is added to the continuity equation, which together with the momentum equations form a set of four equations with pressure and velocity as the dependent variables. The equations' coordinates are transformed for general three dimensional applications. The equations are advanced in time by the implicit, non-iterative, approximately-factored, finite-difference scheme of Beam and Warming. The numerical stability of the scheme depends on the use of higher-order smoothing terms to damp out higher-frequency oscillations caused by second-order central differencing. The artificial compressibility introduces pressure (sound) waves of finite speed (whereas the speed of sound would be infinite in an incompressible fluid). As the solution converges, these pressure waves die out, causing the derivation of pressure with respect to time to approach zero. Thus, continuity is satisfied for the incompressible fluid in the steady state. Computational efficiency is achieved using a diagonal algorithm. A block tri-diagonal option is also available. When a steady-state solution is reached, the modified continuity equation will satisfy the divergence-free velocity field condition. INS3D is capable of handling several different types of boundaries encountered in numerical simulations, including solid-surface, inflow and outflow, and far-field boundaries. Three machine versions of INS3D are available. INS3D for the CRAY is written in CRAY FORTRAN for execution on a CRAY X-MP under COS, INS3D for the IBM is written in FORTRAN 77 for execution on an IBM 3090 under the VM or MVS operating system, and INS3D for DEC RISC-based systems is written in RISC FORTRAN for execution on a DEC workstation running RISC ULTRIX 3.1 or later. The CRAY version has a central memory requirement of 730279 words. The central memory requirement for the IBM is 150Mb. The memory requirement for the DEC RISC ULTRIX version is 3Mb of main memory. INS3D was developed in 1987. The port to the IBM was done in 1990. The port to the DECstation 3100 was done in 1991. CRAY is a registered trademark of Cray Research Inc. IBM is a registered trademark of International Business Machines. DEC, DECstation, and ULTRIX are trademarks of the Digital Equipment Corporation.
ORNL Cray X1 evaluation status report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, P.K.; Alexander, R.A.; Apra, E.

2004-05-01

On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. ''This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership,'' said Dr. Raymond L. Orbach, director of the department's Office of Science. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of themore » architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel bench marks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation. Application performance is found to be markedly improved by this architecture: - Large-scale simulations of high-temperature superconductors run 25 times faster than on an IBM Power4 cluster using the same number of processors. - Best performance of the parallel ocean program (POP v1.4.3) is 50 percent higher than on Japan s Earth Simulator and 5 times higher than on an IBM Power4 cluster. - A fusion application, global GYRO transport, was found to be 16 times faster on the X1 than on an IBM Power3. The increased performance allowed simulations to fully resolve questions raised by a prior study. - The transport kernel in the AGILE-BOLTZTRAN astrophysics code runs 15 times faster than on an IBM Power4 cluster using the same number of processors. - Molecular dynamics simulations related to the phenomenon of photon echo run 8 times faster than previously achieved. Even at 256 processors, the Cray X1 system is already outperforming other supercomputers with thousands of processors for a certain class of applications such as climate modeling and some fusion applications. This evaluation is the outcome of a number of meetings with both high-performance computing (HPC) system vendors and application experts over the past 9 months and has received broad-based support from the scientific community and other agencies.« less
Space Radar Image of Mammoth Mountain, California

NASA Image and Video Library

1999-05-01

This false-color composite radar image of the Mammoth Mountain area in the Sierra Nevada Mountains, California, was acquired by the Spaceborne Imaging Radar-C and X-band Synthetic Aperture Radar aboard the space shuttle Endeavour on its 67th orbit on October 3, 1994. The image is centered at 37.6 degrees north latitude and 119.0 degrees west longitude. The area is about 39 kilometers by 51 kilometers (24 miles by 31 miles). North is toward the bottom, about 45 degrees to the right. In this image, red was created using L-band (horizontally transmitted/vertically received) polarization data; green was created using C-band (horizontally transmitted/vertically received) polarization data; and blue was created using C-band (horizontally transmitted and received) polarization data. Crawley Lake appears dark at the center left of the image, just above or south of Long Valley. The Mammoth Mountain ski area is visible at the top right of the scene. The red areas correspond to forests, the dark blue areas are bare surfaces and the green areas are short vegetation, mainly brush. The purple areas at the higher elevations in the upper part of the scene are discontinuous patches of snow cover from a September 28 storm. New, very thin snow was falling before and during the second space shuttle pass. In parallel with the operational SIR-C data processing, an experimental effort is being conducted to test SAR data processing using the Jet Propulsion Laboratory's massively parallel supercomputing facility, centered around the Cray Research T3D. These experiments will assess the abilities of large supercomputers to produce high throughput Synthetic Aperture Radar processing in preparation for upcoming data-intensive SAR missions. The image released here was produced as part of this experimental effort. http://photojournal.jpl.nasa.gov/catalog/PIA01746
Application of a distributed network in computational fluid dynamic simulations

NASA Technical Reports Server (NTRS)

Deshpande, Manish; Feng, Jinzhang; Merkle, Charles L.; Deshpande, Ashish

1994-01-01

A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computational fluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.

EFFECTS OF TUMORS ON INHALED PHARMACOLOGIC DRUGS: II. PARTICLE MOTION

EPA Science Inventory

ABSTRACT

Computer simulations were conducted to describe drug particle motion in human lung bifurcations with tumors. The computations used FIDAP with a Cray T90 supercomputer. The objective was to better understand particle behavior as affected by particle characteristics...
A performance comparison of the Cray-2 and the Cray X-MP

NASA Technical Reports Server (NTRS)

Schmickley, Ronald; Bailey, David H.

1986-01-01

A suite of thirteen large Fortran benchmark codes were run on Cray-2 and Cray X-MP supercomputers. These codes were a mix of compute-intensive scientific application programs (mostly Computational Fluid Dynamics) and some special vectorized computation exercise programs. For the general class of programs tested on the Cray-2, most of which were not specially tuned for speed, the floating point operation rates varied under a variety of system load configurations from 40 percent up to 125 percent of X-MP performance rates. It is concluded that the Cray-2, in the original system configuration studied (without memory pseudo-banking) will run untuned Fortran code, on average, about 70 percent of X-MP speeds.
Understanding Aprun Use Patterns

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lin, Hwa-Chun Wendy

2009-05-06

On the Cray XT, aprun is the command to launch an application to a set of compute nodes reserved through the Application Level Placement Scheduler (ALPS). At the National Energy Research Scientific Computing Center (NERSC), interactive aprun is disabled. That is, invocations of aprun have to go through the batch system. Batch scripts can and often do contain several apruns which either use subsets of the reserved nodes in parallel, or use all reserved nodes in consecutive apruns. In order to better understand how NERSC users run on the XT, it is necessary to associate aprun information with jobs. Itmore » is surprisingly more challenging than it sounds. In this paper, we describe those challenges and how we solved them to produce daily per-job reports for completed apruns. We also describe additional uses of the data, e.g. adjusting charging policy accordingly or associating node failures with jobs/users, and plans for enhancements.« less
Comparison of Implicit Collocation Methods for the Heat Equation

NASA Technical Reports Server (NTRS)

Kouatchou, Jules; Jezequel, Fabienne; Zukor, Dorothy (Technical Monitor)

2001-01-01

We combine a high-order compact finite difference scheme to approximate spatial derivatives arid collocation techniques for the time component to numerically solve the two dimensional heat equation. We use two approaches to implement the collocation methods. The first one is based on an explicit computation of the coefficients of polynomials and the second one relies on differential quadrature. We compare them by studying their merits and analyzing their numerical performance. All our computations, based on parallel algorithms, are carried out on the CRAY SV1.
Parallel FEM Simulation of Electromechanics in the Heart

NASA Astrophysics Data System (ADS)

Xia, Henian; Wong, Kwai; Zhao, Xiaopeng

2011-11-01

Cardiovascular disease is the leading cause of death in America. Computer simulation of complicated dynamics of the heart could provide valuable quantitative guidance for diagnosis and treatment of heart problems. In this paper, we present an integrated numerical model which encompasses the interaction of cardiac electrophysiology, electromechanics, and mechanoelectrical feedback. The model is solved by finite element method on a Linux cluster and the Cray XT5 supercomputer, kraken. Dynamical influences between the effects of electromechanics coupling and mechanic-electric feedback are shown.
Fast Fourier Transform algorithm design and tradeoffs

NASA Technical Reports Server (NTRS)

Kamin, Ray A., III; Adams, George B., III

1988-01-01

The Fast Fourier Transform (FFT) is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional Single Instruction Stream/Multiple Data (SIMD) parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Fast Fourier Transform programs are compared to the currently best Cray-2 FFT program.
ARC2D - EFFICIENT SOLUTION METHODS FOR THE NAVIER-STOKES EQUATIONS (DEC RISC ULTRIX VERSION)

NASA Technical Reports Server (NTRS)

Biyabani, S. R.

1994-01-01

ARC2D is a computational fluid dynamics program developed at the NASA Ames Research Center specifically for airfoil computations. The program uses implicit finite-difference techniques to solve two-dimensional Euler equations and thin layer Navier-Stokes equations. It is based on the Beam and Warming implicit approximate factorization algorithm in generalized coordinates. The methods are either time accurate or accelerated non-time accurate steady state schemes. The evolution of the solution through time is physically realistic; good solution accuracy is dependent on mesh spacing and boundary conditions. The mathematical development of ARC2D begins with the strong conservation law form of the two-dimensional Navier-Stokes equations in Cartesian coordinates, which admits shock capturing. The Navier-Stokes equations can be transformed from Cartesian coordinates to generalized curvilinear coordinates in a manner that permits one computational code to serve a wide variety of physical geometries and grid systems. ARC2D includes an algebraic mixing length model to approximate the effect of turbulence. In cases of high Reynolds number viscous flows, thin layer approximation can be applied. ARC2D allows for a variety of solutions to stability boundaries, such as those encountered in flows with shocks. The user has considerable flexibility in assigning geometry and developing grid patterns, as well as in assigning boundary conditions. However, the ARC2D model is most appropriate for attached and mildly separated boundary layers; no attempt is made to model wake regions and widely separated flows. The techniques have been successfully used for a variety of inviscid and viscous flowfield calculations. The Cray version of ARC2D is written in FORTRAN 77 for use on Cray series computers and requires approximately 5Mb memory. The program is fully vectorized. The tape includes variations for the COS and UNICOS operating systems. Also included is a sample routine for CONVEX computers to emulate Cray system time calls, which should be easy to modify for other machines as well. The standard distribution media for this version is a 9-track 1600 BPI ASCII Card Image format magnetic tape. The Cray version was developed in 1987. The IBM ES/3090 version is an IBM port of the Cray version. It is written in IBM VS FORTRAN and has the capability of executing in both vector and parallel modes on the MVS/XA operating system and in vector mode on the VM/XA operating system. Various options of the IBM VS FORTRAN compiler provide new features for the ES/3090 version, including 64-bit arithmetic and up to 2 GB of virtual addressability. The IBM ES/3090 version is available only as a 9-track, 1600 BPI IBM IEBCOPY format magnetic tape. The IBM ES/3090 version was developed in 1989. The DEC RISC ULTRIX version is a DEC port of the Cray version. It is written in FORTRAN 77 for RISC-based Digital Equipment platforms. The memory requirement is approximately 7Mb of main memory. It is available in UNIX tar format on TK50 tape cartridge. The port to DEC RISC ULTRIX was done in 1990. COS and UNICOS are trademarks and Cray is a registered trademark of Cray Research, Inc. IBM, ES/3090, VS FORTRAN, MVS/XA, and VM/XA are registered trademarks of International Business Machines. DEC and ULTRIX are registered trademarks of Digital Equipment Corporation.
ATLAS and LHC computing on CRAY

NASA Astrophysics Data System (ADS)

Sciacca, F. G.; Haug, S.; ATLAS Collaboration

2017-10-01

Access and exploitation of large scale computing resources, such as those offered by general purpose HPC centres, is one important measure for ATLAS and the other Large Hadron Collider experiments in order to meet the challenge posed by the full exploitation of the future data within the constraints of flat budgets. We report on the effort of moving the Swiss WLCG T2 computing, serving ATLAS, CMS and LHCb, from a dedicated cluster to the large Cray systems at the Swiss National Supercomputing Centre CSCS. These systems do not only offer very efficient hardware, cooling and highly competent operators, but also have large backfill potentials due to size and multidisciplinary usage and potential gains due to economy at scale. Technical solutions, performance, expected return and future plans are discussed.
NAS technical summaries: Numerical aerodynamic simulation program, March 1991 - February 1992

NASA Technical Reports Server (NTRS)

1992-01-01

NASA created the Numerical Aerodynamic Simulation (NAS) Program in 1987 to focus resources on solving critical problems in aeroscience and related disciplines by utilizing the power of the most advanced supercomputers available. The NAS Program provides scientists with the necessary computing power to solve today's most demanding computational fluid dynamics problems and serves as a pathfinder in integrating leading-edge supercomputing technologies, thus benefiting other supercomputer centers in Government and industry. This report contains selected scientific results from the 1991-92 NAS Operational Year, March 4, 1991 to March 3, 1992, which is the fifth year of operation. During this year, the scientific community was given access to a Cray-2 and a Cray Y-MP. The Cray-2, the first generation supercomputer, has four processors, 256 megawords of central memory, and a total sustained speed of 250 million floating point operations per second. The Cray Y-MP, the second generation supercomputer, has eight processors and a total sustained speed of one billion floating point operations per second. Additional memory was installed this year, doubling capacity from 128 to 256 megawords of solid-state storage-device memory. Because of its higher performance, the Cray Y-MP delivered approximately 77 percent of the total number of supercomputer hours used during this year.
Development of a CRAY 1 version of the SINDA program. [thermo-structural analyzer program

NASA Technical Reports Server (NTRS)

Juba, S. M.; Fogerson, P. E.

1982-01-01

The SINDA thermal analyzer program was transferred from the UNIVAC 1110 computer to a CYBER And then to a CRAY 1. Significant changes to the code of the program were required in order to execute efficiently on the CYBER and CRAY. The program was tested on the CRAY using a thermal math model of the shuttle which was too large to run on either the UNIVAC or CYBER. An effort was then begun to further modify the code of SINDA in order to make effective use of the vector capabilities of the CRAY.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Otto, C., Thomas, G.A.; Peticolas, W.L.; Rippe, K.

Raman spectra of the parallel-stranded duplex formed from the deoxyoligonucleotides 5{prime}-d-((A){sub 10}TAATTTTAAATATTT)-3{prime} (D1) and 5{prime}-d((T){sub 10}ATTAAAATTTATAAA)-3{prime} (D2) in H{sub 2}O and D{sub 2}O have been acquired. The spectra of the parallel-stranded DNA are then compared to the spectra of the antiparallel double helix formed from the deoxyoligonucleotides D1 and 5{prime}-d(AAATATTTAAAATTA-(T){sub 10})-3{prime} (D3). The Raman spectra of the antiparallel-stranded (aps) duplex are reminiscent of the spectra of poly(d(A)){center dot}poly(d(T)) and a B-form structure similar to that adopted by the homopolymer duplex is assigned to the antiparallel double helix. The spectra of the parallel-stranded (ps) and antiparallel-stranded duplexes differ significantly due tomore » changes in helical organization, i.e., base pairing, base stacking, and backbone conformation. Large changes observed in the carbonyl stretching region implicate the involvement of the C(2) carbonyl of thymine in base pairing. The interaction of adenine with the C(2) carbonyl of thymine is consistent with formation of reverse Watson-Crick base pairing in parallel-stranded DNA. Phosphate-furanose vibrations similar to those observed for B-form DNA of heterogeneous sequence and high A,T content are observed at 843 and 1,092 cm{sup {minus}1} in the spectra of the parallel-stranded duplex.« less
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion

NASA Technical Reports Server (NTRS)

Bailey, David H.; Ferguson, Helaman R. P.

1988-01-01

Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
Scaling Properties of Algorithms in Nanotechnology

NASA Technical Reports Server (NTRS)

Saini, Subhash; Bailey, David H.; Chancellor, Marisa K. (Technical Monitor)

1996-01-01

At the present time, several technologies are pressing the limits of microminiature manufacturing. In semiconductor technology, for example, the Intel Pentium Pro (which is used in the Department of Energy's ASCI 'red' parallel supercomputer system) and the DEC Alpha 21164 (which is used in the CRAY T3E) both are fabricated using 0.35 micron process technology. Recently Texas Instruments (TI) announced the availability of 0.25 micron technology chips by the end of 1996 and plans to have 0.18 micron devices in production within two years. However, some significant challenges lie down the road. These include the skyrocketing cost of manufacturing plants, the 0.1 micron foreseeable limit of the photolithography process, quantum effects, data communication bandwidth limitations, heat dissipation, and others. Some related microminiature technologies include micro-electromechanical systems (MEMS), opto-electronic devices, quantum computing, biological computing, and others. All of these technologies require the fabrication of devices whose sizes are approaching the nanometer level. As such they are often collectively referred to with the name 'nanotechnology'. Clearly nanotechnology in this general sense is destined to be a very important technology of the 21st century. The ultimate dream in this arena is 'molecular nanotechnology', in other words the fabrication of devices and materials with most or all atoms and molecules in a pre-programmed position, possibly placed there by 'nano-robots'. This futuristic capability will probably not be achieved for at least two decades. However, it appears that somewhat less ambitious variations of molecular nanotechnology, such as devices and materials based on 'buckyballs' and 'nanotubes' may be realized significantly sooner, possibly within ten years or so. Even at the present time, semiconductor devices are approaching the regime where quantum chemical effects must be considered in design.
The design and implementation of a parallel unstructured Euler solver using software primitives

NASA Technical Reports Server (NTRS)

Das, R.; Mavriplis, D. J.; Saltz, J.; Gupta, S.; Ponnusamy, R.

1992-01-01

This paper is concerned with the implementation of a three-dimensional unstructured grid Euler-solver on massively parallel distributed-memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. An unstructured multigrid algorithm with an edge-based data structure has been adopted, and a number of optimizations have been devised and implemented in order to accelerate the parallel communication rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code, while providing a basis for future automatic run-time compilation support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effect of these optimizations leads to roughly a factor of three performance improvement. The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.
NASA Langley Research Center's distributed mass storage system

NASA Technical Reports Server (NTRS)

Pao, Juliet Z.; Humes, D. Creig

1993-01-01

There is a trend in institutions with high performance computing and data management requirements to explore mass storage systems with peripherals directly attached to a high speed network. The Distributed Mass Storage System (DMSS) Project at NASA LaRC is building such a system and expects to put it into production use by the end of 1993. This paper presents the design of the DMSS, some experiences in its development and use, and a performance analysis of its capabilities. The special features of this system are: (1) workstation class file servers running UniTree software; (2) third party I/O; (3) HIPPI network; (4) HIPPI/IPI3 disk array systems; (5) Storage Technology Corporation (STK) ACS 4400 automatic cartridge system; (6) CRAY Research Incorporated (CRI) CRAY Y-MP and CRAY-2 clients; (7) file server redundancy provision; and (8) a transition mechanism from the existent mass storage system to the DMSS.
A leap forward with UTK s Cray XC30

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fahey, Mark R

2014-01-01

This paper shows a significant productivity leap for several science groups and the accomplishments they have made to date on Darter - a Cray XC30 at the University of Tennessee Knoxville. The increased productivity is due to faster processors and interconnect combined in a new generation from Cray, and yet it still has a very similar programming environment as compared to previous generations of Cray machines that makes porting easy.
Revisiting Parallel Cyclic Reduction and Parallel Prefix-Based Algorithms for Block Tridiagonal System of Equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seal, Sudip K; Perumalla, Kalyan S; Hirshman, Steven Paul

2013-01-01

Simulations that require solutions of block tridiagonal systems of equations rely on fast parallel solvers for runtime efficiency. Leading parallel solvers that are highly effective for general systems of equations, dense or sparse, are limited in scalability when applied to block tridiagonal systems. This paper presents scalability results as well as detailed analyses of two parallel solvers that exploit the special structure of block tridiagonal matrices to deliver superior performance, often by orders of magnitude. A rigorous analysis of their relative parallel runtimes is shown to reveal the existence of a critical block size that separates the parameter space spannedmore » by the number of block rows, the block size and the processor count, into distinct regions that favor one or the other of the two solvers. Dependence of this critical block size on the above parameters as well as on machine-specific constants is established. These formal insights are supported by empirical results on up to 2,048 cores of a Cray XT4 system. To the best of our knowledge, this is the highest reported scalability for parallel block tridiagonal solvers to date.« less
Porting the AVS/Express scientific visualization software to Cray XT4.

PubMed

Leaver, George W; Turner, Martin J; Perrin, James S; Mummery, Paul M; Withers, Philip J

2011-08-28

Remote scientific visualization, where rendering services are provided by larger scale systems than are available on the desktop, is becoming increasingly important as dataset sizes increase beyond the capabilities of desktop workstations. Uptake of such services relies on access to suitable visualization applications and the ability to view the resulting visualization in a convenient form. We consider five rules from the e-Science community to meet these goals with the porting of a commercial visualization package to a large-scale system. The application uses message-passing interface (MPI) to distribute data among data processing and rendering processes. The use of MPI in such an interactive application is not compatible with restrictions imposed by the Cray system being considered. We present details, and performance analysis, of a new MPI proxy method that allows the application to run within the Cray environment yet still support MPI communication required by the application. Example use cases from materials science are considered.
MAGNA (Materially and Geometrically Nonlinear Analysis). Part I. Finite Element Analysis Manual.

DTIC Science & Technology

1982-12-01

provided for operating the program, modifying storage caoacity, preparing input data, estimating computer run times , and interpreting the output...7.1.3 Reserved File Names 7.1.16 7.1.4 Typical Execution Times on CDC Computers 7.1.18 7.2 CRAY PROGRAM VERSION 7.2.1 7.2.1 Job Control Language 7.2.1...7.2.2 Modification of Storage Capacity 7.2.8 7.2.3 Execution Times on the CRAY-I Computer 7.2.12 7.3 VAX PROGRAM VERSION 7.3.1 8 INPUT DATA 8.0.1 8.1
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Secchi, Simone; Villa, Oreste

Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less

MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Taft, James R.

1999-01-01

Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.
Strategies for vectorizing the sparse matrix vector product on the CRAY XMP, CRAY 2, and CYBER 205

NASA Technical Reports Server (NTRS)

Bauschlicher, Charles W., Jr.; Partridge, Harry

1987-01-01

Large, randomly sparse matrix vector products are important in a number of applications in computational chemistry, such as matrix diagonalization and the solution of simultaneous equations. Vectorization of this process is considered for the CRAY XMP, CRAY 2, and CYBER 205, using a matrix of dimension of 20,000 with from 1 percent to 6 percent nonzeros. Efficient scatter/gather capabilities add coding flexibility and yield significant improvements in performance. For the CYBER 205, it is shown that minor changes in the IO can reduce the CPU time by a factor of 50. Similar changes in the CRAY codes make a far smaller improvement.
A Massively Parallel Tensor Contraction Framework for Coupled-Cluster Computations

DTIC Science & Technology

2014-08-02

CCSDT The CCSD model [41], where T = T1 + T2 (i.e. n = 2 in Equation 2), is one of the most widely used coupled-cluster methods as it provides a good...derived from response theory. Extending this to CCSDT [30, 35], where T = T1 + T2 + T3 ( n = 3), gives an even more accurate method (often capable of...CCSD and CCSDT have leading-order costs of O(n2on 4 v) and O( n 3 on 5 v), where no and nv are the number of occupied and virtual orbitals, respectively
Development of iterative techniques for the solution of unsteady compressible viscous flows

NASA Technical Reports Server (NTRS)

Hixon, Duane; Sankar, L. N.

1993-01-01

During the past two decades, there has been significant progress in the field of numerical simulation of unsteady compressible viscous flows. At present, a variety of solution techniques exist such as the transonic small disturbance analyses (TSD), transonic full potential equation-based methods, unsteady Euler solvers, and unsteady Navier-Stokes solvers. These advances have been made possible by developments in three areas: (1) improved numerical algorithms; (2) automation of body-fitted grid generation schemes; and (3) advanced computer architectures with vector processing and massively parallel processing features. In this work, the GMRES scheme has been considered as a candidate for acceleration of a Newton iteration time marching scheme for unsteady 2-D and 3-D compressible viscous flow calculation; from preliminary calculations, this will provide up to a 65 percent reduction in the computer time requirements over the existing class of explicit and implicit time marching schemes. The proposed method has ben tested on structured grids, but is flexible enough for extension to unstructured grids. The described scheme has been tested only on the current generation of vector processor architecture of the Cray Y/MP class, but should be suitable for adaptation to massively parallel machines.
Phase space simulation of collisionless stellar systems on the massively parallel processor

NASA Technical Reports Server (NTRS)

White, Richard L.

1987-01-01

A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem.
Swarm Verification

NASA Technical Reports Server (NTRS)

Holzmann, Gerard J.; Joshi, Rajeev; Groce, Alex

2008-01-01

Reportedly, supercomputer designer Seymour Cray once said that he would sooner use two strong oxen to plow a field than a thousand chickens. Although this is undoubtedly wise when it comes to plowing a field, it is not so clear for other types of tasks. Model checking problems are of the proverbial "search the needle in a haystack" type. Such problems can often be parallelized easily. Alas, none of the usual divide and conquer methods can be used to parallelize the working of a model checker. Given that it has become easier than ever to gain access to large numbers of computers to perform even routine tasks it is becoming more and more attractive to find alternate ways to use these resources to speed up model checking tasks. This paper describes one such method, called swarm verification.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
INS3D - NUMERICAL SOLUTION OF THE INCOMPRESSIBLE NAVIER-STOKES EQUATIONS IN THREE-DIMENSIONAL GENERALIZED CURVILINEAR COORDINATES (DEC RISC ULTRIX VERSION)

NASA Technical Reports Server (NTRS)

Biyabani, S. R.

1994-01-01

INS3D computes steady-state solutions to the incompressible Navier-Stokes equations. The INS3D approach utilizes pseudo-compressibility combined with an approximate factorization scheme. This computational fluid dynamics (CFD) code has been verified on problems such as flow through a channel, flow over a backwardfacing step and flow over a circular cylinder. Three dimensional cases include flow over an ogive cylinder, flow through a rectangular duct, wind tunnel inlet flow, cylinder-wall juncture flow and flow through multiple posts mounted between two plates. INS3D uses a pseudo-compressibility approach in which a time derivative of pressure is added to the continuity equation, which together with the momentum equations form a set of four equations with pressure and velocity as the dependent variables. The equations' coordinates are transformed for general three dimensional applications. The equations are advanced in time by the implicit, non-iterative, approximately-factored, finite-difference scheme of Beam and Warming. The numerical stability of the scheme depends on the use of higher-order smoothing terms to damp out higher-frequency oscillations caused by second-order central differencing. The artificial compressibility introduces pressure (sound) waves of finite speed (whereas the speed of sound would be infinite in an incompressible fluid). As the solution converges, these pressure waves die out, causing the derivation of pressure with respect to time to approach zero. Thus, continuity is satisfied for the incompressible fluid in the steady state. Computational efficiency is achieved using a diagonal algorithm. A block tri-diagonal option is also available. When a steady-state solution is reached, the modified continuity equation will satisfy the divergence-free velocity field condition. INS3D is capable of handling several different types of boundaries encountered in numerical simulations, including solid-surface, inflow and outflow, and far-field boundaries. Three machine versions of INS3D are available. INS3D for the CRAY is written in CRAY FORTRAN for execution on a CRAY X-MP under COS, INS3D for the IBM is written in FORTRAN 77 for execution on an IBM 3090 under the VM or MVS operating system, and INS3D for DEC RISC-based systems is written in RISC FORTRAN for execution on a DEC workstation running RISC ULTRIX 3.1 or later. The CRAY version has a central memory requirement of 730279 words. The central memory requirement for the IBM is 150Mb. The memory requirement for the DEC RISC ULTRIX version is 3Mb of main memory. INS3D was developed in 1987. The port to the IBM was done in 1990. The port to the DECstation 3100 was done in 1991. CRAY is a registered trademark of Cray Research Inc. IBM is a registered trademark of International Business Machines. DEC, DECstation, and ULTRIX are trademarks of the Digital Equipment Corporation.
GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hargrove, Paul H.; Bonachea, Dan

This document is a deliverable for milestone STPM17-6 of the Exascale Computing Project, delivered by WBS 2.3.1.14. It reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as “specializations”, primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth ismore » increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are available in the GASNet-EX 2018.3.0 release.« less
A highly scalable particle tracking algorithm using partitioned global address space (PGAS) programming for extreme-scale turbulence simulations

NASA Astrophysics Data System (ADS)

Buaria, D.; Yeung, P. K.

2017-12-01

A new parallel algorithm utilizing a partitioned global address space (PGAS) programming model to achieve high scalability is reported for particle tracking in direct numerical simulations of turbulent fluid flow. The work is motivated by the desire to obtain Lagrangian information necessary for the study of turbulent dispersion at the largest problem sizes feasible on current and next-generation multi-petaflop supercomputers. A large population of fluid particles is distributed among parallel processes dynamically, based on instantaneous particle positions such that all of the interpolation information needed for each particle is available either locally on its host process or neighboring processes holding adjacent sub-domains of the velocity field. With cubic splines as the preferred interpolation method, the new algorithm is designed to minimize the need for communication, by transferring between adjacent processes only those spline coefficients determined to be necessary for specific particles. This transfer is implemented very efficiently as a one-sided communication, using Co-Array Fortran (CAF) features which facilitate small data movements between different local partitions of a large global array. The cost of monitoring transfer of particle properties between adjacent processes for particles migrating across sub-domain boundaries is found to be small. Detailed benchmarks are obtained on the Cray petascale supercomputer Blue Waters at the University of Illinois, Urbana-Champaign. For operations on the particles in a 81923 simulation (0.55 trillion grid points) on 262,144 Cray XE6 cores, the new algorithm is found to be orders of magnitude faster relative to a prior algorithm in which each particle is tracked by the same parallel process at all times. This large speedup reduces the additional cost of tracking of order 300 million particles to just over 50% of the cost of computing the Eulerian velocity field at this scale. Improving support of PGAS models on major compilers suggests that this algorithm will be of wider applicability on most upcoming supercomputers.
Parallel Simulation of Unsteady Turbulent Flames

NASA Technical Reports Server (NTRS)

Menon, Suresh

1996-01-01

Time-accurate simulation of turbulent flames in high Reynolds number flows is a challenging task since both fluid dynamics and combustion must be modeled accurately. To numerically simulate this phenomenon, very large computer resources (both time and memory) are required. Although current vector supercomputers are capable of providing adequate resources for simulations of this nature, the high cost and their limited availability, makes practical use of such machines less than satisfactory. At the same time, the explicit time integration algorithms used in unsteady flow simulations often possess a very high degree of parallelism, making them very amenable to efficient implementation on large-scale parallel computers. Under these circumstances, distributed memory parallel computers offer an excellent near-term solution for greatly increased computational speed and memory, at a cost that may render the unsteady simulations of the type discussed above more feasible and affordable.This paper discusses the study of unsteady turbulent flames using a simulation algorithm that is capable of retaining high parallel efficiency on distributed memory parallel architectures. Numerical studies are carried out using large-eddy simulation (LES). In LES, the scales larger than the grid are computed using a time- and space-accurate scheme, while the unresolved small scales are modeled using eddy viscosity based subgrid models. This is acceptable for the moment/energy closure since the small scales primarily provide a dissipative mechanism for the energy transferred from the large scales. However, for combustion to occur, the species must first undergo mixing at the small scales and then come into molecular contact. Therefore, global models cannot be used. Recently, a new model for turbulent combustion was developed, in which the combustion is modeled, within the subgrid (small-scales) using a methodology that simulates the mixing and the molecular transport and the chemical kinetics within each LES grid cell. Finite-rate kinetics can be included without any closure and this approach actually provides a means to predict the turbulent rates and the turbulent flame speed. The subgrid combustion model requires resolution of the local time scales associated with small-scale mixing, molecular diffusion and chemical kinetics and, therefore, within each grid cell, a significant amount of computations must be carried out before the large-scale (LES resolved) effects are incorporated. Therefore, this approach is uniquely suited for parallel processing and has been implemented on various systems such as: Intel Paragon, IBM SP-2, Cray T3D and SGI Power Challenge (PC) using the system independent Message Passing Interface (MPI) compiler. In this paper, timing data on these machines is reported along with some characteristic results.
Vectorization of a particle simulation method for hypersonic rarefied flow

NASA Technical Reports Server (NTRS)

Mcdonald, Jeffrey D.; Baganoff, Donald

1988-01-01

An efficient particle simulation technique for hypersonic rarefied flows is presented at an algorithmic and implementation level. The implementation is for a vector computer architecture, specifically the Cray-2. The method models an ideal diatomic Maxwell molecule with three translational and two rotational degrees of freedom. Algorithms are designed specifically for compatibility with fine grain parallelism by reducing the number of data dependencies in the computation. By insisting on this compatibility, the method is capable of performing simulation on a much larger scale than previously possible. A two-dimensional simulation of supersonic flow over a wedge is carried out for the near-continuum limit where the gas is in equilibrium and the ideal solution can be used as a check on the accuracy of the gas model employed in the method. Also, a three-dimensional, Mach 8, rarefied flow about a finite-span flat plate at a 45 degree angle of attack was simulated. It utilized over 10 to the 7th particles carried through 400 discrete time steps in less than one hour of Cray-2 CPU time. This problem was chosen to exhibit the capability of the method in handling a large number of particles and a true three-dimensional geometry.
Local search to improve coordinate-based task mapping

DOE PAGES

Balzuweit, Evan; Bunde, David P.; Leung, Vitus J.; ...

2015-10-31

We present a local search strategy to improve the coordinate-based mapping of a parallel job’s tasks to the MPI ranks of its parallel allocation in order to reduce network congestion and the job’s communication time. The goal is to reduce the number of network hops between communicating pairs of ranks. Our target is applications with a nearest-neighbor stencil communication pattern running on mesh systems with non-contiguous processor allocation, such as Cray XE and XK Systems. Utilizing the miniGhost mini-app, which models the shock physics application CTH, we demonstrate that our strategy reduces application running time while also reducing the runtimemore » variability. Furthermore, we further show that mapping quality can vary based on the selected allocation algorithm, even between allocation algorithms of similar apparent quality.« less
MODA A Framework for Memory Centric Performance Characterization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shrestha, Sunil; Su, Chun-Yi; White, Amanda M.

2012-06-29

In the age of massive parallelism, the focus of performance analysis has switched from the processor and related structures to the memory and I/O resources. Adapting to this new reality, a performance analysis tool has to provide a way to analyze resource usage to pinpoint existing and potential problems in a given application. This paper provides an overview of the Memory Observant Data Analysis (MODA) tool, a memory-centric tool first implemented on the Cray XMT supercomputer. Throughout the paper, MODA's capabilities have been showcased with experiments done on matrix multiply and Graph-500 application codes.
User's and test case manual for FEMATS

NASA Technical Reports Server (NTRS)

Chatterjee, Arindam; Volakis, John; Nurnberger, Mike; Natzke, John

1995-01-01

The FEMATS program incorporates first-order edge-based finite elements and vector absorbing boundary conditions into the scattered field formulation for computation of the scattering from three-dimensional geometries. The code has been validated extensively for a large class of geometries containing inhomogeneities and satisfying transition conditions. For geometries that are too large for the workstation environment, the FEMATS code has been optimized to run on various supercomputers. Currently, FEMATS has been configured to run on the HP 9000 workstation, vectorized for the Cray Y-MP, and parallelized to run on the Kendall Square Research (KSR) architecture and the Intel Paragon.
The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science.

PubMed

Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H

2014-05-28

Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260,000, scalability up to 295,000 CPU cores has been shown on BlueGene/P.
Parallel eigenanalysis of finite element models in a completely connected architecture

NASA Technical Reports Server (NTRS)

Akl, F. A.; Morel, M. R.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
Parallel Grid Manipulations in Earth Science Calculations

NASA Technical Reports Server (NTRS)

Sawyer, W.; Lucchesi, R.; daSilva, A.; Takacs, L. L.

1999-01-01

The National Aeronautics and Space Administration (NASA) Data Assimilation Office (DAO) at the Goddard Space Flight Center is moving its data assimilation system to massively parallel computing platforms. This parallel implementation of GEOS DAS will be used in the DAO's normal activities, which include reanalysis of data, and operational support for flight missions. Key components of GEOS DAS, including the gridpoint-based general circulation model and a data analysis system, are currently being parallelized. The parallelization of GEOS DAS is also one of the HPCC Grand Challenge Projects. The GEOS-DAS software employs several distinct grids. Some examples are: an observation grid- an unstructured grid of points at which observed or measured physical quantities from instruments or satellites are associated- a highly-structured latitude-longitude grid of points spanning the earth at given latitude-longitude coordinates at which prognostic quantities are determined, and a computational lat-lon grid in which the pole has been moved to a different location to avoid computational instabilities. Each of these grids has a different structure and number of constituent points. In spite of that, there are numerous interactions between the grids, e.g., values on one grid must be interpolated to another, or, in other cases, grids need to be redistributed on the underlying parallel platform. The DAO has designed a parallel integrated library for grid manipulations (PILGRIM) to support the needed grid interactions with maximum efficiency. It offers a flexible interface to generate new grids, define transformations between grids and apply them. Basic communication is currently MPI, however the interfaces defined here could conceivably be implemented with other message-passing libraries, e.g., Cray SHMEM, or with shared-memory constructs. The library is written in Fortran 90. First performance results indicate that even difficult problems, such as above-mentioned pole rotation- a sparse interpolation with little data locality between the physical lat-lon grid and a pole rotated computational grid- can be solved efficiently and at the GFlop/s rates needed to solve tomorrow's high resolution earth science models. In the subsequent presentation we will discuss the design and implementation of PILGRIM as well as a number of the problems it is required to solve. Some conclusions will be drawn about the potential performance of the overall earth science models on the supercomputer platforms foreseen for these problems.
Magnetic spectral signatures in the Earth's magnetosheath and plasma depletion layer

NASA Technical Reports Server (NTRS)

Anderson, Brian J.; Fuselier, Stephen A.; Gary, S. Peter; Denton, Richard E.

1994-01-01

Correlations between plasma properties and magnetic fluctuations in the sub-solar magnetosheath downstream of a quasi-perpendicular shock have been found and indicate that mirror and ion cyclotronlike fluctuations correlate with the magnetosheath proper and plasma depletion layer, respectively (Anderson and Fueselier, 1993). We explore the entire range of magnetic spectral signatures observed from the Active Magnetospheric Particle Tracer Explorers/Charge Composition Explorer (AMPTE/CCE)spacecraft in the magnetosheath downstream of a quasi-perpendicular shock. The magnetic spectral signatures typically progress from predominantly compressional fluctuations,delta B(sub parallel)/delta B perpendicular to approximately 3, with F/F (sub p) less than 0.2 (F and F (sub p) are the wave frequency and proton gyrofrequency, respectively) to predominantly transverse fluctuations, delta B(sub parallel)/delta B perpendicular to approximately 0.3, extending up to F(sub p). The compressional fluctuations are characterized by anticorrelation between the field magnitude and electron density, n(sub e), and by a small compressibility, C(sub e) identically equal to (delta n(sub e)/n(sub e)) (exp 2) (B/delta B(sub parallel)) (exp 2) approximately 0.13, indicative of mirror waves. The spectral characteristics of the transverse fluctuations are in agreement with predictions of linear Vlasov theory for the H(+) and He(2+) cyclotron modes. The power spectra and local plasma parameters are found to vary in concert: mirror waves occur for beta(s ub parallel p) (beta (sub parallel p) identically = 2 mu(sub zero) n(sub p) kT (sub parallel p) / B(exp 2) approximately = 2, A(sub p) indentically = T(sub perpendicular to p)/T(sub parallel p) - 1 approximately = 0.4, whereas cyclotron waves occur for beta (sub parallel p) approximately = 0.2 and A(sub p) approximately = 2. The transition from mirror to cyclotron modes is predicted by linear theory. The spectral characteristics overlap for intermediate plasma parameters. The plasma observations are described by A(sub p) = 0.85 beta(sub parallel P) (exp - 0.48) with a log regression coefficient of -0.74. This inverse A(sub p) - beta(sub parallel p) correlation corresponds closely to the isocontours of maximum ion anisotropy instability growth, gamma (sub m)/omega(sub p) = 0.01, for the mirror and cyclotron modes. The agreement of observed properties and predictions of local theory suggests that the spectral signatures reflect the local plasma environment and that the anisotropy instabilities regulate A(sub p). We suggest that the spectral characteristics may provide a useful basis for ordering observations in the magnetosheath and that the A(sub p) - beta(sub parallel p) inverse correlation may be used as a beta-dependent upper limit on the proton anisotropy to represent kinetic effects.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1997-01-01

Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.

Anisotropic upper critical magnetic fields in Rb 2 Cr 3 As 3 superconductor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tang, Zhang-Tu; Liu, Yi; Bao, Jin-Ke

Rb2Cr3As3 is a structurally one-dimensional superconductor containing Cr3As3 chains with a superconducting transition temperature of T-c = 4.8 K. Here we report the electrical resistance measurements for Rb2Cr3As3 single crystals, under magnetic fields up to 29.5 T and at temperatures down to 0.36 K, from which the upper critical fields, H-c2(T), can be obtained in a broad temperature range. For field parallel to the Cr3As3 chains, H-c2(parallel to)(T) is paramagnetically limited with an initial slope of mu(0)dH(c2)(parallel to)/dT|T-c = - 16 T K-1 and a zero-temperature upper critical field of mu H-0(c2)parallel to(0) = 17.5 T. For field perpendicular tomore » the Cr3As3 chains, however, H-c2(perpendicular to)(T) is only limited by orbital pair-breaking effect with mu(0)dH(c2)(perpendicular to)/dT|(Tc) = - 3 T K-1. As a consequence, the anisotropy gamma H = H-c2(parallel to)/H-c2(perpendicular to) decreases sharply near T-c and reverses below 2 K. Remarkably, the low- temperature H-c2(perpendicular to)(T) down to 0.075 T-c remains to increase linearly up to over three times the Pauli paramagnetic limit, which strongly suggests dominant spin-triplet superconductivity in Rb2Cr3As3.« less
Mini-Fragment Fixation Is Equivalent to Bicortical Screw Fixation for Horizontal Medial Malleolus Fractures.

PubMed

Wegner, Adam M; Wolinsky, Philip R; Robbins, Michael A; Garcia, Tanya C; Amanatullah, Derek F

2018-05-01

Horizontal fractures of the medial malleolus occur through application of valgus or abduction force through the ankle that creates a tension failure of the medial malleolus. The authors hypothesize that mini-fragment T-plates may offer improved fixation, but the optimal fixation construct for these fractures remains unclear. Forty synthetic distal tibiae with identical osteotomies were randomized into 4 fixation constructs: (1) two parallel unicortical cancellous screws; (2) two parallel bicortical cortical screws; (3) a contoured mini-fragment T-plate with 2 unicortical screws in the fragment and 2 bicortical screws in the shaft; and (4) a contoured mini-fragment T-plate with 2 bicortical screws in the fragment and 2 unicortical screws in the shaft. Specimens were subjected to offset axial tension loading on a servohydraulic testing system and tracked using high-resolution video. Failure was defined as 2 mm of articular displacement. Analysis of variance followed by a Tukey-Kramer post hoc test was used to assess for differences between groups, with significance defined as P<.05. The mean stiffness (±SD) values of both mini-fragment T-plate constructs (239±83 N/mm and 190±37 N/mm) and the bicortical screw construct (240±17 N/mm) were not statistically different. The mean stiffness values of both mini-fragment T-plate constructs and the bicortical screw construct were higher than that of a parallel unicortical screw construct (102±20 N/mm). Contoured T-plate constructs provide stiffer initial fixation than a unicortical cancellous screw construct. The T-plate is biomechanically equivalent to a bicortical screw construct, but may be superior in capturing small fragments of bone. [Orthopedics. 2018; 41(3):e395-e399.]. Copyright 2018, SLACK Incorporated.
Parallel imaging of knee cartilage at 3 Tesla.

PubMed

Zuo, Jin; Li, Xiaojuan; Banerjee, Suchandrima; Han, Eric; Majumdar, Sharmila

2007-10-01

To evaluate the feasibility and reproducibility of quantitative cartilage imaging with parallel imaging at 3T and to determine the impact of the acceleration factor (AF) on morphological and relaxation measurements. An eight-channel phased-array knee coil was employed for conventional and parallel imaging on a 3T scanner. The imaging protocol consisted of a T2-weighted fast spin echo (FSE), a 3D-spoiled gradient echo (SPGR), a custom 3D-SPGR T1rho, and a 3D-SPGR T2 sequence. Parallel imaging was performed with an array spatial sensitivity technique (ASSET). The left knees of six healthy volunteers were scanned with both conventional and parallel imaging (AF = 2). Morphological parameters and relaxation maps from parallel imaging methods (AF = 2) showed comparable results with conventional method. The intraclass correlation coefficient (ICC) of the two methods for cartilage volume, mean cartilage thickness, T1rho, and T2 were 0.999, 0.977, 0.964, and 0.969, respectively, while demonstrating excellent reproducibility. No significant measurement differences were found when AF reached 3 despite the low signal-to-noise ratio (SNR). The study demonstrated that parallel imaging can be applied to current knee cartilage quantification at AF = 2 without degrading measurement accuracy with good reproducibility while effectively reducing scan time. Shorter imaging times can be achieved with higher AF at the cost of SNR. (c) 2007 Wiley-Liss, Inc.
Measurements of long-range enhanced collisional velocity drag through plasma wave damping

NASA Astrophysics Data System (ADS)

Affolter, M.; Anderegg, F.; Dubin, D. H. E.; Driscoll, C. F.

2018-05-01

We present damping measurements of axial plasma waves in magnetized, multispecies ion plasmas. At high temperatures T ≳ 10-2 eV, collisionless Landau damping dominates, whereas, at lower temperatures T ≲ 10-2 eV, the damping arises from interspecies collisional drag, which is dependent on the plasma composition and scales roughly as T-3 /2 . This drag damping is proportional to the rate of parallel collisional slowing, and is found to exceed classical predictions of collisional drag damping by as much as an order of magnitude, but agrees with a new collision theory that includes long-range collisions. Centrifugal mass separation and collisional locking of the species occur at ultra-low temperatures T ≲ 10-3 eV, which reduce the drag damping from the T-3 /2 collisional scaling. These mechanisms are investigated by measuring the damping of higher frequency axial modes, and by measuring the damping in plasmas with a non-equilibrium species profile.
Big Data and High-Performance Computing in Global Seismology

NASA Astrophysics Data System (ADS)

Bozdag, Ebru; Lefebvre, Matthieu; Lei, Wenjie; Peter, Daniel; Smith, James; Komatitsch, Dimitri; Tromp, Jeroen

2014-05-01

Much of our knowledge of Earth's interior is based on seismic observations and measurements. Adjoint methods provide an efficient way of incorporating 3D full wave propagation in iterative seismic inversions to enhance tomographic images and thus our understanding of processes taking place inside the Earth. Our aim is to take adjoint tomography, which has been successfully applied to regional and continental scale problems, further to image the entire planet. This is one of the extreme imaging challenges in seismology, mainly due to the intense computational requirements and vast amount of high-quality seismic data that can potentially be assimilated. We have started low-resolution inversions (T > 30 s and T > 60 s for body and surface waves, respectively) with a limited data set (253 carefully selected earthquakes and seismic data from permanent and temporary networks) on Oak Ridge National Laboratory's Cray XK7 "Titan" system. Recent improvements in our 3D global wave propagation solvers, such as a GPU version of the SPECFEM3D_GLOBE package, will enable us perform higher-resolution (T > 9 s) and longer duration (~180 m) simulations to take the advantage of high-frequency body waves and major-arc surface waves, thereby improving imbalanced ray coverage as a result of the uneven global distribution of sources and receivers. Our ultimate goal is to use all earthquakes in the global CMT catalogue within the magnitude range of our interest and data from all available seismic networks. To take the full advantage of computational resources, we need a solid framework to manage big data sets during numerical simulations, pre-processing (i.e., data requests and quality checks, processing data, window selection, etc.) and post-processing (i.e., pre-conditioning and smoothing kernels, etc.). We address the bottlenecks in our global seismic workflow, which are mainly coming from heavy I/O traffic during simulations and the pre- and post-processing stages, by defining new data formats for seismograms and outputs of our 3D solvers (i.e., meshes, kernels, seismic models, etc.) based on ORNL's ADIOS libraries. We will discuss our global adjoint tomography workflow on HPC systems as well as the current status of our global inversions.
Development of Specialized Equipment to Measure Piezoelectric Coefficients of Polymer Piezoelectric Materials,

DTIC Science & Technology

1980-02-22

definition g.l = (Voc/t)(F/wt) = field/stress 2. C =KT K 1E w/t (parallel plate capacitdnce formula) ’. VL = Voc Cx’/CL for CL > Cx 4.v g/wF Kt Eoewlt...V = (g3 1 /w)F 2. V L =(Cg/c,.v oc 3. V =(Cg/CL)(g31/w)F 4. Vstd = "d3 3"/C std F 5. V L/ Vst d = (C /C ) (g ’)(C /"d " )Sg L 31 std 33 01. 6. R (in...34 and the second T " e’hicial Note No. 1 - Theory of the use of Channel Products (’DT (X() d., meteor to measure d31 direct.]y on Piezoelect.rlc Ceramic
Parallel performance optimizations on unstructured mesh-based simulations

DOE PAGES

Sarje, Abhinav; Song, Sukhyun; Jacobsen, Douglas; ...

2015-06-01

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intranode data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches.more » We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.« less
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gittens, Alex; Devarakonda, Aditya; Racah, Evan

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark’s data parallel model. We perform scalingmore » experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.« less
PAN AIR: A computer program for predicting subsonic or supersonic linear potential flows about arbitrary configurations using a higher order panel method. Volume 4: Maintenance document (version 3.0)

NASA Technical Reports Server (NTRS)

Purdon, David J.; Baruah, Pranab K.; Bussoletti, John E.; Epton, Michael A.; Massena, William A.; Nelson, Franklin D.; Tsurusaki, Kiyoharu

1990-01-01

The Maintenance Document Version 3.0 is a guide to the PAN AIR software system, a system which computes the subsonic or supersonic linear potential flow about a body of nearly arbitrary shape, using a higher order panel method. The document describes the overall system and each program module of the system. Sufficient detail is given for program maintenance, updating, and modification. It is assumed that the reader is familiar with programming and CRAY computer systems. The PAN AIR system was written in FORTRAN 4 language except for a few CAL language subroutines which exist in the PAN AIR library. Structured programming techniques were used to provide code documentation and maintainability. The operating systems accommodated are COS 1.11, COS 1.12, COS 1.13, and COS 1.14 on the CRAY 1S, 1M, and X-MP computing systems. The system is comprised of a data base management system, a program library, an execution control module, and nine separate FORTRAN technical modules. Each module calculates part of the posed PAN AIR problem. The data base manager is used to communicate between modules and within modules. The technical modules must be run in a prescribed fashion for each PAN AIR problem. In order to ease the problem of supplying the many JCL cards required to execute the modules, a set of CRAY procedures (PAPROCS) was created to automatically supply most of the JCL cards. Most of this document has not changed for Version 3.0. It now, however, strictly applies only to PAN AIR version 3.0. The major changes are: (1) additional sections covering the new FDP module (which calculates streamlines and offbody points); (2) a complete rewrite of the section on the MAG module; and (3) strict applicability to CRAY computing systems.
Scalability study of parallel spatial direct numerical simulation code on IBM SP1 parallel supercomputer

NASA Technical Reports Server (NTRS)

Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad

1994-01-01

The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
Discrete Event Modeling and Massively Parallel Execution of Epidemic Outbreak Phenomena

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S; Seal, Sudip K

2011-01-01

In complex phenomena such as epidemiological outbreaks, the intensity of inherent feedback effects and the significant role of transients in the dynamics make simulation the only effective method for proactive, reactive or post-facto analysis. The spatial scale, runtime speed, and behavioral detail needed in detailed simulations of epidemic outbreaks make it necessary to use large-scale parallel processing. Here, an optimistic parallel execution of a new discrete event formulation of a reaction-diffusion simulation model of epidemic propagation is presented to facilitate in dramatically increasing the fidelity and speed by which epidemiological simulations can be performed. Rollback support needed during optimistic parallelmore » execution is achieved by combining reverse computation with a small amount of incremental state saving. Parallel speedup of over 5,500 and other runtime performance metrics of the system are observed with weak-scaling execution on a small (8,192-core) Blue Gene / P system, while scalability with a weak-scaling speedup of over 10,000 is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes exceeding several hundreds of millions of individuals in the largest cases are successfully exercised to verify model scalability.« less
Supercomputers for engineering analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Goudreau, G.L.; Benson, D.J.; Hallquist, J.O.

1986-07-01

The Cray-1 and Cray X-MP/48 experience in engineering computations at the Lawrence Livermore National Laboratory is surveyed. The fully vectorized explicit DYNA and implicit NIKE finite element codes are discussed with respect to solid and structural mechanics. The main efficiencies for production analyses are currently obtained by simple CFT compiler exploitation of pipeline architecture for inner do-loop optimization. Current developmet of outer-loop multitasking is also discussed. Applications emphasis will be on 3D examples spanning earth penetrator loads analysis, target lethality assessment, and crashworthiness. The use of a vectorized large deformation shell element in both DYNA and NIKE has substantially expandedmore » 3D nonlinear capability. 25 refs., 7 figs.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Maxwell, Don E; Ezell, Matthew A; Becklehimer, Jeff

While sites generally have systems in place to monitor the health of Cray computers themselves, often the cooling systems are ignored until a computer failure requires investigation into the source of the failure. The Liebert XDP units used to cool the Cray XE/XK models as well as the Cray proprietary cooling system used for the Cray XC30 models provide data useful for health monitoring. Unfortunately, this valuable information is often available only to custom solutions not accessible by a center-wide monitoring system or is simply ignored entirely. In this paper, methods and tools used to harvest the monitoring data availablemore » are discussed, and the implementation needed to integrate the data into a center-wide monitoring system at the Oak Ridge National Laboratory is provided.« less
Multitasking the three-dimensional shock wave code CTH on the Cray X-MP/416

DOE Office of Scientific and Technical Information (OSTI.GOV)

McGlaun, J.M.; Thompson, S.L.

1988-01-01

CTH is a software system under development at Sandia National Laboratories Albuquerque that models multidimensional, multi-material, large-deformation, strong shock wave physics. CTH was carefully designed to both vectorize and multitask on the Cray X-MP/416. All of the physics routines are vectorized except the thermodynamics and the interface tracer. All of the physics routines are multitasked except the boundary conditions. The Los Alamos National Laboratory multitasking library was used for the multitasking. The resulting code is easy to maintain, easy to understand, gives the same answers as the unitasked code, and achieves a measured speedup of approximately 3.5 on the fourmore » cpu Cray. This document discusses the design, prototyping, development, and debugging of CTH. It also covers the architecture features of CTH that enhances multitasking, granularity of the tasks, and synchronization of tasks. The utility of system software and utilities such as simulators and interactive debuggers are also discussed. 5 refs., 7 tabs.« less
A Pacific Ocean general circulation model for satellite data assimilation

NASA Technical Reports Server (NTRS)

Chao, Y.; Halpern, D.; Mechoso, C. R.

1991-01-01

A tropical Pacific Ocean General Circulation Model (OGCM) to be used in satellite data assimilation studies is described. The transfer of the OGCM from a CYBER-205 at NOAA's Geophysical Fluid Dynamics Laboratory to a CRAY-2 at NASA's Ames Research Center is documented. Two 3-year model integrations from identical initial conditions but performed on those two computers are compared. The model simulations are very similar to each other, as expected, but the simulations performed with the higher-precision CRAY-2 is smoother than that with the lower-precision CYBER-205. The CYBER-205 and CRAY-2 use 32 and 64-bit mantissa arithmetic, respectively. The major features of the oceanic circulation in the tropical Pacific, namely the North Equatorial Current, the North Equatorial Countercurrent, the South Equatorial Current, and the Equatorial Undercurrent, are realistically produced and their seasonal cycles are described. The OGCM provides a powerful tool for study of tropical oceans and for the assimilation of satellite altimetry data.
Particle simulation on heterogeneous distributed supercomputers

NASA Technical Reports Server (NTRS)

Becker, Jeffrey C.; Dagum, Leonardo

1993-01-01

We describe the implementation and performance of a three dimensional particle simulation distributed between a Thinking Machines CM-2 and a Cray Y-MP. These are connected by a combination of two high-speed networks: a high-performance parallel interface (HIPPI) and an optical network (UltraNet). This is the first application to use this configuration at NASA Ames Research Center. We describe our experience implementing and using the application and report the results of several timing measurements. We show that the distribution of applications across disparate supercomputing platforms is feasible and has reasonable performance. In addition, several practical aspects of the computing environment are discussed.
Accelerating Virtual High-Throughput Ligand Docking: current technology and case study on a petascale supercomputer.

PubMed

Ellingson, Sally R; Dakshanamurthy, Sivanesan; Brown, Milton; Smith, Jeremy C; Baudry, Jerome

2014-04-25

In this paper we give the current state of high-throughput virtual screening. We describe a case study of using a task-parallel MPI (Message Passing Interface) version of Autodock4 [1], [2] to run a virtual high-throughput screen of one-million compounds on the Jaguar Cray XK6 Supercomputer at Oak Ridge National Laboratory. We include a description of scripts developed to increase the efficiency of the predocking file preparation and postdocking analysis. A detailed tutorial, scripts, and source code for this MPI version of Autodock4 are available online at http://www.bio.utk.edu/baudrylab/autodockmpi.htm.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reynolds, William; Weber, Marta S.; Farber, Robert M.

Social Media provide an exciting and novel view into social phenomena. The vast amounts of data that can be gathered from the Internet coupled with massively parallel supercomputers such as the Cray XMT open new vistas for research. Conclusions drawn from such analysis must recognize that social media are distinct from the underlying social reality. Rigorous validation is essential. This paper briefly presents results obtained from computational analysis of social media - utilizing both blog and twitter data. Validation of these results is discussed in the context of a framework of established methodologies from the social sciences. Finally, an outlinemore » for a set of supporting studies is proposed.« less
Time-partitioning simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Milner, Edward J.; Blech, Richard A.; Chima, Rodrick V.

1987-01-01

A technique allowing time-staggered solution of partial differential equations is presented in this report. Using this technique, called time-partitioning, simulation execution speedup is proportional to the number of processors used because all processors operate simultaneously, with each updating of the solution grid at a different time point. The technique is limited by neither the number of processors available nor by the dimension of the solution grid. Time-partitioning was used to obtain the flow pattern through a cascade of airfoils, modeled by the Euler partial differential equations. An execution speedup factor of 1.77 was achieved using a two processor Cray X-MP/24 computer.
Centrifugal acceleration of the polar wind

NASA Technical Reports Server (NTRS)

Horwitz, J. L.; Ho, C. W.; Scarbro, H. D.; Wilson, G. R.; Moore, T. E.

1994-01-01

The effect of parallel ion acceleration associated with convection was first applied to energization of test particle polar ions by Cladis (1986). However, this effect is typically neglected in 'self-consistent' models of polar plasma outflow, apart from the fluid simulation by Swift (1990). Here we include approximations for this acceleration, which we broadly characterize as centrifugal in nature, in our time-dependent, semikinetic model of polar plasma outflow and describe the effects on the bulk parameter profiles and distribution functions of H+ and O+. For meridional convection across the pole the approximate parallel force along a polar magnetic field line may be written as F(sub cent, pole) = 1.5m(E(sub i))/B(sub i))squared (r(squared)/r(sup 3)(sub i)) where m is ion mass, r is geometric distance; and E(sub i), B(sub i) and r(sub i) refer to the electric and magnetic field magnitudes and geocentric distance at the ionosphere, respectively. For purely longitudinal convection along a constant L shell the parallel force is F(cent. long) = F(sub cent, pole)(1 - (r/(r(sub i)L))(sup 3/2)/(1 - 3r/(4 r(sub i)L))(sup 5/2). For high latitudes the difference between these two cases is relatively unimportant below approximately 5 R(sub E). We find that the steady state O+ bulk velocities and parallel temperatures strongly increase and decrease, respectively, with convection strength. In particular, the bulk velocities increase from near 0 km/s at 4000 km altitude to approximately 10 km/s at 5 R(sub E) geocentric distance for 50-mV/m ionospheric convection electric field. However, the centrifugal effect on the steady O+ density profiles depends on the exobase ion and electron temperatures: for low-base temperatures (T(sub i) = T(sub e) = 3000 K) the O+ density at high altitudes increases greatly with convection, while for higher base temperatures (T(sub i) = 5000 K, T(sub e) = 9000 K), the high-altitude O+ density decreases somewhat as convection is enhanced. The centrifugal force further has a pronounced effect on the escaping O+ flux, especially for cool exobase conditions; as referenced to the 4000-km altitude, the steady state O+ flux increases from 10(exp 5) ions/sq cm/s when the ionospheric convection field E(sub i) = 0 mV/m to approximately 10(exp 7) ions/sq cm/s when E(sub i) = 100 mV/m. The centrifugal effect also decreases the time scale for approach to steady-state. For example, in the plasma expansion for T(sub i) = T(sub e) = 3000 K, the O+ density at 7 R(sub E) reaches only 10(exp -7) of it final value approximately 1.5 hours after expansion onset for E(sub i) = 0. For meridional convection driven by E(sub i) = 50 mV/m, the density at the same time after initial injection is 30-50% of its asymptotic level. The centrifugal acceleration described here is a possible explanation for the large (up to approximately 10 km/s or more) o+ outflow velocities observed in the midlatitude polar magnetosphere with the Dynamics Explorer 1 and Akebono spacecraft.

Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

NASA Technical Reports Server (NTRS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
The linearly scaling 3D fragment method for large scale electronic structure calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhao, Zhengji; Meza, Juan; Lee, Byounghak

2009-07-28

The Linearly Scaling three-dimensional fragment (LS3DF) method is an O(N) ab initio electronic structure method for large-scale nano material simulations. It is a divide-and-conquer approach with a novel patching scheme that effectively cancels out the artificial boundary effects, which exist in all divide-and-conquer schemes. This method has made ab initio simulations of thousand-atom nanosystems feasible in a couple of hours, while retaining essentially the same accuracy as the direct calculation methods. The LS3DF method won the 2008 ACM Gordon Bell Prize for algorithm innovation. Our code has reached 442 Tflop/s running on 147,456 processors on the Cray XT5 (Jaguar) atmore » OLCF, and has been run on 163,840 processors on the Blue Gene/P (Intrepid) at ALCF, and has been applied to a system containing 36,000 atoms. In this paper, we will present the recent parallel performance results of this code, and will apply the method to asymmetric CdSe/CdS core/shell nanorods, which have potential applications in electronic devices and solar cells.« less
The Linearly Scaling 3D Fragment Method for Large Scale Electronic Structure Calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhao, Zhengji; Meza, Juan; Lee, Byounghak

2009-06-26

The Linearly Scaling three-dimensional fragment (LS3DF) method is an O(N) ab initio electronic structure method for large-scale nano material simulations. It is a divide-and-conquer approach with a novel patching scheme that effectively cancels out the artificial boundary effects, which exist in all divide-and-conquer schemes. This method has made ab initio simulations of thousand-atom nanosystems feasible in a couple of hours, while retaining essentially the same accuracy as the direct calculation methods. The LS3DF method won the 2008 ACM Gordon Bell Prize for algorithm innovation. Our code has reached 442 Tflop/s running on 147,456 processors on the Cray XT5 (Jaguar) atmore » OLCF, and has been run on 163,840 processors on the Blue Gene/P (Intrepid) at ALCF, and has been applied to a system containing 36,000 atoms. In this paper, we will present the recent parallel performance results of this code, and will apply the method to asymmetric CdSe/CdS core/shell nanorods, which have potential applications in electronic devices and solar cells.« less
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A; Kabel, A.; Lee, L.

In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
Optimization of Supercomputer Use on EADS II System

NASA Technical Reports Server (NTRS)

Ahmed, Ardsher

1998-01-01

The main objective of this research was to optimize supercomputer use to achieve better throughput and utilization of supercomputers and to help facilitate the movement of non-supercomputing (inappropriate for supercomputer) codes to mid-range systems for better use of Government resources at Marshall Space Flight Center (MSFC). This work involved the survey of architectures available on EADS II and monitoring customer (user) applications running on a CRAY T90 system.
The Phytotoxicity of Designated Pollutants

DTIC Science & Technology

1982-11-01

lengths. Results paralleled emer- gence; a reduction of growth occurred when fuel was applied (Table 46). Shoots appeared equally stressed since root...1954, Mode of action of phytotoxic oils, Weeds 3:55-65. Williams, G. R., E. Cumins , A. C. Gardner, M. Palmier, and T. Rubidge, 1981, The growth of
Mode Transitions in Hall Effect Thrusters

DTIC Science & Technology

2013-07-01

bM = number of pixels per bin m = spoke order 0m = spoke order m = 0 em = electron mass, 9.1110 -31 kg im = Xe ion mass, 2.18×10 -25...periodogram spectral estimate, Arb Hz -1 eT = electron temperature eT = electron temperature parallel to magnetic field, eV eT  = electron ...Fourier transform of x(t)  = inverse angle from 2D DFT, deg-1  = mean electron energy, eV * = material dependent cross-over energy, eV xy
Ada Compiler Validation Summary Report: Certificate Number: 901112W1. 11116 Cray Research, Inc., Cray Ada Compiler, Release 2.0, Cray X-MP/EA (Host & Target)

DTIC Science & Technology

1990-11-12

This feature prevents any significant unexpected and undesired size overhead introduced by the automatic inlining of a called subprogram. Any...PRESERVELAYOUT forces the 5.5.1 compiler to maintain the Ada source order of a given record type, thereby, preventing the compiler from performing this...Environment, Volme 2: Prgram nng Guide assignments to the copied array in Ada do not affect the Fortran version of the array. The dimensions and order of
Y-MP floating point and Cholesky factorization

NASA Technical Reports Server (NTRS)

Carter, Russell

1991-01-01

The floating point arithmetics implemented in the Cray 2 and Cray Y-MP computer systems are nearly identical, but large scale computations performed on the two systems have exhibited significant differences in accuracy. The difference in accuracy is analyzed for Cholesky factorization algorithm, and it is found that the source of the difference is the subtract magnitude operation of the Cray Y-MP. The results from numerical experiments for a range of problem sizes are presented, and an efficient method for improving the accuracy of the factorization obtained on the Y-MP is presented.
A Metascalable Computing Framework for Large Spatiotemporal-Scale Atomistic Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nomura, K; Seymour, R; Wang, W

2009-02-17

A metascalable (or 'design once, scale on new architectures') parallel computing framework has been developed for large spatiotemporal-scale atomistic simulations of materials based on spatiotemporal data locality principles, which is expected to scale on emerging multipetaflops architectures. The framework consists of: (1) an embedded divide-and-conquer (EDC) algorithmic framework based on spatial locality to design linear-scaling algorithms for high complexity problems; (2) a space-time-ensemble parallel (STEP) approach based on temporal locality to predict long-time dynamics, while introducing multiple parallelization axes; and (3) a tunable hierarchical cellular decomposition (HCD) parallelization framework to map these O(N) algorithms onto a multicore cluster based onmore » hybrid implementation combining message passing and critical section-free multithreading. The EDC-STEP-HCD framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node parallel efficiency well over 0.95 for 218 billion-atom molecular-dynamics and 1.68 trillion electronic-degrees-of-freedom quantum-mechanical simulations on 212,992 IBM BlueGene/L processors (superscalability); (2) high intra-node, multithreading parallel efficiency (nanoscalability); and (3) nearly perfect time/ensemble parallel efficiency (eon-scalability). The spatiotemporal scale covered by MD simulation on a sustained petaflops computer per day (i.e. petaflops {center_dot} day of computing) is estimated as NT = 2.14 (e.g. N = 2.14 million atoms for T = 1 microseconds).« less
Accelerating lattice QCD simulations with 2 flavors of staggered fermions on multiple GPUs using OpenACC-A first attempt

NASA Astrophysics Data System (ADS)

Gupta, Sourendu; Majumdar, Pushan

2018-07-01

We present the results of an effort to accelerate a Rational Hybrid Monte Carlo (RHMC) program for lattice quantum chromodynamics (QCD) simulation for 2 flavors of staggered fermions on multiple Kepler K20X GPUs distributed on different nodes of a Cray XC30. We do not use CUDA but adopt a higher level directive based programming approach using the OpenACC platform. The lattice QCD algorithm is known to be bandwidth bound; our timing results illustrate this clearly, and we discuss how this limits the parallelization gains. We achieve more than a factor three speed-up compared to the CPU only MPI program.
Critical role of electron heat flux on Bohm criterion

DOE PAGES

Tang, Xianzhu; Guo, Zehua

2016-12-05

Bohm criterion, originally derived for an isothermal-electron and cold-ion plasma, is often used as a rule of thumb for more general plasmas. Here, we establish a more precise determination of the Bohm criterion that are quantitatively useful for understanding and modeling collisional plasmas that still have collisional mean-free-path much greater than plasma Debye length. Specifically, it is shown that electron heat flux, rather than the isothermal electron assumption, is what sets the Bohm speed to bemore » $$\\sqrt{k_B(T_e||+3T_i||)/m_i}$$ with T e,i∥ the electron and ion parallel temperature at the sheath entrance and m i the ion mass.« less
Critical role of electron heat flux on Bohm criterion

NASA Astrophysics Data System (ADS)

Tang, Xian-Zhu; Guo, Zehua

2016-12-01

Bohm criterion, originally derived for an isothermal-electron and cold-ion plasma, is often used as a rule of thumb for more general plasmas. Here, we establish a more precise determination of the Bohm criterion that are quantitatively useful for understanding and modeling collisional plasmas that still have collisional mean-free-path much greater than plasma Debye length. Specifically, it is shown that electron heat flux, rather than the isothermal electron assumption, is what sets the Bohm speed to be √{ k B ( T e ∥ + 3 T i ∥ ) / m i } with T e , i ∥ the electron and ion parallel temperature at the sheath entrance and mi the ion mass.
Parallel Performance Optimizations on Unstructured Mesh-based Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sarje, Abhinav; Song, Sukhyun; Jacobsen, Douglas

2015-01-01

© The Authors. Published by Elsevier B.V. This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intranode data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cachemore » efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.« less
Thought Leaders during Crises in Massive Social Networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Corley, Courtney D.; Farber, Robert M.; Reynolds, William

The vast amount of social media data that can be gathered from the internet coupled with workflows that utilize both commodity systems and massively parallel supercomputers, such as the Cray XMT, open new vistas for research to support health, defense, and national security. Computer technology now enables the analysis of graph structures containing more than 4 billion vertices joined by 34 billion edges along with metrics and massively parallel algorithms that exhibit near-linear scalability according to number of processors. The challenge lies in making this massive data and analysis comprehensible to an analyst and end-users that require actionable knowledge tomore » carry out their duties. Simply stated, we have developed language and content agnostic techniques to reduce large graphs built from vast media corpora into forms people can understand. Specifically, our tools and metrics act as a survey tool to identify thought leaders' -- those members that lead or reflect the thoughts and opinions of an online community, independent of the source language.« less
Defining the Transfer Functions of the PCAD Model in North Atlantic Right Whales (Eubalaena glacialis) - Retrospective Analyses of Existing Data

DTIC Science & Technology

2012-09-30

potentially providing information on nutritional state and chronic stress ( Wasser et al., 2010). We tested both T3 and T4 assays for parallelism.The...EconPapers.RePEc.org/RePEc:inn:wpaper:2011-20 Hunt K.E., Rolland R.M., Kraus S.D., Wasser S.K. 2006. Analysis of fecal glucocorticoids in the North Atlantic right...version 1.2.5. http://CRAN.R-project.org/package=doMC Rolland R.M., Hunt K.E., Kraus S.D., Wasser S.K. 2005. Assessing reproductive status of right
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
Unsteady Flow Field Measurements Using LDV (Laser Doppler Velocimetry).

DTIC Science & Technology

1987-12-01

data and digitized velocity data F.-o- the LDV signal processors were channeled to a 3D -LDV Computer ,nterface (CI). The CI, multiplexing the inputs...beam al _:gnm.ent Fr L.Zr’.s3 , - steering wedges, within the Bragg ’ell modules serves l: bring the beams back t: parallel. A 7i:r~s:: pe -b:t .e, pla...INITIALS DESCRIPTION C 07/26/83 TML Adapted from DAPNT C 12/12,/85 CLH Modified to print results in either Coctal or integer. C 02/25/87 GBG Modified
Checking Equivalence of SPMD Programs Using Non-Interference

DTIC Science & Technology

2010-01-29

with it hopes to go beyond the limits of Moore’s law, but also worries that programming will become harder [5]. One of the reasons why parallel...array name in G or L, and e is an arithmetic expression of integer type. In the CUDA code shown in Section 3, b and t are represented by coreId and...b+ t. A second, optimized version of the program (using function “reverse2”, see Section 3) can be modeled as a tuple P2 = ( G ,L2, F 2), with G same
Using Strassen's algorithm to accelerate the solution of linear systems

NASA Technical Reports Server (NTRS)

Bailey, David H.; Lee, King; Simon, Horst D.

1990-01-01

Strassen's algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY-2 and CRAY Y-MP supercomputers. Several techniques have been used to reduce the scratch space requirement for this algorithm while simultaneously preserving a high level of performance. When the resulting Strassen-based matrix multiply routine is combined with some routines from the new LAPACK library, LU decomposition can be performed with rates significantly higher than those achieved by conventional means. We succeeded in factoring a 2048 x 2048 matrix on the CRAY Y-MP at a rate equivalent to 325 MFLOPS.

Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;

2006-01-01

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.

A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

NASA Astrophysics Data System (ADS)

Hariri, F.; Tran, T. M.; Jocksch, A.; Lanti, E.; Progsch, J.; Messmer, P.; Brunner, S.; Gheller, C.; Villard, L.

2016-10-01

We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandy bridge 8-core CPU by a factor of 3.4.
Lanczos eigensolution method for high-performance computers

NASA Technical Reports Server (NTRS)

Bostic, Susan W.

1991-01-01

The theory, computational analysis, and applications are presented of a Lanczos algorithm on high performance computers. The computationally intensive steps of the algorithm are identified as: the matrix factorization, the forward/backward equation solution, and the matrix vector multiples. These computational steps are optimized to exploit the vector and parallel capabilities of high performance computers. The savings in computational time from applying optimization techniques such as: variable band and sparse data storage and access, loop unrolling, use of local memory, and compiler directives are presented. Two large scale structural analysis applications are described: the buckling of a composite blade stiffened panel with a cutout, and the vibration analysis of a high speed civil transport. The sequential computational time for the panel problem executed on a CONVEX computer of 181.6 seconds was decreased to 14.1 seconds with the optimized vector algorithm. The best computational time of 23 seconds for the transport problem with 17,000 degs of freedom was on the the Cray-YMP using an average of 3.63 processors.
Space Radar Image of Mammoth Mountain, California

NASA Image and Video Library

1999-05-01

These two false-color composite images of the Mammoth Mountain area in the Sierra Nevada Mountains, Calif., show significant seasonal changes in snow cover. The image at left was acquired by the Spaceborne Imaging Radar-C and X-band Synthetic Aperture Radar aboard the space shuttle Endeavour on its 67th orbit on April 13, 1994. The image is centered at 37.6 degrees north latitude and 119 degrees west longitude. The area is about 36 kilometers by 48 kilometers (22 miles by 29 miles). In this image, red is L-band (horizontally transmitted and vertically received) polarization data; green is C-band (horizontally transmitted and vertically received) polarization data; and blue is C-band (horizontally transmitted and received) polarization data. The image at right was acquired on October 3, 1994, on the space shuttle Endeavour's 67th orbit of the second radar mission. Crowley Lake appears dark at the center left of the image, just above or south of Long Valley. The Mammoth Mountain ski area is visible at the top right of the scene. The red areas correspond to forests, the dark blue areas are bare surfaces and the green areas are short vegetation, mainly brush. The changes in color tone at the higher elevations (e.g. the Mammoth Mountain ski area) from green-blue in April to purple in September reflect changes in snow cover between the two missions. The April mission occurred immediately following a moderate snow storm. During the mission the snow evolved from a dry, fine-grained snowpack with few distinct layers to a wet, coarse-grained pack with multiple ice inclusions. Since that mission, all snow in the area has melted except for small glaciers and permanent snowfields on the Silver Divide and near the headwaters of Rock Creek. On October 3, 1994, only discontinuous patches of snow cover were present at very high elevations following the first snow storm of the season on September 28, 1994. For investigations in hydrology and land-surface climatology, seasonal snow cover and alpine glaciers are critical to the radiation and water balances. SIR-C/X-SAR is a powerful tool because it is sensitive to most snowpack conditions and is less influenced by weather conditions than other remote sensing instruments, such as Landsat. In parallel with the operational SIR-C data processing, an experimental effort is being conducted to test SAR data processing using the Jet Propulsion Laboratory's massively parallel supercomputing facility, centered around the Cray Research T3D. These experiments will assess the abilities of large supercomputers to produce high throughput SAR processing in preparation for upcoming data-intensive SAR missions. The images released here were produced as part of this experimental effort. http://photojournal.jpl.nasa.gov/catalog/PIA01753
Compute Server Performance Results

NASA Technical Reports Server (NTRS)

Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)

1994-01-01

Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms

DOE PAGES

Meng, Zhaoyi; Koniges, Alice; He, Yun Helen; ...

2016-09-21

In this paper, we investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is a self-contained module that can be used in conjunction with other graph-Laplacian based methods such as spectral clustering. We use performance tools to collect the hotspots and memory access of the serial codes and use OpenMP as the parallelization language to parallelizemore » the most time-consuming parts. Where possible, we also use library routines. We then optimize the OpenMP implementations and detail the performance on traditional supercomputer nodes (in our case a Cray XC30), and test the optimization steps on emerging testbed systems based on Intel’s Knights Corner and Landing processors. We show both performance improvement and strong scaling behavior. Finally, a large number of optimization techniques and analyses are necessary before the algorithm reaches almost ideal scaling.« less
A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs

DOE PAGES

Azad, Ariful; Buluç, Aydın

2016-05-16

We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more » these algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. We report high-performance implementations of three maximal matching algorithms using hybrid OpenMP-MPI and evaluate the performance of these algorithm using more than 35 real and randomly generated graphs. On real instances, our algorithms achieve up to 200 × speedup on 2048 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 cores.« less
Schemas in Problem Solving: An Integrated Model of Learning, Memory, and Instruction

DTIC Science & Technology

1992-01-01

reflected in the title of a recent article: "lybid Coupation, in Cognitive Science: Neural Networks ad Symbl (3. A Andesson, 1990). And, Marvin Mtuky...Rumneihart, D. E (1989). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: The MrT Press. Minsky
Time-Resolved 3D Quantitative Flow MRI of the Major Intracranial Vessels: Initial Experience and Comparative Evaluation at 1.5T and 3.0T in Combination With Parallel Imaging

PubMed Central

Bammer, Roland; Hope, Thomas A.; Aksoy, Murat; Alley, Marcus T.

2012-01-01

Exact knowledge of blood flow characteristics in the major cerebral vessels is of great relevance for diagnosing cerebrovascular abnormalities. This involves the assessment of hemodynamically critical areas as well as the derivation of biomechanical parameters such as wall shear stress and pressure gradients. A time-resolved, 3D phase-contrast (PC) MRI method using parallel imaging was implemented to measure blood flow in three dimensions at multiple instances over the cardiac cycle. The 4D velocity data obtained from 14 healthy volunteers were used to investigate dynamic blood flow with the use of multiplanar reformatting, 3D streamlines, and 4D particle tracing. In addition, the effects of magnetic field strength, parallel imaging, and temporal resolution on the data were investigated in a comparative evaluation at 1.5T and 3T using three different parallel imaging reduction factors and three different temporal resolutions in eight of the 14 subjects. Studies were consistently performed faster at 3T than at 1.5T because of better parallel imaging performance. A high temporal resolution (65 ms) was required to follow dynamic processes in the intracranial vessels. The 4D flow measurements provided a high degree of vascular conspicuity. Time-resolved streamline analysis provided features that have not been reported previously for the intracranial vasculature. PMID:17195166
DOE Office of Scientific and Technical Information (OSTI.GOV)

Yee, Seonghwan, E-mail: Seonghwan.Yee@Beaumont.edu; Gao, Jia-Hong

Purpose: To investigate whether the direction of spin-lock field, either parallel or antiparallel to the rotating magnetization, has any effect on the spin-lock MRI signal and further on the quantitative measurement of T1ρ, in a clinical 3 T MRI system. Methods: The effects of inverted spin-lock field direction were investigated by acquiring a series of spin-lock MRI signals for an American College of Radiology MRI phantom, while the spin-lock field direction was switched between the parallel and antiparallel directions. The acquisition was performed for different spin-locking methods (i.e., for the single- and dual-field spin-locking methods) and for different levels ofmore » clinically feasible spin-lock field strength, ranging from 100 to 500 Hz, while the spin-lock duration was varied in the range from 0 to 100 ms. Results: When the spin-lock field was inverted into the antiparallel direction, the rate of MRI signal decay was altered and the T1ρ value, when compared to the value for the parallel field, was clearly different. Different degrees of such direction-dependency were observed for different spin-lock field strengths. In addition, the dependency was much smaller when the parallel and the antiparallel fields are mixed together in the dual-field method. Conclusions: The spin-lock field direction could impact the MRI signal and further the T1ρ measurement in a clinical MRI system.« less
Attaching IBM-compatible 3380 disks to Cray X-MP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Engert, D.E.; Midlock, J.L.

1989-01-01

A method of attaching IBM-compatible 3380 disks directly to a Cray X-MP via the XIOP with a BMC is described. The IBM 3380 disks appear to the UNICOS operating system as DD-29 disks with UNICOS file systems. IBM 3380 disks provide cheap, reliable large capacity disk storage. Combined with a small number of high-speed Cray disks, the IBM disks provide for the bulk of the storage for small files and infrequently used files. Cray Research designed the BMC and its supporting software in the XIOP to allow IBM tapes and other devices to be attached to the X-MP. No hardwaremore » changes were necessary, and we added less than 2000 lines of code to the XIOP to accomplish this project. This system has been in operation for over eight months. Future enhancements such as the use of a cache controller and attachment to a Y-MP are also described. 1 tab.« less
Differential clinical efficacy of anti-CD4 monoclonal antibodies in rat adjuvant arthritis is paralleled by differential influence on NF-κB binding activity and TNF-α secretion of T cells

PubMed Central

Pohlers, Dirk; Schmidt-Weber, Carsten B; Franch, Angels; Kuhlmann, Jürgen; Bräuer, Rolf; Emmrich, Frank; Kinne, Raimund W

2002-01-01

The aim of this study was to analyze the differential effects of three anti-CD4 monoclonal antibodies (mAbs) (with distinct epitope specifities) in the treatment of rat adjuvant arthritis (AA) and on T-cell function and signal transduction. Rat AA was preventively treated by intraperitoneal injection of the anti-CD4 mAbs W3/25, OX35, and RIB5/2 (on days -1, 0, 3, and 6, i.e. 1 day before AA induction, on the day of induction [day 0], and thereafter). The effects on T-cell reactivity in vivo (delayed-type hypersensitivity), ex vivo (ConA-induced proliferation), and in vitro (mixed lymphocyte culture) were assessed. The in vitro effects of anti-CD4 preincubation on T-cell receptor (TCR)/CD3-induced cytokine production and signal transduction were also analyzed. While preventive treatment with OX35 and W3/25 significantly ameliorated AA from the onset, treatment with RIB5/2 even accelerated the onset of AA by approximately 2 days (day 10), and ameliorated the arthritis only in the late phase (day 27). Differential clinical effects at the onset of AA were paralleled by a differential influence of the mAbs on T-cell functions, i.e. in comparison with OX35 and W3/25, the 'accelerating' mAb RIB5/2 failed to increase the delayed-type hypersentivity (DTH) to Mycobacterium tuberculosis, increased the in vitro tumor necrosis factor (TNF)-α secretion, and more strongly induced NF-κB binding activity after anti-CD4 preincubation and subsequent TCR/CD3-stimulation. Depending on their epitope specificity, different anti-CD4 mAbs differentially influence individual proinflammatory functions of T cells. This fine regulation may explain the differential efficacy in the treatment of AA and may contribute to the understanding of such treatments in other immunopathologies. PMID:12010568
Ionospheric Disturbances during the Period 30 April to 5 May 1976.

DTIC Science & Technology

1979-05-01

Data 1977, IUGG Publications Office, 39 Ter Rue Gay, Lussac , Paris, 1977. Ichinose, T., and T. Oagawa, HF Doppler observations associated with McMath...electric field E can be separated into components parallel and perpendicular to the magnetic field, E and E . Ohm’s law , relating current density i to the
Advanced Boundary Electrode Modeling for tES and Parallel tES/EEG.

PubMed

Pursiainen, Sampsa; Agsten, Britte; Wagner, Sven; Wolters, Carsten H

2018-01-01

This paper explores advanced electrode modeling in the context of separate and parallel transcranial electrical stimulation (tES) and electroencephalography (EEG) measurements. We focus on boundary condition-based approaches that do not necessitate adding auxiliary elements, e.g., sponges, to the computational domain. In particular, we investigate the complete electrode model (CEM) which incorporates a detailed description of the skin-electrode interface including its contact surface, impedance, and normal current distribution. The CEM can be applied for both tES and EEG electrodes which are advantageous when a parallel system is used. In comparison to the CEM, we test two important reduced approaches: the gap model (GAP) and the point electrode model (PEM). We aim to find out the differences of these approaches for a realistic numerical setting based on the stimulation of the auditory cortex. The results obtained suggest, among other things, that GAP and GAP/PEM are sufficiently accurate for the practical application of tES and parallel tES/EEG, respectively. Differences between CEM and GAP were observed mainly in the skin compartment, where only CEM explains the heating effects characteristic to tES.
Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorentla Venkata, Manjunath; Shamis, Pavel; Graham, Richard L

2013-01-01

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth ofmore » hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.« less
Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav

2013-01-21

Abstract—Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Cray’s proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In thismore » work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. In particular we discuss the causes of variability in performance across the systems and also quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XT6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
Wakefield Simulation of CLIC PETS Structure Using Parallel 3D Finite Element Time-Domain Solver T3P

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A.; Kabel, A.; Lee, L.

In recent years, SLAC's Advanced Computations Department (ACD) has developed the parallel 3D Finite Element electromagnetic time-domain code T3P. Higher-order Finite Element methods on conformal unstructured meshes and massively parallel processing allow unprecedented simulation accuracy for wakefield computations and simulations of transient effects in realistic accelerator structures. Applications include simulation of wakefield damping in the Compact Linear Collider (CLIC) power extraction and transfer structure (PETS).
Segmental Refinement: A Multigrid Technique for Data Locality

DOE PAGES

Adams, Mark F.; Brown, Jed; Knepley, Matt; ...

2016-08-04

In this paper, we investigate a domain decomposed multigrid technique, termed segmental refinement, for solving general nonlinear elliptic boundary value problems. We extend the method first proposed in 1994 by analytically and experimentally investigating its complexity. We confirm that communication of traditional parallel multigrid is eliminated on fine grids, with modest amounts of extra work and storage, while maintaining the asymptotic exactness of full multigrid. We observe an accuracy dependence on the segmental refinement subdomain size, which was not considered in the original analysis. Finally, we present a communication complexity analysis that quantifies the communication costs ameliorated by segmental refinementmore » and report performance results with up to 64K cores on a Cray XC30.« less
The Data Transfer Kit: A geometric rendezvous-based tool for multiphysics data transfer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slattery, S. R.; Wilson, P. P. H.; Pawlowski, R. P.

2013-07-01

The Data Transfer Kit (DTK) is a software library designed to provide parallel data transfer services for arbitrary physics components based on the concept of geometric rendezvous. The rendezvous algorithm provides a means to geometrically correlate two geometric domains that may be arbitrarily decomposed in a parallel simulation. By repartitioning both domains such that they have the same geometric domain on each parallel process, efficient and load balanced search operations and data transfer can be performed at a desirable algorithmic time complexity with low communication overhead relative to other types of mapping algorithms. With the increased development efforts in multiphysicsmore » simulation and other multiple mesh and geometry problems, generating parallel topology maps for transferring fields and other data between geometric domains is a common operation. The algorithms used to generate parallel topology maps based on the concept of geometric rendezvous as implemented in DTK are described with an example using a conjugate heat transfer calculation and thermal coupling with a neutronics code. In addition, we provide the results of initial scaling studies performed on the Jaguar Cray XK6 system at Oak Ridge National Laboratory for a worse-case-scenario problem in terms of algorithmic complexity that shows good scaling on 0(1 x 104) cores for topology map generation and excellent scaling on 0(1 x 105) cores for the data transfer operation with meshes of O(1 x 109) elements. (authors)« less
Plasma and Electro-energetic Physics

DTIC Science & Technology

2012-03-07

Dynamical Equations (with complex surfaces ): Relativistic Lorentz Force Law for relativistic momentum p and velocity u: tDcJcH tBcE   /)/1()/4...0.1-1 s • 3D, high-fidelity, parallel modeling of high energy density fields and particles in complex geometry with some surface effects...cathodes (500 µm separation) Tang, AFRL/RD 12 DISTRIBUTION A: Approved for public release; distribution is unlimited. ICEPIC simulations Equipotential

Test and evaluation procedures for Sandia's Teraflops Operating System (TOS) on Janus.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnette, Daniel Wayne

This report describes the test and evaluation methods by which the Teraflops Operating System, or TOS, that resides on Sandia's massively-parallel computer Janus is verified for production release. Also discussed are methods used to build TOS before testing and evaluating, miscellaneous utility scripts, a sample test plan, and a proposed post-test method for quickly examining the large number of test results. The purpose of the report is threefold: (1) to provide a guide to T&E procedures, (2) to aid and guide others who will run T&E procedures on the new ASCI Red Storm machine, and (3) to document some ofmore » the history of evaluation and testing of TOS. This report is not intended to serve as an exhaustive manual for testers to conduct T&E procedures.« less
Researchers Mine Information from Next-Generation Subsurface Flow Simulations

DOE PAGES

Gedenk, Eric D.

2015-12-01

A research team based at Virginia Tech University leveraged computing resources at the US Department of Energy's (DOE's) Oak Ridge National Laboratory to explore subsurface multiphase flow phenomena that can't be experimentally observed. Using the Cray XK7 Titan supercomputer at the Oak Ridge Leadership Computing Facility, the team took Micro-CT images of subsurface geologic systems and created two-phase flow simulations. The team's model development has implications for computational research pertaining to carbon sequestration, oil recovery, and contaminant transport.
Massively parallel and linear-scaling algorithm for second-order Møller-Plesset perturbation theory applied to the study of supramolecular wires

NASA Astrophysics Data System (ADS)

Kjærgaard, Thomas; Baudin, Pablo; Bykov, Dmytro; Eriksen, Janus Juul; Ettenhuber, Patrick; Kristensen, Kasper; Larkin, Jeff; Liakh, Dmitry; Pawłowski, Filip; Vose, Aaron; Wang, Yang Min; Jørgensen, Poul

2017-03-01

We present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide-Expand-Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide-Expand-Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalability of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the "resolution of the identity second-order Møller-Plesset perturbation theory" (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.
Modeling and new equipment definition for the vibration isolation box equipment system

NASA Technical Reports Server (NTRS)

Sani, Robert L.

1993-01-01

Our MSAD-funded research project is to provide numerical modeling support for the VIBES (Vibration Isolation Box Experiment System) which is an IML2 flight experiment being built by the Japanese research team of Dr. H. Azuma of the Japanese National Aerospace Laboratory. During this reporting period, the following have been accomplished: A semi-consistent mass finite element projection algorithm for 2D and 3D Boussinesq flows has been implemented on Sun, HP And Cray Platforms. The algorithm has better phase speed accuracy than similar finite difference or lumped mass finite element algorithms, an attribute which is essential for addressing realistic g-jitter effects as well as convectively-dominated transient systems. The projection algorithm has been benchmarked against solutions generated via the commercial code FIDAP. The algorithm appears to be accurate as well as computationally efficient. Optimization and potential parallelization studies are underway. Our implementation to date has focused on execution of the basic algorithm with at most a concern for vectorization. The initial time-varying gravity Boussinesq flow simulation is being set up. The mesh is being designed and the input file is being generated. Some preliminary 'small mesh' cases will be attempted on our HP9000/735 while our request to MSAD for supercomputing resources is being addressed. The Japanese research team for VIBES was visited, the current set up and status of the physical experiment was obtained and ongoing E-Mail communication link was established.
The " Swarm of Ants vs. Herd of Elephants" Debated Revisited: Performance Measurements of PVM-Overflow Across a Wide Spectrum of Architectures

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Jespersen, Dennis; Buning, Peter; Bailey, David (Technical Monitor)

1996-01-01

The Gorden Bell Prizes given out at Supercomputing every year includes at least two catergories: performance (highest GFLOP count) and price-performance (GFLOP/million $$) for real applications. In the past five years, the winners of the price-performance categories all came from networks of work-stations. This reflects three important facts: 1. supercomputers are still too expensive for the masses; 2. achieving high performance for real applications takes real work; and, most importantly; 3. it is possible to obtain acceptable performance for certain real applications on network of work stations. With the continued advance of network technology as well as increased performance of "desktop" workstation, the "Swarm of Ants vs. Herd of Elephants" debate, which began with vector multiprocessors (VPPs) against SIMD type multiprocessors (e.g. CM2), is now recast as VPPs against Symetric Multiprocessors (SMPs, e.g. SGI PowerChallenge). This paper reports on performance studies we performed solving a large scale (2-million grid pt.s) CFD problem involving a Boeing 747 based on a parallel version of OVERFLOW that utilizes message passing on PVM. A performance monitoring tool developed under NASA HPCC, called AIMS, was used to instrument and analyze the the performance data thus obtained. We plan to compare its performance data obtained across a wide spectrum of architectures including: the Cray C90, IBM/SP2, SGI/Power Challenge Cluster, to a group of workstations connected over a simple network. The metrics of comparison includes speed-up, price-performance, throughput, and turn-around time. We also plan to present a plan of attack for various issues that will make the execution of Grand Challenge Applications across the Global Information Infrastructure a reality.
Casimir force in O(n) systems with a diffuse interface.

PubMed

Dantchev, Daniel; Grüneberg, Daniel

2009-04-01

We study the behavior of the Casimir force in O(n) systems with a diffuse interface and slab geometry infinity;{d-1}xL , where 2infinity limit of O(n) models with antiperiodic boundary conditions applied along the finite dimension L of the film. We observe that the Casimir amplitude Delta_{Casimir}(dmid R:J_{ perpendicular},J_{ parallel}) of the anisotropic d -dimensional system is related to that of the isotropic system Delta_{Casimir}(d) via Delta_{Casimir}(dmid R:J_{ perpendicular},J_{ parallel})=(J_{ perpendicular}J_{ parallel});{(d-1)2}Delta_{Casimir}(d) . For d=3 we derive the exact Casimir amplitude Delta_{Casimir}(3,mid R:J_{ perpendicular},J_{ parallel})=[Cl_{2}(pi3)3-zeta(3)(6pi)](J_{ perpendicular}J_{ parallel}) , as well as the exact scaling functions of the Casimir force and of the helicity modulus Upsilon(T,L) . We obtain that beta_{c}Upsilon(T_{c},L)=(2pi;{2})[Cl_{2}(pi3)3+7zeta(3)(30pi)](J_{ perpendicular}J_{ parallel})L;{-1} , where T_{c} is the critical temperature of the bulk system. We find that the contributions in the excess free energy due to the existence of a diffuse interface result in a repulsive Casimir force in the whole temperature region.
Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization

NASA Astrophysics Data System (ADS)

Berg, Bernd A.; Wu, Hao

2012-10-01

We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo-Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors. Program summary Program title: STMC2LSU3MPI Catalogue identifier: AEMJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 26666 No. of bytes in distributed program, including test data, etc.: 233126 Distribution format: tar.gz Programming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions. Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions. Operating system: Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0, Centos 5.3 with OpenMPI + gfortran 4.1.2, Cray XT4 with MPICH2 + pgf90 11.2-0. Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664 RAM: 200 Mega bytes per process. Classification: 11.5. Nature of problem: Physics of pure SU(3) Quantum Field Theory (QFT). This is relevant for our understanding of Quantum Chromodynamics (QCD). It includes the glueball spectrum, topological properties and the deconfining phase transition of pure SU(3) QFT. For instance, Relativistic Heavy Ion Collision (RHIC) experiments at the Brookhaven National Laboratory provide evidence that quarks confined in hadrons undergo at high enough temperature and pressure a transition into a Quark-Gluon Plasma (QGP). Investigations of its thermodynamics in pure SU(3) QFT are of interest. Solution method: Markov Chain Monte Carlo (MCMC) simulations of SU(3) Lattice Gauge Theory (LGT) with the Wilson action. This is a regularization of pure SU(3) QFT on a hypercubic lattice, which allows approaching the continuum SU(3) QFT by means of Finite Size Scaling (FSS) studies. Specifically, we provide updating routines for the Cabibbo-Marinari heatbath with and without checkerboard parallelization. While the first is suitable for pedagogical purposes and small scale projects, the latter allows for efficient parallel processing. Targetting the geometry of RHIC experiments, we have implemented a Double-Layered Torus (DLT) lattice geometry, which has previously not been used in LGT MCMC simulations and enables inside and outside layers at distinct temperatures, the lower-temperature layer acting as the outside boundary for the higher-temperature layer, where the deconfinement transition goes on. Restrictions: The checkerboard partition of the lattice makes the development of measurement programs more tedious than is the case for an unpartitioned lattice. Presently, only one measurement routine for Polyakov loops is provided. Unusual features: We provide three different versions for the send/receive function of the MPI library, which work for different operating system +compiler +MPI combinations. This involves activating the correct row in the last three rows of our latmpi.par parameter file. The underlying reason is distinct buffer conventions. Running time: For a typical run using an Intel i7 processor, it takes (1.8-6) E-06 seconds to update one link of the lattice, depending on the compiler used. For example, if we do a simulation on a small (4 * 83) DLT lattice with a statistics of 221 sweeps (i.e., update the two lattice layers of 4 * (4 * 83) links each 221 times), the total CPU time needed can be 2 * 4 * (4 * 83) * 221 * 3 E-06 seconds = 1.7 minutes, where 2 — two layers of lattice 4 — four dimensions 83 * 4 — lattice size 221 — sweeps of updating 6 E-06 s mdash; average time to update one link variable. If we divide the job into 8 parallel processes, then the real time is (for negligible communication overhead) 1.7 mins / 8 = 0.2 mins.
Edison - A New Cray Supercomputer Advances Discovery at NERSC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dosanjh, Sudip; Parkinson, Dula; Yelick, Kathy

2014-02-06

When a supercomputing center installs a new system, users are invited to make heavy use of the computer as part of the rigorous testing. In this video, find out what top scientists have discovered using Edison, a Cray XC30 supercomputer, and how NERSC's newest supercomputer will accelerate their future research.
Edison - A New Cray Supercomputer Advances Discovery at NERSC

ScienceCinema

Dosanjh, Sudip; Parkinson, Dula; Yelick, Kathy; Trebotich, David; Broughton, Jeff; Antypas, Katie; Lukic, Zarija, Borrill, Julian; Draney, Brent; Chen, Jackie

2018-01-16

When a supercomputing center installs a new system, users are invited to make heavy use of the computer as part of the rigorous testing. In this video, find out what top scientists have discovered using Edison, a Cray XC30 supercomputer, and how NERSC's newest supercomputer will accelerate their future research.
Performance of the fusion code GYRO on four generations of Cray computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fahey, Mark R

2014-01-01

GYRO is a code used for the direct numerical simulation of plasma microturbulence. It has been ported to a variety of modern MPP platforms including several modern commodity clusters, IBM SPs, and Cray XC, XT, and XE series machines. We briefly describe the mathematical structure of the equations, the data layout, and the redistribution scheme. Also, while the performance and scaling of GYRO on many of these systems has been shown before, here we show the comparative performance and scaling on four generations of Cray supercomputers including the newest addition - the Cray XC30. The more recently added hybrid OpenMP/MPImore » imple- mentation also shows a great deal of promise on custom HPC systems that utilize fast CPUs and proprietary interconnects. Four machines of varying sizes were used in the experiment, all of which are located at the National Institute for Computational Sciences at the University of Tennessee at Knoxville and Oak Ridge National Laboratory. The advantages, limitations, and performance of using each system are discussed.« less
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

NASA Astrophysics Data System (ADS)

Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

2018-07-01

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Implementation of the Automated Numerical Model Performance Metrics System

DTIC Science & Technology

2011-09-26

question. As of this writing, the DSRC IBM AIX machines DaVinci and Pascal, and the Cray XT Einstein all use the PBS batch queuing system for...3.3). 12 Appendix A – General Automation System This system provides general purpose tools and a general way to automatically run
Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
High Performance Distributed Computing in a Supercomputer Environment: Computational Services and Applications Issues

NASA Technical Reports Server (NTRS)

Kramer, Williams T. C.; Simon, Horst D.

1994-01-01

This tutorial proposes to be a practical guide for the uninitiated to the main topics and themes of high-performance computing (HPC), with particular emphasis to distributed computing. The intent is first to provide some guidance and directions in the rapidly increasing field of scientific computing using both massively parallel and traditional supercomputers. Because of their considerable potential computational power, loosely or tightly coupled clusters of workstations are increasingly considered as a third alternative to both the more conventional supercomputers based on a small number of powerful vector processors, as well as high massively parallel processors. Even though many research issues concerning the effective use of workstation clusters and their integration into a large scale production facility are still unresolved, such clusters are already used for production computing. In this tutorial we will utilize the unique experience made at the NAS facility at NASA Ames Research Center. Over the last five years at NAS massively parallel supercomputers such as the Connection Machines CM-2 and CM-5 from Thinking Machines Corporation and the iPSC/860 (Touchstone Gamma Machine) and Paragon Machines from Intel were used in a production supercomputer center alongside with traditional vector supercomputers such as the Cray Y-MP and C90.
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Taming parallel I/O complexity with auto-tuning

DOE PAGES

Behzad, Babak; Luu, Huong Vu Thanh; Huchette, Joseph; ...

2013-11-17

We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, andmore » 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. In conclusion, we consistently demonstrate I/O write speedups between 2x and 100x for test configurations.« less
The fusion code XGC: Enabling kinetic study of multi-scale edge turbulent transport in ITER [Book Chapter

DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

NASA Astrophysics Data System (ADS)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

2018-01-01

We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav

2014-01-01

We present here a performance analysis of three of current architectures that have become commonplace in the High Performance Computing world. Blue Gene/Q is the third generation of systems from IBM that use modestly performing cores but at large-scale in order to achieve high performance. The XE6 is the latest in a long line of Cray systems that use a 3-D topology but the first to use its Gemini interconnection network. InfiniBand provides the flexibility of using compute nodes from many vendors that can be connected in many possible topologies. The performance characteristics of each vary vastly, and the waymore » in which nodes are allocated in each type of system can significantly impact on achieved performance. In this work we compare these three systems using a combination of micro-benchmarks and a set of production applications. In addition we also examine the differences in performance variability observed on each system and quantify the lost performance using a combination of both empirical measurements and performance models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
Parallel 3D Mortar Element Method for Adaptive Nonconforming Meshes

NASA Technical Reports Server (NTRS)

Feng, Huiyu; Mavriplis, Catherine; VanderWijngaart, Rob; Biswas, Rupak

2004-01-01

High order methods are frequently used in computational simulation for their high accuracy. An efficient way to avoid unnecessary computation in smooth regions of the solution is to use adaptive meshes which employ fine grids only in areas where they are needed. Nonconforming spectral elements allow the grid to be flexibly adjusted to satisfy the computational accuracy requirements. The method is suitable for computational simulations of unsteady problems with very disparate length scales or unsteady moving features, such as heat transfer, fluid dynamics or flame combustion. In this work, we select the Mark Element Method (MEM) to handle the non-conforming interfaces between elements. A new technique is introduced to efficiently implement MEM in 3-D nonconforming meshes. By introducing an "intermediate mortar", the proposed method decomposes the projection between 3-D elements and mortars into two steps. In each step, projection matrices derived in 2-D are used. The two-step method avoids explicitly forming/deriving large projection matrices for 3-D meshes, and also helps to simplify the implementation. This new technique can be used for both h- and p-type adaptation. This method is applied to an unsteady 3-D moving heat source problem. With our new MEM implementation, mesh adaptation is able to efficiently refine the grid near the heat source and coarsen the grid once the heat source passes. The savings in computational work resulting from the dynamic mesh adaptation is demonstrated by the reduction of the the number of elements used and CPU time spent. MEM and mesh adaptation, respectively, bring irregularity and dynamics to the computer memory access pattern. Hence, they provide a good way to gauge the performance of computer systems when running scientific applications whose memory access patterns are irregular and unpredictable. We select a 3-D moving heat source problem as the Unstructured Adaptive (UA) grid benchmark, a new component of the NAS Parallel Benchmarks (NPB). In this paper, we present some interesting performance results of ow OpenMP parallel implementation on different architectures such as the SGI Origin2000, SGI Altix, and Cray MTA-2.

Improving the spatial accuracy in functional magnetic resonance imaging (fMRI) based on the blood oxygenation level dependent (BOLD) effect: benefits from parallel imaging and a 32-channel head array coil at 1.5 Tesla.

PubMed

Fellner, C; Doenitz, C; Finkenzeller, T; Jung, E M; Rennert, J; Schlaier, J

2009-01-01

Geometric distortions and low spatial resolution are current limitations in functional magnetic resonance imaging (fMRI). The aim of this study was to evaluate if application of parallel imaging or significant reduction of voxel size in combination with a new 32-channel head array coil can reduce those drawbacks at 1.5 T for a simple hand motor task. Therefore, maximum t-values (tmax) in different regions of activation, time-dependent signal-to-noise ratios (SNR(t)) as well as distortions within the precentral gyrus were evaluated. Comparing fMRI with and without parallel imaging in 17 healthy subjects revealed significantly reduced geometric distortions in anterior-posterior direction. Using parallel imaging, tmax only showed a mild reduction (7-11%) although SNR(t) was significantly diminished (25%). In 7 healthy subjects high-resolution (2 x 2 x 2 mm3) fMRI was compared with standard fMRI (3 x 3 x 3 mm3) in a 32-channel coil and with high-resolution fMRI in a 12-channel coil. The new coil yielded a clear improvement for tmax (21-32%) and SNR(t) (51%) in comparison with the 12-channel coil. Geometric distortions were smaller due to the smaller voxel size. Therefore, the reduction in tmax (8-16%) and SNR(t) (52%) in the high-resolution experiment seems to be tolerable with this coil. In conclusion, parallel imaging is an alternative to reduce geometric distortions in fMRI at 1.5 T. Using a 32-channel coil, reduction of the voxel size might be the preferable way to improve spatial accuracy.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Guo Zehua; Tang Xianzhu

Parallel transport of long mean-free-path plasma along an open magnetic field line is characterized by strong temperature anisotropy, which is driven by two effects. The first is magnetic moment conservation in a non-uniform magnetic field, which can transfer energy between parallel and perpendicular degrees of freedom. The second is decompressional cooling of the parallel temperature due to parallel flow acceleration by conventional presheath electric field which is associated with the sheath condition near the wall surface where the open magnetic field line intercepts the discharge chamber. To the leading order in gyroradius to system gradient length scale expansion, the parallelmore » transport can be understood via the Chew-Goldbeger-Low (CGL) model which retains two components of the parallel heat flux, i.e., q{sub n} associated with the parallel thermal energy and q{sub s} related to perpendicular thermal energy. It is shown that in addition to the effect of magnetic field strength (B) modulation, the two components (q{sub n} and q{sub s}) of the parallel heat flux play decisive roles in the parallel variation of the plasma profile, which includes the plasma density (n), parallel flow (u), parallel and perpendicular temperatures (T{sub Parallel-To} and T{sub Up-Tack }), and the ambipolar potential ({phi}). Both their profile (q{sub n}/B and q{sub s}/B{sup 2}) and the upstream values of the ratio of the conductive and convective thermal flux (q{sub n}/nuT{sub Parallel-To} and q{sub s}/nuT{sub Up-Tack }) provide the controlling physics, in addition to B modulation. The physics described by the CGL model are contrasted with those of the double-adiabatic laws and further elucidated by comparison with the first-principles kinetic simulation for a specific but representative flux expander case.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Koniges, A.E.

The author describes the new T3D parallel computer at NERSC. The adaptive mesh ICF3D code is one of the current applications being ported and developed for use on the T3D. It has been stressed in other papers in this proceedings that the development environment and tools available on the parallel computer is similar to any planned for the future including networks of workstations.
Boundedness and exponential convergence in a chemotaxis model for tumor invasion

NASA Astrophysics Data System (ADS)

Jin, Hai-Yang; Xiang, Tian

2016-12-01

We revisit the following chemotaxis system modeling tumor invasion {ut=Δu-∇ṡ(u∇v),x∈Ω,t>0,vt=Δv+wz,x∈Ω,t>0,wt=-wz,x∈Ω,t>0,zt=Δz-z+u,x∈Ω,t>0, in a smooth bounded domain Ω \\subset {{{R}}n}(n≥slant 1) with homogeneous Neumann boundary and initial conditions. This model was recently proposed by Fujie et al (2014 Adv. Math. Sci. Appl. 24 67-84) as a model for tumor invasion with the role of extracellular matrix incorporated, and was analyzed later by Fujie et al (2016 Discrete Contin. Dyn. Syst. 36 151-69), showing the uniform boundedness and convergence for n≤slant 3 . In this work, we first show that the {{L}∞} -boundedness of the system can be reduced to the boundedness of \\parallel u(\\centerdot,t){{\\parallel}{{L\\frac{n{4}+ɛ}}(Ω )}} for some ɛ >0 alone, and then, for n≥slant 4 , if the initial data \\parallel {{u}0}{{\\parallel}{{L\\frac{n{4}}}}} , \\parallel {{z}0}{{\\parallel}{{L\\frac{n{2}}}}} and \\parallel \
Research on Spectroscopy, Opacity, and Atmospheres

NASA Technical Reports Server (NTRS)

Kurucz, Robert L.

1999-01-01

A web site has been set up to make the calculations accessible; (i.e., cfakus.harvard.edu) This data can also be accessed by FTP. It has all of the atomic and diatomic molecular data, tables of distribution function opacities, grids of model atmospheres, colors, fluxes, etc, programs that are ready for distribution, and most of recent papers developed during this grant. Atlases and computed spectra will be added as they are completed. New atomic and molecular calculations will be added as they are completed. The atomic programs that had been running on a Cray at the San Diego Supercomputer Center can now run on the Vaxes and Alpha. The work started with Ni and Co because there were new laboratory analyses that included isotopic and hyperfine splitting. Those calculations are described in the appended abstract for the 6th Atomic Spectroscopy and oscillator Strengths meeting in Victoria last summer. A surprising finding is that quadrupole transitions have been grossly in error because mixing with higher levels has not been included. All levels up through n=9 for Fe I and II, the spectra for which the most information is available, are now included. After Fe I and Fe II, all other spectra are "easy". ATLAS12, the opacity sampling program for computing models with arbitrary abundances, has been put on the web server. A new distribution function opacity program for workstations that replaces the one used on the Cray at the San Diego Supercomputer Center has been written. Each set of abundances would take 100 Cray hours costing $100,000.
Beyond the Face of Race: Emo-Cognitive Explorations of White Neurosis and Racial Cray-Cray

ERIC Educational Resources Information Center

Matias, Cheryl E.; DiAngelo, Robin

2013-01-01

In this article, the authors focus on the emotional and cognitive context that underlies whiteness. They employ interdisciplinary approaches of critical Whiteness studies and critical race theory to entertain how common White responses to racial material stem from the need for Whites to deny race, a traumatizing process that begins in childhood.…
Implementing dense linear algebra algorithms using multitasking on the CRAY X-MP-4 (or approaching the gigaflop)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, J.J.; Hewitt, T.

1985-08-01

This note describes some experiments on simple, dense linear algebra algorithms. These experiments show that the CRAY X-MP is capable of small-grain multitasking arising from standard implementations of LU and Cholesky decomposition. The implementation described here provides the ''fastest'' execution rate for LU decomposition, 718 MFLOPS for a matrix of order 1000.
Early MIMD experience on the CRAY X-MP

NASA Astrophysics Data System (ADS)

Rhoades, Clifford E.; Stevens, K. G.

1985-07-01

This paper describes some early experience with converting four physics simulation programs to the CRAY X-MP, a current Multiple Instruction, Multiple Data (MIMD) computer consisting of two processors each with an architecture similar to that of the CRAY-1. As a multi-processor, the CRAY X-MP together with the high speed Solid-state Storage Device (SSD) in an ideal machine upon which to study MIMD algorithms for solving the equations of mathematical physics because it is fast enough to run real problems. The computer programs used in this study are all FORTRAN versions of original production codes. They range in sophistication from a one-dimensional numerical simulation of collisionless plasma to a two-dimensional hydrodynamics code with heat flow to a couple of three-dimensional fluid dynamics codes with varying degrees of viscous modeling. Early research with a dual processor configuration has shown speed-ups ranging from 1.55 to 1.98. It has been observed that a few simple extensions to FORTRAN allow a typical programmer to achieve a remarkable level of efficiency. These extensions involve the concept of memory local to a concurrent subprogram and memory common to all concurrent subprograms.
Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

PubMed

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-06-01

We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Enabling parallel simulation of large-scale HPC network systems

DOE PAGES

Mubarak, Misbah; Carothers, Christopher D.; Ross, Robert B.; ...

2016-04-07

Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks usedmore » in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations« less
Enabling parallel simulation of large-scale HPC network systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mubarak, Misbah; Carothers, Christopher D.; Ross, Robert B.

Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks usedmore » in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations« less
Board-level optical clock signal distribution using Si CMOS-compatible polyimide-based 1- to 48-fanout H-tree

NASA Astrophysics Data System (ADS)

Wu, Linghui; Bihari, Bipin; Gan, Jianhua; Chen, Ray T.; Tang, Suning

1998-08-01

Si-CMOS compatible polymer-based waveguides for optoelectronic interconnects and packaging have been fabricated and characterized. A 1-to-48 fanout optoelectronic interconnection layer (OIL) structure based on Ultradel 9120/9020 for the high-speed massive clock signal distribution for a Cray T-90 supercomputer board has been constructed. The OIL employs multimode polymeric channel waveguides in conjunction with surface-normal waveguide output coupler and 1-to-2 splitter. A total insertion loss of 7.98 dB at 850 nm was measured experimentally.
Formation of vortex line around the glass transition in YBa{sub 2}Cu{sub 3}O{sub 7-{delta}} films

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nojima, T.; Kakinuma, A.; Kuwasawa, Y.

1996-12-01

Two components of current-induced electric fields in ab plane, E{sub x} and E{sub y}, have been measured simultaneously on YBCO(123) films around the glass transition temperature T{sub g} in magnetic fields H with components (H{sub 0}, H{sub 0}, 0.1H{sub 0}), where x and y axes are parallel to the direction of the current density and c axis, respectively. In this condition, a finite transverse field E{sub y} almost equal to E{sub x} can be observed if the vortex lines form and move along the Lorentz force. In each H, the ratio {vert_bar}E{sub y}/E{sub x}{vert_bar} at a low current limit, whichmore » is zero far above T{sub g}, increases in the critical region and transfers to unity below T{sub g}. The authors results indicate that the vortices become lines with long range correlation along H direction at the vortex glass transition without receiving the effect of the intrinsic pinning.« less
Applications of CFD and visualization techniques

NASA Technical Reports Server (NTRS)

Saunders, James H.; Brown, Susan T.; Crisafulli, Jeffrey J.; Southern, Leslie A.

1992-01-01

In this paper, three applications are presented to illustrate current techniques for flow calculation and visualization. The first two applications use a commercial computational fluid dynamics (CFD) code, FLUENT, performed on a Cray Y-MP. The results are animated with the aid of data visualization software, apE. The third application simulates a particulate deposition pattern using techniques inspired by developments in nonlinear dynamical systems. These computations were performed on personal computers.
Massively parallel and linear-scaling algorithm for second-order Moller–Plesset perturbation theory applied to the study of supramolecular wires

DOE PAGES

Kjaergaard, Thomas; Baudin, Pablo; Bykov, Dmytro; ...

2016-11-16

Here, we present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide–Expand–Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide–Expand–Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalabilitymore » of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the “resolution of the identity second-order Moller–Plesset perturbation theory” (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.« less
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Lightweight computational steering of very large scale molecular dynamics simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Beazley, D.M.; Lomdahl, P.S.

1996-09-01

We present a computational steering approach for controlling, analyzing, and visualizing very large scale molecular dynamics simulations involving tens to hundreds of millions of atoms. Our approach relies on extensible scripting languages and an easy to use tool for building extensions and modules. The system is extremely easy to modify, works with existing C code, is memory efficient, and can be used from inexpensive workstations and networks. We demonstrate how we have used this system to manipulate data from production MD simulations involving as many as 104 million atoms running on the CM-5 and Cray T3D. We also show howmore » this approach can be used to build systems that integrate common scripting languages (including Tcl/Tk, Perl, and Python), simulation code, user extensions, and commercial data analysis packages.« less
Multigrid Methods in Electronic Structure Calculations

NASA Astrophysics Data System (ADS)

Briggs, Emil

1996-03-01

Multigrid techniques have become the method of choice for a broad range of computational problems. Their use in electronic structure calculations introduces a new set of issues when compared to traditional plane wave approaches. We have developed a set of techniques that address these issues and permit multigrid algorithms to be applied to the electronic structure problem in an efficient manner. In our approach the Kohn-Sham equations are discretized on a real-space mesh using a compact representation of the Hamiltonian. The resulting equations are solved directly on the mesh using multigrid iterations. This produces rapid convergence rates even for ill-conditioned systems with large length and/or energy scales. The method has been applied to both periodic and non-periodic systems containing over 400 atoms and the results are in very good agreement with both theory and experiment. Example applications include a vacancy in diamond, an isolated C60 molecule, and a 64-atom cell of GaN with the Ga d-electrons in valence which required a 250 Ry cutoff. A particular strength of a real-space multigrid approach is its ready adaptability to massively parallel computer architectures. The compact representation of the Hamiltonian is especially well suited to such machines. Tests on the Cray-T3D have shown nearly linear scaling of the execution time up to the maximum number of processors (512). The MPP implementation has been used for studies of a large Amyloid Beta Peptide (C_146O_45N_42H_210) found in the brains of Alzheimers disease patients. Further applications of the multigrid method will also be described. (in collaboration D. J. Sullivan and J. Bernholc)
Orientation dependent cyclic stability of the elastocaloric effect in textured Ni-Mn-Ga alloys

NASA Astrophysics Data System (ADS)

Wei, Longsha; Zhang, Xuexi; Liu, Jian; Geng, Lin

2018-05-01

High-performance elastocaloric materials require a large reversible elastocaloric effect and long life cyclic stability. Here, we fabricated textured polycrystalline Ni50.4Mn27.3Ga22.3 alloys by cost-effective casting method to create a <001> texture. A strong correlation between the cyclic stability and the crystal orientation was demonstrated. A large reversible adiabatic temperature change ΔT ˜6 K was obtained when the external stress was applied parallel to <001> direction. However, the ΔT decreased rapidly after 50 cycles, showing an unstable elastocaloric effect (eCE). On the other hand, when the external stress was applied perpendicular to <001>, the adiabatic ΔT was smaller ˜4 K, but was stable over 100 cycles. This significantly enhanced eCE stability was related to the high yield strength, low transformation strain and much higher crack initiation-propagation resistances perpendicular to <001> direction. This study provides a feasible strategy for optimizing the eCE property by creation of the texture structure in polycrystalline Ni-Mn-Ga and Ni-Mn-X (X= In, Sn, Sb) alloys.
Parallel particle filters for online identification of mechanistic mathematical models of physiology from monitoring data: performance and real-time scalability in simulation scenarios.

PubMed

Zenker, Sven

2010-08-01

Combining mechanistic mathematical models of physiology with quantitative observations using probabilistic inference may offer advantages over established approaches to computerized decision support in acute care medicine. Particle filters (PF) can perform such inference successively as data becomes available. The potential of PF for real-time state estimation (SE) for a model of cardiovascular physiology is explored using parallel computers and the ability to achieve joint state and parameter estimation (JSPE) given minimal prior knowledge tested. A parallelized sequential importance sampling/resampling algorithm was implemented and its scalability for the pure SE problem for a non-linear five-dimensional ODE model of the cardiovascular system evaluated on a Cray XT3 using up to 1,024 cores. JSPE was implemented using a state augmentation approach with artificial stochastic evolution of the parameters. Its performance when simultaneously estimating the 5 states and 18 unknown parameters when given observations only of arterial pressure, central venous pressure, heart rate, and, optionally, cardiac output, was evaluated in a simulated bleeding/resuscitation scenario. SE was successful and scaled up to 1,024 cores with appropriate algorithm parametrization, with real-time equivalent performance for up to 10 million particles. JSPE in the described underdetermined scenario achieved excellent reproduction of observables and qualitative tracking of enddiastolic ventricular volumes and sympathetic nervous activity. However, only a subset of the posterior distributions of parameters concentrated around the true values for parts of the estimated trajectories. Parallelized PF's performance makes their application to complex mathematical models of physiology for the purpose of clinical data interpretation, prediction, and therapy optimization appear promising. JSPE in the described extremely underdetermined scenario nevertheless extracted information of potential clinical relevance from the data in this simulation setting. However, fully satisfactory resolution of this problem when minimal prior knowledge about parameter values is available will require further methodological improvements, which are discussed.

FORTRAN multitasking library for use on the ELXSI 6400 and the CRAY XMP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Montry, G.R.

1985-07-16

A library of FORTRAN-based multitasking routines has been written for the ELXSI 6400 and the CRAY XMP. This library is designed to make multitasking codes easily transportable between machines with different hardware configurations. The library provides enhanced error checking and diagnostics over vendor-supplied multitasking intrinsics. The library also contains multitasking control structures not normally supplied by the vendor.
Integrated microfluidic devices for combinatorial cell-based assays.

PubMed

Yu, Zeta Tak For; Kamei, Ken-ichiro; Takahashi, Hiroko; Shu, Chengyi Jenny; Wang, Xiaopu; He, George Wenfu; Silverman, Robert; Radu, Caius G; Witte, Owen N; Lee, Ki-Bum; Tseng, Hsian-Rong

2009-06-01

The development of miniaturized cell culture platforms for performing parallel cultures and combinatorial assays is important in cell biology from the single-cell level to the system level. In this paper we developed an integrated microfluidic cell-culture platform, Cell-microChip (Cell-microChip), for parallel analyses of the effects of microenvironmental cues (i.e., culture scaffolds) on different mammalian cells and their cellular responses to external stimuli. As a model study, we demonstrated the ability of culturing and assaying several mammalian cells, such as NIH 3T3 fibroblast, B16 melanoma and HeLa cell lines, in a parallel way. For functional assays, first we tested drug-induced apoptotic responses from different cell lines. As a second functional assay, we performed "on-chip" transfection of a reporter gene encoding an enhanced green fluorescent protein (EGFP) followed by live-cell imaging of transcriptional activation of cyclooxygenase 2 (Cox-2) expression. Collectively, our Cell-microChip approach demonstrated the capability to carry out parallel operations and the potential to further integrate advanced functions and applications in the broader space of combinatorial chemistry and biology.
Integrated microfluidic devices for combinatorial cell-based assays

PubMed Central

Yu, Zeta Tak For; Kamei, Ken-ichiro; Takahashi, Hiroko; Shu, Chengyi Jenny; Wang, Xiaopu; He, George Wenfu; Silverman, Robert

2010-01-01

The development of miniaturized cell culture platforms for performing parallel cultures and combinatorial assays is important in cell biology from the single-cell level to the system level. In this paper we developed an integrated microfluidic cell-culture platform, Cell-microChip (Cell-μChip), for parallel analyses of the effects of microenvir-onmental cues (i.e., culture scaffolds) on different mammalian cells and their cellular responses to external stimuli. As a model study, we demonstrated the ability of culturing and assaying several mammalian cells, such as NIH 3T3 fibro-blast, B16 melanoma and HeLa cell lines, in a parallel way. For functional assays, first we tested drug-induced apoptotic responses from different cell lines. As a second functional assay, we performed "on-chip" transfection of a reporter gene encoding an enhanced green fluorescent protein (EGFP) followed by live-cell imaging of transcriptional activation of cyclooxygenase 2 (Cox-2) expression. Collectively, our Cell-μChip approach demonstrated the capability to carry out parallel operations and the potential to further integrate advanced functions and applications in the broader space of combinatorial chemistry and biology. PMID:19130244
Introduction of Automation for the Production of Bilingual, Parallel-Aligned Text

DTIC Science & Technology

2011-10-01

Eril~ , .. !m, lwMJI.te: :o~JSt.’pJ~’f.IJ~tJ’.J15;11t. IISpe .:c~:oe ’(1JWU’t:O’.Se ’l’tlt"S."\\Ie$11eter.e$’d~tlelow:!Milew.s ’!’ll.wll’t:Q~Ksamtj
3D gain modeling of LMJ and NIF amplifiers

NASA Astrophysics Data System (ADS)

LeTouze, Geoffroy; Cabourdin, Olivier; Mengue, J. F.; Guenet, Mireille; Grebot, Eric; Seznec, Stephane E.; Jancaitis, Kenneth S.; Marshall, Christopher D.; Zapata, Luis E.; Erlandson, A. E.

1999-07-01

A 3D ray-trace model has been developed to predict the performance of flashlamp pumped laser amplifiers. The computer program, written in C++, includes a graphical display option using the Open Inventor library, as well as a parser and a loader allowing the user to easily model complex multi-segment amplifier systems. It runs both on a workstation cluster at LLNL, and on the T3E Cray at CEA. We will discuss how we have reduce the required computation time without changing precision by optimizing the parameters which set the discretization level of the calculation. As an example, the sample of calculation points is chosen to fit the pumping profile through the thickness of amplifier slabs. We will show the difference in pump rates with our latest model as opposed to those produced by our earlier 2.5D code AmpModel. We will also present the results of calculations which model surfaces and other 3D effects such as top and bottom refelcotr positions and reflectivity which could not be included in the 2.5D model. This new computer model also includes a full 3D calculation of the amplified spontaneous emission rate in the laser slab, as opposed to the 2.5D model which tracked only the variation in the gain across the transverse dimensions of the slab. We will present the impact of this evolution of the model on the predicted stimulated decay rate and the resulting gain distribution. Comparison with most recent AmpLab experimental result will be presented, in the different typical NIF and LMJ configurations.
Multitasking 3-D forward modeling using high-order finite difference methods on the Cray X-MP/416

DOE Office of Scientific and Technical Information (OSTI.GOV)

Terki-Hassaine, O.; Leiss, E.L.

1988-01-01

The CRAY X-MP/416 was used to multitask 3-D forward modeling by the high-order finite difference method. Flowtrace analysis reveals that the most expensive operation in the unitasked program is a matrix vector multiplication. The in-core and out-of-core versions of a reentrant subroutine can perform any fraction of the matrix vector multiplication independently, a pattern compatible with multitasking. The matrix vector multiplication routine can be distributed over two to four processors. The rest of the program utilizes the microtasking feature that lets the system treat independent iterations of DO-loops as subtasks to be performed by any available processor. The availability ofmore » the Solid-State Storage Device (SSD) meant the I/O wait time was virtually zero. A performance study determined a theoretical speedup, taking into account the multitasking overhead. Multitasking programs utilizing both macrotasking and microtasking features obtained actual speedups that were approximately 80% of the ideal speedup.« less
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
Fine Structure in Helium-like Fluorine by Fast-Beam Laser Spectroscopy

NASA Astrophysics Data System (ADS)

Myers, E. G.; Thompson, J. K.; Silver, J. D.

1998-05-01

With the aim of providing an additional precise test of higher-order corrections to high precision calculations of fine structure in helium and helium-like ions(T. Zhang, Z.-C. Yan and G.W.F. Drake, Phys. Rev. Lett. 77), 1715 (1996)., a measurement of the 2^3P_2,F - 2^3P_1,F' fine structure in ^19F^7+ is in progress. The method involves doppler-tuned laser spectroscopy using a CO2 laser on a foil-stripped fluorine ion beam. We aim to achieve a higher precision, compared to an earlier measurement(E.G. Myers, P. Kuske, H.J. Andrae, I.A. Armour, H.A. Klein, J.D. Silver, and E. Traebert, Phys. Rev. Lett. 47), 87 (1981)., by using laser beams parallel and anti-parallel to the ion beam, to obtain partial cancellation of the doppler shift(J.K. Thompson, D.J.H. Howie and E.G. Myers, Phys. Rev. A 57), 180 (1998).. A calculation of the hyperfine structure, allowing for relativistic, QED and nuclear size effects, will be required to obtain the ``hyperfine-free'' fine structure interval from the measurements.
Study of near SOL decay lengths in ASDEX Upgrade under attached and detached divertor conditions

NASA Astrophysics Data System (ADS)

Sun, H. J.; Wolfrum, E.; Kurzan, B.; Eich, T.; Lackner, K.; Scarabosio, A.; Paradela Pérez, I.; Kardaun, O.; Faitsch, M.; Potzel, S.; Stroth, U.; the ASDEX Upgrade Team

2017-10-01

A database with attached, partially detached and completely detached divertors has been constructed in ASDEX Upgrade discharges in both H-mode and L-mode plasmas with Thomson Scattering data suitable for the analysis of the upstream SOL electron profiles. By comparing upstream temperature decay width, {λ }{Te,u}, with the scaling of the SOL power decay width, {λ }{q\\parallel e}, based on the downstream IR measurements, it is found that a simple relation based on classical electron conduction can relate {λ }{Te,u} and {λ }{q\\parallel e} well. The combined dataset can be described by both a single scaling and a separate scaling for H-modes and L-modes. For the single scaling, a strong inverse dependence of, {λ }{Te,u} on the separatrix temperature, {T}e,u, is found, suggesting the classical parallel Spitzer-Harm conductivity as dominant mechanism controlling the SOL width in both L-mode and H-mode over a large set of plasma parameters. This dependence on {T}e,u explains why, for the same global plasma parameters, {λ }{q\\parallel e} in L-mode is approximately twice that in H-mode and under detached conditions, the SOL upstream electron profile broadens when the density reaches a critical value. Comparing the derived scaling from experimental data with power balance, gives the cross-field thermal diffusivity as {χ }\\perp \\propto {T}e{1/2}/{n}e, consistent with earlier studies on Compass-D, JET and Alcator C-Mod. However, the possibility of the separate scalings for different regimes cannot be excluded, which gives results similar to those previously reported for the H-mode, but here the wider SOL width for L-mode plasmas is explained simply by the larger premultiplying coefficient. The relative merits of the two scalings in representing the data and their theoretical implications are discussed.
Electron pressure balance in the SOL through the transition to detachment

DOE PAGES

McLean, A. G.; Leonard, A. W.; Makowski, M. A.; ...

2015-02-07

Upgrades to core and divertor Thomson scattering (DTS) diagnostics at DIII-D have provided measurements of electron pressure profiles in the lower divertor from attached- to fully-detached divertor plasma conditions. Detailed, multistep sequences of discharges with increasing line-averaged density were run at several levels of P inj. Strike point sweeping allowed 2D divertor characterization using DTS optimized to measure T e down to 0.5 eV. The ionization front at the onset of detachment is found to move upwards in a controlled manner consistent with the indication that scrape-off layer parallel power flux is converted from conducted to convective heat transport. Measurementsmore » of n e, T e and p e in the divertor versus Lparallel demonstrate a rapid transition from Te ≥ 15 eV to ≤3 eV occurring both at the outer strike point and upstream of the X-point. Furthermore, these observations provide a strong benchmark for ongoing modeling of divertor detachment for existing and future tokamak devices.« less
Investigating the impact of the cielo cray XE6 architecture on scientific application codes.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rajan, Mahesh; Barrett, Richard; Pedretti, Kevin Thomas Tauke

2010-12-01

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, andmore » supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.« less
Transport in a toroidally confined pure electron plasma

DOE Office of Scientific and Technical Information (OSTI.GOV)

Crooks, S.M.; ONeil, T.M.

1996-07-01

O{close_quote}Neil and Smith [T.M. O{close_quote}Neil and R.A. Smith, Phys. Plasmas {bold 1}, 8 (1994)] have argued that a pure electron plasma can be confined stably in a toroidal magnetic field configuration. This paper shows that the toroidal curvature of the magnetic field of necessity causes slow cross-field transport. The transport mechanism is similar to magnetic pumping and may be understood by considering a single flux tube of plasma. As the flux tube of plasma undergoes poloidal {ital E}{bold {times}}{ital B} drift rotation about the center of the plasma, the length of the flux tube and the magnetic field strength withinmore » the flux tube oscillate, and this produces corresponding oscillations in {ital T}{sub {parallel}} and {ital T}{sub {perpendicular}}. The collisional relaxation of {ital T}{sub {parallel}} toward {ital T}{sub {perpendicular}} produces a slow dissipation of electrostatic energy into heat and a consequent expansion (cross-field transport) of the plasma. In the limit where the cross section of the plasma is nearly circular the radial particle flux is given by {Gamma}{sub {ital r}}=1/2{nu}{sub {perpendicular},{parallel}}{ital T}({ital r}/{rho}{sub 0}){sup 2}{ital n}/({minus}{ital e}{partial_derivative}{Phi}/{partial_derivative}{ital r}), where {nu}{sub {perpendicular},{parallel}} is the collisional equipartition rate, {rho}{sub 0} is the major radius at the center of the plasma, and {ital r} is the minor radius measured from the center of the plasma. The transport flux is first calculated using this simple physical picture and then is calculated by solving the drift-kinetic Boltzmann equation. This latter calculation is not limited to a plasma with a circular cross section. {copyright} {ital 1996 American Institute of Physics.}« less
A molecular dynamics implementation of the 3D Mercedes-Benz water model

NASA Astrophysics Data System (ADS)

Hynninen, T.; Dias, C. L.; Mkrtchyan, A.; Heinonen, V.; Karttunen, M.; Foster, A. S.; Ala-Nissila, T.

2012-02-01

The three-dimensional Mercedes-Benz model was recently introduced to account for the structural and thermodynamic properties of water. It treats water molecules as point-like particles with four dangling bonds in tetrahedral coordination, representing H-bonds of water. Its conceptual simplicity renders the model attractive in studies where complex behaviors emerge from H-bond interactions in water, e.g., the hydrophobic effect. A molecular dynamics (MD) implementation of the model is non-trivial and we outline here the mathematical framework of its force-field. Useful routines written in modern Fortran are also provided. This open source code is free and can easily be modified to account for different physical context. The provided code allows both serial and MPI-parallelized execution. Program summaryProgram title: CASHEW (Coarse Approach Simulator for Hydrogen-bonding Effects in Water) Catalogue identifier: AEKM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEKM_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 20 501 No. of bytes in distributed program, including test data, etc.: 551 044 Distribution format: tar.gz Programming language: Fortran 90 Computer: Program has been tested on desktop workstations and a Cray XT4/XT5 supercomputer. Operating system: Linux, Unix, OS X Has the code been vectorized or parallelized?: The code has been parallelized using MPI. RAM: Depends on size of system, about 5 MB for 1500 molecules. Classification: 7.7 External routines: A random number generator, Mersenne Twister ( http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/VERSIONS/FORTRAN/mt95.f90), is used. A copy of the code is included in the distribution. Nature of problem: Molecular dynamics simulation of a new geometric water model. Solution method: New force-field for water molecules, velocity-Verlet integration, representation of molecules as rigid particles with rotations described using quaternion algebra. Restrictions: Memory and cpu time limit the size of simulations. Additional comments: Software web site: https://gitorious.org/cashew/. Running time: Depends on the size of system. The sample tests provided only take a few seconds.
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

PubMed Central

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
CDC to CRAY FORTRAN conversion manual

NASA Technical Reports Server (NTRS)

Mcgary, C.; Diebert, D.

1983-01-01

Documentation describing software differences between two general purpose computers for scientific applications is presented. Descriptions of the use of the FORTRAN and FORTRAN 77 high level programming language on a CDC 7600 under SCOPE and a CRAY XMP under COS are offered. Itemized differences of the FORTRAN language sets of the two machines are also included. The material is accompanied by numerous examples of preferred programming techniques for the two machines.
RPYFMM: Parallel adaptive fast multipole method for Rotne-Prager-Yamakawa tensor in biomolecular hydrodynamics simulations

NASA Astrophysics Data System (ADS)

Guan, W.; Cheng, X.; Huang, J.; Huber, G.; Li, W.; McCammon, J. A.; Zhang, B.

2018-06-01

RPYFMM is a software package for the efficient evaluation of the potential field governed by the Rotne-Prager-Yamakawa (RPY) tensor interactions in biomolecular hydrodynamics simulations. In our algorithm, the RPY tensor is decomposed as a linear combination of four Laplace interactions, each of which is evaluated using the adaptive fast multipole method (FMM) (Greengard and Rokhlin, 1997) where the exponential expansions are applied to diagonalize the multipole-to-local translation operators. RPYFMM offers a unified execution on both shared and distributed memory computers by leveraging the DASHMM library (DeBuhr et al., 2016, 2018). Preliminary numerical results show that the interactions for a molecular system of 15 million particles (beads) can be computed within one second on a Cray XC30 cluster using 12,288 cores, while achieving approximately 54% strong-scaling efficiency.
Global Adjoint Tomography: Combining Big Data with HPC Simulations

NASA Astrophysics Data System (ADS)

Bozdag, E.; Lefebvre, M. P.; Lei, W.; Peter, D. B.; Smith, J. A.; Komatitsch, D.; Tromp, J.

2014-12-01

The steady increase in data quality and the number of global seismographic stations have substantially grown the amount of data available for construction of Earth models. Meanwhile, developments in the theory of wave propagation, numerical methods and HPC systems have enabled unprecedented simulations of seismic wave propagation in realistic 3D Earth models which lead the extraction of more information from data, ultimately culminating in the use of entire three-component seismograms.Our aim is to take adjoint tomography further to image the entire planet which is one of the extreme cases in seismology due to its intense computational requirements and vast amount of high-quality seismic data that can potentially be assimilated in inversions. We have started low resolution (T > 27 s, soon will be > 17 s) global inversions with 253 earthquakes for a transversely isotropic crust and mantle model on Oak Ridge National Laboratory's Cray XK7 "Titan" system. Recent improvements in our 3D solvers, such as the GPU version of the SPECFEM3D_GLOBE package, will allow us perform higher-resolution (T > 9 s) and longer-duration (~180 m) simulations to take the advantage of high-frequency body waves and major-arc surface waves to improve imbalanced ray coverage as a result of uneven distribution of sources and receivers on the globe. Our initial results after 10 iterations already indicate several prominent features reported in high-resolution continental studies, such as major slabs (Hellenic, Japan, Bismarck, Sandwich, etc.) and enhancement in plume structures (the Pacific superplume, the Hawaii hot spot, etc.). Our ultimate goal is to assimilate seismic data from more than 6,000 earthquakes within the magnitude range 5.5 ≤ Mw ≤ 7.0. To take full advantage of this data set on ORNL's computational resources, we need a solid framework for managing big data sets during pre-processing (e.g., data requests and quality checks), gradient calculations, and post-processing (e.g., pre-conditioning and smoothing gradients) where we address the bottlenecks in our global seismic workflow based on ORNL's ADIOS libraries. We will present our "first generation" model, discuss challenges and future directions in global seismology.
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Application of a hybrid MPI/OpenMP approach for parallel groundwater model calibration using multi-core computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan

2010-01-01

Calibration of groundwater models involves hundreds to thousands of forward solutions, each of which may solve many transient coupled nonlinear partial differential equations, resulting in a computationally intensive problem. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelisms in software and hardware to reduce calibration time on multi-core computers. HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for direct solutions for a reactive transport model application, and a field-scale coupled flow and transport model application. In the reactive transport model, a single parallelizable loop is identified to account for over 97% of the total computational time using GPROF.more » Addition of a few lines of OpenMP compiler directives to the loop yields a speedup of about 10 on a 16-core compute node. For the field-scale model, parallelizable loops in 14 of 174 HGC5 subroutines that require 99% of the execution time are identified. As these loops are parallelized incrementally, the scalability is found to be limited by a loop where Cray PAT detects over 90% cache missing rates. With this loop rewritten, similar speedup as the first application is achieved. The OpenMP-parallelized code can be run efficiently on multiple workstations in a network or multiple compute nodes on a cluster as slaves using parallel PEST to speedup model calibration. To run calibration on clusters as a single task, the Levenberg Marquardt algorithm is added to HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, 100 200 compute cores are used to reduce the calibration time from weeks to a few hours for these two applications. This approach is applicable to most of the existing groundwater model codes for many applications.« less
Ontogeny of surface markers on functionally distinct T cell subsets in the chicken.

PubMed

Traill, K N; Böck, G; Boyd, R L; Ratheiser, K; Wick, G

1984-01-01

Three subsets of chicken peripheral T cells (T1, T2 and T3) have been identified in peripheral blood of adult chickens on the basis of fluorescence intensity after staining with certain xenogeneic anti-thymus cell sera (from turkeys and rabbits). They differentiate between 3-10 weeks of age in parallel with development of responsiveness to the mitogens concanavalin A (Con A), phytohemagglutinin (PHA) and pokeweed mitogen (PWM). Functional tests on the T subsets, sorted with a fluorescence-activated cell sorter, have shown that T2, 3 cells respond to Con A, PHA and PWM and are capable of eliciting a graft-vs.-host reaction (GvHR). In contrast, although T1 cells respond to Con A, they respond poorly to PHA and not at all to PWM or in GvHR. There was some indication of cooperation between T1 and T2,3 cells for the PHA response. Parallels between these chicken subsets and helper and suppressor/cytotoxic subsets in mammalian systems are discussed.

LANZ: Software solving the large sparse symmetric generalized eigenproblem

NASA Technical Reports Server (NTRS)

Jones, Mark T.; Patrick, Merrell L.

1990-01-01

A package, LANZ, for solving the large symmetric generalized eigenproblem is described. The package was tested on four different architectures: Convex 200, CRAY Y-MP, Sun-3, and Sun-4. The package uses a Lanczos' method and is based on recent research into solving the generalized eigenproblem.
Achieving High Performance on the i860 Microprocessor

NASA Technical Reports Server (NTRS)

Lee, King; Kutler, Paul (Technical Monitor)

1998-01-01

The i860 is a high performance microprocessor used in the Intel Touchstone project. This paper proposes a paradigm for programming the i860 that is modelled on the vector instructions of the Cray computers. Fortran callable assembler subroutines were written that mimic the concurrent vector instructions of the Cray. Cache takes the place of vector registers. Using this paradigm we have achieved twice the performance of compiled code on a traditional solve.
Production Experiences with the Cray-Enabled TORQUE Resource Manager

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ezell, Matthew A; Maxwell, Don E; Beer, David

High performance computing resources utilize batch systems to manage the user workload. Cray systems are uniquely different from typical clusters due to Cray s Application Level Placement Scheduler (ALPS). ALPS manages binary transfer, job launch and monitoring, and error handling. Batch systems require special support to integrate with ALPS using an XML protocol called BASIL. Previous versions of Adaptive Computing s TORQUE and Moab batch suite integrated with ALPS from within Moab, using PERL scripts to interface with BASIL. This would occasionally lead to problems when all the components would become unsynchronized. Version 4.1 of the TORQUE Resource Manager introducedmore » new features that allow it to directly integrate with ALPS using BASIL. This paper describes production experiences at Oak Ridge National Laboratory using the new TORQUE software versions, as well as ongoing and future work to improve TORQUE.« less
Numerical results on the transcendence of constants involving pi, e, and Euler's constant

NASA Technical Reports Server (NTRS)

Bailey, David H.

1988-01-01

The existence of simple polynomial equations (integer relations) for the constants e/pi, e + pi, log pi, gamma (Euler's constant), e exp gamma, gamma/e, gamma/pi, and log gamma is investigated by means of numerical computations. The recursive form of the Ferguson-Fourcade algorithm (Ferguson and Fourcade, 1979; Ferguson, 1986 and 1987) is implemented on the Cray-2 supercomputer at NASA Ames, applying multiprecision techniques similar to those described by Bailey (1988) except that FFTs are used instead of dual-prime-modulus transforms for multiplication. It is shown that none of the constants has an integer relation of degree eight or less with coefficients of Euclidean norm 10 to the 9th or less.
Coupled Monte Carlo Probability Density Function/ SPRAY/CFD Code Developed for Modeling Gas-Turbine Combustor Flows

NASA Technical Reports Server (NTRS)

1995-01-01

The success of any solution methodology for studying gas-turbine combustor flows depends a great deal on how well it can model various complex, rate-controlling processes associated with turbulent transport, mixing, chemical kinetics, evaporation and spreading rates of the spray, convective and radiative heat transfer, and other phenomena. These phenomena often strongly interact with each other at disparate time and length scales. In particular, turbulence plays an important role in determining the rates of mass and heat transfer, chemical reactions, and evaporation in many practical combustion devices. Turbulence manifests its influence in a diffusion flame in several forms depending on how turbulence interacts with various flame scales. These forms range from the so-called wrinkled, or stretched, flamelets regime, to the distributed combustion regime. Conventional turbulence closure models have difficulty in treating highly nonlinear reaction rates. A solution procedure based on the joint composition probability density function (PDF) approach holds the promise of modeling various important combustion phenomena relevant to practical combustion devices such as extinction, blowoff limits, and emissions predictions because it can handle the nonlinear chemical reaction rates without any approximation. In this approach, mean and turbulence gas-phase velocity fields are determined from a standard turbulence model; the joint composition field of species and enthalpy are determined from the solution of a modeled PDF transport equation; and a Lagrangian-based dilute spray model is used for the liquid-phase representation with appropriate consideration of the exchanges of mass, momentum, and energy between the two phases. The PDF transport equation is solved by a Monte Carlo method, and existing state-of-the-art numerical representations are used to solve the mean gasphase velocity and turbulence fields together with the liquid-phase equations. The joint composition PDF approach was extended in our previous work to the study of compressible reacting flows. The application of this method to several supersonic diffusion flames associated with scramjet combustor flow fields provided favorable comparisons with the available experimental data. A further extension of this approach to spray flames, three-dimensional computations, and parallel computing was reported in a recent paper. The recently developed PDF/SPRAY/computational fluid dynamics (CFD) module combines the novelty of the joint composition PDF approach with the ability to run on parallel architectures. This algorithm was implemented on the NASA Lewis Research Center's Cray T3D, a massively parallel computer with an aggregate of 64 processor elements. The calculation procedure was applied to predict the flow properties of both open and confined swirl-stabilized spray flames.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE PAGES

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...

2017-09-14

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Computational Fluid Dynamic Solutions of Optimized Heat Shields Designed for Earth Entry

DTIC Science & Technology

2010-01-01

domain ρ = Density (kg/m3) σ = Stefan Boltzmann constant τ = Shear stress tensor τT−V = T-V relaxation time τe−V = e-V relaxation time xi φ = Sweep angle...Vehicle DES = Differential evolutionary Scheme DOR = Design Optimization Tools DPLR = Data Parallel Line Relaxation GSLR = Gauss- Seidel Line... Stefan - Boltzmann constant. This model provides accurate heating predictions, especially for the non-ablating heat-shields explored in this work. Various
A performance comparison of scalar, vector, and concurrent vector computers including supercomputers for modeling transport of reactive contaminants in groundwater

NASA Astrophysics Data System (ADS)

Tripathi, Vijay S.; Yeh, G. T.

1993-06-01

Sophisticated and highly computation-intensive models of transport of reactive contaminants in groundwater have been developed in recent years. Application of such models to real-world contaminant transport problems, e.g., simulation of groundwater transport of 10-15 chemically reactive elements (e.g., toxic metals) and relevant complexes and minerals in two and three dimensions over a distance of several hundred meters, requires high-performance computers including supercomputers. Although not widely recognized as such, the computational complexity and demand of these models compare with well-known computation-intensive applications including weather forecasting and quantum chemical calculations. A survey of the performance of a variety of available hardware, as measured by the run times for a reactive transport model HYDROGEOCHEM, showed that while supercomputers provide the fastest execution times for such problems, relatively low-cost reduced instruction set computer (RISC) based scalar computers provide the best performance-to-price ratio. Because supercomputers like the Cray X-MP are inherently multiuser resources, often the RISC computers also provide much better turnaround times. Furthermore, RISC-based workstations provide the best platforms for "visualization" of groundwater flow and contaminant plumes. The most notable result, however, is that current workstations costing less than $10,000 provide performance within a factor of 5 of a Cray X-MP.
SNS programming environment user's guide

NASA Technical Reports Server (NTRS)

Tennille, Geoffrey M.; Howser, Lona M.; Humes, D. Creig; Cronin, Catherine K.; Bowen, John T.; Drozdowski, Joseph M.; Utley, Judith A.; Flynn, Theresa M.; Austin, Brenda A.

1992-01-01

The computing environment is briefly described for the Supercomputing Network Subsystem (SNS) of the Central Scientific Computing Complex of NASA Langley. The major SNS computers are a CRAY-2, a CRAY Y-MP, a CONVEX C-210, and a CONVEX C-220. The software is described that is common to all of these computers, including: the UNIX operating system, computer graphics, networking utilities, mass storage, and mathematical libraries. Also described is file management, validation, SNS configuration, documentation, and customer services.
Job Complexity and Cognitive Performance in Bank Personnel

DTIC Science & Technology

1984-09-01

Hemispheric differences in serial versus parallel processing. Journal of Experimental Psychology 97(3):349-356, 1973. Dumas, R. and Morgan, A. EEG ...7 8 9 10 J_l 12 13 14 15. 16 .17 18 19 20 i 1 n -ifl i 3^1 9 4f^^^ 1 1 M sM^ tM i ’ N eH^^^^H i 1 yllH^^^^I i sH^^^^^^^B ! i...NL Boiling Air Force Base Washington, D.C. 20332 ’ AFHRL/LRS TDC Attn: Susan Ewing Wright-Patterson AFB, OH 45433 Chief, Systems Engineering
ERDC MSRC (Major Shared Resource Center) Resource. Spring 2008

DTIC Science & Technology

2008-01-01

obtained from ADCIRC results. The alpha test was performed on the Cray XT3 machine (Sapphire) at ERDC and the IBM P575+ system ( Babbage ) at the...2008 20 Scotty Swillie (center) and Charles Ray (far right) were part of the team that constructed the DoD HPCMP booth for the Conference (From
Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lin, P. T.; Shadid, J. N.; Hu, J. J.

Here, we explore the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. Our study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of themore » original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.« less
Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD

DOE PAGES

Lin, P. T.; Shadid, J. N.; Hu, J. J.; ...

2017-11-06

Here, we explore the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. Our study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of themore » original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.« less
Solar Wind Proton Temperature Anisotropy: Linear Theory and WIND/SWE Observations

NASA Technical Reports Server (NTRS)

Hellinger, P.; Travnicek, P.; Kasper, J. C.; Lazarus, A. J.

2006-01-01

We present a comparison between WIND/SWE observations (Kasper et al., 2006) of beta parallel to p and T perpendicular to p/T parallel to p (where beta parallel to p is the proton parallel beta and T perpendicular to p and T parallel to p are the perpendicular and parallel proton are the perpendicular and parallel proton temperatures, respectively; here parallel and perpendicular indicate directions with respect to the ambient magnetic field) and predictions of the Vlasov linear theory. In the slow solar wind, the observed proton temperature anisotropy seems to be constrained by oblique instabilities, by the mirror one and the oblique fire hose, contrary to the results of the linear theory which predicts a dominance of the proton cyclotron instability and the parallel fire hose. The fast solar wind core protons exhibit an anticorrelation between beta parallel to c and T perpendicular to c/T parallel to c (where beta parallel to c is the core proton parallel beta and T perpendicular to c and T parallel to c are the perpendicular and parallel core proton temperatures, respectively) similar to that observed in the HELIOS data (Marsch et al., 2004).
GPU acceleration of the Locally Selfconsistent Multiple Scattering code for first principles calculation of the ground state and statistical physics of materials

NASA Astrophysics Data System (ADS)

Eisenbach, Markus; Larkin, Jeff; Lutjens, Justin; Rennich, Steven; Rogers, James H.

2017-02-01

The Locally Self-consistent Multiple Scattering (LSMS) code solves the first principles Density Functional theory Kohn-Sham equation for a wide range of materials with a special focus on metals, alloys and metallic nano-structures. It has traditionally exhibited near perfect scalability on massively parallel high performance computer architectures. We present our efforts to exploit GPUs to accelerate the LSMS code to enable first principles calculations of O(100,000) atoms and statistical physics sampling of finite temperature properties. We reimplement the scattering matrix calculation for GPUs with a block matrix inversion algorithm that only uses accelerator memory. Using the Cray XK7 system Titan at the Oak Ridge Leadership Computing Facility we achieve a sustained performance of 14.5PFlop/s and a speedup of 8.6 compared to the CPU only code.
T2-weighted prostate MRI at 7 Tesla using a simplified external transmit-receive coil array: correlation with radical prostatectomy findings in two prostate cancer patients.

PubMed

Rosenkrantz, Andrew B; Zhang, Bei; Ben-Eliezer, Noam; Le Nobin, Julien; Melamed, Jonathan; Deng, Fang-Ming; Taneja, Samir S; Wiggins, Graham C

2015-01-01

To report design of a simplified external transmit-receive coil array for 7 Tesla (T) prostate MRI, including demonstration of the array for tumor localization using T2-weighted imaging (T2WI) at 7T before prostatectomy. Following simulations of transmitter designs not requiring parallel transmission or radiofrequency-shimming, a coil array was constructed using loop elements, with anterior and posterior rows comprising one transmit-receive element and three receive-only elements. This coil structure was optimized using a whole-body phantom. In vivo sequence optimization was performed to optimize achieved flip angle (FA) and signal to noise ratio (SNR) in prostate. The system was evaluated in a healthy volunteer at 3T and 7T. The 7T T2WI was performed in two prostate cancer patients before prostatectomy, and localization of dominant tumors was subjectively compared with histopathological findings. Image quality was compared between 3T and 7T in these patients. Simulations of the B1(+) field in prostate using two-loop design showed good magnitude (B1(+) of 0.245 A/m/w(1/2)) and uniformity (nonuniformity [SD/mean] of 10.4%). In the volunteer, 90° FA was achieved in prostate using 225 v 1 ms hard-pulse (indicating good efficiency), FA maps confirmed good uniformity (14.1% nonuniformity), and SNR maps showed SNR gain of 2.1 at 7T versus 3T. In patients, 7T T2WI showed excellent visual correspondence with prostatectomy findings. 7T images demonstrated higher estimated SNR (eSNR) in benign peripheral zone (PZ) and tumor compared with 3T, but lower eSNR in fat and slight decreases in tumor-to-PZ contrast and PZ-homogeneity. We have demonstrated feasibility of a simplified external coil array for high-resolution T2-weighted prostate MRI at 7T. © 2013 Wiley Periodicals, Inc.
Large-scale structural analysis: The structural analyst, the CSM Testbed and the NAS System

NASA Technical Reports Server (NTRS)

Knight, Norman F., Jr.; Mccleary, Susan L.; Macy, Steven C.; Aminpour, Mohammad A.

1989-01-01

The Computational Structural Mechanics (CSM) activity is developing advanced structural analysis and computational methods that exploit high-performance computers. Methods are developed in the framework of the CSM testbed software system and applied to representative complex structural analysis problems from the aerospace industry. An overview of the CSM testbed methods development environment is presented and some numerical methods developed on a CRAY-2 are described. Selected application studies performed on the NAS CRAY-2 are also summarized.
Engineering PFLOTRAN for Scalable Performance on Cray XT and IBM BlueGene Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mills, Richard T; Sripathi, Vamsi K; Mahinthakumar, Gnanamanika

We describe PFLOTRAN - a code for simulation of coupled hydro-thermal-chemical processes in variably saturated, non-isothermal, porous media - and the approaches we have employed to obtain scalable performance on some of the largest scale supercomputers in the world. We present detailed analyses of I/O and solver performance on Jaguar, the Cray XT5 at Oak Ridge National Laboratory, and Intrepid, the IBM BlueGene/P at Argonne National Laboratory, that have guided our choice of algorithms.
surf3d: A 3-D finite-element program for the analysis of surface and corner cracks in solids subjected to mode-1 loadings

NASA Technical Reports Server (NTRS)

Raju, I. S.; Newman, J. C., Jr.

1993-01-01

A computer program, surf3d, that uses the 3D finite-element method to calculate the stress-intensity factors for surface, corner, and embedded cracks in finite-thickness plates with and without circular holes, was developed. The cracks are assumed to be either elliptic or part eliptic in shape. The computer program uses eight-noded hexahedral elements to model the solid. The program uses a skyline storage and solver. The stress-intensity factors are evaluated using the force method, the crack-opening displacement method, and the 3-D virtual crack closure methods. In the manual the input to and the output of the surf3d program are described. This manual also demonstrates the use of the program and describes the calculation of the stress-intensity factors. Several examples with sample data files are included with the manual. To facilitate modeling of the user's crack configuration and loading, a companion program (a preprocessor program) that generates the data for the surf3d called gensurf was also developed. The gensurf program is a three dimensional mesh generator program that requires minimal input and that builds a complete data file for surf3d. The program surf3d is operational on Unix machines such as CRAY Y-MP, CRAY-2, and Convex C-220.

Experiences From NASA/Langley's DMSS Project

NASA Technical Reports Server (NTRS)

1996-01-01

There is a trend in institutions with high performance computing and data management requirements to explore mass storage systems with peripherals directly attached to a high speed network. The Distributed Mass Storage System (DMSS) Project at the NASA Langley Research Center (LaRC) has placed such a system into production use. This paper will present the experiences, both good and bad, we have had with this system since putting it into production usage. The system is comprised of: 1) National Storage Laboratory (NSL)/UniTree 2.1, 2) IBM 9570 HIPPI attached disk arrays (both RAID 3 and RAID 5), 3) IBM RS6000 server, 4) HIPPI/IPI3 third party transfers between the disk array systems and the supercomputer clients, a CRAY Y-MP and a CRAY 2, 5) a "warm spare" file server, 6) transition software to convert from CRAY's Data Migration Facility (DMF) based system to DMSS, 7) an NSC PS32 HIPPI switch, and 8) a STK 4490 robotic library accessed from the IBM RS6000 block mux interface. This paper will cover: the performance of the DMSS in the following areas: file transfer rates, migration and recall, and file manipulation (listing, deleting, etc.); the appropriateness of a workstation class of file server for NSL/UniTree with LaRC's present storage requirements in mind the role of the third party transfers between the supercomputers and the DMSS disk array systems in DMSS; a detailed comparison (both in performance and functionality) between the DMF and DMSS systems LaRC's enhancements to the NSL/UniTree system administration environment the mechanism for DMSS to provide file server redundancy the statistics on the availability of DMSS the design and experiences with the locally developed transparent transition software which allowed us to make over 1.5 million DMF files available to NSL/UniTree with minimal system outage
Particle acceleration by quasi-parallel shocks in the solar wind

NASA Astrophysics Data System (ADS)

Galinsky, V. L.; Shevchenko, V. I.

2008-11-01

The theoretical study of proton acceleration at a quasi-parallel shock due to interaction with Alfven waves self-consistently excited in both upstream and downstream regions was conducted using a scale-separation model [1]. The model uses conservation laws and resonance conditions to find where waves will be generated or dumped and hence particles will be pitch--angle scattered as well as the change of the wave energy due to instability or damping. It includes in consideration the total distribution function (the bulk plasma and high energy tail), so no any assumptions (e.g. seed populations, or some ad-hoc escape rate of accelerated particles) are required. The dynamics of ion acceleration by the November 11-12, 1978 interplanetary traveling shock was investigated and compared with the observations [2] as well as with solution obtained using the so-called convection-diffusion equation for distribution function of accelerated particles [3]. [1] Galinsky, V.L., and V.I. Shevchenko, Astrophys. J., 669, L109, 2007. [2] Kennel, C.F., F.W. Coroniti, F.L. Scarf, W.A. Livesey, C.T. Russell, E.J. Smith, K.P. Wenzel, and M. Scholer, J. Geophys. Res. 91, 11,917, 1986. [3] Gordon B.E., M.A. Lee, E. Mobius, and K.J. Trattner, J. Geophys. Res., 104, 28,263, 1990.
Experiment E89-044 of quasi-elastic diffusion 3He(e,e'p) at Jefferson Laboratory: Analyze cross sections of the two body breakup in parallel kinematics; Experience E89-044 de diffusion quasi-elastique 3he(e,e'p) au Jefferson Laboratory : analyse des sections efficaces de desintegration a deux corps en cinematique parallele (in French)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Penel-Nottaris, Emilie

2004-07-01

The Jefferson Lab Hall A experiment has measured the 3He(e,e'p) reaction cross sections. The separation of the longitudinal and transverse response functions for the two-body breakup reaction in parallel kinematics allows to study the bound proton electromagnetic properties in the 3He nucleus and the involved nuclear mechanisms beyond impulse approximation. Preliminary cross sections show some disagreement with theoretical predictions for the forward angles kinematics around 0 MeV/c missing momenta, and sensitivity to final state interactions and 3He wave functions for missing momenta of 300 MeV/c.
Gamma Delta T-Cells Regulate Inflammatory Cell Infiltration of the Lung after Trauma-Hemorrhage

DTIC Science & Technology

2015-06-01

suggesting a role for this T- cell subset in both innate and acquired immunity (7, 8). Studies have shown that +% T cells are required for both controlled...increased infiltration of both lymphoid and myeloid cells in WT mice after TH-induced ALI. In parallel to +% T cells , myeloid cells (i.e., monocytes...GAMMA DELTA T CELLS REGULATE INFLAMMATORY CELL INFILTRATION OF THE LUNG AFTER TRAUMA-HEMORRHAGE Meenakshi Rani,* Qiong Zhang,* Richard F. Oppeltz
The solution of linear systems of equations with a structural analysis code on the NAS CRAY-2

NASA Technical Reports Server (NTRS)

Poole, Eugene L.; Overman, Andrea L.

1988-01-01

Two methods for solving linear systems of equations on the NAS Cray-2 are described. One is a direct method; the other is an iterative method. Both methods exploit the architecture of the Cray-2, particularly the vectorization, and are aimed at structural analysis applications. To demonstrate and evaluate the methods, they were installed in a finite element structural analysis code denoted the Computational Structural Mechanics (CSM) Testbed. A description of the techniques used to integrate the two solvers into the Testbed is given. Storage schemes, memory requirements, operation counts, and reformatting procedures are discussed. Finally, results from the new methods are compared with results from the initial Testbed sparse Choleski equation solver for three structural analysis problems. The new direct solvers described achieve the highest computational rates of the methods compared. The new iterative methods are not able to achieve as high computation rates as the vectorized direct solvers but are best for well conditioned problems which require fewer iterations to converge to the solution.
Data communication requirements for the advanced NAS network

NASA Technical Reports Server (NTRS)

Levin, Eugene; Eaton, C. K.; Young, Bruce

1986-01-01

The goal of the Numerical Aerodynamic Simulation (NAS) Program is to provide a powerful computational environment for advanced research and development in aeronautics and related disciplines. The present NAS system consists of a Cray 2 supercomputer connected by a data network to a large mass storage system, to sophisticated local graphics workstations, and by remote communications to researchers throughout the United States. The program plan is to continue acquiring the most powerful supercomputers as they become available. In the 1987/1988 time period it is anticipated that a computer with 4 times the processing speed of a Cray 2 will be obtained and by 1990 an additional supercomputer with 16 times the speed of the Cray 2. The implications of this 20-fold increase in processing power on the data communications requirements are described. The analysis was based on models of the projected workload and system architecture. The results are presented together with the estimates of their sensitivity to assumptions inherent in the models.
Anomalous effective dimensionality of quantum gas adsorption near nanopores.

PubMed

Full, Steven J; McNutt, Jessica P; Cole, Milton W; Mbaye, Mamadou T; Gatica, Silvina M

2010-08-25

Three problems involving quasi-one-dimensional (1D) ideal gases are discussed. The simplest problem involves quantum particles localized within the 'groove', a quasi-1D region created by two adjacent, identical and parallel nanotubes. At low temperature (T), the transverse motion of the adsorbed gas, in the plane perpendicular to the axes of the tubes, is frozen out. Then, the low T heat capacity C(T) of N particles is that of a 1D classical gas: C(*)(T) = C(T)/(Nk(B)) --> 1/2. The dimensionless heat capacity C(*) increases when T ≥ 0.1T(x, y) (transverse excitation temperatures), asymptoting at C(*) = 2.5. The second problem involves a gas localized between two nearly parallel, co-planar nanotubes, with small divergence half-angle γ. In this case, too, the transverse motion does not contribute to C(T) at low T, leaving a problem of a gas of particles in a 1D harmonic potential (along the z axis, midway between the tubes). Setting ω(z) as the angular frequency of this motion, for T ≥ τ(z) ≡ ω(z)ħ/k(B), the behavior approaches that of a 2D classical gas, C(*) = 1; one might have expected instead C(*) = 1/2, as in the groove problem, since the limit γ ≡ 0 is 1D. For T < τ(z), the thermal behavior is exponentially activated, C(*) ∼ (τ(z)/T)(2)e(-τ(z)/T). At higher T (T ≈ ε(y)/k(B) ≡ τ(y) > τ(z)), motion is excited in the y direction, perpendicular to the plane of nanotubes, resulting in thermal behavior (C(*) = 7/4) corresponding to a gas in 7/2 dimensions, while at very high T (T > ħω(x)/k(B) ≡ τ(x) > τ(y)), the behavior becomes that of a D = 11/2 system. The third problem is that of a gas of particles, e.g. (4)He, confined in the interstitial region between four square parallel pores. The low T behavior found in this case is again surprising--that of a 5D gas.
Vectorization of a particle code used in the simulation of rarefied hypersonic flow

NASA Technical Reports Server (NTRS)

Baganoff, D.

1990-01-01

A limitation of the direct simulation Monte Carlo (DSMC) method is that it does not allow efficient use of vector architectures that predominate in current supercomputers. Consequently, the problems that can be handled are limited to those of one- and two-dimensional flows. This work focuses on a reformulation of the DSMC method with the objective of designing a procedure that is optimized to the vector architectures found on machines such as the Cray-2. In addition, it focuses on finding a better balance between algorithmic complexity and the total number of particles employed in a simulation so that the overall performance of a particle simulation scheme can be greatly improved. Simulations of the flow about a 3D blunt body are performed with 10 to the 7th particles and 4 x 10 to the 5th mesh cells. Good statistics are obtained with time averaging over 800 time steps using 4.5 h of Cray-2 single-processor CPU time.
High-resolution whole-brain diffusion MRI at 7T using radiofrequency parallel transmission.

PubMed

Wu, Xiaoping; Auerbach, Edward J; Vu, An T; Moeller, Steen; Lenglet, Christophe; Schmitter, Sebastian; Van de Moortele, Pierre-François; Yacoub, Essa; Uğurbil, Kâmil

2018-03-30

Investigating the utility of RF parallel transmission (pTx) for Human Connectome Project (HCP)-style whole-brain diffusion MRI (dMRI) data at 7 Tesla (7T). Healthy subjects were scanned in pTx and single-transmit (1Tx) modes. Multiband (MB), single-spoke pTx pulses were designed to image sagittal slices. HCP-style dMRI data (i.e., 1.05-mm resolutions, MB2, b-values = 1000/2000 s/mm 2 , 286 images and 40-min scan) and data with higher accelerations (MB3 and MB4) were acquired with pTx. pTx significantly improved flip-angle detected signal uniformity across the brain, yielding ∼19% increase in temporal SNR (tSNR) averaged over the brain relative to 1Tx. This allowed significantly enhanced estimation of multiple fiber orientations (with ∼21% decrease in dispersion) in HCP-style 7T dMRI datasets. Additionally, pTx pulses achieved substantially lower power deposition, permitting higher accelerations, enabling collection of the same data in 2/3 and 1/2 the scan time or of more data in the same scan time. pTx provides a solution to two major limitations for slice-accelerated high-resolution whole-brain dMRI at 7T; it improves flip-angle uniformity, and enables higher slice acceleration relative to current state-of-the-art. As such, pTx provides significant advantages for rapid acquisition of high-quality, high-resolution truly whole-brain dMRI data. © 2018 International Society for Magnetic Resonance in Medicine.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J., E-mail: Eric.Bylaska@pnnl.gov; Weare, Jonathan Q., E-mail: weare@uchicago.edu; Weare, John H., E-mail: jweare@ucsd.edu

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time t{sub i} (trajectory positions and velocities x{sub i} = (r{sub i}, v{sub i})) to time t{sub i+1} (x{sub i+1}) by x{sub i+1} = f{sub i}(x{sub i}), the dynamics problem spanning an interval from t{sub 0}…t{sub M} can be transformed into a root finding problem, F(X) = [x{sub i} − f(x{sub (i−1})]{sub i} {sub =1,M} = 0, for themore » trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H{sub 2}O AIMD simulation at the MP2 level. The maximum speedup ((serial execution time)/(parallel execution time) ) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H{sub 2}O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.« less
CSM Testbed Development and Large-Scale Structural Applications

NASA Technical Reports Server (NTRS)

Knight, Norman F., Jr.; Gillian, R. E.; Mccleary, Susan L.; Lotts, C. G.; Poole, E. L.; Overman, A. L.; Macy, S. C.

1989-01-01

A research activity called Computational Structural Mechanics (CSM) conducted at the NASA Langley Research Center is described. This activity is developing advanced structural analysis and computational methods that exploit high-performance computers. Methods are developed in the framework of the CSM Testbed software system and applied to representative complex structural analysis problems from the aerospace industry. An overview of the CSM Testbed methods development environment is presented and some new numerical methods developed on a CRAY-2 are described. Selected application studies performed on the NAS CRAY-2 are also summarized.
NAS (Numerical Aerodynamic Simulation Program) technical summaries, March 1989 - February 1990

NASA Technical Reports Server (NTRS)

1990-01-01

Given here are selected scientific results from the Numerical Aerodynamic Simulation (NAS) Program's third year of operation. During this year, the scientific community was given access to a Cray-2 and a Cray Y-MP supercomputer. Topics covered include flow field analysis of fighter wing configurations, large-scale ocean modeling, the Space Shuttle flow field, advanced computational fluid dynamics (CFD) codes for rotary-wing airloads and performance prediction, turbulence modeling of separated flows, airloads and acoustics of rotorcraft, vortex-induced nonlinearities on submarines, and standing oblique detonation waves.
Water Selective Imaging and bSSFP Banding Artifact Correction in Humans and Small Animals at 3T and 7T, Respectively

PubMed Central

Ribot, Emeline J.; Wecker, Didier; Trotier, Aurélien J.; Dallaudière, Benjamin; Lefrançois, William; Thiaudière, Eric; Franconi, Jean-Michel; Miraux, Sylvain

2015-01-01

Introduction The purpose of this paper is to develop an easy method to generate both fat signal and banding artifact free 3D balanced Steady State Free Precession (bSSFP) images at high magnetic field. Methods In order to suppress fat signal and bSSFP banding artifacts, two or four images were acquired with the excitation frequency of the water-selective binomial radiofrequency pulse set On Resonance or shifted by a maximum of 3/4TR. Mice and human volunteers were imaged at 7T and 3T, respectively to perform whole-body and musculoskeletal imaging. “Sum-Of-Square” reconstruction was performed and combined or not with parallel imaging. Results The frequency selectivity of 1-2-3-2-1 or 1-3-3-1 binomial pulses was preserved after (3/4TR) frequency shifting. Consequently, whole body small animal 3D imaging was performed at 7T and enabled visualization of small structures within adipose tissue like lymph nodes. In parallel, this method allowed 3D musculoskeletal imaging in humans with high spatial resolution at 3T. The combination with parallel imaging allowed the acquisition of knee images with ~500μm resolution images in less than 2min. In addition, ankles, full head coverage and legs of volunteers were imaged, demonstrating the possible application of the method also for large FOV. Conclusion In conclusion, this robust method can be applied in small animals and humans at high magnetic fields. The high SNR and tissue contrast obtained in short acquisition times allows to prescribe bSSFP sequence for several preclinical and clinical applications. PMID:26426849
3D sensitivity encoded ellipsoidal MR spectroscopic imaging of gliomas at 3T☆

PubMed Central

Ozturk-Isik, Esin; Chen, Albert P.; Crane, Jason C.; Bian, Wei; Xu, Duan; Han, Eric T.; Chang, Susan M.; Vigneron, Daniel B.; Nelson, Sarah J.

2010-01-01

Purpose The goal of this study was to implement time efficient data acquisition and reconstruction methods for 3D magnetic resonance spectroscopic imaging (MRSI) of gliomas at a field strength of 3T using parallel imaging techniques. Methods The point spread functions, signal to noise ratio (SNR), spatial resolution, metabolite intensity distributions and Cho:NAA ratio of 3D ellipsoidal, 3D sensitivity encoding (SENSE) and 3D combined ellipsoidal and SENSE (e-SENSE) k-space sampling schemes were compared with conventional k-space data acquisition methods. Results The 3D SENSE and e-SENSE methods resulted in similar spectral patterns as the conventional MRSI methods. The Cho:NAA ratios were highly correlated (P<.05 for SENSE and P<.001 for e-SENSE) with the ellipsoidal method and all methods exhibited significantly different spectral patterns in tumor regions compared to normal appearing white matter. The geometry factors ranged between 1.2 and 1.3 for both the SENSE and e-SENSE spectra. When corrected for these factors and for differences in data acquisition times, the empirical SNRs were similar to values expected based upon theoretical grounds. The effective spatial resolution of the SENSE spectra was estimated to be same as the corresponding fully sampled k-space data, while the spectra acquired with ellipsoidal and e-SENSE k-space samplings were estimated to have a 2.36–2.47-fold loss in spatial resolution due to the differences in their point spread functions. Conclusion The 3D SENSE method retained the same spatial resolution as full k-space sampling but with a 4-fold reduction in scan time and an acquisition time of 9.28 min. The 3D e-SENSE method had a similar spatial resolution as the corresponding ellipsoidal sampling with a scan time of 4:36 min. Both parallel imaging methods provided clinically interpretable spectra with volumetric coverage and adequate SNR for evaluating Cho, Cr and NAA. PMID:19766422
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
HEAT.PRO - THERMAL IMBALANCE FORCE SIMULATION AND ANALYSIS USING PDE2D

NASA Technical Reports Server (NTRS)

Vigue, Y.

1994-01-01

HEAT.PRO calculates the thermal imbalance force resulting from satellite surface heating. The heated body of a satellite re-radiates energy at a rate that is proportional to its temperature, losing the energy in the form of photons. By conservation of momentum, this momentum flux out of the body creates a reaction force against the radiation surface, and the net thermal force can be observed as a small perturbation that affects long term orbital behavior of the satellite. HEAT.PRO calculates this thermal imbalance force and then determines its effects on satellite orbits, especially where the Earth's shadowing of an orbiting satellite causes periodic changes in the spacecraft's thermal environment. HEAT.PRO implements a finite element method routine called PDE2D which incorporates material properties to determine the solar panel surface temperatures. The nodal temperatures are computed at specified time steps and are used to determine the magnitude and direction of the thermal force on the spacecraft. These calculations are based on the solar panel orientation and satellite's position with respect to the earth and sun. It is necessary to have accurate, current knowledge of surface emissivity, thermal conductivity, heat capacity, and material density. These parameters, which may change due to degradation of materials in the environment of space, influence the nodal temperatures that are computed and thus the thermal force calculations. HEAT.PRO was written in FORTRAN 77 for Cray series computers running UNICOS. The source code contains directives for and is used as input to the required partial differential equation solver, PDE2D. HEAT.PRO is available on a 9-track 1600 BPI magnetic tape in UNIX tar format (standard distribution medium) or a .25 inch streaming magnetic tape cartridge in UNIX tar format. An electronic copy of the documentation in Macintosh Microsoft Word format is included on the distribution tape. HEAT.PRO was developed in 1991. Cray and UNICOS are registered trademarks of Cray Research, Inc. UNIX is a trademark of AT&T Bell Laboratories. PDE2D is available from Granville Sewell, Mathematics Dept., University of Texas at El Paso, El Paso, Texas 79968.
Sex differences in the stress response in SD rats.

PubMed

Lu, Jing; Wu, Xue-Yan; Zhu, Qiong-Bin; Li, Jia; Shi, Li-Gen; Wu, Juan-Li; Zhang, Qi-Jun; Huang, Man-Li; Bao, Ai-Min

2015-05-01

Sex differences play an important role in depression, the basis of which is an excessive stress response. We aimed at revealing the neurobiological sex differences in the same study in acute- and chronically-stressed rats. Female Sprague-Dawley (SD) rats were randomly divided into 6 groups: chronic unpredictable mild stress (CUMS), acute foot shock (FS) and controls, animals in all 3 groups were sacrificed in proestrus or diestrus. Male SD rats were randomly divided into 3 groups: CUMS, FS and controls. Comparisons were made of behavioral changes in CUMS and control rats, plasma levels of corticosterone (CORT), testosterone (T) and estradiol (E2), and of the hypothalamic mRNA-expression of stress-related molecules, i.e. estrogen receptor α and β, androgen receptor, aromatase, mineralocorticoid receptor, glucocorticoid receptor, corticotropin-releasing hormone, arginine vasopressin and oxytocin. CUMS resulted in disordered estrus cycles, more behavioral and hypothalamic stress-related molecules changes and a stronger CORT response in female rats compared with male rats. Female rats also showed decreased E2 and T levels after FS and CUMS, while male FS rats showed increased E2 and male CUMS rats showed decreased T levels. Stress affects the behavioral, endocrine and the molecular response of the stress systems in the hypothalamus of SD rats in a clear sexual dimorphic way, which has parallels in human data on stress and depression. Copyright © 2015 Elsevier B.V. All rights reserved.
: A Scalable and Transparent System for Simulating MPI Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S

2010-01-01

is a scalable, transparent system for experimenting with the execution of parallel programs on simulated computing platforms. The level of simulated detail can be varied for application behavior as well as for machine characteristics. Unique features of are repeatability of execution, scalability to millions of simulated (virtual) MPI ranks, scalability to hundreds of thousands of host (real) MPI ranks, portability of the system to a variety of host supercomputing platforms, and the ability to experiment with scientific applications whose source-code is available. The set of source-code interfaces supported by is being expanded to support a wider set of applications, andmore » MPI-based scientific computing benchmarks are being ported. In proof-of-concept experiments, has been successfully exercised to spawn and sustain very large-scale executions of an MPI test program given in source code form. Low slowdowns are observed, due to its use of purely discrete event style of execution, and due to the scalability and efficiency of the underlying parallel discrete event simulation engine, sik. In the largest runs, has been executed on up to 216,000 cores of a Cray XT5 supercomputer, successfully simulating over 27 million virtual MPI ranks, each virtual rank containing its own thread context, and all ranks fully synchronized by virtual time.« less
Stabilization of lower hybrid drift modes by finite parallel wavenumber and electron temperature gradients in field-reversed configurations

NASA Astrophysics Data System (ADS)

Farengo, R.; Guzdar, P. N.; Lee, Y. C.

1989-08-01

The effect of finite parallel wavenumber and electron temperature gradients on the lower hybrid drift instability is studied in the parameter regime corresponding to the TRX-2 device [Fusion Technol. 9, 48 (1986)]. Perturbations in the electrostatic potential and all three components of the vector potential are considered and finite beta electron orbit modifications are included. The electron temperature gradient decreases the growth rate of the instability but, for kz=0, unstable modes exist for ηe(=T'en0/Ten0)>6. Since finite kz effects completely stabilize the mode at small values of kz/ky(≂5×10-3), magnetic shear could be responsible for stabilizing the lower hybrid drift instability in field-reversed configurations.
M3T: Morphable Multithreaded Memory Tiles

DTIC Science & Technology

2004-01-01

4 5 6 7 8 9 10 A B C D E F G H I J K L x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[0] y[4] y[2] y[6] y[1] y[5] y[3] y[7] Figure 4: 8-point radix-2...CONPAR 94, September. 33. [NAR95] Narayan, S. and Gajski , D. (1995) “Interfacing Incompatible Protocols Using Interface Process Generation,” 32nd...Parallel and Distributed Tools, August. 42. [SOH95] Sohi, G ., Breach, S. and Vajapeyam, S. (1995) “Multiscalar Processors,” Proceedings of the 22nd

Using a multifrontal sparse solver in a high performance, finite element code

NASA Technical Reports Server (NTRS)

King, Scott D.; Lucas, Robert; Raefsky, Arthur

1990-01-01

We consider the performance of the finite element method on a vector supercomputer. The computationally intensive parts of the finite element method are typically the individual element forms and the solution of the global stiffness matrix both of which are vectorized in high performance codes. To further increase throughput, new algorithms are needed. We compare a multifrontal sparse solver to a traditional skyline solver in a finite element code on a vector supercomputer. The multifrontal solver uses the Multiple-Minimum Degree reordering heuristic to reduce the number of operations required to factor a sparse matrix and full matrix computational kernels (e.g., BLAS3) to enhance vector performance. The net result in an order-of-magnitude reduction in run time for a finite element application on one processor of a Cray X-MP.
Multigrid direct numerical simulation of the whole process of flow transition in 3-D boundary layers

NASA Technical Reports Server (NTRS)

Liu, Chaoqun; Liu, Zhining

1993-01-01

A new technology was developed in this study which provides a successful numerical simulation of the whole process of flow transition in 3-D boundary layers, including linear growth, secondary instability, breakdown, and transition at relatively low CPU cost. Most other spatial numerical simulations require high CPU cost and blow up at the stage of flow breakdown. A fourth-order finite difference scheme on stretched and staggered grids, a fully implicit time marching technique, a semi-coarsening multigrid based on the so-called approximate line-box relaxation, and a buffer domain for the outflow boundary conditions were all used for high-order accuracy, good stability, and fast convergence. A new fine-coarse-fine grid mapping technique was developed to keep the code running after the laminar flow breaks down. The computational results are in good agreement with linear stability theory, secondary instability theory, and some experiments. The cost for a typical case with 162 x 34 x 34 grid is around 2 CRAY-YMP CPU hours for 10 T-S periods.
Lateral parabrachial nucleus mediates shortening of expiration during hypoxia.

PubMed

Song, Gang; Poon, Chi-Sang

2009-01-01

Acute hypoxia elicits complex time-dependent responses including rapid augmentation of inspiratory drive, shortening of inspiratory and expiratory durations (T(I), T(E)), and short-term potentiation and depression. The central pathways mediating these varied effects are largely unknown. Here, we show that the lateral parabrachial nucleus (LPBN) of the dorsolateral pons specifically mediates T(E)-shortening during hypoxia and not other hypoxic response components. Twelve urethane-anesthetized and vagotomized adult Sprague-Dawley rats were exposed to 1-min poikilocapnic hypoxia before and after unilateral kainic acid or bilateral electrolytic lesioning of the LPBN. Bilateral lesions resulted in a significant increase in baseline T(E) under hyperoxia. After unilateral or bilateral lesions, the decrease in T(E) during hypoxia was markedly attenuated without appreciable changes in all other hypoxic response components. These findings add to the mounting evidence that the central processing of peripheral chemoafferent inputs is segregated into parallel integrator and differentiator (low-pass and high-pass filter) pathways that separately modulate inspiratory drive, T(I), T(E) and resultant short-term potentiation and depression.
Targeting the UPR to Circumvent Endocrine Resistance in Breast Cancer

DTIC Science & Technology

2015-10-01

were submitted for the screen. Kinase assay protocol (DiscoverX): For most assays, kinase-tagged T7 phage strains were grown in parallel in 24-well...blocks in an E. coli host derived from the BL21 strain. E. coli were grown to log-phase and infected with T7 phage from a frozen stock (multiplicity of...unbound ligand and to reduce non-specific phage binding. Binding reactions were assembled by combining kinases, liganded affinity beads, and test
Changes in serum sex steroid levels throughout the reproductive cycle of Bufo arenarum females.

PubMed

Medina, Marcela F; Ramos, Inés; Crespo, Claudia A; González-Calvar, Silvia; Fernández, Silvia N

2004-04-01

The changes in the serum levels of the sexual steroids estradiol-17beta (E(2)), testosterone (T), dihydrotestosterone (DHT), and progesterone (P) in Bufo arenarum females were determined by radioimmunoassay (RIA) during 3 consecutive cycles (1999-2001). The serum concentrations of T and DHT, which showed a close parallelism during the annual reproductive cycle, exhibited the highest levels during the preovulatory period, when oogenesis is advanced, while lowest serum levels of these hormones were found during the ovulatory period. The data obtained for E(2) showed a pattern contrary to that determined for androgens. The maximum E(2) concentrations detected in the early postovulatory period might be associated with vitellogenesis and follicular growth. Lowest E(2) concentrations were reached during the period in which B. arenarum undergoes its final hibernation stage. Serum P showed a peak during the preovulatoy period, related to the induction of nuclear maturation in full grown oocytes. A strong decrease in the levels of the circulating hormones was observed after ovariectomy. Our results showed that, out of the four hormones examined, T and DHT were the best indicators of ovarian and oviductal stage, as shown by the strong positive correlation found between androgen levels and organ weight, while E(2) showed a weak negative correlation with ovarian and oviductal weight.
Simultaneous dual-task performance reveals parallel response selection after practice

NASA Technical Reports Server (NTRS)

Hazeltine, Eliot; Teague, Donald; Ivry, Richard B.

2002-01-01

E. H. Schumacher, T. L. Seymour, J. M. Glass, D. E. Kieras, and D. E. Meyer (2001) reported that dual-task costs are minimal when participants are practiced and give the 2 tasks equal emphasis. The present research examined whether such findings are compatible with the operation of an efficient response selection bottleneck. Participants trained until they were able to perform both tasks simultaneously without interference. Novel stimulus pairs produced no reaction time costs, arguing against the development of compound stimulus-response associations (Experiment 1). Manipulating the relative onsets (Experiments 2 and 4) and durations (Experiments 3 and 4) of response selection processes did not lead to dual-task costs. The results indicate that the 2 tasks did not share a bottleneck after practice.
Adaptation of MSC/NASTRAN to a supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gloudeman, J.F.; Hodge, J.C.

1982-01-01

MSC/NASTRAN is a large-scale general purpose digital computer program which solves a wider variety of engineering analysis problems by the finite element method. The program capabilities include static and dynamic structural analysis (linear and nonlinear), heat transfer, acoustics, electromagnetism and other types of field problems. It is used worldwide by large and small companies in such diverse fields as automotive, aerospace, civil engineering, shipbuilding, offshore oil, industrial equipment, chemical engineering, biomedical research, optics and government research. The paper presents the significant aspects of the adaptation of MSC/NASTRAN to the Cray-1. First, the general architecture and predominant functional use of MSC/NASTRANmore » are discussed to help explain the imperatives and the challenges of this undertaking. The key characteristics of the Cray-1 which influenced the decision to undertake this effort are then reviewed to help identify performance targets. An overview of the MSC/NASTRAN adaptation effort is then given to help define the scope of the project. Finally, some measures of MSC/NASTRAN's operational performance on the Cray-1 are given, along with a few guidelines to help avoid improper interpretation. 17 references.« less
Optimization strategies for molecular dynamics programs on Cray computers and scalar work stations

NASA Astrophysics Data System (ADS)

Unekis, Michael J.; Rice, Betsy M.

1994-12-01

We present results of timing runs and different optimization strategies for a prototype molecular dynamics program that simulates shock waves in a two-dimensional (2-D) model of a reactive energetic solid. The performance of the program may be improved substantially by simple changes to the Fortran or by employing various vendor-supplied compiler optimizations. The optimum strategy varies among the machines used and will vary depending upon the details of the program. The effect of various compiler options and vendor-supplied subroutine calls is demonstrated. Comparison is made between two scalar workstations (IBM RS/6000 Model 370 and Model 530) and several Cray supercomputers (X-MP/48, Y-MP8/128, and C-90/16256). We find that for a scientific application program dominated by sequential, scalar statements, a relatively inexpensive high-end work station such as the IBM RS/60006 RISC series will outperform single processor performance of the Cray X-MP/48 and perform competitively with single processor performance of the Y-MP8/128 and C-9O/16256.
Global Seismic Imaging Based on Adjoint Tomography

NASA Astrophysics Data System (ADS)

Bozdag, E.; Lefebvre, M.; Lei, W.; Peter, D. B.; Smith, J. A.; Zhu, H.; Komatitsch, D.; Tromp, J.

2013-12-01

Our aim is to perform adjoint tomography at the scale of globe to image the entire planet. We have started elastic inversions with a global data set of 253 CMT earthquakes with moment magnitudes in the range 5.8 ≤ Mw ≤ 7 and used GSN stations as well as some local networks such as USArray, European stations, etc. Using an iterative pre-conditioned conjugate gradient scheme, we initially set the aim to obtain a global crustal and mantle model with confined transverse isotropy in the upper mantle. Global adjoint tomography has so far remained a challenge mainly due to computational limitations. Recent improvements in our 3D solvers (e.g., a GPU version) and access to high-performance computational centers (e.g., ORNL's Cray XK7 "Titan" system) now enable us to perform iterations with higher-resolution (T > 9 s) and longer-duration (200 min) simulations to accommodate high-frequency body waves and major-arc surface waves, respectively, which help improve data coverage. The remaining challenge is the heavy I/O traffic caused by the numerous files generated during the forward/adjoint simulations and the pre- and post-processing stages of our workflow. We improve the global adjoint tomography workflow by adopting the ADIOS file format for our seismic data as well as models, kernels, etc., to improve efficiency on high-performance clusters. Our ultimate aim is to use data from all available networks and earthquakes within the magnitude range of our interest (5.5 ≤ Mw ≤ 7) which requires a solid framework to manage big data in our global adjoint tomography workflow. We discuss the current status and future of global adjoint tomography based on our initial results as well as practical issues such as handling big data in inversions and on high-performance computing systems.
Toroidal Ampere-Faraday Equations Solved Consistently with the CQL3D Fokker-Planck Time-Evolution

NASA Astrophysics Data System (ADS)

Harvey, R. W.; Petrov, Yu. V.

2013-10-01

A self-consistent, time-dependent toroidal electric field calculation is a key feature of a complete 3D Fokker-Planck kinetic distribution radial transport code for f(v,theta,rho,t). In the present CQL3D finite-difference model, the electric field E(rho,t) is either prescribed, or iteratively adjusted to obtain prescribed toroidal or parallel currents. We discuss first results of an implementation of the Ampere-Faraday equation for the self-consistent toroidal electric field, as applied to the runaway electron production in tokamaks due to rapid reduction of the plasma temperature as occurs in a plasma disruption. Our previous results assuming a constant current density (Lenz' Law) model showed that prompt ``hot-tail runaways'' dominated ``knock-on'' and Dreicer ``drizzle'' runaways; we will examine modifications due to the more complete Ampere-Faraday solution. Work supported by US DOE under DE-FG02-ER54744.
Strategic Assessment 1999. Priorities for a Turbulent World.

DTIC Science & Technology

1999-01-01

collectively they have economic and political weight. Second, they are demographi - cally linked to the United States Third, they have a capacity...regimes are at a considerably higher risk of failure than those with greater longevity . INSTITUTE FOR NATIONAL STRATEGIC STUDIES 243 S T R A T E G I...Re-thinking START Negotiations The START era may be ending in a formal sense. Similar to parallel reductions pioneered by Pres iden t s Bush
Linearly polarized photoluminescence of InGaN quantum disks embedded in GaN nanorods.

PubMed

Park, Youngsin; Chan, Christopher C S; Nuttall, Luke; Puchtler, Tim J; Taylor, Robert A; Kim, Nammee; Jo, Yongcheol; Im, Hyunsik

2018-05-25

We have investigated the emission from InGaN/GaN quantum disks grown on the tip of GaN nanorods. The emission at 3.21 eV from the InGaN quantum disk doesn't show a Stark shift, and it is linearly polarized when excited perpendicular to the growth direction. The degree of linear polarization is about 39.3% due to the anisotropy of the nanostructures. In order to characterize a single nanostructure, the quantum disks were dispersed on a SiO 2 substrate patterned with a metal reference grid. By rotating the excitation polarization angle from parallel to perpendicular relative to the nanorods, the variation of overall PL for the 3.21 eV peak was recorded and it clearly showed the degree of linear polarization (DLP) of 51.5%.
The CSM testbed software system: A development environment for structural analysis methods on the NAS CRAY-2

NASA Technical Reports Server (NTRS)

Gillian, Ronnie E.; Lotts, Christine G.

1988-01-01

The Computational Structural Mechanics (CSM) Activity at Langley Research Center is developing methods for structural analysis on modern computers. To facilitate that research effort, an applications development environment has been constructed to insulate the researcher from the many computer operating systems of a widely distributed computer network. The CSM Testbed development system was ported to the Numerical Aerodynamic Simulator (NAS) Cray-2, at the Ames Research Center, to provide a high end computational capability. This paper describes the implementation experiences, the resulting capability, and the future directions for the Testbed on supercomputers.
Immediate Effects of Sports Taping Applied on the Lead Knee of Low- and High-Handicapped Golfers During Golf Swing.

PubMed

Kim, Tae-Gyu; Kim, Eun-Kuk; Park, Jong-Chul

2017-04-01

Kim, T-G, Kim, E-K, and Park, J-C. Immediate effects of sports taping applied on the lead knee of low- and high-handicapped golfers during golf swing. J Strength Cond Res 31(4): 981-989, 2017-Elite golf athletes suffer from various musculoskeletal injuries due to repeated golf swings. Repetitive varus moment during golf swing has been suggested as a possible cause of injuries to the lead knee. The aim of this study was to objectively and quantitatively evaluate the immediate effects of sports taping on the lead knee of elite golfers to restrict varus moment. Thirty-one elite golfers were assigned to the low- (LHG, n = 15) or high-handicapped group (HHG, n = 16). Using 3-dimensional motion analysis, the lead knee position on the frontal plane with and without rigid taping (RT), elastic taping (ET), and placebo taping was identified in 4 separate phases by the 5 events of golf swing as follows: the peak of the backswing (E1), parallel to the ground during downswing (E2), ball impact (E3), parallel to the ground during follow-through (E4), and finish (E5). The LHG when using a driver club had decreased movement toward knee varus with RT and ET than that without it from E1 to E2 (p = 0.001). The LHG when using a 5-iron club decreased movement toward knee varus with RT than that without it from E1 to E2 (p = 0.006) and from E2 to E3 (p = 0.019). The HHG when using a driver club had decreased movement toward knee varus with RT from E1 to E2 (p = 0.014). Sports taping may be helpful for elite golfers in terms of reducing varus moment of the lead knee during the downswing and be useful for the development of preventive strategies for golf-related knee injuries.
Effects of 5,5'-diphenylhydantoin on thyroxine and 3,5,3'-triiodothyronine concentrations in several tissues of the rat

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schroeder-van der Elst, J.Pv.; van der Heide, D.

1990-01-01

We studied the effect of 5,5'-diphenylhydantoin (phenytoin, DPH) on the metabolism of thyroid hormones, the intracellular concentration of T4, and the source and concentration of T3. Two groups of six male Wistar rats received a continuous infusion of 10 ml saline/rat. day. One group received DPH in their food (50 mg/kg BW) for 20 days. For both groups (125I)T4 and (131I)T3 were added to the infusion fluid for the last 10 and 7 days, respectively. At isotopic equilibrium the rats were bled and perfused. Compared to the controls, plasma T4 and T3 in the DPH group were reduced (22% andmore » 31%, respectively); TSH did not change. The rate of production of T4 and the plasma appearance rate for T3 were decreased. Thyroidal T3 production was markedly reduced. From the increased (125I)T3/(125I)T4 ratio for plasma, it follows that total body conversion was enhanced. The tissue T4 concentrations decreased in parallel with the plasma T4 level. Total T3 was reduced in all organs. In tissues in which local conversion does not occur, i.e. heart and muscle, the decrease reflected the decrease in plasma T3. In the liver both plasma-derived T3 and locally produced T3 were diminished. In cerebellum and brain the plasma-derived T3 pool was even smaller than was expected from the decrease in plasma T3. This was partly compensated by an increase in local conversion. Only for these two organs was the decrease in the tissue/plasma ratio for (131I)T3 significant. Our results suggest tissue hypothyroidism, caused by a decrease in the production of T4 and T3, which is partly compensated by increased conversion in several organs. The transport of T3 into cerebellum and brain is disturbed, which can be attributed to the mode of action of DPH.« less
Therapeutic Potential of Date Palm Pollen for Testicular Dysfunction Induced by Thyroid Disorders in Male Rats.

PubMed

El-Kashlan, Akram M; Nooh, Mohammed M; Hassan, Wafaa A; Rizk, Sherine M

2015-01-01

Hyper- or hypothyroidism can impair testicular function leading to infertility. The present study was designed to examine the protective effect of date palm pollen (DPP) extract on thyroid disorder-induced testicular dysfunction. Rats were divided into six groups. Group I was normal control. Group II received oral DPP extract (150 mg kg(-1)), group III (hyperthyroid group) received intraperitoneal injection of L-thyroxine (L-T4, 300 μg kg(-1); i.p.), group IV received L-T4 plus DPP extract, group V (hypothyroid group) received propylthiouracil (PTU, 10 mg kg(-1); i.p.) and group VI received PTU plus DPP extract. All treatments were given every day for 56 days. L-T4 or PTU lowered genital sex organs weight, sperm count and motility, serum levels of luteinizing hormone (LH), follicle stimulating hormone (FSH) and testosterone (T), testicular function markers and activities of testicular 3β-hydroxysteroid dehydrogenase (3β-HSD) and 17β-hydroxysteroid dehydrogenase (17β-HSD). Moreover, L-T4 or PTU increased estradiol (E2) serum level, testicular oxidative stress, DNA damage and apoptotic markers. Morphometric and histopathologic studies backed these observations. Treatment with DPP extract prevented LT4- or PTU induced changes. In addition, supplementation of DPP extract to normal rats augmented sperm count and motility, serum levels of LH, T and E2 paralleled with increased activities of 3β-HSD and 17β-HSD as well as testicular antioxidant status. These results provide evidence that DPP extract may have potential protective effects on testicular dysfunction induced by altered thyroid hormones.
Therapeutic Potential of Date Palm Pollen for Testicular Dysfunction Induced by Thyroid Disorders in Male Rats

PubMed Central

El-Kashlan, Akram M.; Nooh, Mohammed M.; Hassan, Wafaa A.; Rizk, Sherine M.

2015-01-01

Hyper- or hypothyroidism can impair testicular function leading to infertility. The present study was designed to examine the protective effect of date palm pollen (DPP) extract on thyroid disorder-induced testicular dysfunction. Rats were divided into six groups. Group I was normal control. Group II received oral DPP extract (150 mg kg-1), group III (hyperthyroid group) received intraperitoneal injection of L-thyroxine (L-T4, 300μg kg-1; i.p.), group IV received L-T4 plus DPP extract, group V (hypothyroid group) received propylthiouracil (PTU, 10 mg kg-1; i.p.) and group VI received PTU plus DPP extract. All treatments were given every day for 56 days. L-T4 or PTU lowered genital sex organs weight, sperm count and motility, serum levels of luteinizing hormone (LH), follicle stimulating hormone (FSH) and testosterone (T), testicular function markers and activities of testicular 3β-hydroxysteroid dehydrogenase (3β-HSD) and 17β-hydroxysteroid dehydrogenase (17β-HSD). Moreover, L-T4 or PTU increased estradiol (E2) serum level, testicular oxidative stress, DNA damage and apoptotic markers. Morphometric and histopathologic studies backed these observations. Treatment with DPP extract prevented LT4- or PTU induced changes. In addition, supplementation of DPP extract to normal rats augmented sperm count and motility, serum levels of LH, T and E2 paralleled with increased activities of 3β-HSD and 17β-HSD as well as testicular antioxidant status. These results provide evidence that DPP extract may have potential protective effects on testicular dysfunction induced by altered thyroid hormones. PMID:26425844
Tailor Made Synthesis of T-Shaped and π-STACKED Dimers in the Gas Phase: Concept for Efficient Drug Design and Material Synthesis

NASA Astrophysics Data System (ADS)

Kumar, Sumit; Das, Aloke

2013-06-01

Non-covalent interactions play a key role in governing the specific functional structures of biomolecules as well as materials. Thus molecular level understanding of these intermolecular interactions can help in efficient drug design and material synthesis. It has been found from X-ray crystallography that pure hydrocarbon solids (i.e. benzene, hexaflurobenzene) have mostly slanted T-shaped (herringbone) packing arrangement whereas mixed solid hydrocarbon crystals (i.e. solid formed from mixtures of benzene and hexafluorobenzene) exhibit preferentially parallel displaced (PD) π-stacked arrangement. Gas phase spectroscopy of the dimeric complexes of the building blocks of solid pure benzene and mixed benzene-hexafluorobenzene adducts exhibit similar structural motifs observed in the corresponding crystal strcutures. In this talk, I will discuss about the jet-cooled dimeric complexes of indole with hexafluorobenzene and p-xylene in the gas phase using Resonant two photon ionzation and IR-UV double resonance spectroscopy combined with quantum chemistry calculations. In stead of studying benzene...p-xylene and benzene...hexafluorobenzene dimers, we have studied corresponding indole complexes because N-H group is much more sensitive IR probe compared to C-H group. We have observed that indole...hexafluorobenzene dimer has parallel displaced (PD) π-stacked structure whereas indole...p-xylene has slanted T-shaped structure. We have shown here selective switching of dimeric structure from T-shaped to π-stacked by changing the substituent from electron donating (-CH3) to electron withdrawing group (fluorine) in one of the complexing partners. Thus, our results demonstrate that efficient engineering of the non-covalent interactions can lead to efficient drug design and material synthesis.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Wright, J. C.; Bonoli, P. T.; Schmidt, A. E.

Lower hybrid (LH) waves ({omega}{sub ci}<<{omega}<<{omega}{sub ce}, where {omega}{sub i,e}{identical_to}Z{sub i,e}eB/m{sub i,e}c) have the attractive property of damping strongly via electron Landau resonance on relatively fast tail electrons and consequently are well-suited to driving current. Established modeling techniques use Wentzel-Kramers-Brillouin (WKB) expansions with self-consistent non-Maxwellian distributions. Higher order WKB expansions have shown some effects on the parallel wave number evolution and consequently on the damping due to diffraction [G. Pereverzev, Nucl. Fusion 32, 1091 (1991)]. A massively parallel version of the TORIC full wave electromagnetic field solver valid in the LH range of frequencies has been developed [J. C. Wrightmore » et al., Comm. Comp. Phys. 4, 545 (2008)] and coupled to an electron Fokker-Planck solver CQL3D[R. W. Harvey and M. G. McCoy, in Proceedings of the IAEA Technical Committee Meeting, Montreal, 1992 (IAEA Institute of Physics Publishing, Vienna, 1993), USDOC/NTIS Document No. DE93002962, pp. 489-526] in order to self-consistently evolve nonthermal electron distributions characteristic of LH current drive (LHCD) experiments in devices such as Alcator C-Mod and ITER (B{sub 0}{approx_equal}5 T, n{sub e0}{approx_equal}1x10{sup 20} m{sup -3}). These simulations represent the first ever self-consistent simulations of LHCD utilizing both a full wave and Fokker-Planck calculation in toroidal geometry.« less
P13631-E002PF: Pulsed field magnetostriction of Ba 2CoTeO 6

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tanaka, H.; Kurita, N.; Koike, M.

Ba 2CoTeO 6 is an insulating material consisting of two magnetic subsystems referred as A and B build of S=1/2 spins. Subsystem A is considered to be a Heisenberg-like antiferromagnet (AFM) on a triangular lattice and subsystem B is a J 1-J 2 Ising-like AFM on a honey-comb lattice. The magnetic phase transitions are observed at T N1 = 12 K and T N2=3K for A and B respectively. The application of magnetic fields unveils a rich phase diagram that varies depending on the direction of the applied field for either H ll c or H perp c. To datemore » the phase diagram has been investigated by means of specific heat measurement up to 9T and susceptibility measurements with a SQUID magnetometer up to 7T. Magnetization measured in pulsed magnetic fields up to 60T at 1.3K and 4.2K reveal several steps and plateaux occurring at varying critical fields depending on the crystallographic direction. Common to the magnetization parallel c and perpendicular c is the saturation above 40T.« less

Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less
First Detection of the Hatchett-McCray Effect in the High-Mass X-ray Binary

NASA Technical Reports Server (NTRS)

Sonneborn, G.; Iping, R. C.; Kaper, L.; Hammerschiag-Hensberge, G.; Hutchings, J. B.

2004-01-01

The orbital modulation of stellar wind UV resonance line profiles as a result of ionization of the wind by the X-ray source has been observed in the high-mass X-ray binary 4U1700-37/HD 153919 for the first time. Far-UV observations (905-1180 Angstrom, resolution 0.05 Angstroms) were made at the four quadrature points of the binary orbit with the Far Ultraviolet Spectroscopic Explorer (FUSE) in 2003 April and August. The O6.5 laf primary eclipses the X-ray source (neutron star or black hole) with a 3.41-day period. Orbital modulation of the UV resonance lines, resulting from X-ray photoionization of the dense stellar wind, the so-called Hatchett-McCray (HM) effect, was predicted for 4U1700-37/HD153919 (Hatchett 8 McCray 1977, ApJ, 211, 522) but was not seen in N V 1240, Si IV 1400, or C IV 1550 in IUE and HST spectra. The FUSE spectra show that the P V 1118-1128 and S IV 1063-1073 P-Cygni lines appear to vary as expected for the HM effect, weakest at phase 0.5 (X-ray source conjunction) and strongest at phase 0.0 (X-ray source eclipse). The phase modulation of the O VI 1032-1037 lines, however, is opposite to P V and S IV, implying that O VI may be a byproduct of the wind's ionization by the X-ray source. Such variations were not observed in N V, Si IV, and C IV because of their high optical depth. Due to their lower cosmic abundance, the P V and S IV wind lines are unsaturated, making them excellent tracers of the ionization conditions in the O star's wind.
SU-F-J-146: Experimental Validation of 6 MV Photon PDD in Parallel Magnetic Field Calculated by EGSnrc

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ghila, A; Steciw, S; Fallone, B

Purpose: Integrated linac-MR systems are uniquely suited for real time tumor tracking during radiation treatment. Understanding the magnetic field dose effects and incorporating them in treatment planning is paramount for linac-MR clinical implementation. We experimentally validated the EGSnrc dose calculations in the presence of a magnetic field parallel to the radiation beam travel. Methods: Two cylindrical bore electromagnets produced a 0.21 T magnetic field parallel to the central axis of a 6 MV photon beam. A parallel plate ion chamber was used to measure the PDD in a polystyrene phantom, placed inside the bore in two setups: phantom top surfacemore » coinciding with the magnet bore center (183 cm SSD), and with the magnet bore’s top surface (170 cm SSD). We measured the field of the magnet at several points and included the exact dimensions of the coils to generate a 3D magnetic field map in a finite element model. BEAMnrc and DOSXYZnrc simulated the PDD experiments in parallel magnetic field (i.e. 3D magnetic field included) and with no magnetic field. Results: With the phantom surface at the top of the electromagnet, the surface dose increased by 10% (compared to no-magnetic field), due to electrons being focused by the smaller fringe fields of the electromagnet. With the phantom surface at the bore center, the surface dose increased by 30% since extra 13 cm of air column was in relatively higher magnetic field (>0.13T) in the magnet bore. EGSnrc Monte Carlo code correctly calculated the radiation dose with and without the magnetic field, and all points passed the 2%, 2 mm Gamma criterion when the ion chamber’s entrance window and air cavity were included in the simulated phantom. Conclusion: A parallel magnetic field increases the surface and buildup dose during irradiation. The EGSnrc package can model these magnetic field dose effects accurately. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
Distributed multitasking ITS with PVM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fan, W.C.; Halbleib, J.A. Sr.

1995-12-31

Advances in computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable to or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generatedmore » in a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of the MCNP code on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electronphoton transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load-balancing schemes for homogeneous and heterogeneous networks.« less
Distributed multitasking ITS with PVM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fan, W.C.; Halbleib, J.A. Sr.

1995-02-01

Advances of computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources, and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generated inmore » a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of MCNP on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electron/photon transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load balancing schemes for homogeneous and heterogeneous networks.« less
Compression of an Accelerated Taylor State in SSX

NASA Astrophysics Data System (ADS)

Shrock, J. E.; Suen-Lewis, E. M.; Barbano, L. J.; Kaur, M.; Schaffner, D. A.; Brown, M. R.

2017-10-01

In the Swarthmore Spheromak Experiment (SSX), compact toroidal plasmas are launched from a plasma gun and evolve into minimum energy twisted Taylor states. The plumes initially have a velocity 40 km/s, density 0.4 ×1016 cm-3 , and proton temperature 20 eV . After formation, the plumes are accelerated by pulsed pinch coils with rise times τ1 / 4 = (π / 2) √{ LC } less than 1 μ s and currents Ipeak =V0 / Z =V0 /√{ L / C } on the order of 104 A. The accelerated Taylor States are abruptly stagnated in a copper flux conserver, and over the course of t < 10 μ s, adiabatic compression is observed. The magnetothermodynamics of this compression do not appear to be dictated by the MHD equation of state d / dt (P /nγ) = 0 . Rather, the compression appears to evolve according to the Chew-Goldberger-Low (CGL) double adiabatic model. CGL theory presents two equations of state, one corresponding with particle motion perpendicular to magnetic field in a plasma, the other to particle motion parallel to the field. We observe Taylor state compression most in agreement with the parallel equation of state: d / dt (P∥B2 /n3) = 0 . DOE ARPA-E ALPHA Program.
Microstructure and phase composition of hypoeutectic Te-Bi alloy as evaporation source for photoelectric cathode

NASA Astrophysics Data System (ADS)

Wang, Bao-guang; Yang, Wen-hui; Gao, Hong-ye; Tian, Wen-huai

2018-05-01

A hypoeutectic 60Te-40Bi alloy in mass percent was designed as a tellurium atom evaporation source instead of pure tellurium for an ultraviolet detection photocathode. The alloy was prepared by slow solidification at about 10-2 K·s-1. The microstructure, crystal structure, chemical composition, and crystallographic orientation of each phase in the as-prepared alloy were investigated by optical microscopy, scanning electron microscopy, X-ray diffraction, electron backscatter diffraction, and transmission electron microscopy. The experimental results suggest that the as-prepared 60Te-40Bi alloy consists of primary Bi2Te3 and eutectic Bi2Te3/Te phases. The primary Bi2Te3 phase has the characteristics of faceted growth. The eutectic Bi2Te3 phase is encased by the eutectic Te phase in the eutectic structure. The purity of the eutectic Te phase reaches 100wt% owing to the slow solidification. In the eutectic phases, the crystallographic orientation relationship between Bi2Te3 and Te is confirmed as {[0001]_{B{i_2}T{e_3}}}//{[1\\bar 21\\bar 3]_{Te}} and the direction of Te phase parallel to {[11\\bar 20]_{B{i_2}T{e_3}}} is deviated by 18° from Te N{(2\\bar 1\\bar 11)_{Te}}.
Dynamic Programming and Transitive Closure on Linear Pipelines.

DTIC Science & Technology

1984-05-01

four partitions. 2.0 - 1.9 1.0t N. N 3N N -8 4 24 Figure41 An ideal solution to small problem sizes is to design an algorithm on an array where the...12 References [1] A.V. Aho, J. Hopcroft, and J.D. Ullman. The Design and Analysis of Computer Algorithms, Addison-Wesley, (1974). - " [2] R. Aubusson...K.E. Batcher, " Design of a Massively Parallel Processor," IEEE-TC, Vol. C-9, No. 9, (September, 1980), pp. 83-840. [4] K.Q. Brown, "Dynamic Programming
Prospective In Vivo Comparison of Damaged and Healthy-Appearing Articular Cartilage Specimens in Patients With Femoroacetabular Impingement: Comparison of T2 Mapping, Histologic Endpoints, and Arthroscopic Grading.

PubMed

Ho, Charles P; Surowiec, Rachel K; Frisbie, David D; Ferro, Fernando P; Wilson, Katharine J; Saroki, Adriana J; Fitzcharles, Eric K; Dornan, Grant J; Philippon, Marc J

2016-08-01

To describe T2 mapping values in arthroscopically determined International Cartilage Repair Society (ICRS) grades in damaged and healthy-appearing articular cartilage waste specimens from arthroscopic femoroacetabular impingement (FAI) treatment. Furthermore, we sought to compare ICRS grades of the specimens with biochemical, immunohistochemistry and histologic endpoints and assess correlations with T2 mapping. Twenty-four patients were prospectively enrolled, consecutively, between December 2011 and August 2012. Patients were included if they were aged 18 years or older and met criteria that followed the clinical indications for arthroscopy to treat FAI. Patients with prior hip trauma including fracture or dislocation or who have undergone prior hip surgery were excluded. All patients received a preoperative sagittal T2 mapping scan of the hip joint. Cartilage was graded intraoperatively using the ICRS grading system, and graded specimens were collected as cartilage waste for histologic, biochemical, and immunohistochemistry analysis. Forty-four cartilage specimens (22 healthy-appearing, 22 damaged) were analyzed. Median T2 values were significantly higher among damaged specimens (55.7 ± 14.9 ms) than healthy-appearing specimens (49.3 ± 12.3 ms; P = .043), which was most exaggerated among mild (grade 1 or 2) defects where the damaged specimens (58.1 ± 16.4 ms) were significantly higher than their paired healthy-appearing specimens (48.7 ± 15.4 ms; P = .026). Severely damaged specimens (grade 3 or 4) had significantly lower cumulative H&E than their paired healthy-appearing counterparts (P = .02) but was not statistically significant among damaged specimens with mild (grade 1 or 2) defects (P = .198). Among healthy-appearing specimens, median T2 and the percentage of collagen fibers oriented parallel were significantly correlated (rho = 0.425, P = .048). This study outlines the potential for T2 mapping to identify early cartilage degeneration in patients undergoing arthroscopy to treat FAI. Findings in ICRS grade 1 and 2 degeneration corresponded to an increase in T2 values. Further biochemical evaluation revealed a significant difference between healthy-appearing cartilage and late degeneration in cumulative H&E as well as significantly lower percentage of collagen fibers oriented parallel and a higher percentage of collagen fibers oriented randomly when considering all grades of cartilage damage. Level II, prospective comparative study. Copyright © 2016 Arthroscopy Association of North America. Published by Elsevier Inc. All rights reserved.
Performance Analysis and Portability of the PLUM Load Balancing System

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak; Gabow, Harold N.

1998-01-01

The ability to dynamically adapt an unstructured mesh is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult. To address this problem, we have developed PLUM, an automatic portable framework for performing adaptive numerical computations in a message-passing environment. PLUM requires that all data be globally redistributed after each mesh adaption to achieve load balance. We present an algorithm for minimizing this remapping overhead by guaranteeing an optimal processor reassignment. We also show that the data redistribution cost can be significantly reduced by applying our heuristic processor reassignment algorithm to the default mapping of the parallel partitioner. Portability is examined by comparing performance on a SP2, an Origin2000, and a T3E. Results show that PLUM can be successfully ported to different platforms without any code modifications.
Novel molecular targets for kRAS downregulation: promoter G-quadruplexes

DTIC Science & Technology

2016-11-01

conditions, and described the structure as having mixed parallel/anti-parallel loops of lengths 2:8:10 in the 5’-3’ direction. Using selective small...and anti-parallel loop directionality of lengths 4:10:8 in the 5’–3’ direction, three tetrads stacked, and involving guanines in runs B, C, E, and F...a tri-stacked structure incorporating runs B, C, E and F with intervening loops of 2, 10, and 8 bases in the 5’–3’ direction. G = black circles, C
State of the art in electromagnetic modeling for the Compact Linear Collider

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, Arno; Kabel, Andreas; Lee, Lie-Quan

SLAC's Advanced Computations Department (ACD) has developed the parallel 3D electromagnetic time-domain code T3P for simulations of wakefields and transients in complex accelerator structures. T3P is based on state-of-the-art Finite Element methods on unstructured grids and features unconditional stability, quadratic surface approximation and up to 6th-order vector basis functions for unprecedented simulation accuracy. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with fast turn-around times, aiding the design of the next generation of accelerator facilities. Applications include simulations of the proposed two-beam accelerator structures for the Compact Linear Collider (CLIC) - wakefieldmore » damping in the Power Extraction and Transfer Structure (PETS) and power transfer to the main beam accelerating structures are investigated.« less
[The effect of synchronization on estrus and pregnancy in sheep and on levels of thyroxine, triiodothyronine, 17 beta-estradiol, progesterone and cholesterol].

PubMed

Bekeová, E; Krajnicáková, M; Hendrichovský, V; Maracek, I

1991-07-01

Knowledge of pathogenesis of sexual dysfunctions at altered thyroid activity is limited by the knowledge of multiple and ubiquitous action of its hormones throughout the organism. One of the possibilities of modulatory influence of thyroid hormones on sexual functions can be realized through the participation of thyroxine and triiodothyronine in the synthesis and metabolism of primary substrate of steroid synthesis--cholesterol. The presented work is aimed at the study of simultaneous dynamic changes of concentrations of thyroxine (T4), triiodothyronine (T3), 17 beta-estradiol (E2), progesterone (P4) and cholesterol (Chol) during synchronization of the rutting period and gravidity at parallel correlative evaluation of mutual relations of the followed parameters in ten Merino sheep in the seasonal period. Synchronization was achieved by chlorsuperlutin (Agelin--vaginal swabs, Spofa; 20 mg of chlorsuperlutin/swab) and PMSG (500 I. U./animal). Blood was sampled by means of a jugular vein puncture at the time of swab insertion (-13th day) and after three (-10th day) and seven (-7th day) following days, at the removal of swabs and application of PMSG (-3rd day), on the day of insemination (zero day), on the 7th, 14th and 17th day and in the middle of the 2nd, 3rd, 4th and 5th month of gravidity. In the phase of oestrus synchronization a significant increase of E2 concentrations on days -7 and -3 of the experiment (0.47 +/- 0.079 and 0.542 +/- 0.177 nmol.l-1 of serum, P less than 0.001; P less than 0.001) was observed compared to the E2 values on day -13 (0.084 +/- 0.036 nmol.l-1 of serum). Parallel to these observations, marked intermittent changes of T4 (Tab. I, Graph 1) were recorded with the lowest values of this parameter observed on days -10 (41.75 +/- 20.23, P less than 0.05) and -3 (50.22 +/- 18.77, P less than 0.05) and the highest on day -7 (96.77 +/- 17.51 nmol.l-1, P less than 0.01) and day zero (85.40 +/- 19.59 nmol.l-1 of serum, P less than 0.05) in comparison with the -13th day (67.22 +/- 18.29 nmol.l-1 of serum). Concentrations of P4 (Tab. I, Graph 4) declined to the lowest values on day zero observation (0.09 +/- 0.08 nmol.l-1 of serum, P less than 0.05 vs 3.40 +/- 3.61 nmol.l-1 on day -13). No significant changes of concentrations of T3 (Tab. I, Graph 2) and Chol (Tab. I, Graph 5) were observed during oestrus synchronization. During gravidity, concentrations of E2 (Tab. I Graph 3) showed an increasing trend compared to the -13th day.(ABSTRACT TRUNCATED AT 250 WORDS)
High-Performance 3D Image Processing Architectures for Image-Guided Interventions

DTIC Science & Technology

2008-01-01

Parallel architectures and algorithms for image understanding. Boston: Academic Press, 1991. [99] A. Bruhn, T. Jakob, M. Fischer, T. Kohlberger , J...Symposium on Pattern Recognition, vol. 2449(pp. 290-297, 2002. [100] A. Bruhn, T. Jakob, M. Fischer, T. Kohlberger , J. Weickert, U. Bruning, and C
The effect of obesity and type 1 diabetes on renal function in children and adolescents.

PubMed

Franchini, Simone; Savino, Alessandra; Marcovecchio, M Loredana; Tumini, Stefano; Chiarelli, Francesco; Mohn, Angelika

2015-09-01

Early signs of renal complications can be common in youths with type 1 diabetes (T1D). Recently, there has been an increasing interest in potential renal complications associated with obesity, paralleling the epidemics of this condition, although there are limited data in children. Obese children and adolescents present signs of early alterations in renal function similar to non-obese peers with T1D. Eighty-three obese (age: 11.6 ± 3.0 yr), 164 non-obese T1D (age: 12.4 ± 3.2 yr), and 71 non-obese control (age: 12.3 ± 3.2 yr) children and adolescents were enrolled in the study. Anthropometric parameters and blood pressure were measured. Renal function was assessed by albumin excretion rate (AER), serum cystatin C, creatinine and estimated glomerular filtration rate (e-GFR), calculated using the Bouvet's formula. Obese and non-obese T1D youths had similar AER [8.9(5.9-10.8) vs. 8.7(5.9-13.1) µg/min] and e-GFR levels (114.8 ± 19.6 vs. 113.4 ± 19.1 mL/min), which were higher than in controls [AER: 8.1(5.9-8.7) µg/min, e-GFR: 104.7 ± 18.9 mL/min]. Prevalence of microalbuminuria and hyperfiltration was similar between obese and T1D youths and higher than their control peers (6.0 vs. 8.0 vs. 0%, p = 0.02; 15.9 vs. 15.9 vs. 4.3%, p = 0.03, respectively). Body mass index (BMI) z-score was independently related to e-GFR (r = 0.328; p < 0.001), and AER (r = 0.138; p = 0.017). Hemoglobin A1c (HbA1c) correlated with AER (r = 0.148; p = 0.007) but not with eGFR (r = 0.041; p = 0.310). Obese children and adolescents show early alterations in renal function, compared to normal weight peers, and they have similar renal profiles than age-matched peers with T1D. © 2014 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Network issues for large mass storage requirements

NASA Technical Reports Server (NTRS)

Perdue, James

1992-01-01

File Servers and Supercomputing environments need high performance networks to balance the I/O requirements seen in today's demanding computing scenarios. UltraNet is one solution which permits both high aggregate transfer rates and high task-to-task transfer rates as demonstrated in actual tests. UltraNet provides this capability as both a Server-to-Server and Server-to-Client access network giving the supercomputing center the following advantages highest performance Transport Level connections (to 40 MBytes/sec effective rates); matches the throughput of the emerging high performance disk technologies, such as RAID, parallel head transfer devices and software striping; supports standard network and file system applications using SOCKET's based application program interface such as FTP, rcp, rdump, etc.; supports access to the Network File System (NFS) and LARGE aggregate bandwidth for large NFS usage; provides access to a distributed, hierarchical data server capability using DISCOS UniTree product; supports file server solutions available from multiple vendors, including Cray, Convex, Alliant, FPS, IBM, and others.
Multitasking the Davidson algorithm for the large, sparse eigenvalue problem

DOE Office of Scientific and Technical Information (OSTI.GOV)

Umar, V.M.; Fischer, C.F.

1989-01-01

The authors report how the Davidson algorithm, developed for handling the eigenvalue problem for large and sparse matrices arising in quantum chemistry, was modified for use in atomic structure calculations. To date these calculations have used traditional eigenvalue methods, which limit the range of feasible calculations because of their excessive memory requirements and unsatisfactory performance attributed to time-consuming and costly processing of zero valued elements. The replacement of a traditional matrix eigenvalue method by the Davidson algorithm reduced these limitations. Significant speedup was found, which varied with the size of the underlying problem and its sparsity. Furthermore, the range ofmore » matrix sizes that can be manipulated efficiently was expended by more than one order or magnitude. On the CRAY X-MP the code was vectorized and the importance of gather/scatter analyzed. A parallelized version of the algorithm obtained an additional 35% reduction in execution time. Speedup due to vectorization and concurrency was also measured on the Alliant FX/8.« less
Towards real-time detection and tracking of spatio-temporal features: Blob-filaments in fusion plasma

DOE PAGES

Wu, Lingfei; Wu, Kesheng; Sim, Alex; ...

2016-06-01

A novel algorithm and implementation of real-time identification and tracking of blob-filaments in fusion reactor data is presented. Similar spatio-temporal features are important in many other applications, for example, ignition kernels in combustion and tumor cells in a medical image. This work presents an approach for extracting these features by dividing the overall task into three steps: local identification of feature cells, grouping feature cells into extended feature, and tracking movement of feature through overlapping in space. Through our extensive work in parallelization, we demonstrate that this approach can effectively make use of a large number of compute nodes tomore » detect and track blob-filaments in real time in fusion plasma. Here, on a set of 30GB fusion simulation data, we observed linear speedup on 1024 processes and completed blob detection in less than three milliseconds using Edison, a Cray XC30 system at NERSC.« less
Comparison of scientific computing platforms for MCNP4A Monte Carlo calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hendricks, J.S.; Brockhoff, R.C.

1994-04-01

The performance of seven computer platforms is evaluated with the widely used and internationally available MCNP4A Monte Carlo radiation transport code. All results are reproducible and are presented in such a way as to enable comparison with computer platforms not in the study. The authors observed that the HP/9000-735 workstation runs MCNP 50% faster than the Cray YMP 8/64. Compared with the Cray YMP 8/64, the IBM RS/6000-560 is 68% as fast, the Sun Sparc10 is 66% as fast, the Silicon Graphics ONYX is 90% as fast, the Gateway 2000 model 4DX2-66V personal computer is 27% as fast, and themore » Sun Sparc2 is 24% as fast. In addition to comparing the timing performance of the seven platforms, the authors observe that changes in compilers and software over the past 2 yr have resulted in only modest performance improvements, hardware improvements have enhanced performance by less than a factor of [approximately]3, timing studies are very problem dependent, MCNP4Q runs about as fast as MCNP4.« less
An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Hargrove, Paul H.; Iancu, Costin

The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one- and two-sided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6x when compared with strict ordering. When hardware allows it, high-level one-sided programmingmore » models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, we argue that exposing out-of-order delivery at the application level is required for the next-generation programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput.« less

Shear Alignment of Diblock Copolymers for Patterning Nanowire Meshes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gustafson, Kyle T.

2016-09-08

Metallic nanowire meshes are useful as cheap, flexible alternatives to indium tin oxide – an expensive, brittle material used in transparent conductive electrodes. We have fabricated nanowire meshes over areas up to 2.5 cm 2 by: 1) mechanically aligning parallel rows of diblock copolymer (diBCP) microdomains; 2) selectively infiltrating those domains with metallic ions; 3) etching away the diBCP template; 4) sintering to reduce ions to metal nanowires; and, 5) repeating steps 1 – 4 on the same sample at a 90° offset. We aligned parallel rows of polystyrene-b-poly(2-vinylpyridine) [PS(48.5 kDa)-b-P2VP(14.5 kDa)] microdomains by heating above its glass transition temperaturemore » (T g ≈ 100°C), applying mechanical shear pressure (33 kPa) and normal force (13.7 N), and cooling below T g. DiBCP samples were submerged in aqueous solutions of metallic ions (15 – 40 mM ions; 0.1 – 0.5 M HCl) for 30 – 90 minutes, which coordinate to nitrogen in P2VP. Subsequent ozone-etching and sintering steps yielded parallel nanowires. We aimed to optimize alignment parameters (e.g. shear and normal pressures, alignment duration, and PDMS thickness) to improve the quality, reproducibility, and scalability of meshes. We also investigated metals other than Pt and Au that may be patterned using this technique (Cu, Ag).« less
Theoretical research program to study chemical reactions in AOTV bow shock tubes

NASA Technical Reports Server (NTRS)

Taylor, P.

1986-01-01

Progress in the development of computational methods for the characterization of chemical reactions in aerobraking orbit transfer vehicle (AOTV) propulsive flows is reported. Two main areas of code development were undertaken: (1) the implementation of CASSCF (complete active space self-consistent field) and SCF (self-consistent field) analytical first derivatives on the CRAY X-MP; and (2) the installation of the complete set of electronic structure codes on the CRAY 2. In the area of application calculations the main effort was devoted to performing full configuration-interaction calculations and using these results to benchmark other methods. Preprints describing some of the systems studied are included.
Optimal Full Information Synthesis for Flexible Structures Implemented on Cray Supercomputers

NASA Technical Reports Server (NTRS)

Lind, Rick; Balas, Gary J.

1995-01-01

This paper considers an algorithm for synthesis of optimal controllers for full information feedback. The synthesis procedure reduces to a single linear matrix inequality which may be solved via established convex optimization algorithms. The computational cost of the optimization is investigated. It is demonstrated the problem dimension and corresponding matrices can become large for practical engineering problems. This algorithm represents a process that is impractical for standard workstations for large order systems. A flexible structure is presented as a design example. Control synthesis requires several days on a workstation but may be solved in a reasonable amount of time using a Cray supercomputer.
A comparison of the Cray-2 performance before and after the installation of memory pseudo-banking

NASA Technical Reports Server (NTRS)

Schmickley, Ronald D.; Bailey, David H.

1987-01-01

A suite of 13 large Fortran benchmark codes were run on a Cray-2 configured with memory pseudo-banking circuits, and floating point operation rates were measured for each under a variety of system load configurations. These were compared with similar flop measurements taken on the same system before installation of the pseudo-banking. A useful memory access efficiency parameter was defined and calculated for both sets of performance rates, allowing a crude quantitative measure of the improvement in efficiency due to pseudo-banking. Programs were categorized as either highly scalar (S) or highly vectorized (V) and either memory-intensive or register-intensive, giving 4 categories: S-memory, S-register, V-memory, and V-register. Using flop rates as a simple quantifier of these 4 categories, a scatter plot of efficiency gain vs Mflops roughly illustrates the improvement in floating point processing speed due to pseudo-banking. On the Cray-2 system tested this improvement ranged from 1 percent for S-memory codes to about 12 percent for V-memory codes. No significant gains were made for V-register codes, which was to be expected.
A 10 mK scanning tunneling microscope operating in ultra high vacuum and high magnetic fields.

PubMed

Assig, Maximilian; Etzkorn, Markus; Enders, Axel; Stiepany, Wolfgang; Ast, Christian R; Kern, Klaus

2013-03-01

We present design and performance of a scanning tunneling microscope (STM) that operates at temperatures down to 10 mK providing ultimate energy resolution on the atomic scale. The STM is attached to a dilution refrigerator with direct access to an ultra high vacuum chamber allowing in situ sample preparation. High magnetic fields of up to 14 T perpendicular and up to 0.5 T parallel to the sample surface can be applied. Temperature sensors mounted directly at the tip and sample position verified the base temperature within a small error margin. Using a superconducting Al tip and a metallic Cu(111) sample, we determined an effective temperature of 38 ± 1 mK from the thermal broadening observed in the tunneling spectra. This results in an upper limit for the energy resolution of ΔE = 3.5 kBT = 11.4 ± 0.3 μeV. The stability between tip and sample is 4 pm at a temperature of 15 mK as demonstrated by topography measurements on a Cu(111) surface.
ARCGRAPH SYSTEM - AMES RESEARCH GRAPHICS SYSTEM

NASA Technical Reports Server (NTRS)

Hibbard, E. A.

1994-01-01

Ames Research Graphics System, ARCGRAPH, is a collection of libraries and utilities which assist researchers in generating, manipulating, and visualizing graphical data. In addition, ARCGRAPH defines a metafile format that contains device independent graphical data. This file format is used with various computer graphics manipulation and animation packages at Ames, including SURF (COSMIC Program ARC-12381) and GAS (COSMIC Program ARC-12379). In its full configuration, the ARCGRAPH system consists of a two stage pipeline which may be used to output graphical primitives. Stage one is associated with the graphical primitives (i.e. moves, draws, color, etc.) along with the creation and manipulation of the metafiles. Five distinct data filters make up stage one. They are: 1) PLO which handles all 2D vector primitives, 2) POL which handles all 3D polygonal primitives, 3) RAS which handles all 2D raster primitives, 4) VEC which handles all 3D raster primitives, and 5) PO2 which handles all 2D polygonal primitives. Stage two is associated with the process of displaying graphical primitives on a device. To generate the various graphical primitives, create and reprocess ARCGRAPH metafiles, and access the device drivers in the VDI (Video Device Interface) library, users link their applications to ARCGRAPH's GRAFIX library routines. Both FORTRAN and C language versions of the GRAFIX and VDI libraries exist for enhanced portability within these respective programming environments. The ARCGRAPH libraries were developed on a VAX running VMS. Minor documented modification of various routines, however, allows the system to run on the following computers: Cray X-MP running COS (no C version); Cray 2 running UNICOS; DEC VAX running BSD 4.3 UNIX, or Ultrix; SGI IRIS Turbo running GL2-W3.5 and GL2-W3.6; Convex C1 running UNIX; Amhdahl 5840 running UTS; Alliant FX8 running UNIX; Sun 3/160 running UNIX (no native device driver); Stellar GS1000 running Stellex (no native device driver); and an SGI IRIS 4D running IRIX (no native device driver). Currently with version 7.0 of ARCGRAPH, the VDI library supports the following output devices: A VT100 terminal with a RETRO-GRAPHICS board installed, a VT240 using the Tektronix 4010 emulation capability, an SGI IRIS turbo using the native GL2 library, a Tektronix 4010, a Tektronix 4105, and the Tektronix 4014. ARCGRAPH version 7.0 was developed in 1988.
Ion acceleration and heating by kinetic Alfvén waves associated with magnetic reconnection

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liang, Ji; Lin, Yu; Johnson, Jay R.

In a previous study on the generation and signatures of kinetic Alfv en waves (KAWs) associated with magnetic reconnection in a current sheet revealed that KAWs are a common feature during reconnection [Liang et al. J. Geophys. Res.: Space Phys. 121, 6526 (2016)]. In this paper, ion acceleration and heating by the KAWs generated during magnetic reconnection are investigated with a three-dimensional (3-D) hybrid model. It is found that in the outflow region, a fraction of inflow ions are accelerated by the KAWs generated in the leading bulge region of reconnection, and their parallel velocities gradually increase up to slightly super-Alfv enic. As a result of waveparticle interactions, an accelerated ion beam forms in the direction of the anti-parallel magnetic field, in addition to the core ion population, leading to the development of non-Maxwellian velocity distributions, which include a trapped population with parallel velocities consistent with the wave speed. We then heat ions in both parallel and perpendicular directions. In the parallel direction, the heating results from nonlinear Landau resonance of trapped ions. In the perpendicular direction, however, evidence of stochastic heating by the KAWs is found during the acceleration stage, with an increase of magnetic moment μ. The coherence in the T more » $$\\perp$$ ion temperature and the perpendicular electric and magnetic fields of KAWs also provides evidence for perpendicular heating by KAWs. The parallel and perpendicular heating of the accelerated beam occur simultaneously, leading to the development of temperature anisotropy with the perpendicular temperature T $$\\perp$$>T $$\\parallel$$ temperature. The heating rate agrees with the damping rate of the KAWs, and the heating is dominated by the accelerated ion beam. In the later stage, with the increase of the fraction of the accelerated ions, interaction between the accelerated beam and the core population also contributes to the ion heating, ultimately leading to overlap of the beams and an overall anisotropy with T $$\\perp$$>T $$\\parallel$$.« less
Ion acceleration and heating by kinetic Alfvén waves associated with magnetic reconnection

DOE PAGES

Liang, Ji; Lin, Yu; Johnson, Jay R.; ...

2017-09-19

In a previous study on the generation and signatures of kinetic Alfv en waves (KAWs) associated with magnetic reconnection in a current sheet revealed that KAWs are a common feature during reconnection [Liang et al. J. Geophys. Res.: Space Phys. 121, 6526 (2016)]. In this paper, ion acceleration and heating by the KAWs generated during magnetic reconnection are investigated with a three-dimensional (3-D) hybrid model. It is found that in the outflow region, a fraction of inflow ions are accelerated by the KAWs generated in the leading bulge region of reconnection, and their parallel velocities gradually increase up to slightly super-Alfv enic. As a result of waveparticle interactions, an accelerated ion beam forms in the direction of the anti-parallel magnetic field, in addition to the core ion population, leading to the development of non-Maxwellian velocity distributions, which include a trapped population with parallel velocities consistent with the wave speed. We then heat ions in both parallel and perpendicular directions. In the parallel direction, the heating results from nonlinear Landau resonance of trapped ions. In the perpendicular direction, however, evidence of stochastic heating by the KAWs is found during the acceleration stage, with an increase of magnetic moment μ. The coherence in the T more » $$\\perp$$ ion temperature and the perpendicular electric and magnetic fields of KAWs also provides evidence for perpendicular heating by KAWs. The parallel and perpendicular heating of the accelerated beam occur simultaneously, leading to the development of temperature anisotropy with the perpendicular temperature T $$\\perp$$>T $$\\parallel$$ temperature. The heating rate agrees with the damping rate of the KAWs, and the heating is dominated by the accelerated ion beam. In the later stage, with the increase of the fraction of the accelerated ions, interaction between the accelerated beam and the core population also contributes to the ion heating, ultimately leading to overlap of the beams and an overall anisotropy with T $$\\perp$$>T $$\\parallel$$.« less
2DRMP: A suite of two-dimensional R-matrix propagation codes

NASA Astrophysics Data System (ADS)

Scott, N. S.; Scott, M. P.; Burke, P. G.; Stitt, T.; Faro-Maza, V.; Denis, C.; Maniopoulou, A.

2009-12-01

The R-matrix method has proved to be a remarkably stable, robust and efficient technique for solving the close-coupling equations that arise in electron and photon collisions with atoms, ions and molecules. During the last thirty-four years a series of related R-matrix program packages have been published periodically in CPC. These packages are primarily concerned with low-energy scattering where the incident energy is insufficient to ionise the target. In this paper we describe 2DRMP, a suite of two-dimensional R-matrix propagation programs aimed at creating virtual experiments on high performance and grid architectures to enable the study of electron scattering from H-like atoms and ions at intermediate energies. Program summaryProgram title: 2DRMP Catalogue identifier: AEEA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 196 717 No. of bytes in distributed program, including test data, etc.: 3 819 727 Distribution format: tar.gz Programming language: Fortran 95, MPI Computer: Tested on CRAY XT4 [1]; IBM eServer 575 [2]; Itanium II cluster [3] Operating system: Tested on UNICOS/lc [1]; IBM AIX [2]; Red Hat Linux Enterprise AS [3] Has the code been vectorised or parallelised?: Yes. 16 cores were used for small test run Classification: 2.4 External routines: BLAS, LAPACK, PBLAS, ScaLAPACK Subprograms used: ADAZ_v1_1 Nature of problem: 2DRMP is a suite of programs aimed at creating virtual experiments on high performance architectures to enable the study of electron scattering from H-like atoms and ions at intermediate energies. Solution method: Two-dimensional R-matrix propagation theory. The (r,r) space of the internal region is subdivided into a number of subregions. Local R-matrices are constructed within each subregion and used to propagate a global R-matrix, ℜ, across the internal region. On the boundary of the internal region ℜ is transformed onto the IERM target state basis. Thus, the two-dimensional R-matrix propagation technique transforms an intractable problem into a series of tractable problems enabling the internal region to be extended far beyond that which is possible with the standard one-sector codes. A distinctive feature of the method is that both electrons are treated identically and the R-matrix basis states are constructed to allow for both electrons to be in the continuum. The subregion size is flexible and can be adjusted to accommodate the number of cores available. Restrictions: The implementation is currently restricted to electron scattering from H-like atoms and ions. Additional comments: The programs have been designed to operate on serial computers and to exploit the distributed memory parallelism found on tightly coupled high performance clusters and supercomputers. 2DRMP has been systematically and comprehensively documented using ROBODoc [4] which is an API documentation tool that works by extracting specially formatted headers from the program source code and writing them to documentation files. Running time: The wall clock running time for the small test run using 16 cores and performed on [3] is as follows: bp (7 s); rint2 (34 s); newrd (32 s); diag (21 s); amps (11 s); prop (24 s). References:HECToR, CRAY XT4 running UNICOS/lc, http://www.hector.ac.uk/, accessed 22 July, 2009. HPCx, IBM eServer 575 running IBM AIX, http://www.hpcx.ac.uk/, accessed 22 July, 2009. HP Cluster, Itanium II cluster running Red Hat Linux Enterprise AS, Queen s University Belfast, http://www.qub.ac.uk/directorates/InformationServices/Research/HighPerformanceComputing/Services/Hardware/HPResearch/, accessed 22 July, 2009. Automating Software Documentation with ROBODoc, http://www.xs4all.nl/~rfsber/Robo/, accessed 22 July, 2009.
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE PAGES

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

2018-05-05

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
Geometric constraints on potentially singular solutions for the 3-D Euler equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Constantin, P.; Fefferman, C.; Majda, A.J.

1996-12-31

We discuss necessary and sufficient conditions for the formation of finite time singularities (blow up) in the incompressible three dimensional Euler equations. The well-known result of Beale, Kato and Majda states that these equations have smooth solutions on the time interval (0,t) if, and only if lim/t{r_arrow}T {integral}{sup t}{sub 0} {parallel}{Omega}({center_dot},s){parallel}{sub L}{sup {infinity}} (dx)dx < {infinity} where {Omega} = {triangledown} X u is the vorticity of the fluid and u is its divergence=free velocity. In this paper we prove criteria in which the direction of vorticity {xi} = {Omega}/{vert_bar}{Omega}{vert_bar} plays an important role.
Parallel sequencing lives, or what makes large sequencing projects successful

PubMed Central

Cuartero, Yasmina; Stadhouders, Ralph; Graf, Thomas; Marti-Renom, Marc A; Beato, Miguel

2017-01-01

Abstract T47D_rep2 and b1913e6c1_51720e9cf were 2 Hi-C samples. They were born and processed at the same time, yet their fates were very different. The life of b1913e6c1_51720e9cf was simple and fruitful, while that of T47D_rep2 was full of accidents and sorrow. At the heart of these differences lies the fact that b1913e6c1_51720e9cf was born under a lab culture of Documentation, Automation, Traceability, and Autonomy and compliance with the FAIR Principles. Their lives are a lesson for those who wish to embark on the journey of managing high-throughput sequencing data. PMID:29048533
A Network of Hydrophobic Residues Impeding Helix αC Rotation Maintains Latency of Kinase Gcn2, Which Phosphorylates the α Subunit of Translation Initiation Factor 2▿

PubMed Central

Gárriz, Andrés; Qiu, Hongfang; Dey, Madhusudan; Seo, Eun-Joo; Dever, Thomas E.; Hinnebusch, Alan G.

2009-01-01

Kinase Gcn2 is activated by amino acid starvation and downregulates translation initiation by phosphorylating the α subunit of translation initiation factor 2 (eIF2α). The Gcn2 kinase domain (KD) is inert and must be activated by tRNA binding to the adjacent regulatory domain. Previous work indicated that Saccharomyces cerevisiae Gcn2 latency results from inflexibility of the hinge connecting the N and C lobes and a partially obstructed ATP-binding site in the KD. Here, we provide strong evidence that a network of hydrophobic interactions centered on Leu-856 also promotes latency by constraining helix αC rotation in the KD in a manner relieved during amino acid starvation by tRNA binding and autophosphorylation of Thr-882 in the activation loop. Thus, we show that mutationally disrupting the hydrophobic network in various ways constitutively activates eIF2α phosphorylation in vivo and bypasses the requirement for a key tRNA binding motif (m2) and Thr-882 in Gcn2. In particular, replacing Leu-856 with any nonhydrophobic residue activates Gcn2, while substitutions with various hydrophobic residues maintain kinase latency. We further provide strong evidence that parallel, back-to-back dimerization of the KD is a step on the Gcn2 activation pathway promoted by tRNA binding and autophosphorylation. Remarkably, mutations that disrupt the L856 hydrophobic network or enhance hinge flexibility eliminate the need for the conserved salt bridge at the parallel dimer interface, implying that KD dimerization facilitates the reorientation of αC and remodeling of the active site for enhanced ATP binding and catalysis. We propose that hinge remodeling, parallel dimerization, and reorientation of αC are mutually reinforcing conformational transitions stimulated by tRNA binding and secured by the ensuing autophosphorylation of T882 for stable kinase activation. PMID:19114556
A network of hydrophobic residues impeding helix alphaC rotation maintains latency of kinase Gcn2, which phosphorylates the alpha subunit of translation initiation factor 2.

PubMed

Gárriz, Andrés; Qiu, Hongfang; Dey, Madhusudan; Seo, Eun-Joo; Dever, Thomas E; Hinnebusch, Alan G

2009-03-01

Kinase Gcn2 is activated by amino acid starvation and downregulates translation initiation by phosphorylating the alpha subunit of translation initiation factor 2 (eIF2alpha). The Gcn2 kinase domain (KD) is inert and must be activated by tRNA binding to the adjacent regulatory domain. Previous work indicated that Saccharomyces cerevisiae Gcn2 latency results from inflexibility of the hinge connecting the N and C lobes and a partially obstructed ATP-binding site in the KD. Here, we provide strong evidence that a network of hydrophobic interactions centered on Leu-856 also promotes latency by constraining helix alphaC rotation in the KD in a manner relieved during amino acid starvation by tRNA binding and autophosphorylation of Thr-882 in the activation loop. Thus, we show that mutationally disrupting the hydrophobic network in various ways constitutively activates eIF2alpha phosphorylation in vivo and bypasses the requirement for a key tRNA binding motif (m2) and Thr-882 in Gcn2. In particular, replacing Leu-856 with any nonhydrophobic residue activates Gcn2, while substitutions with various hydrophobic residues maintain kinase latency. We further provide strong evidence that parallel, back-to-back dimerization of the KD is a step on the Gcn2 activation pathway promoted by tRNA binding and autophosphorylation. Remarkably, mutations that disrupt the L856 hydrophobic network or enhance hinge flexibility eliminate the need for the conserved salt bridge at the parallel dimer interface, implying that KD dimerization facilitates the reorientation of alphaC and remodeling of the active site for enhanced ATP binding and catalysis. We propose that hinge remodeling, parallel dimerization, and reorientation of alphaC are mutually reinforcing conformational transitions stimulated by tRNA binding and secured by the ensuing autophosphorylation of T882 for stable kinase activation.
Simulation of a Rotorcraft in Turbulent Flows

DTIC Science & Technology

1991-09-01

Knot) Aircraft Parallel Aircraft Parallel Aircraft Parallel To Ship’s To Port-To-Star- To Starboard- Centerline board Landing To-Port Landing Lineup ...Line Lineup Line 345 to 015/35 340 to 005/45 345 to 005145 016 t,) 040/30 006 to 035!35 006 to 025/40 041 to 180/45 036 to 050/30 026 to 040/30 181 to...WIND /FRA3 LOW REYNOLD’S NUMBER AERODYNAMICS FOR NACA0012 AIRFOIL REQUIRES DS/DM NACA0012/AIRFOIL NO SEQUENTIAL FILES REQUIRED INPUT FOR FORCE FRA3
Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

1999-01-01

The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
Coronary Artery Anomalies and Variants: Technical Feasibility of Assessment with Coronary MR Angiography at 3 T1

PubMed Central

Gharib, Ahmed M.; Ho, Vincent B.; Rosing, Douglas R.; Herzka, Daniel A.; Stuber, Matthias; Arai, Andrew E.; Pettigrew, Roderic I.

2008-01-01

The purpose of this study was to prospectively use a whole-heart three-dimensional (3D) coronary magnetic resonance (MR) angiography technique specifically adapted for use at 3 T and a parallel imaging technique (sensitivity encoding) to evaluate coronary arterial anomalies and variants (CAAV). This HIPAA-compliant study was approved by the local institutional review board, and informed consent was obtained from all participants. Twenty-two participants (11 men, 11 women; age range, 18–62 years) were included. Ten participants were healthy volunteers, whereas 12 participants were patients suspected of having CAAV. Coronary MR angiography was performed with a 3-T MR imager. A 3D free-breathing navigator-gated and vector electrocardiographically–gated segmented k-space gradient-echo sequence with adiabatic T2 preparation pulse and parallel imaging (sensitivity encoding) was used. Whole-heart acquisitions (repetition time msec/echo time msec, 4/1.35; 20° flip angle; 1 × 1 × 2-mm acquired voxel size) lasted 10–12 minutes. Mean examination time was 41 minutes ± 14 (standard deviation). Findings included aneurysms, ectasia, arteriovenous fistulas, and anomalous origins. The 3D whole-heart acquisitions developed for use with 3 T are feasible for use in the assessment of CAAV. © RSNA, 2008 PMID:18372470
Renal magnetic resonance angiography at 3.0 Tesla using a 32-element phased-array coil system and parallel imaging in 2 directions.

PubMed

Fenchel, Michael; Nael, Kambiz; Deshpande, Vibhas S; Finn, J Paul; Kramer, Ulrich; Miller, Stephan; Ruehm, Stefan; Laub, Gerhard

2006-09-01

The aim of the present study was to assess the feasibility of renal magnetic resonance angiography at 3.0 T using a phased-array coil system with 32-coil elements. Specifically, high parallel imaging factors were used for an increased spatial resolution and anatomic coverage of the whole abdomen. Signal-to-noise values and the g-factor distribution of the 32 element coil were examined in phantom studies for the magnetic resonance angiography (MRA) sequence. Eleven volunteers (6 men, median age of 30.0 years) were examined on a 3.0-T MR scanner (Magnetom Trio, Siemens Medical Solutions, Malvern, PA) using a 32-element phased-array coil (prototype from In vivo Corp.). Contrast-enhanced 3D-MRA (TR 2.95 milliseconds, TE 1.12 milliseconds, flip angle 25-30 degrees , bandwidth 650 Hz/pixel) was acquired with integrated generalized autocalibrating partially parallel acquisition (GRAPPA), in both phase- and slice-encoding direction. Images were assessed by 2 independent observers with regard to image quality, noise and presence of artifacts. Signal-to-noise levels of 22.2 +/- 22.0 and 57.9 +/- 49.0 were measured with (GRAPPAx6) and without parallel-imaging, respectively. The mean g-factor of the 32-element coil for GRAPPA with an acceleration of 3 and 2 in the phase-encoding and slice-encoding direction, respectively, was 1.61. High image quality was found in 9 of 11 volunteers (2.6 +/- 0.8) with good overall interobserver agreement (k = 0.87). Relatively low image quality with higher noise levels were encountered in 2 volunteers. MRA at 3.0 T using a 32-element phased-array coil is feasible in healthy volunteers. High diagnostic image quality and extended anatomic coverage could be achieved with application of high parallel imaging factors.
TRASYS - THERMAL RADIATION ANALYZER SYSTEM (CRAY VERSION WITH NASADIG)

NASA Technical Reports Server (NTRS)

Anderson, G. E.

1994-01-01

The Thermal Radiation Analyzer System, TRASYS, is a computer software system with generalized capability to solve the radiation related aspects of thermal analysis problems. TRASYS computes the total thermal radiation environment for a spacecraft in orbit. The software calculates internode radiation interchange data as well as incident and absorbed heat rate data originating from environmental radiant heat sources. TRASYS provides data of both types in a format directly usable by such thermal analyzer programs as SINDA/FLUINT (available from COSMIC, program number MSC-21528). One primary feature of TRASYS is that it allows users to write their own driver programs to organize and direct the preprocessor and processor library routines in solving specific thermal radiation problems. The preprocessor first reads and converts the user's geometry input data into the form used by the processor library routines. Then, the preprocessor accepts the user's driving logic, written in the TRASYS modified FORTRAN language. In many cases, the user has a choice of routines to solve a given problem. Users may also provide their own routines where desirable. In particular, the user may write output routines to provide for an interface between TRASYS and any thermal analyzer program using the R-C network concept. Input to the TRASYS program consists of Options and Edit data, Model data, and Logic Flow and Operations data. Options and Edit data provide for basic program control and user edit capability. The Model data describe the problem in terms of geometry and other properties. This information includes surface geometry data, documentation data, nodal data, block coordinate system data, form factor data, and flux data. Logic Flow and Operations data house the user's driver logic, including the sequence of subroutine calls and the subroutine library. Output from TRASYS consists of two basic types of data: internode radiation interchange data, and incident and absorbed heat rate data. The flexible structure of TRASYS allows considerable freedom in the definition and choice of solution method for a thermal radiation problem. The program's flexible structure has also allowed TRASYS to retain the same basic input structure as the authors update it in order to keep up with changing requirements. Among its other important features are the following: 1) up to 3200 node problem size capability with shadowing by intervening opaque or semi-transparent surfaces; 2) choice of diffuse, specular, or diffuse/specular radiant interchange solutions; 3) a restart capability that minimizes recomputing; 4) macroinstructions that automatically provide the executive logic for orbit generation that optimizes the use of previously completed computations; 5) a time variable geometry package that provides automatic pointing of the various parts of an articulated spacecraft and an automatic look-back feature that eliminates redundant form factor calculations; 6) capability to specify submodel names to identify sets of surfaces or components as an entity; and 7) subroutines to perform functions which save and recall the internodal and/or space form factors in subsequent steps for nodes with fixed geometry during a variable geometry run. There are two machine versions of TRASYS v27: a DEC VAX version and a Cray UNICOS version. Both versions require installation of the NASADIG library (MSC-21801 for DEC VAX or COS-10049 for CRAY), which is available from COSMIC either separately or bundled with TRASYS. The NASADIG (NASA Device Independent Graphics Library) plot package provides a pictorial representation of input geometry, orbital/orientation parameters, and heating rate output as a function of time. NASADIG supports Tektronix terminals. The CRAY version of TRASYS v27 is written in FORTRAN 77 for batch or interactive execution and has been implemented on CRAY X-MP and CRAY Y-MP series computers running UNICOS. The standard distribution medium for MSC-21959 (CRAY version without NASADIG) is a 1600 BPI 9-track magnetic tape in UNIX tar format. The standard distribution medium for COS-10040 (CRAY version with NASADIG) is a set of two 6250 BPI 9-track magnetic tapes in UNIX tar format. Alternate distribution media and formats are available upon request. The DEC VAX version of TRASYS v27 is written in FORTRAN 77 for batch execution (only the plotting driver program is interactive) and has been implemented on a DEC VAX 8650 computer under VMS. Since the source codes for MSC-21030 and COS-10026 are in VAX/VMS text library files and DEC Command Language files, COSMIC will only provide these programs in the following formats: MSC-21030, TRASYS (DEC VAX version without NASADIG) is available on a 1600 BPI 9-track magnetic tape in VAX BACKUP format (standard distribution medium) or in VAX BACKUP format on a TK50 tape cartridge; COS-10026, TRASYS (DEC VAX version with NASADIG), is available in VAX BACKUP format on a set of three 6250 BPI 9-track magnetic tapes (standard distribution medium) or a set of three TK50 tape cartridges in VAX BACKUP format. TRASYS was last updated in 1993.

Internal computational fluid mechanics on supercomputers for aerospace propulsion systems

NASA Technical Reports Server (NTRS)

Andersen, Bernhard H.; Benson, Thomas J.

1987-01-01

The accurate calculation of three-dimensional internal flowfields for application towards aerospace propulsion systems requires computational resources available only on supercomputers. A survey is presented of three-dimensional calculations of hypersonic, transonic, and subsonic internal flowfields conducted at the Lewis Research Center. A steady state Parabolized Navier-Stokes (PNS) solution of flow in a Mach 5.0, mixed compression inlet, a Navier-Stokes solution of flow in the vicinity of a terminal shock, and a PNS solution of flow in a diffusing S-bend with vortex generators are presented and discussed. All of these calculations were performed on either the NAS Cray-2 or the Lewis Research Center Cray XMP.
New tools using the hardware performance monitor to help users tune programs on the Cray X-MP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Engert, D.E.; Rudsinski, L.; Doak, J.

1991-09-25

The performance of a Cray system is highly dependent on the tuning techniques used by individuals on their codes. Many of our users were not taking advantage of the tuning tools that allow them to monitor their own programs by using the Hardware Performance Monitor (HPM). We therefore modified UNICOS to collect HPM data for all processes and to report Mflop ratings based on users, programs, and time used. Our tuning efforts are now being focused on the users and programs that have the best potential for performance improvements. These modifications and some of the more striking performance improvements aremore » described.« less
A vectorized Lanczos eigensolver for high-performance computers

NASA Technical Reports Server (NTRS)

Bostic, Susan W.

1990-01-01

The computational strategies used to implement a Lanczos-based-method eigensolver on the latest generation of supercomputers are described. Several examples of structural vibration and buckling problems are presented that show the effects of using optimization techniques to increase the vectorization of the computational steps. The data storage and access schemes and the tools and strategies that best exploit the computer resources are presented. The method is implemented on the Convex C220, the Cray 2, and the Cray Y-MP computers. Results show that very good computation rates are achieved for the most computationally intensive steps of the Lanczos algorithm and that the Lanczos algorithm is many times faster than other methods extensively used in the past.
Optimization of large matrix calculations for execution on the Cray X-MP vector supercomputer

NASA Technical Reports Server (NTRS)

Hornfeck, William A.

1988-01-01

A considerable volume of large computational computer codes were developed for NASA over the past twenty-five years. This code represents algorithms developed for machines of earlier generation. With the emergence of the vector supercomputer as a viable, commercially available machine, an opportunity exists to evaluate optimization strategies to improve the efficiency of existing software. This result is primarily due to architectural differences in the latest generation of large-scale machines and the earlier, mostly uniprocessor, machines. A sofware package being used by NASA to perform computations on large matrices is described, and a strategy for conversion to the Cray X-MP vector supercomputer is also described.
Parallel Higher-order Finite Element Method for Accurate Field Computations in Wakefield and PIC Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A.; Kabel, A.; Lee, L.

Over the past years, SLAC's Advanced Computations Department (ACD), under SciDAC sponsorship, has developed a suite of 3D (2D) parallel higher-order finite element (FE) codes, T3P (T2P) and Pic3P (Pic2P), aimed at accurate, large-scale simulation of wakefields and particle-field interactions in radio-frequency (RF) cavities of complex shape. The codes are built on the FE infrastructure that supports SLAC's frequency domain codes, Omega3P and S3P, to utilize conformal tetrahedral (triangular)meshes, higher-order basis functions and quadratic geometry approximation. For time integration, they adopt an unconditionally stable implicit scheme. Pic3P (Pic2P) extends T3P (T2P) to treat charged-particle dynamics self-consistently using the PIC (particle-in-cell)more » approach, the first such implementation on a conformal, unstructured grid using Whitney basis functions. Examples from applications to the International Linear Collider (ILC), Positron Electron Project-II (PEP-II), Linac Coherent Light Source (LCLS) and other accelerators will be presented to compare the accuracy and computational efficiency of these codes versus their counterparts using structured grids.« less
Radar Data Processing Using a Distributed Computational System

DTIC Science & Technology

1992-06-01

objects to processors must reduce Toc (N) (i.e., the time to compute on 85 N nodes) [Ref. 28]. Time spent communicating can represent a degradation of...de Sistemas e Computaq&o, s/ data. [9] Vilhena R. "IntroduqAo aos Algoritmos para Processamento de Marcaq6es e DistAncias", Escola Naval - Notas de...Aula - Automaq&o de Sistemas Navais, s/ data. (101 Averbuch A., Itzikcwitz S., and Kapon T. "Parallel Implementation of Multiple Model Tracking
Origins and nature of non-Fickian transport through fractures

NASA Astrophysics Data System (ADS)

Wang, L.; Cardenas, M. B.

2014-12-01

Non-Fickian transport occurs across all scales within fractured and porous geological media. Fundamental understanding and appropriate characterization of non-Fickian transport through fractures is critical for understanding and prediction of the fate of solutes and other scalars. We use both analytical and numerical modeling, including direct numerical simulation and particle tracking random walk, to investigate the origin of non-Fickian transport through both homogeneous and heterogeneous fractures. For the simple homogenous fracture case, i.e., parallel plates, we theoretically derived a formula for dynamic longitudinal dispersion (D) within Poiseuille flow. Using the closed-form expression for the theoretical D, we quantified the time (T) and length (L) scales separating preasymptotic and asymptotic dispersive transport, with T and L proportional to aperture (b) of parallel plates to second and fourth orders, respectively. As for heterogeneous fractures, the fracture roughness and correlation length are closely associated with the T and L, and thus indicate the origin for non-Fickian transport. Modeling solute transport through 2D rough-walled fractures with continuous time random walk with truncated power shows that the degree of deviation from Fickian transport is proportional to fracture roughness. The estimated L for 2D rough-walled fractures is significantly longer than that derived from the formula within Poiseuille flow with equivalent b. Moreover, we artificially generated normally distributed 3D fractures with fixed correlation length but different fracture dimensions. Solute transport through 3D fractures was modeled with a particle tracking random walk algorithm. We found that transport transitions from non-Fickian to Fickian with increasing fracture dimensions, where the estimated L for the studied 3D fractures is related to the correlation length.
Multilayer MgB{sub 2} superconducting quantum interference filter magnetometers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Galan, Elias; Melbourne, Thomas; Davidson, Bruce A.

2016-04-25

We report two types of all-MgB{sub 2} superconductive quantum interference filter (SQIF) magnetometers that can measure absolute magnetic fields with high sensitivity. In one configuration, the SQIFs were made of 20 multilayer nonplanar all-MgB{sub 2} superconducting quantum interference devices (SQUIDs) connected in parallel with loop areas ranging in size from 0.4 to 3.6 μm{sup 2}. These devices are sensitive to magnetic fields parallel to the substrate and show a single antipeak from 3 to 16 K with a maximum transfer function of ∼16 V/T at 3 K and a field noise of ∼110 pT/Hz{sup 1/2} above 100 Hz at 10 K. In a second configuration, themore » SQIFs were made with 16 planar SQUIDs connected in parallel with loop areas ranging in size from 4 μm{sup 2} to 25 μm{sup 2} and are sensitive to the magnetic fields perpendicular to the substrate. The planar SQIF shows a single antipeak from 10 to 22 K with a maximum transfer function of 7800 V/T at 10 K and a field noise of ∼70 pT/Hz{sup 1/2} above 100 Hz at 20 K.« less
Parallel sequencing lives, or what makes large sequencing projects successful.

PubMed

Quilez, Javier; Vidal, Enrique; Dily, François Le; Serra, François; Cuartero, Yasmina; Stadhouders, Ralph; Graf, Thomas; Marti-Renom, Marc A; Beato, Miguel; Filion, Guillaume

2017-11-01

T47D_rep2 and b1913e6c1_51720e9cf were 2 Hi-C samples. They were born and processed at the same time, yet their fates were very different. The life of b1913e6c1_51720e9cf was simple and fruitful, while that of T47D_rep2 was full of accidents and sorrow. At the heart of these differences lies the fact that b1913e6c1_51720e9cf was born under a lab culture of Documentation, Automation, Traceability, and Autonomy and compliance with the FAIR Principles. Their lives are a lesson for those who wish to embark on the journey of managing high-throughput sequencing data. © The Author 2017. Published by Oxford University Press.
Venus Gravity Handbook

NASA Technical Reports Server (NTRS)

Konopliv, Alexander S.; Sjogren, William L.

1996-01-01

This report documents the Venus gravity methods and results to date (model MGNP90LSAAP). It is called a handbook in that it contains many useful plots (such as geometry and orbit behavior) that are useful in evaluating the tracking data. We discuss the models that are used in processing the Doppler data and the estimation method for determining the gravity field. With Pioneer Venus Orbiter and Magellan tracking data, the Venus gravity field was determined complete to degree and order 90 with the use of the JPL Cray T3D Supercomputer. The gravity field shows unprecedented high correlation with topography and resolution of features to the 2OOkm resolution. In the procedure for solving the gravity field, other information is gained as well, and, for example, we discuss results for the Venus ephemeris, Love number, pole orientation of Venus, and atmospheric densities. Of significance is the Love number solution which indicates a liquid core for Venus. The ephemeris of Venus is determined to an accuracy of 0.02 mm/s (tens of meters in position), and the rotation period to 243.0194 +/- 0.0002 days.
Collaborative Research: Simulation of Beam-Electron Cloud Interactions in Circular Accelerators Using Plasma Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Katsouleas, Thomas; Decyk, Viktor

Final Report for grant DE-FG02-06ER54888, "Simulation of Beam-Electron Cloud Interactions in Circular Accelerators Using Plasma Models" Viktor K. Decyk, University of California, Los Angeles Los Angeles, CA 90095-1547 The primary goal of this collaborative proposal was to modify the code QuickPIC and apply it to study the long-time stability of beam propagation in low density electron clouds present in circular accelerators. The UCLA contribution to this collaborative proposal was in supporting the development of the pipelining scheme for the QuickPIC code, which extended the parallel scaling of this code by two orders of magnitude. The USC work was as describedmore » here the PhD research for Ms. Bing Feng, lead author in reference 2 below, who performed the research at USC under the guidance of the PI Tom Katsouleas and the collaboration of Dr. Decyk The QuickPIC code [1] is a multi-scale Particle-in-Cell (PIC) code. The outer 3D code contains a beam which propagates through a long region of plasma and evolves slowly. The plasma response to this beam is modeled by slices of a 2D plasma code. This plasma response then is fed back to the beam code, and the process repeats. The pipelining is based on the observation that once the beam has passed a 2D slice, its response can be fed back to the beam immediately without waiting for the beam to pass all the other slices. Thus independent blocks of 2D slices from different time steps can be running simultaneously. The major difficulty was when particles at the edges needed to communicate with other blocks. Two versions of the pipelining scheme were developed, for the the full quasi-static code and the other for the basic quasi-static code used by this e-cloud proposal. Details of the pipelining scheme were published in [2]. The new version of QuickPIC was able to run with more than 1,000 processors, and was successfully applied in modeling e-clouds by our collaborators in this proposal [3-8]. Jean-Luc Vay at Lawrence Berkeley National Lab later implemented a similar basic quasistatic scheme including pipelining in the code WARP [9] and found good to very good quantitative agreement between the two codes in modeling e-clouds. References [1] C. Huang, V. K. Decyk, C. Ren, M. Zhou, W. Lu, W. B. Mori, J. H. Cooley, T. M. Antonsen, Jr., and T. Katsouleas, "QUICKPIC: A highly efficient particle-in-cell code for modeling wakefield acceleration in plasmas," J. Computational Phys. 217, 658 (2006). [2] B. Feng, C. Huang, V. K. Decyk, W. B. Mori, P. Muggli, and T. Katsouleas, "Enhancing parallel quasi-static particle-in-cell simulations with a pipelining algorithm," J. Computational Phys, 228, 5430 (2009). [3] C. Huang, V. K. Decyk, M. Zhou, W. Lu, W. B. Mori, J. H. Cooley, T. M. Antonsen, Jr., and B. Feng, T. Katsouleas, J. Vieira, and L. O. Silva, "QUICKPIC: A highly efficient fully parallelized PIC code for plasma-based acceleration," Proc. of the SciDAC 2006 Conf., Denver, Colorado, June, 2006 [Journal of Physics: Conference Series, W. M. Tang, Editor, vol. 46, Institute of Physics, Bristol and Philadelphia, 2006], p. 190. [4] B. Feng, C. Huang, V. Decyk, W. B. Mori, T. Katsouleas, P. Muggli, "Enhancing Plasma Wakefield and E-cloud Simulation Performance Using a Pipelining Algorithm," Proc. 12th Workshop on Advanced Accelerator Concepts, Lake Geneva, WI, July, 2006, p. 201 [AIP Conf. Proceedings, vol. 877, Melville, NY, 2006]. [5] B. Feng, P. Muggli, T. Katsouleas, V. Decyk, C. Huang, and W. Mori, "Long Time Electron Cloud Instability Simulation Using QuickPIC with Pipelining Algorithm," Proc. of the 2007 Particle Accelerator Conference, Albuquerque, NM, June, 2007, p. 3615. [6] B. Feng, C. Huang, V. Decyk, W. B. Mori, G. H. Hoffstaetter, P. Muggli, T. Katsouleas, "Simulation of Electron Cloud Effects on Electron Beam at ERL with Pipelined QuickPIC," Proc. 13th Workshop on Advanced Accelerator Concepts, Santa Cruz, CA, July-August, 2008, p. 340 [AIP Conf. Proceedings, vol. 1086, Melville, NY, 2008]. [7] B. Feng, C. Huang, V. K. Decyk, W. B. Mori, P. Muggli, and T. Katsouleas, "Enhancing parallel quasi-static particle-in-cell simulations with a pipelining algorithm," J. Computational Phys, 228, 5430 (2009). [8] C. Huang, W. An, V. K. Decyk, W. Lu, W. B. Mori, F. S. Tsung, M. Tzoufras, S. Morshed, T. Antonsen, B. Feng, T. Katsouleas, R., A. Fonseca, S. F. Martins, J. Vieira, L. O. Silva, E. Esarey, C. G. R. Geddes, W. P. Leemans, E. Cormier-Michel, J.-L. Vay, D. L. Bruhwiler, B. Cowan, J. R. Cary, and K. Paul, "Recent results and future challenges for large scale particleion- cell simulations of plasma-based accelerator concepts," Proc. of the SciDAC 2009 Conf., San Diego, CA, June, 2009 [Journal of Physics: Conference Series, vol. 180, Institute of Physics, Bristol and Philadelphia, 2009], p. 012005. [9] J.-L. Vay, C. M. Celata, M. A. Furman, G. Penn, M. Venturini, D. P. Grote, and K. G. Sonnad, ?Update on Electron-Cloud Simulations Using the Package WARP-POSINST.? Proc. of the 2009 Particle Accelerator Conference PAC09, Vancouver, Canada, June, 2009, paper FR5RFP078.« less
[Signal loss in magnetic resonance imaging caused by intraoral anchored dental magnetic materials].

PubMed

Blankenstein, F H; Truong, B; Thomas, A; Schröder, R J; Naumann, M

2006-08-01

To measure the maximum extent of the signal loss areas in the center of the susceptibility artifacts generated by ferromagnetic dental magnet attachments using three different sequences in the 1.5 and 3.0 Tesla MRI. Five different pieces of standard dental magnet attachments with volumes of 6.5 to 31.4 mm(3) were used: a NdFeB magnet with an open magnetic field, a NdFeB magnet with a closed magnetic field, a SmCo magnet with an open magnetic field, a stainless steel keeper (AUM-20) and a PdCo piece. The attachments were placed between two cylindrical phantoms and examined in 1.5 and 3.0 Tesla MRI using gradient echo and T1- and T2-weighted spin echoes. We measured the maximum extent of the generated signal loss areas parallel and perpendicular to the direction of B (O). In gradient echoes the artifacts were substantially larger and symmetrically adjusted around the object. The areas with total signal loss were mushroom-like with a maximum extent of 7.4 to 9.7 cm parallel to the direction of B (O) and 6.7 to 7.4 cm perpendicular to B (O). In spin echoes the signal loss areas were obviously smaller, but not centered. The maximum values ranged between 4.9 and 7.2 cm (parallel B (O)) and 3.6 and 7.0 cm (perpendicular B (O)). The different ferromagnetic attachments had no clinically relevant influence on the signal loss neither in 1.5 T nor 3.0 T MRI. Ferromagnetic materials used in dentistry are not intraorally standardized. To ensure, that the area of interest is not affected by the described artifacts, the maximum extent of the signal loss area should be assumed: a radius of up to 7 cm in 1.5 and 3.0 T MRI by T1 and T2 sequences, and a radius of up to 10 cm in T2* sequences. To decide whether magnet attachments have to be removed before MR imaging, physicians should consider both the intact retention of the keepers and the safety distance between the ferromagnetic objects and the area of interest.
Scalable parallel communications

NASA Technical Reports Server (NTRS)

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq.

PubMed

Macaulay, Iain C; Teng, Mabel J; Haerty, Wilfried; Kumar, Parveen; Ponting, Chris P; Voet, Thierry

2016-11-01

Parallel sequencing of a single cell's genome and transcriptome provides a powerful tool for dissecting genetic variation and its relationship with gene expression. Here we present a detailed protocol for G&T-seq, a method for separation and parallel sequencing of genomic DNA and full-length polyA(+) mRNA from single cells. We provide step-by-step instructions for the isolation and lysis of single cells; the physical separation of polyA(+) mRNA from genomic DNA using a modified oligo-dT bead capture and the respective whole-transcriptome and whole-genome amplifications; and library preparation and sequence analyses of these amplification products. The method allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell. G&T-seq differs from other currently available methods for parallel DNA and RNA sequencing from single cells, as it involves physical separation of the DNA and RNA and does not require bespoke microfluidics platforms. The process can be implemented manually or through automation. When performed manually, paired genome and transcriptome sequencing libraries from eight single cells can be produced in ∼3 d by researchers experienced in molecular laboratory work. For users with experience in the programming and operation of liquid-handling robots, paired DNA and RNA libraries from 96 single cells can be produced in the same time frame. Sequence analysis and integration of single-cell G&T-seq DNA and RNA data requires a high level of bioinformatics expertise and familiarity with a wide range of informatics tools.
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram

2013-04-09

A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
A note on sample size calculation for mean comparisons based on noncentral t-statistics.

PubMed

Chow, Shein-Chung; Shao, Jun; Wang, Hansheng

2002-11-01

One-sample and two-sample t-tests are commonly used in analyzing data from clinical trials in comparing mean responses from two drug products. During the planning stage of a clinical study, a crucial step is the sample size calculation, i.e., the determination of the number of subjects (patients) needed to achieve a desired power (e.g., 80%) for detecting a clinically meaningful difference in the mean drug responses. Based on noncentral t-distributions, we derive some sample size calculation formulas for testing equality, testing therapeutic noninferiority/superiority, and testing therapeutic equivalence, under the popular one-sample design, two-sample parallel design, and two-sample crossover design. Useful tables are constructed and some examples are given for illustration.
Free-breathing dynamic contrast-enhanced MRI for assessment of pulmonary lesions using golden-angle radial sparse parallel imaging.

PubMed

Chen, Lihua; Liu, Daihong; Zhang, Jiuquan; Xie, Bing; Zhou, Xiaoyue; Grimm, Robert; Huang, Xuequan; Wang, Jian; Feng, Li

2018-02-13

Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has been shown to be a promising technique for assessing lung lesions. However, DCE-MRI often suffers from motion artifacts and insufficient imaging speed. Therefore, highly accelerated free-breathing DCE-MRI is of clinical interest for lung exams. To test the performance of rapid free-breathing DCE-MRI for simultaneous qualitative and quantitative assessment of pulmonary lesions using Golden-angle RAdial Sparse Parallel (GRASP) imaging. Prospective. Twenty-six patients (17 males, mean age = 55.1 ± 14.4) with known pulmonary lesions. 3T MR scanner; a prototype fat-saturated, T 1 -weighted stack-of-stars golden-angle radial sequence for data acquisition and a Cartesian breath-hold volumetric-interpolated examination (BH-VIBE) sequence for comparison. After a dual-mode GRASP reconstruction, one with 3-second temporal resolution (3s-GRASP) and the other with 15-second temporal resolution (15s-GRASP), all GRASP and BH-VIBE images were pooled together for blind assessment by two experienced radiologists, who independently scored the overall image quality, lesion delineation, overall artifact level, and diagnostic confidence of each case. Perfusion analysis was performed for the 3s-GRASP images using a Tofts model to generate the volume transfer coefficient (K trans ) and interstitial volume (V e ). Nonparametric paired two-tailed Wilcoxon signed-rank test; Cohen's kappa; unpaired Student's t-test. 15s-GRASP achieved comparable image quality with conventional BH-VIBE (P > 0.05), except for the higher overall artifact level in the precontrast phase (P = 0.018). The K trans and V e in inflammation were higher than those in malignant lesions (K trans : 0.78 ± 0.52 min -1 vs. 0.37 ± 0.22 min -1 , P = 0.020; V e : 0.36 ± 0.16 vs. 0.26 ± 0.1, P = 0.177). Also, the K trans and V e in malignant lesions were also higher than those in benign lesions (K trans : 0.37 ± 0.22 min -1 vs. 0.04 ± 0.04 min -1 , P = 0.001; V e : 0.26 ± 0.12 vs. 0.10 ± 0.00, P = 0.063). This feasibility study demonstrated the performance of high spatiotemporal resolution free-breathing DCE-MRI of the lung using GRASP for qualitative and quantitative assessment of pulmonary lesions. 2 Technical Efficacy: Stage 1 J. Magn. Reson. Imaging 2018. © 2018 International Society for Magnetic Resonance in Medicine.
Gyrokinetic simulations of turbulent transport in a ring dipole plasma.

PubMed

Kobayashi, Sumire; Rogers, Barrett N; Dorland, William

2009-07-31

Gyrokinetic flux-tube simulations of turbulent transport due to small-scale entropy modes are presented in a ring-dipole magnetic geometry relevant to the Columbia-MIT levitated dipole experiment (LDX) [J. Kesner, Plasma Phys. J. 23, 742 (1997)]. Far from the current ring, the dipolar magnetic field leads to strong parallel variations, while close to the ring the system becomes nearly uniform along circular magnetic field lines. The transport in these two limits are found to be quantitatively similar given an appropriate normalization based on the local out-board parameters. The transport increases strongly with the density gradient, and for small eta=L(n)/L(T)<1, T(i) approximately T(e), and typical LDX parameters, can reach large levels. Consistent with linear theory, temperature gradients are stabilizing, and for T(i) approximately T(e) can completely cut off the transport when eta greater or similar to 0.6.
Phonon coupling to dynamic short-range polar order in a relaxor ferroelectric near the morphotropic phase boundary

DOE PAGES

John A. Schneeloch; Xu, Zhijun; Winn, B.; ...

2015-12-28

We report neutron inelastic scattering experiments on single-crystal PbMg 1/3Nb 2/3O 3 doped with 32% PbTiO 3, a relaxor ferroelectric that lies close to the morphotropic phase boundary. When cooled under an electric field E∥ [001] into tetragonal and monoclinic phases, the scattering cross section from transverse acoustic (TA) phonons polarized parallel to E weakens and shifts to higher energy relative to that under zero-field-cooled conditions. Likewise, the scattering cross section from transverse optic (TO) phonons polarized parallel to E weakens for energy transfers 4 ≤ ℏω ≤ 9 meV. However, TA and TO phonons polarized perpendicular to E showmore » no change. This anisotropic field response is similar to that of the diffuse scattering cross section, which, as previously reported, is suppressed when polarized parallel to E but not when polarized perpendicular to E. Lastly, our findings suggest that the lattice dynamics and dynamic short-range polar correlations that give rise to the diffuse scattering are coupled.« less
Approach and separation of quantum vortices with balanced cores

NASA Astrophysics Data System (ADS)

Kerr, Robert M.; Rorai, C.; Skipper, J.; Sreenivasan, K. R.

2014-11-01

Using two innovations, smooth but different, scaling laws for the reconnection of pairs of initially orthogonal and anti-parallel quantum vortices are obtained using the three-dimensional Gross-Pitaevskii equations. For the anti-parallel case, the scaling laws just before and after reconnection obey the dimensional δ ~ | t - tr| 1 / 2 prediction with temporal symmetry about the reconnection time tr and physical space symmetry about xr, the mid-point between the vortices, with extensions forming the edges of an equilateral pyramid. For all of the orthogonal cases, before reconnection δin ~(t -tr) 1 / 3 and after reconnection δout ~(tr - t) 2 / 3 , which are respectively slower and faster than the dimensional prediction. In these cases, the reconnection takes place in a plane defined by the directions of the curvature and vorticity. Robert.Kerr@warwick.ac.uk.

Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis.. (VII) HFODD (v2.49t): A new version of the program

NASA Astrophysics Data System (ADS)

Schunck, N.; Dobaczewski, J.; McDonnell, J.; Satuła, W.; Sheikh, J. A.; Staszczak, A.; Stoitsov, M.; Toivanen, P.

2012-01-01

We describe the new version (v2.49t) of the code HFODD which solves the nuclear Skyrme-Hartree-Fock (HF) or Skyrme-Hartree-Fock-Bogolyubov (HFB) problem by using the Cartesian deformed harmonic-oscillator basis. In the new version, we have implemented the following physics features: (i) the isospin mixing and projection, (ii) the finite-temperature formalism for the HFB and HF + BCS methods, (iii) the Lipkin translational energy correction method, (iv) the calculation of the shell correction. A number of specific numerical methods have also been implemented in order to deal with large-scale multi-constraint calculations and hardware limitations: (i) the two-basis method for the HFB method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint calculations, (iii) the linear constraint method based on the approximation of the RPA matrix for multi-constraint calculations, (iv) an interface with the axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or HFB matrix elements instead of the HF fields. Special care has been paid to using the code on massively parallel leadership class computers. For this purpose, the following features are now available with this version: (i) the Message Passing Interface (MPI) framework, (ii) scalable input data routines, (iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the HFB matrix in the simplex-breaking case using the ScaLAPACK library. Finally, several little significant errors of the previous published version were corrected. New version program summaryProgram title:HFODD (v2.49t) Catalogue identifier: ADFL_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADFL_v3_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public Licence v3 No. of lines in distributed program, including test data, etc.: 190 614 No. of bytes in distributed program, including test data, etc.: 985 898 Distribution format: tar.gz Programming language: FORTRAN-90 Computer: Intel Pentium-III, Intel Xeon, AMD-Athlon, AMD-Opteron, Cray XT4, Cray XT5 Operating system: UNIX, LINUX, Windows XP Has the code been vectorized or parallelized?: Yes, parallelized using MPI RAM: 10 Mwords Word size: The code is written in single-precision for the use on a 64-bit processor. The compiler option -r8 or +autodblpad (or equivalent) has to be used to promote all real and complex single-precision floating-point items to double precision when the code is used on a 32-bit machine. Classification: 17.22 Catalogue identifier of previous version: ADFL_v2_2 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 2361 External routines: The user must have access to the NAGLIB subroutine f02axe, or LAPACK subroutines zhpev, zhpevx, zheevr, or zheevd, which diagonalize complex hermitian matrices, the LAPACK subroutines dgetri and dgetrf which invert arbitrary real matrices, the LAPACK subroutines dsyevd, dsytrf and dsytri which compute eigenvalues and eigenfunctions of real symmetric matrices, the LINPACK subroutines zgedi and zgeco, which invert arbitrary complex matrices and calculate determinants, the BLAS routines dcopy, dscal, dgeem and dgemv for double-precision linear algebra and zcopy, zdscal, zgeem and zgemv for complex linear algebra, or provide another set of subroutines that can perform such tasks. The BLAS and LAPACK subroutines can be obtained from the Netlib Repository at the University of Tennessee, Knoxville: http://netlib2.cs.utk.edu/. Does the new version supersede the previous version?: Yes Nature of problem: The nuclear mean field and an analysis of its symmetries in realistic cases are the main ingredients of a description of nuclear states. Within the Local Density Approximation, or for a zero-range velocity-dependent Skyrme interaction, the nuclear mean field is local and velocity dependent. The locality allows for an effective and fast solution of the self-consistent Hartree-Fock equations, even for heavy nuclei, and for various nucleonic ( n-particle- n-hole) configurations, deformations, excitation energies, or angular momenta. Similarly, Local Density Approximation in the particle-particle channel, which is equivalent to using a zero-range interaction, allows for a simple implementation of pairing effects within the Hartree-Fock-Bogolyubov method. Solution method: The program uses the Cartesian harmonic oscillator basis to expand single-particle or single-quasiparticle wave functions of neutrons and protons interacting by means of the Skyrme effective interaction and zero-range pairing interaction. The expansion coefficients are determined by the iterative diagonalization of the mean-field Hamiltonians or Routhians which depend non-linearly on the local neutron and proton densities. Suitable constraints are used to obtain states corresponding to a given configuration, deformation or angular momentum. The method of solution has been presented in: [J. Dobaczewski, J. Dudek, Comput. Phys. Commun. 102 (1997) 166]. Reasons for new version: Version 2.49s of HFODD provides a number of new options such as the isospin mixing and projection of the Skyrme functional, the finite-temperature HF and HFB formalism and optimized methods to perform multi-constrained calculations. It is also the first version of HFODD to contain threading and parallel capabilities. Summary of revisions: Isospin mixing and projection of the HF states has been implemented. The finite-temperature formalism for the HFB equations has been implemented. The Lipkin translational energy correction method has been implemented. Calculation of the shell correction has been implemented. The two-basis method for the solution to the HFB equations has been implemented. The Augmented Lagrangian Method (ALM) for calculations with multiple constraints has been implemented. The linear constraint method based on the cranking approximation of the RPA matrix has been implemented. An interface between HFODD and the axially-symmetric and parity-conserving code HFBTHO has been implemented. The mixing of the matrix elements of the HF or HFB matrix has been implemented. A parallel interface using the MPI library has been implemented. A scalable model for reading input data has been implemented. OpenMP pragmas have been implemented in three subroutines. The diagonalization of the HFB matrix in the simplex-breaking case has been parallelized using the ScaLAPACK library. Several little significant errors of the previous published version were corrected. Running time: In serial mode, running 6 HFB iterations for 152Dy for conserved parity and signature symmetries in a full spherical basis of N=14 shells takes approximately 8 min on an AMD Opteron processor at 2.6 GHz, assuming standard BLAS and LAPACK libraries. As a rule of thumb, runtime for HFB calculations for parity and signature conserved symmetries roughly increases as N, where N is the number of full HO shells. Using custom-built optimized BLAS and LAPACK libraries (such as in the ATLAS implementation) can bring down the execution time by 60%. Using the threaded version of the code with 12 threads and threaded BLAS libraries can bring an additional factor 2 speed-up, so that the same 6 HFB iterations now take of the order of 2 min 30 s.
31 CFR 500.201 - Transactions involving designated foreign countries or their nationals; effective date.

Code of Federal Regulations, 2010 CFR

2010-07-01

... noted after the country or area. Schedule (1) North Korea, i.e., Korea north of the 38th parallel of north latitude: December 17, 1950. (2) Cambodia: April 17, 1975. (3) North Vietnam; i.e., Vietnam north of the 17th parallel of north latitude: May 5, 1964. (4) South Vietnam, i.e., Vietnam south of the...
Dimensional synthesis of a 3-DOF parallel manipulator with full circle rotation

NASA Astrophysics Data System (ADS)

Ni, Yanbing; Wu, Nan; Zhong, Xueyong; Zhang, Biao

2015-07-01

Parallel robots are widely used in the academic and industrial fields. In spite of the numerous achievements in the design and dimensional synthesis of the low-mobility parallel robots, few research efforts are directed towards the asymmetric 3-DOF parallel robots whose end-effector can realize 2 translational and 1 rotational(2T1R) motion. In order to develop a manipulator with the capability of full circle rotation to enlarge the workspace, a new 2T1R parallel mechanism is proposed. The modeling approach and kinematic analysis of this proposed mechanism are investigated. Using the method of vector analysis, the inverse kinematic equations are established. This is followed by a vigorous proof that this mechanism attains an annular workspace through its circular rotation and 2 dimensional translations. Taking the first order perturbation of the kinematic equations, the error Jacobian matrix which represents the mapping relationship between the error sources of geometric parameters and the end-effector position errors is derived. With consideration of the constraint conditions of pressure angles and feasible workspace, the dimensional synthesis is conducted with a goal to minimize the global comprehensive performance index. The dimension parameters making the mechanism to have optimal error mapping and kinematic performance are obtained through the optimization algorithm. All these research achievements lay the foundation for the prototype building of such kind of parallel robots.
3.0 Tesla high spatial resolution contrast-enhanced magnetic resonance angiography (CE-MRA) of the pulmonary circulation: initial experience with a 32-channel phased array coil using a high relaxivity contrast agent.

PubMed

Nael, Kambiz; Fenchel, Michael; Krishnam, Mayil; Finn, J Paul; Laub, Gerhard; Ruehm, Stefan G

2007-06-01

To evaluate the technical feasibility of high spatial resolution contrast-enhanced magnetic resonance angiography (CE-MRA) with highly accelerated parallel acquisition at 3.0 T using a 32-channel phased array coil, and a high relaxivity contrast agent. Ten adult healthy volunteers (5 men, 5 women, aged 21-66 years) underwent high spatial resolution CE-MRA of the pulmonary circulation. Imaging was performed at 3 T using a 32-channel phase array coil. After intravenous injection of 1 mL of gadobenate dimeglumine (Gd-BOPTA) at 1.5 mL/s, a timing bolus was used to measure the transit time from the arm vein to the main pulmonary artery. Subsequently following intravenous injection of 0.1 mmol/kg of Gd-BOPTA at the same rate, isotropic high spatial resolution data sets (1 x 1 x 1 mm3) CE-MRA of the entire pulmonary circulation were acquired using a fast gradient-recalled echo sequence (TR/TE 3/1.2 milliseconds, FA 18 degrees) and highly accelerated parallel acquisition (GRAPPA x 6) during a 20-second breath hold. The presence of artifact, noise, and image quality of the pulmonary arterial segments were evaluated independently by 2 radiologists. Phantom measurements were performed to assess the signal-to-noise ratio (SNR). Statistical analysis of data was performed by using Wilcoxon rank sum test and 2-sample Student t test. The interobserver variability was tested by kappa coefficient. All studies were of diagnostic quality as determined by both observers. The pulmonary arteries were routinely identified up to fifth-order branches, with definition in the diagnostic range and excellent interobserver agreement (kappa = 0.84, 95% confidence interval 0.77-0.90). Phantom measurements showed significantly lower SNR (P < 0.01) using GRAPPA (17.3 +/- 18.8) compared with measurements without parallel acquisition (58 +/- 49.4). The described 3 T CE-MRA protocol in addition to high T1 relaxivity of Gd-BOPTA provides sufficient SNR to support highly accelerated parallel acquisition (GRAPPA x 6), resulting in acquisition of isotopic (1 x 1 x 1 mm3) voxels over the entire pulmonary circulation in 20 seconds.
Behaviour at high pressure of Rb7NaGa8Si12O40·3H2O (a zeolite with EDI topology): a combined experimental-computational study

NASA Astrophysics Data System (ADS)

Gatta, G. D.; Tabacchi, G.; Fois, E.; Lee, Y.

2016-03-01

The high-pressure behaviour and the P-induced structural evolution of a synthetic zeolite Rb7NaGa8Si12O40·3H2O (with edingtonite-type structure) were investigated both by in situ synchrotron powder diffraction (with a diamond anvil cell and the methanol:ethanol:water = 16:3:1 mixture as pressure-transmitting fluid) up to 3.27 GPa and by ab initio first-principles computational modelling. No evidence of phase transition or penetration of P-fluid molecules was observed within the P-range investigated. The isothermal equation of state was determined; V 0 and K T0 refined with a second-order Birch-Murnaghan equation of state are V 0 = 1311.3(2) Å3 and K T0 = 29.8(7) GPa. The main deformation mechanism (at the atomic scale) in response to the applied pressure is represented by the cooperative rotation of the secondary building units (SBU) about their chain axis (i.e. [001]). The direct consequence of SBU anti-rotation on the zeolitic channels parallel to [001] is the increase in pore ellipticity with pressure, in response to the extension of the major axis and to the contraction of the minor axis of the elliptical channel parallel to [001]. The effect of the applied pressure on the bonding configuration of the extra-framework content is only secondary. A comparison between the P-induced main deformation mechanisms observed in Rb7NaGa8Si12O40·3H2O and those previously found in natural fibrous zeolites is made.
Left ventricular epicardial activation increases transmural dispersion of repolarization in healthy, long QT, and dilated cardiomyopathy dogs.

PubMed

Bai, Rong; Lü, Jiagao; Pu, Jun; Liu, Nian; Zhou, Qiang; Ruan, Yanfei; Niu, Huiyan; Zhang, Cuntai; Wang, Lin; Kam, Ruth

2005-10-01

Benefits of cardiac resynchronization therapy (CRT) are well established. However, less is understood concerning its effects on myocardial repolarization and the potential proarrhythmic risk. Healthy dogs (n = 8) were compared to a long QT interval (LQT) model (n = 8, induced by cesium chloride, CsCl) and a dilated cardiomyopathy with congestive heart failure (DCM-CHF, induced by rapid ventricular pacing, n = 5). Monophasic action potential (MAP) recordings were obtained from the subendocardium, midmyocardium, subepicardium, and the transmural dispersion of repolarization (TDR) was calculated. The QT interval and the interval from the peak to the end of the T wave (T(p-e)) were measured. All these characteristics were compared during left ventricular epicardial (LV-Epi), right ventricular endocardial (RV-Endo), and biventricular (Bi-V) pacing. In healthy dogs, TDR prolonged to 37.54 ms for Bi-V pacing and to 47.16 ms for LV-Epi pacing as compared to 26.75 ms for RV-Endo pacing (P < 0.001), which was parallel to an augmentation in T(p-e) interval (Bi-V pacing, 64.29 ms; LV-Epi pacing, 57.89 ms; RV-Endo pacing, 50.29 ms; P < 0.01). During CsCl exposure, Bi-V and LV-Epi pacing prolonged MAPD, TDR, and T(p-e) interval as compared to RV-Endo pacing. The midmyocardial MAPD (276.30 ms vs 257.35 ms, P < 0.0001) and TDR (33.80 ms vs 27.58 ms, P=0.002) were significantly longer in DCM-CHF dogs than those in healthy dogs. LV-Epi and Bi-V pacing further prolonged the MAPD and TDR in this model. LV-Epi and Bi-V pacing result in prolongation of ventricular repolarization time, and increase of TDR accounted for a parallel augmentation of the T(p-e) interval, which provides evidence that T(p-e) interval accurately represents TDR. These effects are magnified in the LQT and DCM-CHF canine models in addition to their intrinsic transmural heterogeneity in the intact heart. This mechanism may contribute to the development of malignant ventricular arrhythmias, such as torsades de pointes (TdP) in congestive heart failure (CHF) patients treated with CRT.
SPLAT: Using Spectral Indices to Identify and Characterize Ultracool Stars, Brown Dwarfs and Exoplanets in Deep Surveys and as Companions to Nearby Stars

NASA Astrophysics Data System (ADS)

Aganze, Christian; Burgasser, Adam J.; Martin, Eduardo; Konopacky, Quinn; Masters, Daniel C.

2016-06-01

The majority of ultracool dwarf stars and brown dwarfs currently known were identified in wide-field red optical and infrared surveys, enabling measures of the local, typically isolated, population in a relatively shallow (<100 pc radius) volume. Constraining the properties of the wider Galactic population (scale height, radial distribution, Population II sources), and close brown dwarf and exoplanet companions to nearby stars, requires specialized instrumentation, such as high-contrast, coronagraphic spectrometers (e.g., Gemini/GPI, VLT/Sphere, Project 1640); and deep spectral surveys (e.g., HST/WFC3 parallel fields, Euclid). We present a set of quantitative methodologies to identify and robustly characterize sources for these specific populations, based on templates and tools developed as part of the SpeX Prism Library Analysis Toolkit. In particular, we define and characterize specifically-tuned sets spectral indices that optimize selection of cool dwarfs and distinguish rare populations (subdwarfs, young planetary-mass objects) based on low-resolution, limited-wavelength-coverage spectral data; and present a template-matching classification method for these instruments. We apply these techniques to HST/WFC3 parallel fields data in the WISPS and HST-3D programs, where our spectral index set allows high completeness and low contamination for searches of late M, L and T dwarfs to distances out to ~3 kpc.The material presented here is based on work supported by the National Aeronautics and Space Administration under Grant No. NNX15AI75G.
Software and Systems Test Track Architecture and Concept Definition

DTIC Science & Technology

2007-05-01

Light 11.0 11.0 11.0 ASC Flex Free Software Foundation 2.5.31 2.5.31 2.5.31 ASC Fluent Fluent Inc. 6.2.26 6.2.26 6.2.26 6.2.26 ASC FMD ...11 ERDC Fluent Fluent 6.2.16 ERDC Fortran 77/90 compiler Compaq/Cray/SGI 7.4 7.4.3m 7.4.4m 5.6 ERDC FTA Platform 1.1 1.1 1.1 ERDC GAMESS
Maternal T-Cell Engraftment Interferes With Human Leukocyte Antigen Typing in Severe Combined Immunodeficiency.

PubMed

Liu, Chang; Duffy, Brian; Bednarski, Jeffrey J; Calhoun, Cecelia; Lay, Lindsay; Rundblad, Barrett; Payton, Jacqueline E; Mohanakumar, Thalachallour

2016-02-01

To report the laboratory investigation of a case of severe combined immunodeficiency (SCID) with maternal T-cell engraftment, focusing on the interference of human leukocyte antigen (HLA) typing by blood chimerism. HLA typing was performed with three different methods, including sequence-specific primer (SSP), sequence-specific oligonucleotide, and Sanger sequencing on peripheral blood leukocytes and buccal cells, from a 3-month-old boy and peripheral blood leukocytes from his parents. Short tandem repeat (STR) testing was performed in parallel. HLA typing of the patient's peripheral blood leukocytes using the SSP method demonstrated three different alleles for each of the HLA-B and HLA-C loci, with both maternal alleles present at each locus. Typing results from the patient's buccal cells showed a normal pattern of inheritance for paternal and maternal haplotypes. STR enrichment testing of the patient's CD3+ T lymphocytes and CD15+ myeloid cells confirmed maternal T-cell engraftment, while the myeloid cell profile matched the patient's buccal cells. Maternal T-cell engraftment may interfere with HLA typing in patients with SCID. Selection of the appropriate typing methods and specimens is critical for accurate HLA typing and immunologic assessment before allogeneic hematopoietic stem cell transplantation. © American Society for Clinical Pathology, 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Lifshitz transition with interactions in high magnetic fields: Application to CeIn3

NASA Astrophysics Data System (ADS)

Schlottmann, Pedro

2012-02-01

The N'eel ordered state of CeIn3 is suppressed by a magnetic field of 61 T at ambient pressure. There is a second transition at ˜45 T, which has been associated with a Lifshitz transition [1,2]. Skin depth measurements [2] indicate that the transition is discontinuous as T ->0. Motivated by this transition we study the effects of Landau quantization and interaction among carriers on a Lifshitz transition. The Landau quantization leads to quasi-one-dimensional behavior for the direction parallel to the field. Repulsive Coulomb interactions give rise to a gas of strongly coupled carriers [3]. The density correlation function is calculated for a special long-ranged potential [4]. It is concluded that in CeIn3 a pocket is being emptied as a function of field in a discontinuous fashion in the ground state. This discontinuity is gradually smeared by the temperature [4] in agreement with the skin depth experiments [2]. 0.05in [1] S.E. Sebastian et al, PNAS 106, 7741 (2009). [2] K.M. Purcell et al, Phys. Rev. B 79, 214428 (2009). [3] P. Schlottmann and R. Gerhardts, Z. Phys. B 34, 363 (1979). [4] P. Schlottmann, Phys. Rev. B 83, 115133 (2011); J. Appl. Phys., in print.
The NASA Neutron Star Grand Challenge: The coalescences of Neutron Star Binary System

NASA Astrophysics Data System (ADS)

Suen, Wai-Mo

1998-04-01

NASA funded a Grand Challenge Project (9/1996-1999) for the development of a multi-purpose numerical treatment for relativistic astrophysics and gravitational wave astronomy. The coalescence of binary neutron stars is chosen as the model problem for the code development. The institutes involved in it are the Argonne Lab, Livermore lab, Max-Planck Institute at Potsdam, StonyBrook, U of Illinois and Washington U. We have recently succeeded in constructing a highly optimized parallel code which is capable of solving the full Einstein equations coupled with relativistic hydrodynamics, running at over 50 GFLOPS on a T3E (the second milestone point of the project). We are presently working on the head-on collisions of two neutron stars, and the inclusion of realistic equations of state into the code. The code will be released to the relativity and astrophysics community in April of 1998. With the full dynamics of the spacetime, relativistic hydro and microphysics all combined into a unified 3D code for the first time, many interesting large scale calculations in general relativistic astrophysics can now be carried out on massively parallel computers.
Evaluation of dual-source parallel RF excitation for diffusion-weighted whole-body MR imaging with background body signal suppression at 3.0 T.

PubMed

Mürtz, Petra; Kaschner, Marius; Träber, Frank; Kukuk, Guido M; Büdenbender, Sarah M; Skowasch, Dirk; Gieseke, Jürgen; Schild, Hans H; Willinek, Winfried A

2012-11-01

To evaluate the use of dual-source parallel RF excitation (TX) for diffusion-weighted whole-body MRI with background body signal suppression (DWIBS) at 3.0 T. Forty consecutive patients were examined on a clinical 3.0-T MRI system using a diffusion-weighted (DW) spin-echo echo-planar imaging sequence with a combination of short TI inversion recovery and slice-selective gradient reversal fat suppression. DWIBS of the neck (n=5), thorax (n=8), abdomen (n=6) and pelvis (n=21) was performed both with TX (2:56 min) and with standard single-source RF excitation (4:37 min). The quality of DW images and reconstructed inverted maximum intensity projections was visually judged by two readers (blinded to acquisition technique). Signal homogeneity and fat suppression were scored as "improved", "equal", "worse" or "ambiguous". Moreover, the apparent diffusion coefficient (ADC) values were measured in muscles, urinary bladder, lymph nodes and lesions. By the use of TX, signal homogeneity was "improved" in 25/40 and "equal" in 15/40 cases. Fat suppression was "improved" in 17/40 and "equal" in 23/40 cases. These improvements were statistically significant (p<0.001, Wilcoxon signed-rank test). In five patients, fluid-related dielectric shading was present, which improved remarkably. The ADC values did not significantly differ for the two RF excitation methods (p=0.630 over all data, pairwise Student's t-test). Dual-source parallel RF excitation improved image quality of DWIBS at 3.0 T with respect to signal homogeneity and fat suppression, reduced scan time by approximately one-third, and did not influence the measured ADC values. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
A PIC-MCC code RFdinity1d for simulation of discharge initiation by ICRF antenna

NASA Astrophysics Data System (ADS)

Tripský, M.; Wauters, T.; Lyssoivan, A.; Bobkov, V.; Schneider, P. A.; Stepanov, I.; Douai, D.; Van Eester, D.; Noterdaeme, J.-M.; Van Schoor, M.; ASDEX Upgrade Team; EUROfusion MST1 Team

2017-12-01

Discharges produced and sustained by ion cyclotron range of frequency (ICRF) waves in absence of plasma current will be used on ITER for (ion cyclotron-) wall conditioning (ICWC, Te = 3{-}5 eV, ne < 1018 m-3 ). In this paper, we present the 1D particle-in-cell Monte Carlo collision (PIC-MCC) RFdinity1d for the study the breakdown phase of ICRF discharges, and its dependency on the RF discharge parameters (i) antenna input power P i , (ii) RF frequency f, (iii) shape of the electric field and (iv) the neutral gas pressure pH_2 . The code traces the motion of both electrons and ions in a narrow bundle of magnetic field lines close to the antenna straps. The charged particles are accelerated in the parallel direction with respect to the magnetic field B T by two electric fields: (i) the vacuum RF field of the ICRF antenna E_z^RF and (ii) the electrostatic field E_zP determined by the solution of Poisson’s equation. The electron density evolution in simulations follows exponential increase, {\\dot{n_e} ∼ ν_ion t } . The ionization rate varies with increasing electron density as different mechanisms become important. The charged particles are affected solely by the antenna RF field E_z^RF at low electron density ({ne < 1011} m-3 , {≤ft \\vert E_z^RF \\right \\vert \\gg ≤ft \\vert E_zP \\right \\vert } ). At higher densities, when the electrostatic field E_zP is comparable to the antenna RF field E_z^RF , the ionization frequency reaches the maximum. Plasma oscillations propagating toroidally away from the antenna are observed. The simulated energy distributions of ions and electrons at {ne ∼ 1015} m-3 correspond a power-law Kappa energy distribution. This energy distribution was also observed in NPA measurements at ASDEX Upgrade in ICWC experiments.
21. 1934 aerial of Tempe Canal, Sections 19 and 30 ...

Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

21. 1934 aerial of Tempe Canal, Sections 19 and 30 (T1N R5E) and Sections 24 and 25 (T1N R4E) (top of page is north). The main canal enters the picture at upper right and curves out of picture at lower right. The Hayden Branch (thin dark line) runs from top of picture to the southwest, then curves to the west. The Western Branch enters picture running parallel to main canal, then angles off to southwest. Photographer: Unknown, 1934. Source: SRP Cartographic Drafting - Tempe Canal, South Side Salt River in Tempe, Mesa & Phoenix, Tempe, Maricopa County, AZ
DOE Office of Scientific and Technical Information (OSTI.GOV)

Mubarak, Misbah; Ross, Robert B.

This technical report describes the experiments performed to validate the MPI performance measurements reported by the CODES dragonfly network simulation with the Theta Cray XC system at the Argonne Leadership Computing Facility (ALCF).
Extensions and improvements on XTRAN3S

NASA Technical Reports Server (NTRS)

Borland, C. J.

1989-01-01

Improvements to the XTRAN3S computer program are summarized. Work on this code, for steady and unsteady aerodynamic and aeroelastic analysis in the transonic flow regime has concentrated on the following areas: (1) Maintenance of the XTRAN3S code, including correction of errors, enhancement of operational capability, and installation on the Cray X-MP system; (2) Extension of the vectorization concepts in XTRAN3S to include additional areas of the code for improved execution speed; (3) Modification of the XTRAN3S algorithm for improved numerical stability for swept, tapered wing cases and improved computational efficiency; and (4) Extension of the wing-only version of XTRAN3S to include pylon and nacelle or external store capability.
Calibrationless parallel magnetic resonance imaging: a joint sparsity model.

PubMed

Majumdar, Angshul; Chaudhury, Kunal Narayan; Ward, Rabab

2013-12-05

State-of-the-art parallel MRI techniques either explicitly or implicitly require certain parameters to be estimated, e.g., the sensitivity map for SENSE, SMASH and interpolation weights for GRAPPA, SPIRiT. Thus all these techniques are sensitive to the calibration (parameter estimation) stage. In this work, we have proposed a parallel MRI technique that does not require any calibration but yields reconstruction results that are at par with (or even better than) state-of-the-art methods in parallel MRI. Our proposed method required solving non-convex analysis and synthesis prior joint-sparsity problems. This work also derives the algorithms for solving them. Experimental validation was carried out on two datasets-eight channel brain and eight channel Shepp-Logan phantom. Two sampling methods were used-Variable Density Random sampling and non-Cartesian Radial sampling. For the brain data, acceleration factor of 4 was used and for the other an acceleration factor of 6 was used. The reconstruction results were quantitatively evaluated based on the Normalised Mean Squared Error between the reconstructed image and the originals. The qualitative evaluation was based on the actual reconstructed images. We compared our work with four state-of-the-art parallel imaging techniques; two calibrated methods-CS SENSE and l1SPIRiT and two calibration free techniques-Distributed CS and SAKE. Our method yields better reconstruction results than all of them.
Cheetah: A Framework for Scalable Hierarchical Collective Operations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graham, Richard L; Gorentla Venkata, Manjunath; Ladd, Joshua S

2011-01-01

Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passingmore » Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.« less
Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.

In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
A model for pion-pion scattering in large- N QCD

NASA Astrophysics Data System (ADS)

Veneziano, G.; Yankielowicz, S.; Onofri, E.

2017-04-01

Following up on recent work by Caron-Huot et al. we consider a generalization of the old Lovelace-Shapiro model as a toy model for ππ scattering satisfying (most of) the properties expected to hold in ('t Hooft's) large- N limit of massless QCD. In particular, the model has asymptotically linear and parallel Regge trajectories at positive t, a positive leading Regge intercept α 0 < 1, and an effective bending of the trajectories in the negative- t region producing a fixed branch point at J = 0 for t < t 0 < 0. Fixed (physical) angle scattering can be tuned to match the power-like behavior (including logarithmic corrections) predicted by perturbative QCD: A( s, t) ˜ s - β log( s)-γ F ( θ). Tree-level unitarity (i.e. positivity of residues for all values of s and J ) imposes strong constraints on the allowed region in the α0- β-γ parameter space, which nicely includes a physically interesting region around α 0 = 0 .5, β = 2 and γ = 3. The full consistency of the model would require an extension to multi-pion processes, a program we do not undertake in this paper.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.