fast parallel code: Topics by Science.gov

Sample records for fast parallel code

Development of Fast Algorithms Using Recursion, Nesting and Iterations for Computational Electromagnetics

NASA Technical Reports Server (NTRS)

Chew, W. C.; Song, J. M.; Lu, C. C.; Weedon, W. H.

1995-01-01

In the first phase of our work, we have concentrated on laying the foundation to develop fast algorithms, including the use of recursive structure like the recursive aggregate interaction matrix algorithm (RAIMA), the nested equivalence principle algorithm (NEPAL), the ray-propagation fast multipole algorithm (RPFMA), and the multi-level fast multipole algorithm (MLFMA). We have also investigated the use of curvilinear patches to build a basic method of moments code where these acceleration techniques can be used later. In the second phase, which is mainly reported on here, we have concentrated on implementing three-dimensional NEPAL on a massively parallel machine, the Connection Machine CM-5, and have been able to obtain some 3D scattering results. In order to understand the parallelization of codes on the Connection Machine, we have also studied the parallelization of 3D finite-difference time-domain (FDTD) code with PML material absorbing boundary condition (ABC). We found that simple algorithms like the FDTD with material ABC can be parallelized very well allowing us to solve within a minute a problem of over a million nodes. In addition, we have studied the use of the fast multipole method and the ray-propagation fast multipole algorithm to expedite matrix-vector multiplication in a conjugate-gradient solution to integral equations of scattering. We find that these methods are faster than LU decomposition for one incident angle, but are slower than LU decomposition when many incident angles are needed as in the monostatic RCS calculations.
Matrix-Free Polynomial-Based Nonlinear Least Squares Optimized Preconditioning and its Application to Discontinuous Galerkin Discretizations of the Euler Equations

DTIC Science & Technology

2015-06-01

cient parallel code for applying the operator. Our method constructs a polynomial preconditioner using a nonlinear least squares (NLLS) algorithm. We show...apply the underlying operator. Such a preconditioner can be very attractive in scenarios where one has a highly efficient parallel code for applying...repeatedly solve a large system of linear equations where one has an extremely fast parallel code for applying an underlying fixed linear operator
Parallel-vector computation for linear structural analysis and non-linear unconstrained optimization problems

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.

1991-01-01

Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
One-step trinary signed-digit arithmetic using an efficient encoding scheme

NASA Astrophysics Data System (ADS)

Salim, W. Y.; Fyath, R. S.; Ali, S. A.; Alam, Mohammad S.

2000-11-01

The trinary signed-digit (TSD) number system is of interest for ultra fast optoelectronic computing systems since it permits parallel carry-free addition and borrow-free subtraction of two arbitrary length numbers in constant time. In this paper, a simple coding scheme is proposed to encode the decimal number directly into the TSD form. The coding scheme enables one to perform parallel one-step TSD arithmetic operation. The proposed coding scheme uses only a 5-combination coding table instead of the 625-combination table reported recently for recoded TSD arithmetic technique.
Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

NASA Technical Reports Server (NTRS)

Wigton, Larry

1996-01-01

Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
A Parallel Implementation of Multilevel Recursive Spectral Bisection for Application to Adaptive Unstructured Meshes. Chapter 1

NASA Technical Reports Server (NTRS)

Barnard, Stephen T.; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

The design of a parallel implementation of multilevel recursive spectral bisection is described. The goal is to implement a code that is fast enough to enable dynamic repartitioning of adaptive meshes.
fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

PubMed

Hung, Ling-Hong; Samudrala, Ram

2014-06-15

fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

NASA Technical Reports Server (NTRS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

DOE PAGES

O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; ...

1995-01-01

Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less
Combustor Simulation

NASA Technical Reports Server (NTRS)

Norris, Andrew

2003-01-01

The goal was to perform 3D simulation of GE90 combustor, as part of full turbofan engine simulation. Requirements of high fidelity as well as fast turn-around time require massively parallel code. National Combustion Code (NCC) was chosen for this task as supports up to 999 processors and includes state-of-the-art combustion models. Also required is ability to take inlet conditions from compressor code and give exit conditions to turbine code.
Ex-vessel neutron dosimetry analysis for westinghouse 4-loop XL pressurized water reactor plant using the RadTrack{sup TM} Code System with the 3D parallel discrete ordinates code RAPTOR-M3G

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, J.; Alpan, F. A.; Fischer, G.A.

2011-07-01

Traditional two-dimensional (2D)/one-dimensional (1D) SYNTHESIS methodology has been widely used to calculate fast neutron (>1.0 MeV) fluence exposure to reactor pressure vessel in the belt-line region. However, it is expected that this methodology cannot provide accurate fast neutron fluence calculation at elevations far above or below the active core region. A three-dimensional (3D) parallel discrete ordinates calculation for ex-vessel neutron dosimetry on a Westinghouse 4-Loop XL Pressurized Water Reactor has been done. It shows good agreement between the calculated results and measured results. Furthermore, the results show very different fast neutron flux values at some of the former plate locationsmore » and elevations above and below an active core than those calculated by a 2D/1D SYNTHESIS method. This indicates that for certain irregular reactor internal structures, where the fast neutron flux has a very strong local effect, it is required to use a 3D transport method to calculate accurate fast neutron exposure. (authors)« less
Parallel community climate model: Description and user`s guide

DOE Office of Scientific and Technical Information (OSTI.GOV)

Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG

NASA Astrophysics Data System (ADS)

Griffiths, M. K.; Fedun, V.; Erdélyi, R.

2015-03-01

Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations.

PubMed

Procacci, Piero

2016-06-27

We present a new release (6.0β) of the ORAC program [Marsili et al. J. Comput. Chem. 2010, 31, 1106-1116] with a hybrid OpenMP/MPI (open multiprocessing message passing interface) multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptor system on high performance computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity, and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation high throughput virtual screening in drug discovery and design. The code, along with documentation, testing, and ancillary tools, is distributed under the provisions of the General Public License and can be freely downloaded at www.chim.unifi.it/orac .
Real-Time Parallel Software Design Case Study: Implementation of the RASSP SAR Benchmark on the Intel Paragon.

DTIC Science & Technology

1996-01-01

Real-Time 19 5 Conclusion 23 List of References 25 ii LIST OF FIGURES FIGURE PAGE 3-1 Test Bench Pseudo Code 7 3-2 Fast Convolution...3-1 shows pseudo - code for a test bench with two application nodes. The outer test bench wrapper consists of three functions: pipeline_init, pipeline...exit_func); Figure 3-1. Test Bench Pseudo Code The application wrapper is contained in the pipeline routine and similarly consists of an
Schnek: A C++ library for the development of parallel simulation codes on regular grids

NASA Astrophysics Data System (ADS)

Schmitz, Holger

2018-05-01

A large number of algorithms across the field of computational physics are formulated on grids with a regular topology. We present Schnek, a library that enables fast development of parallel simulations on regular grids. Schnek contains a number of easy-to-use modules that greatly reduce the amount of administrative code for large-scale simulation codes. The library provides an interface for reading simulation setup files with a hierarchical structure. The structure of the setup file is translated into a hierarchy of simulation modules that the developer can specify. The reader parses and evaluates mathematical expressions and initialises variables or grid data. This enables developers to write modular and flexible simulation codes with minimal effort. Regular grids of arbitrary dimension are defined as well as mechanisms for defining physical domain sizes, grid staggering, and ghost cells on these grids. Ghost cells can be exchanged between neighbouring processes using MPI with a simple interface. The grid data can easily be written into HDF5 files using serial or parallel I/O.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Final report for the Tera Computer TTI CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Davidson, G.S.; Pavlakos, C.; Silva, C.

1997-01-01

Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
Parallel fast multipole boundary element method applied to computational homogenization

NASA Astrophysics Data System (ADS)

Ptaszny, Jacek

2018-01-01

In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

PubMed

Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas

2016-04-01

Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .

Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.

PubMed

Dharmaraj, Christopher D; Thadikonda, Kishan; Fletcher, Anthony R; Doan, Phuc N; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A; Cook, John A; Mitchell, James B; Subramanian, Sankaran; Krishna, Murali C

2009-01-01

Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 x 23 x 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time.
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators

PubMed Central

Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew

2014-01-01

Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950
An object-oriented approach for parallel self adaptive mesh refinement on block structured grids

NASA Technical Reports Server (NTRS)

Lemke, Max; Witsch, Kristian; Quinlan, Daniel

1993-01-01

Self-adaptive mesh refinement dynamically matches the computational demands of a solver for partial differential equations to the activity in the application's domain. In this paper we present two C++ class libraries, P++ and AMR++, which significantly simplify the development of sophisticated adaptive mesh refinement codes on (massively) parallel distributed memory architectures. The development is based on our previous research in this area. The C++ class libraries provide abstractions to separate the issues of developing parallel adaptive mesh refinement applications into those of parallelism, abstracted by P++, and adaptive mesh refinement, abstracted by AMR++. P++ is a parallel array class library to permit efficient development of architecture independent codes for structured grid applications, and AMR++ provides support for self-adaptive mesh refinement on block-structured grids of rectangular non-overlapping blocks. Using these libraries, the application programmers' work is greatly simplified to primarily specifying the serial single grid application and obtaining the parallel and self-adaptive mesh refinement code with minimal effort. Initial results for simple singular perturbation problems solved by self-adaptive multilevel techniques (FAC, AFAC), being implemented on the basis of prototypes of the P++/AMR++ environment, are presented. Singular perturbation problems frequently arise in large applications, e.g. in the area of computational fluid dynamics. They usually have solutions with layers which require adaptive mesh refinement and fast basic solvers in order to be resolved efficiently.
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code

NASA Astrophysics Data System (ADS)

Payne, J.; McCune, D.; Prater, R.

2010-11-01

NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
On a model of three-dimensional bursting and its parallel implementation

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

2008-04-01

A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
Hybrid MPI-OpenMP Parallelism in the ONETEP Linear-Scaling Electronic Structure Code: Application to the Delamination of Cellulose Nanofibrils.

PubMed

Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton

2014-11-11

We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
A parallel and modular deformable cell Car-Parrinello code

NASA Astrophysics Data System (ADS)

Cavazzoni, Carlo; Chiarotti, Guido L.

1999-12-01

We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.
Novel Scalable 3-D MT Inverse Solver

NASA Astrophysics Data System (ADS)

Kuvshinov, A. V.; Kruglyakov, M.; Geraskin, A.

2016-12-01

We present a new, robust and fast, three-dimensional (3-D) magnetotelluric (MT) inverse solver. As a forward modelling engine a highly-scalable solver extrEMe [1] is used. The (regularized) inversion is based on an iterative gradient-type optimization (quasi-Newton method) and exploits adjoint sources approach for fast calculation of the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT (single-site and/or inter-site) responses, and supports massive parallelization. Different parallelization strategies implemented in the code allow for optimal usage of available computational resources for a given problem set up. To parameterize an inverse domain a mask approach is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to high-performance clusters demonstrate practically linear scalability of the code up to thousands of nodes. 1. Kruglyakov, M., A. Geraskin, A. Kuvshinov, 2016. Novel accurate and scalable 3-D MT forward solver based on a contracting integral equation method, Computers and Geosciences, in press.
Avoiding Defect Nucleation during Equilibration in Molecular Dynamics Simulations with ReaxFF

DTIC Science & Technology

2015-04-01

respectively. All simulations are performed using the LAMMPS computer code.12 2 Fig. 1 a) Initial and b) final configurations of the molecular centers...Plimpton S. Fast parallel algorithms for short-range molecular dynamics. Comput J Phys. 1995;117:1–19. (Software available at http:// lammps .sandia.gov
Phosphoenolpyruvate carboxykinase 1 gene (Pck1) displays parallel evolution between Old World and New World fruit bats.

PubMed

Zhu, Lei; Yin, Qiuyuan; Irwin, David M; Zhang, Shuyi

2015-01-01

Bats are an ideal mammalian group for exploring adaptations to fasting due to their large variety of diets and because fasting is a regular part of their life cycle. Mammals fed on a carbohydrate-rich diet experience a rapid decrease in blood glucose levels during a fast, thus, the development of mechanisms to resist the consequences of regular fasts, experienced on a daily basis, must have been crucial in the evolution of frugivorous bats. Phosphoenolpyruvate carboxykinase 1 (PEPCK1, encoded by the Pck1 gene) is the rate-limiting enzyme in gluconeogenesis and is largely responsible for the maintenance of glucose homeostasis during fasting in fruit-eating bats. To test whether Pck1 has experienced adaptive evolution in frugivorous bats, we obtained Pck1 coding sequence from 20 species of bats, including five Old World fruit bats (OWFBs) (Pteropodidae) and two New World fruit bats (NWFBs) (Phyllostomidae). Our molecular evolutionary analyses of these sequences revealed that Pck1 was under purifying selection in both Old World and New World fruit bats with no evidence of positive selection detected in either ancestral branch leading to fruit bats. Interestingly, however, six specific amino acid substitutions were detected on the ancestral lineage of OWFBs. In addition, we found considerable evidence for parallel evolution, at the amino acid level, between the PEPCK1 sequences of Old World fruit bats and New World fruit bats. Test for parallel evolution showed that four parallel substitutions (Q276R, R503H, I558V and Q593R) were driven by natural selection. Our study provides evidence that Pck1 underwent parallel evolution between Old World and New World fruit bats, two lineages of mammals that feed on a carbohydrate-rich diet and experience regular periods of fasting as part of their life cycle.
Phosphoenolpyruvate Carboxykinase 1 Gene (Pck1) Displays Parallel Evolution between Old World and New World Fruit Bats

PubMed Central

Irwin, David M.; Zhang, Shuyi

2015-01-01

Bats are an ideal mammalian group for exploring adaptations to fasting due to their large variety of diets and because fasting is a regular part of their life cycle. Mammals fed on a carbohydrate-rich diet experience a rapid decrease in blood glucose levels during a fast, thus, the development of mechanisms to resist the consequences of regular fasts, experienced on a daily basis, must have been crucial in the evolution of frugivorous bats. Phosphoenolpyruvate carboxykinase 1 (PEPCK1, encoded by the Pck1 gene) is the rate-limiting enzyme in gluconeogenesis and is largely responsible for the maintenance of glucose homeostasis during fasting in fruit-eating bats. To test whether Pck1 has experienced adaptive evolution in frugivorous bats, we obtained Pck1 coding sequence from 20 species of bats, including five Old World fruit bats (OWFBs) (Pteropodidae) and two New World fruit bats (NWFBs) (Phyllostomidae). Our molecular evolutionary analyses of these sequences revealed that Pck1 was under purifying selection in both Old World and New World fruit bats with no evidence of positive selection detected in either ancestral branch leading to fruit bats. Interestingly, however, six specific amino acid substitutions were detected on the ancestral lineage of OWFBs. In addition, we found considerable evidence for parallel evolution, at the amino acid level, between the PEPCK1 sequences of Old World fruit bats and New World fruit bats. Test for parallel evolution showed that four parallel substitutions (Q276R, R503H, I558V and Q593R) were driven by natural selection. Our study provides evidence that Pck1 underwent parallel evolution between Old World and New World fruit bats, two lineages of mammals that feed on a carbohydrate-rich diet and experience regular periods of fasting as part of their life cycle. PMID:25807515
Using a source-to-source transformation to introduce multi-threading into the AliRoot framework for a parallel event reconstruction

NASA Astrophysics Data System (ADS)

Lohn, Stefan B.; Dong, Xin; Carminati, Federico

2012-12-01

Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
PCTDSE: A parallel Cartesian-grid-based TDSE solver for modeling laser-atom interactions

NASA Astrophysics Data System (ADS)

Fu, Yongsheng; Zeng, Jiaolong; Yuan, Jianmin

2017-01-01

We present a parallel Cartesian-grid-based time-dependent Schrödinger equation (TDSE) solver for modeling laser-atom interactions. It can simulate the single-electron dynamics of atoms in arbitrary time-dependent vector potentials. We use a split-operator method combined with fast Fourier transforms (FFT), on a three-dimensional (3D) Cartesian grid. Parallelization is realized using a 2D decomposition strategy based on the Message Passing Interface (MPI) library, which results in a good parallel scaling on modern supercomputers. We give simple applications for the hydrogen atom using the benchmark problems coming from the references and obtain repeatable results. The extensions to other laser-atom systems are straightforward with minimal modifications of the source code.
Development and Validation of a Fast, Accurate and Cost-Effective Aeroservoelastic Method on Advanced Parallel Computing Systems

NASA Technical Reports Server (NTRS)

Goodwin, Sabine A.; Raj, P.

1999-01-01

Progress to date towards the development and validation of a fast, accurate and cost-effective aeroelastic method for advanced parallel computing platforms such as the IBM SP2 and the SGI Origin 2000 is presented in this paper. The ENSAERO code, developed at the NASA-Ames Research Center has been selected for this effort. The code allows for the computation of aeroelastic responses by simultaneously integrating the Euler or Navier-Stokes equations and the modal structural equations of motion. To assess the computational performance and accuracy of the ENSAERO code, this paper reports the results of the Navier-Stokes simulations of the transonic flow over a flexible aeroelastic wing body configuration. In addition, a forced harmonic oscillation analysis in the frequency domain and an analysis in the time domain are done on a wing undergoing a rigid pitch and plunge motion. Finally, to demonstrate the ENSAERO flutter-analysis capability, aeroelastic Euler and Navier-Stokes computations on an L-1011 wind tunnel model including pylon, nacelle and empennage are underway. All computational solutions are compared with experimental data to assess the level of accuracy of ENSAERO. As the computations described above are performed, a meticulous log of computational performance in terms of wall clock time, execution speed, memory and disk storage is kept. Code scalability is also demonstrated by studying the impact of varying the number of processors on computational performance on the IBM SP2 and the Origin 2000 systems.
A Massively Parallel Code for Polarization Calculations

NASA Astrophysics Data System (ADS)

Akiyama, Shizuka; Höflich, Peter

2001-03-01

We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
Toward an automated parallel computing environment for geosciences

NASA Astrophysics Data System (ADS)

Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping

2007-08-01

Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.
Large-Constraint-Length, Fast Viterbi Decoder

NASA Technical Reports Server (NTRS)

Collins, O.; Dolinar, S.; Hsu, In-Shek; Pollara, F.; Olson, E.; Statman, J.; Zimmerman, G.

1990-01-01

Scheme for efficient interconnection makes VLSI design feasible. Concept for fast Viterbi decoder provides for processing of convolutional codes of constraint length K up to 15 and rates of 1/2 to 1/6. Fully parallel (but bit-serial) architecture developed for decoder of K = 7 implemented in single dedicated VLSI circuit chip. Contains six major functional blocks. VLSI circuits perform branch metric computations, add-compare-select operations, and then store decisions in traceback memory. Traceback processor reads appropriate memory locations and puts out decoded bits. Used as building block for decoders of larger K.
Smart photodetector arrays for error control in page-oriented optical memory

NASA Astrophysics Data System (ADS)

Schaffer, Maureen Elizabeth

1998-12-01

Page-oriented optical memories (POMs) have been proposed to meet high speed, high capacity storage requirements for input/output intensive computer applications. This technology offers the capability for storage and retrieval of optical data in two-dimensional pages resulting in high throughput data rates. Since currently measured raw bit error rates for these systems fall several orders of magnitude short of industry requirements for binary data storage, powerful error control codes must be adopted. These codes must be designed to take advantage of the two-dimensional memory output. In addition, POMs require an optoelectronic interface to transfer the optical data pages to one or more electronic host systems. Conventional charge coupled device (CCD) arrays can receive optical data in parallel, but the relatively slow serial electronic output of these devices creates a system bottleneck thereby eliminating the POM advantage of high transfer rates. Also, CCD arrays are "unintelligent" interfaces in that they offer little data processing capabilities. The optical data page can be received by two-dimensional arrays of "smart" photo-detector elements that replace conventional CCD arrays. These smart photodetector arrays (SPAs) can perform fast parallel data decoding and error control, thereby providing an efficient optoelectronic interface between the memory and the electronic computer. This approach optimizes the computer memory system by combining the massive parallelism and high speed of optics with the diverse functionality, low cost, and local interconnection efficiency of electronics. In this dissertation we examine the design of smart photodetector arrays for use as the optoelectronic interface for page-oriented optical memory. We review options and technologies for SPA fabrication, develop SPA requirements, and determine SPA scalability constraints with respect to pixel complexity, electrical power dissipation, and optical power limits. Next, we examine data modulation and error correction coding for the purpose of error control in the POM system. These techniques are adapted, where possible, for 2D data and evaluated as to their suitability for a SPA implementation in terms of BER, code rate, decoder time and pixel complexity. Our analysis shows that differential data modulation combined with relatively simple block codes known as array codes provide a powerful means to achieve the desired data transfer rates while reducing error rates to industry requirements. Finally, we demonstrate the first smart photodetector array designed to perform parallel error correction on an entire page of data and satisfy the sustained data rates of page-oriented optical memories. Our implementation integrates a monolithic PN photodiode array and differential input receiver for optoelectronic signal conversion with a cluster error correction code using 0.35-mum CMOS. This approach provides high sensitivity, low electrical power dissipation, and fast parallel correction of 2 x 2-bit cluster errors in an 8 x 8 bit code block to achieve corrected output data rates scalable to 102 Gbps in the current technology increasing to 1.88 Tbps in 0.1-mum CMOS.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter

2000-03-01

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor K.; Dauger, Dean E.; Kokelaar, Pieter R.

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 Mflops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.

Fast Computation of the Two-Point Correlation Function in the Age of Big Data

NASA Astrophysics Data System (ADS)

Pellegrino, Andrew; Timlin, John

2018-01-01

We present a new code which quickly computes the two-point correlation function for large sets of astronomical data. This code combines the ease of use of Python with the speed of parallel shared libraries written in C. We include the capability to compute the auto- and cross-correlation statistics, and allow the user to calculate the three-dimensional and angular correlation functions. Additionally, the code automatically divides the user-provided sky masks into contiguous subsamples of similar size, using the HEALPix pixelization scheme, for the purpose of resampling. Errors are computed using jackknife and bootstrap resampling in a way that adds negligible extra runtime, even with many subsamples. We demonstrate comparable speed with other clustering codes, and code accuracy compared to known and analytic results.
LDRD final report on massively-parallel linear programming : the parPCx system.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

2005-02-01

This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.« less
A parallel-vector algorithm for rapid structural analysis on high-performance computers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.

1990-01-01

A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the 'loop unrolling' technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large-scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
A parallel-vector algorithm for rapid structural analysis on high-performance computers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.

1990-01-01

A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the loop unrolling technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
Parameters that affect parallel processing for computational electromagnetic simulation codes on high performance computing clusters

NASA Astrophysics Data System (ADS)

Moon, Hongsik

What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
New technologies for advanced three-dimensional optimum shape design in aeronautics

NASA Astrophysics Data System (ADS)

Dervieux, Alain; Lanteri, Stéphane; Malé, Jean-Michel; Marco, Nathalie; Rostaing-Schmidt, Nicole; Stoufflet, Bruno

1999-05-01

The analysis of complex flows around realistic aircraft geometries is becoming more and more predictive. In order to obtain this result, the complexity of flow analysis codes has been constantly increasing, involving more refined fluid models and sophisticated numerical methods. These codes can only run on top computers, exhausting their memory and CPU capabilities. It is, therefore, difficult to introduce best analysis codes in a shape optimization loop: most previous works in the optimum shape design field used only simplified analysis codes. Moreover, as the most popular optimization methods are the gradient-based ones, the more complex the flow solver, the more difficult it is to compute the sensitivity code. However, emerging technologies are contributing to make such an ambitious project, of including a state-of-the-art flow analysis code into an optimisation loop, feasible. Among those technologies, there are three important issues that this paper wishes to address: shape parametrization, automated differentiation and parallel computing. Shape parametrization allows faster optimization by reducing the number of design variable; in this work, it relies on a hierarchical multilevel approach. The sensitivity code can be obtained using automated differentiation. The automated approach is based on software manipulation tools, which allow the differentiation to be quick and the resulting differentiated code to be rather fast and reliable. In addition, the parallel algorithms implemented in this work allow the resulting optimization software to run on increasingly larger geometries. Copyright
Parallel processing in the honeybee olfactory pathway: structure, function, and evolution.

PubMed

Rössler, Wolfgang; Brill, Martin F

2013-11-01

Animals face highly complex and dynamic olfactory stimuli in their natural environments, which require fast and reliable olfactory processing. Parallel processing is a common principle of sensory systems supporting this task, for example in visual and auditory systems, but its role in olfaction remained unclear. Studies in the honeybee focused on a dual olfactory pathway. Two sets of projection neurons connect glomeruli in two antennal-lobe hemilobes via lateral and medial tracts in opposite sequence with the mushroom bodies and lateral horn. Comparative studies suggest that this dual-tract circuit represents a unique adaptation in Hymenoptera. Imaging studies indicate that glomeruli in both hemilobes receive redundant sensory input. Recent simultaneous multi-unit recordings from projection neurons of both tracts revealed widely overlapping response profiles strongly indicating parallel olfactory processing. Whereas lateral-tract neurons respond fast with broad (generalistic) profiles, medial-tract neurons are odorant specific and respond slower. In analogy to "what-" and "where" subsystems in visual pathways, this suggests two parallel olfactory subsystems providing "what-" (quality) and "when" (temporal) information. Temporal response properties may support across-tract coincidence coding in higher centers. Parallel olfactory processing likely enhances perception of complex odorant mixtures to decode the diverse and dynamic olfactory world of a social insect.
OCTGRAV: Sparse Octree Gravitational N-body Code on Graphics Processing Units

NASA Astrophysics Data System (ADS)

Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

2010-10-01

Octgrav is a very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The algorithms are based on parallel-scan and sort methods. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way, a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s is achieved. It takes about a second to compute forces on a million particles with an opening angle of heta approx 0.5. To test the performance and feasibility, we implemented the algorithms in CUDA in the form of a gravitational tree-code which completely runs on the GPU. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. The code has a convenient user interface and is freely available for use.
Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems; towards millisecond-scale simulations of protein structure, dynamics, and thermodynamics

PubMed Central

Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.

2010-01-01

We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
Distributed Learning, Recognition, and Prediction by ART and ARTMAP Neural Networks.

PubMed

Carpenter, Gail A.

1997-11-01

A class of adaptive resonance theory (ART) models for learning, recognition, and prediction with arbitrarily distributed code representations is introduced. Distributed ART neural networks combine the stable fast learning capabilities of winner-take-all ART systems with the noise tolerance and code compression capabilities of multilayer perceptrons. With a winner-take-all code, the unsupervised model dART reduces to fuzzy ART and the supervised model dARTMAP reduces to fuzzy ARTMAP. With a distributed code, these networks automatically apportion learned changes according to the degree of activation of each coding node, which permits fast as well as slow learning without catastrophic forgetting. Distributed ART models replace the traditional neural network path weight with a dynamic weight equal to the rectified difference between coding node activation and an adaptive threshold. Thresholds increase monotonically during learning according to a principle of atrophy due to disuse. However, monotonic change at the synaptic level manifests itself as bidirectional change at the dynamic level, where the result of adaptation resembles long-term potentiation (LTP) for single-pulse or low frequency test inputs but can resemble long-term depression (LTD) for higher frequency test inputs. This paradoxical behavior is traced to dual computational properties of phasic and tonic coding signal components. A parallel distributed match-reset-search process also helps stabilize memory. Without the match-reset-search system, dART becomes a type of distributed competitive learning network.
fastBMA: scalable network inference and transitive reduction.

PubMed

Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee

2017-10-01

Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
An Approach in Radiation Therapy Treatment Planning: A Fast, GPU-Based Monte Carlo Method.

PubMed

Karbalaee, Mojtaba; Shahbazi-Gahrouei, Daryoush; Tavakoli, Mohammad B

2017-01-01

An accurate and fast radiation dose calculation is essential for successful radiation radiotherapy. The aim of this study was to implement a new graphic processing unit (GPU) based radiation therapy treatment planning for accurate and fast dose calculation in radiotherapy centers. A program was written for parallel running based on GPU. The code validation was performed by EGSnrc/DOSXYZnrc. Moreover, a semi-automatic, rotary, asymmetric phantom was designed and produced using a bone, the lung, and the soft tissue equivalent materials. All measurements were performed using a Mapcheck dosimeter. The accuracy of the code was validated using the experimental data, which was obtained from the anthropomorphic phantom as the gold standard. The findings showed that, compared with those of DOSXYZnrc in the virtual phantom and for most of the voxels (>95%), <3% dose-difference or 3 mm distance-to-agreement (DTA) was found. Moreover, considering the anthropomorphic phantom, compared to the Mapcheck dose measurements, <5% dose-difference or 5 mm DTA was observed. Fast calculation speed and high accuracy of GPU-based Monte Carlo method in dose calculation may be useful in routine radiation therapy centers as the core and main component of a treatment planning verification system.
Introducing GAMER: A fast and accurate method for ray-tracing galaxies using procedural noise

DOE Office of Scientific and Technical Information (OSTI.GOV)

Groeneboom, N. E.; Dahle, H., E-mail: nicolaag@astro.uio.no

2014-03-10

We developed a novel approach for fast and accurate ray-tracing of galaxies using procedural noise fields. Our method allows for efficient and realistic rendering of synthetic galaxy morphologies, where individual components such as the bulge, disk, stars, and dust can be synthesized in different wavelengths. These components follow empirically motivated overall intensity profiles but contain an additional procedural noise component that gives rise to complex natural patterns that mimic interstellar dust and star-forming regions. These patterns produce more realistic-looking galaxy images than using analytical expressions alone. The method is fully parallelized and creates accurate high- and low- resolution images thatmore » can be used, for example, in codes simulating strong and weak gravitational lensing. In addition to having a user-friendly graphical user interface, the C++ software package GAMER is easy to implement into an existing code.« less
Introducing GAMER: A Fast and Accurate Method for Ray-tracing Galaxies Using Procedural Noise

NASA Astrophysics Data System (ADS)

Groeneboom, N. E.; Dahle, H.

2014-03-01

We developed a novel approach for fast and accurate ray-tracing of galaxies using procedural noise fields. Our method allows for efficient and realistic rendering of synthetic galaxy morphologies, where individual components such as the bulge, disk, stars, and dust can be synthesized in different wavelengths. These components follow empirically motivated overall intensity profiles but contain an additional procedural noise component that gives rise to complex natural patterns that mimic interstellar dust and star-forming regions. These patterns produce more realistic-looking galaxy images than using analytical expressions alone. The method is fully parallelized and creates accurate high- and low- resolution images that can be used, for example, in codes simulating strong and weak gravitational lensing. In addition to having a user-friendly graphical user interface, the C++ software package GAMER is easy to implement into an existing code.
FoSSI: the family of simplified solver interfaces for the rapid development of parallel numerical atmosphere and ocean models

NASA Astrophysics Data System (ADS)

Frickenhaus, Stephan; Hiller, Wolfgang; Best, Meike

The portable software FoSSI is introduced that—in combination with additional free solver software packages—allows for an efficient and scalable parallel solution of large sparse linear equations systems arising in finite element model codes. FoSSI is intended to support rapid model code development, completely hiding the complexity of the underlying solver packages. In particular, the model developer need not be an expert in parallelization and is yet free to switch between different solver packages by simple modifications of the interface call. FoSSI offers an efficient and easy, yet flexible interface to several parallel solvers, most of them available on the web, such as PETSC, AZTEC, MUMPS, PILUT and HYPRE. FoSSI makes use of the concept of handles for vectors, matrices, preconditioners and solvers, that is frequently used in solver libraries. Hence, FoSSI allows for a flexible treatment of several linear equations systems and associated preconditioners at the same time, even in parallel on separate MPI-communicators. The second special feature in FoSSI is the task specifier, being a combination of keywords, each configuring a certain phase in the solver setup. This enables the user to control a solver over one unique subroutine. Furthermore, FoSSI has rather similar features for all solvers, making a fast solver intercomparison or exchange an easy task. FoSSI is a community software, proven in an adaptive 2D-atmosphere model and a 3D-primitive equation ocean model, both formulated in finite elements. The present paper discusses perspectives of an OpenMP-implementation of parallel iterative solvers based on domain decomposition methods. This approach to OpenMP solvers is rather attractive, as the code for domain-local operations of factorization, preconditioning and matrix-vector product can be readily taken from a sequential implementation that is also suitable to be used in an MPI-variant. Code development in this direction is in an advanced state under the name ScOPES: the Scalable Open Parallel sparse linear Equations Solver.
Performance Improvements of the CYCOFOS Flow Model

NASA Astrophysics Data System (ADS)

Radhakrishnan, Hari; Moulitsas, Irene; Syrakos, Alexandros; Zodiatis, George; Nikolaides, Andreas; Hayes, Daniel; Georgiou, Georgios C.

2013-04-01

The CYCOFOS-Cyprus Coastal Ocean Forecasting and Observing System has been operational since early 2002, providing daily sea current, temperature, salinity and sea level forecasting data for the next 4 and 10 days to end-users in the Levantine Basin, necessary for operational application in marine safety, particularly concerning oil spills and floating objects predictions. CYCOFOS flow model, similar to most of the coastal and sub-regional operational hydrodynamic forecasting systems of the MONGOOS-Mediterranean Oceanographic Network for Global Ocean Observing System is based on the POM-Princeton Ocean Model. CYCOFOS is nested with the MyOcean Mediterranean regional forecasting data and with SKIRON and ECMWF for surface forcing. The increasing demand for higher and higher resolution data to meet coastal and offshore downstream applications motivated the parallelization of the CYCOFOS POM model. This development was carried out in the frame of the IPcycofos project, funded by the Cyprus Research Promotion Foundation. The parallel processing provides a viable solution to satisfy these demands without sacrificing accuracy or omitting any physical phenomena. Prior to IPcycofos project, there are been several attempts to parallelise the POM, as for example the MP-POM. The existing parallel code models rely on the use of specific outdated hardware architectures and associated software. The objective of the IPcycofos project is to produce an operational parallel version of the CYCOFOS POM code that can replicate the results of the serial version of the POM code used in CYCOFOS. The parallelization of the CYCOFOS POM model use Message Passing Interface-MPI, implemented on commodity computing clusters running open source software and not depending on any specialized vendor hardware. The parallel CYCOFOS POM code constructed in a modular fashion, allowing a fast re-locatable downscaled implementation. The MPI takes advantage of the Cartesian nature of the POM mesh, and use the built-in functionality of MPI routines to split the mesh, using a weighting scheme, along longitude and latitude among the processors. Each server processor work on the model based on domain decomposition techniques. The new parallel CYCOFOS POM code has been benchmarked against the serial POM version of CYCOFOS for speed, accuracy, and resolution and the results are more than satisfactory. With a higher resolution CYCOFOS Levantine model domain the forecasts need much less time than the serial CYCOFOS POM coarser version, both with identical accuracy.
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

NASA Astrophysics Data System (ADS)

Yokota, Rio; Barba, L. A.; Narumi, Tetsu; Yasuoka, Kenji

2013-03-01

This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on GPU hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 40963 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the FMM-based vortex method achieving 74% parallel efficiency on 4096 processes (one GPU per MPI process, 3 GPUs per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of MPI processes (using only CPU cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.
A fast and low-cost genotyping method for hepatitis B virus based on pattern recognition in point-of-care settings

PubMed Central

Qiu, Xianbo; Song, Liuwei; Yang, Shuo; Guo, Meng; Yuan, Quan; Ge, Shengxiang; Min, Xiaoping; Xia, Ningshao

2016-01-01

A fast and low-cost method for HBV genotyping especially for genotypes A, B, C and D was developed and tested. A classifier was used to detect and analyze a one-step immunoassay lateral flow strip functionalized with genotype-specific monoclonal antibodies (mAbs) on multiple capture lines in the form of pattern recognition for point-of-care (POC) diagnostics. The fluorescent signals from the capture lines and the background of the strip were collected via multiple optical channels in parallel. A digital HBV genotyping model, whose inputs are the fluorescent signals and outputs are a group of genotype-specific digital binary codes (0/1), was developed based on the HBV genotyping strategy. Meanwhile, a companion decoding table was established to cover all possible pairing cases between the states of a group of genotype-specific digital binary codes and the HBV genotyping results. A logical analyzing module was constructed to process the detected signals in parallel without program control, and its outputs were used to drive a set of LED indicators, which determine the HBV genotype. Comparing to the nucleic acid analysis to HBV viruses, much faster HBV genotyping with significantly lower cost can be obtained with the developed method. PMID:27306485
Gravitational tree-code on graphics processing units: implementation in CUDA

NASA Astrophysics Data System (ADS)

Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

2010-05-01

We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html
Parallel equilibrium current effect on existence of reversed shear Alfvén eigenmodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xie, Hua-sheng, E-mail: huashengxie@gmail.com; Xiao, Yong, E-mail: yxiao@zju.edu.cn

2015-02-15

A new fast global eigenvalue code, where the terms are segregated according to their physics contents, is developed to study Alfvén modes in tokamak plasmas, particularly, the reversed shear Alfvén eigenmode (RSAE). Numerical calculations show that the parallel equilibrium current corresponding to the kink term is strongly unfavorable for the existence of the RSAE. An improved criterion for the RSAE existence is given for with and without the parallel equilibrium current. In the limits of ideal magnetohydrodynamics (MHD) and zero-pressure, the toroidicity effect is the main possible favorable factor for the existence of the RSAE, which is however usually small.more » This suggests that it is necessary to include additional physics such as kinetic term in the MHD model to overcome the strong unfavorable effect of the parallel current in order to enable the existence of RSAE.« less

Simulations of toroidal Alfvén eigenmode excited by fast ions on the Experimental Advanced Superconducting Tokamak

NASA Astrophysics Data System (ADS)

Pei, Youbin; Xiang, Nong; Shen, Wei; Hu, Youjun; Todo, Y.; Zhou, Deng; Huang, Juan

2018-05-01

Kinetic-MagnetoHydroDynamic (MHD) hybrid simulations are carried out to study fast ion driven toroidal Alfvén eigenmodes (TAEs) on the Experimental Advanced Superconducting Tokamak (EAST). The first part of this article presents the linear benchmark between two kinetic-MHD codes, namely MEGA and M3D-K, based on a realistic EAST equilibrium. Parameter scans show that the frequency and the growth rate of the TAE given by the two codes agree with each other. The second part of this article discusses the resonance interaction between the TAE and fast ions simulated by the MEGA code. The results show that the TAE exchanges energy with the co-current passing particles with the parallel velocity |v∥ | ≈VA 0/3 or |v∥ | ≈VA 0/5 , where VA 0 is the Alfvén speed on the magnetic axis. The TAE destabilized by the counter-current passing ions is also analyzed and found to have a much smaller growth rate than the co-current ions driven TAE. One of the reasons for this is found to be that the overlapping region of the TAE spatial location and the counter-current ion orbits is narrow, and thus the wave-particle energy exchange is not efficient.
Numerical study of the existence criterion for the reversed shear Alfven eigenmode in the presence of a parallel equilibrium current

NASA Astrophysics Data System (ADS)

Shahzad, M.; Rizvi, H.; Panwar, A.; Ryu, C. M.

2017-06-01

We have re-visited the existence criterion of the reverse shear Alfven eigenmodes (RSAEs) in the presence of the parallel equilibrium current by numerically solving the eigenvalue equation using a fast eigenvalue solver code KAES. The parallel equilibrium current can bring in the kink effect and is known to be strongly unfavorable for the RSAE. We have numerically estimated the critical value of the toroidicity factor Qtor in a circular tokamak plasma, above which RSAEs can exist, and compared it to the analytical one. The difference between the numerical and analytical critical values is small for low frequency RSAEs, but it increases as the frequency of the mode increases, becoming greater for higher poloidal harmonic modes.
Parallel processing approach to transform-based image coding

NASA Astrophysics Data System (ADS)

Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.

1991-06-01

This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
High performance Python for direct numerical simulations of turbulent flows

NASA Astrophysics Data System (ADS)

Mortensen, Mikael; Langtangen, Hans Petter

2016-06-01

Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores. A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform.
a Virtual Trip to the Schwarzschild-De Sitter Black Hole

NASA Astrophysics Data System (ADS)

Bakala, Pavel; Hledík, Stanislav; Stuchlík, Zdenĕk; Truparová, Kamila; Čermák, Petr

2008-09-01

We developed realistic fully general relativistic computer code for simulation of optical projection in a strong, spherically symmetric gravitational field. Standard theoretical analysis of optical projection for an observer in the vicinity of a Schwarzschild black hole is extended to black hole spacetimes with a repulsive cosmological constant, i.e, Schwarzschild-de Sitter (SdS) spacetimes. Influence of the cosmological constant is investigated for static observers and observers radially free-falling from static radius. Simulation includes effects of gravitational lensing, multiple images, Doppler and gravitational frequency shift, as well as the amplification of intensity. The code generates images of static observers sky and a movie simulations for radially free-falling observers. Techniques of parallel programming are applied to get high performance and fast run of the simulation code.
FastChem: A computer program for efficient complex chemical equilibrium calculations in the neutral/ionized gas phase with applications to stellar and planetary atmospheres

NASA Astrophysics Data System (ADS)

Stock, Joachim W.; Kitzmann, Daniel; Patzer, A. Beate C.; Sedlmayr, Erwin

2018-06-01

For the calculation of complex neutral/ionized gas phase chemical equilibria, we present a semi-analytical versatile and efficient computer program, called FastChem. The applied method is based on the solution of a system of coupled nonlinear (and linear) algebraic equations, namely the law of mass action and the element conservation equations including charge balance, in many variables. Specifically, the system of equations is decomposed into a set of coupled nonlinear equations in one variable each, which are solved analytically whenever feasible to reduce computation time. Notably, the electron density is determined by using the method of Nelder and Mead at low temperatures. The program is written in object-oriented C++ which makes it easy to couple the code with other programs, although a stand-alone version is provided. FastChem can be used in parallel or sequentially and is available under the GNU General Public License version 3 at https://github.com/exoclime/FastChem together with several sample applications. The code has been successfully validated against previous studies and its convergence behavior has been tested even for extreme physical parameter ranges down to 100 K and up to 1000 bar. FastChem converges stable and robust in even most demanding chemical situations, which posed sometimes extreme challenges for previous algorithms.
Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gohn, W.

Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muonmore » $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $$\\mu$$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $$\\mu$$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.« less
Improvements on non-equilibrium and transport Green function techniques: The next-generation TRANSIESTA

NASA Astrophysics Data System (ADS)

Papior, Nick; Lorente, Nicolás; Frederiksen, Thomas; García, Alberto; Brandbyge, Mads

2017-03-01

We present novel methods implemented within the non-equilibrium Green function code (NEGF) TRANSIESTA based on density functional theory (DFT). Our flexible, next-generation DFT-NEGF code handles devices with one or multiple electrodes (Ne ≥ 1) with individual chemical potentials and electronic temperatures. We describe its novel methods for electrostatic gating, contour optimizations, and assertion of charge conservation, as well as the newly implemented algorithms for optimized and scalable matrix inversion, performance-critical pivoting, and hybrid parallelization. Additionally, a generic NEGF "post-processing" code (TBTRANS/PHTRANS) for electron and phonon transport is presented with several novelties such as Hamiltonian interpolations, Ne ≥ 1 electrode capability, bond-currents, generalized interface for user-defined tight-binding transport, transmission projection using eigenstates of a projected Hamiltonian, and fast inversion algorithms for large-scale simulations easily exceeding 106 atoms on workstation computers. The new features of both codes are demonstrated and bench-marked for relevant test systems.
Evaluation and application of a fast module in a PLC based interlock and control system

NASA Astrophysics Data System (ADS)

Zaera-Sanz, M.

2009-08-01

The LHC Beam Interlock system requires a controller performing a simple matrix function to collect the different beam dump requests. To satisfy the expected safety level of the Interlock, the system should be robust and reliable. The PLC is a promising candidate to fulfil both aspects but too slow to meet the expected response time which is of the order of μseconds. Siemens has introduced a ``so called'' fast module (FM352-5 Boolean Processor). It provides independent and extremely fast control of a process within a larger control system using an onboard processor, a Field Programmable Gate Array (FPGA), to execute code in parallel which results in extremely fast scan times. It is interesting to investigate its features and to evaluate it as a possible candidate for the beam interlock system. This paper publishes the results of this study. As well, this paper could be useful for other applications requiring fast processing using a PLC.
COLA with scale-dependent growth: applications to screened modified gravity models

NASA Astrophysics Data System (ADS)

Winther, Hans A.; Koyama, Kazuya; Manera, Marc; Wright, Bill S.; Zhao, Gong-Bo

2017-08-01

We present a general parallelized and easy-to-use code to perform numerical simulations of structure formation using the COLA (COmoving Lagrangian Acceleration) method for cosmological models that exhibit scale-dependent growth at the level of first and second order Lagrangian perturbation theory. For modified gravity theories we also include screening using a fast approximate method that covers all the main examples of screening mechanisms in the literature. We test the code by comparing it to full simulations of two popular modified gravity models, namely f(R) gravity and nDGP, and find good agreement in the modified gravity boost-factors relative to ΛCDM even when using a fairly small number of COLA time steps.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Winther, Hans A.; Koyama, Kazuya; Wright, Bill S.

We present a general parallelized and easy-to-use code to perform numerical simulations of structure formation using the COLA (COmoving Lagrangian Acceleration) method for cosmological models that exhibit scale-dependent growth at the level of first and second order Lagrangian perturbation theory. For modified gravity theories we also include screening using a fast approximate method that covers all the main examples of screening mechanisms in the literature. We test the code by comparing it to full simulations of two popular modified gravity models, namely f ( R ) gravity and nDGP, and find good agreement in the modified gravity boost-factors relative tomore » ΛCDM even when using a fairly small number of COLA time steps.« less
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Density-based parallel skin lesion border detection with webCL

PubMed Central

2015-01-01

Background Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Methods Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Results Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. Conclusions When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser. PMID:26423836
Density-based parallel skin lesion border detection with webCL.

PubMed

Lemon, James; Kockara, Sinan; Halic, Tansel; Mete, Mutlu

2015-01-01

Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser.
Collisionless stellar hydrodynamics as an efficient alternative to N-body methods

NASA Astrophysics Data System (ADS)

Mitchell, Nigel L.; Vorobyov, Eduard I.; Hensler, Gerhard

2013-01-01

The dominant constituents of the Universe's matter are believed to be collisionless in nature and thus their modelling in any self-consistent simulation is extremely important. For simulations that deal only with dark matter or stellar systems, the conventional N-body technique is fast, memory efficient and relatively simple to implement. However when extending simulations to include the effects of gas physics, mesh codes are at a distinct disadvantage compared to Smooth Particle Hydrodynamics (SPH) codes. Whereas implementing the N-body approach into SPH codes is fairly trivial, the particle-mesh technique used in mesh codes to couple collisionless stars and dark matter to the gas on the mesh has a series of significant scientific and technical limitations. These include spurious entropy generation resulting from discreteness effects, poor load balancing and increased communication overhead which spoil the excellent scaling in massively parallel grid codes. In this paper we propose the use of the collisionless Boltzmann moment equations as a means to model the collisionless material as a fluid on the mesh, implementing it into the massively parallel FLASH Adaptive Mesh Refinement (AMR) code. This approach which we term `collisionless stellar hydrodynamics' enables us to do away with the particle-mesh approach and since the parallelization scheme is identical to that used for the hydrodynamics, it preserves the excellent scaling of the FLASH code already demonstrated on peta-flop machines. We find that the classic hydrodynamic equations and the Boltzmann moment equations can be reconciled under specific conditions, allowing us to generate analytic solutions for collisionless systems using conventional test problems. We confirm the validity of our approach using a suite of demanding test problems, including the use of a modified Sod shock test. By deriving the relevant eigenvalues and eigenvectors of the Boltzmann moment equations, we are able to use high order accurate characteristic tracing methods with Riemann solvers to generate numerical solutions which show excellent agreement with our analytic solutions. We conclude by demonstrating the ability of our code to model complex phenomena by simulating the evolution of a two-armed spiral galaxy whose properties agree with those predicted by the swing amplification theory.
ColDICE: A parallel Vlasov–Poisson solver using moving adaptive simplicial tessellation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sousbie, Thierry, E-mail: tsousbie@gmail.com; Department of Physics, The University of Tokyo, Tokyo 113-0033; Research Center for the Early Universe, School of Science, The University of Tokyo, Tokyo 113-0033

2016-09-15

Resolving numerically Vlasov–Poisson equations for initially cold systems can be reduced to following the evolution of a three-dimensional sheet evolving in six-dimensional phase-space. We describe a public parallel numerical algorithm consisting in representing the phase-space sheet with a conforming, self-adaptive simplicial tessellation of which the vertices follow the Lagrangian equations of motion. The algorithm is implemented both in six- and four-dimensional phase-space. Refinement of the tessellation mesh is performed using the bisection method and a local representation of the phase-space sheet at second order relying on additional tracers created when needed at runtime. In order to preserve in the bestmore » way the Hamiltonian nature of the system, refinement is anisotropic and constrained by measurements of local Poincaré invariants. Resolution of Poisson equation is performed using the fast Fourier method on a regular rectangular grid, similarly to particle in cells codes. To compute the density projected onto this grid, the intersection of the tessellation and the grid is calculated using the method of Franklin and Kankanhalli [65–67] generalised to linear order. As preliminary tests of the code, we study in four dimensional phase-space the evolution of an initially small patch in a chaotic potential and the cosmological collapse of a fluctuation composed of two sinusoidal waves. We also perform a “warm” dark matter simulation in six-dimensional phase-space that we use to check the parallel scaling of the code.« less
MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Harrison, Robert J.; Beylkin, Gregory; Bischoff, Florian A.

2016-01-01

MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.
Parallelized direct execution simulation of message-passing parallel programs

NASA Technical Reports Server (NTRS)

Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

1994-01-01

As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
Parallel Index and Query for Large Scale Data Analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chou, Jerry; Wu, Kesheng; Ruebel, Oliver

2011-07-18

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less

Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

NASA Astrophysics Data System (ADS)

Amooie, M. A.; Moortgat, J.

2017-12-01

We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Next-generation acceleration and code optimization for light transport in turbid media using GPUs

PubMed Central

Alerstam, Erik; Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Andersson-Engels, Stefan; Lilge, Lothar

2010-01-01

A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml). PMID:21258498
Full-wave simulations of ICRF heating regimes in toroidal plasma with non-Maxwellian distribution functions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bertelli, N.; Valeo, E. J.; Green, D. L.

At the power levels required for significant heating and current drive in magnetically-confined toroidal plasma, modification of the particle distribution function from a Maxwellian shape is likely (Stix 1975 Nucl. Fusion 15 737), with consequent changes in wave propagation and in the location and amount of absorption. In order to study these effects computationally, both the finite-Larmor-radius and the high-harmonic fast wave (HHFW), versions of the full-wave, hot-plasma toroidal simulation code TORIC (Brambilla 1999 Plasma Phys. Control. Fusion 41 1 and Brambilla 2002 Plasma Phys. Control. Fusion 44 2423), have been extended to allow the prescription of arbitrary velocity distributionsmore » of the form f(v(parallel to), v(perpendicular to) , psi, theta). For hydrogen (H) minority heating of a deuterium (D) plasma with anisotropic Maxwellian H distributions, the fractional H absorption varies significantly with changes in parallel temperature but is essentially independent of perpendicular temperature. On the other hand, for HHFW regime with anisotropic Maxwellian fast ion distribution, the fractional beam ion absorption varies mainly with changes in the perpendicular temperature. The evaluation of the wave-field and power absorption, through the full wave solver, with the ion distribution function provided by either a Monte-Carlo particle and Fokker-Planck codes is also examined for Alcator C-Mod and NSTX plasmas. Non-Maxwellian effects generally tend to increase the absorption with respect to the equivalent Maxwellian distribution.« less
Full-wave simulations of ICRF heating regimes in toroidal plasma with non-Maxwellian distribution functions

NASA Astrophysics Data System (ADS)

Bertelli, N.; Valeo, E. J.; Green, D. L.; Gorelenkova, M.; Phillips, C. K.; Podestà, M.; Lee, J. P.; Wright, J. C.; Jaeger, E. F.

2017-05-01

At the power levels required for significant heating and current drive in magnetically-confined toroidal plasma, modification of the particle distribution function from a Maxwellian shape is likely (Stix 1975 Nucl. Fusion 15 737), with consequent changes in wave propagation and in the location and amount of absorption. In order to study these effects computationally, both the finite-Larmor-radius and the high-harmonic fast wave (HHFW), versions of the full-wave, hot-plasma toroidal simulation code TORIC (Brambilla 1999 Plasma Phys. Control. Fusion 41 1 and Brambilla 2002 Plasma Phys. Control. Fusion 44 2423), have been extended to allow the prescription of arbitrary velocity distributions of the form f≤ft({{v}\\parallel},{{v}\\bot},\\psi,θ \\right) . For hydrogen (H) minority heating of a deuterium (D) plasma with anisotropic Maxwellian H distributions, the fractional H absorption varies significantly with changes in parallel temperature but is essentially independent of perpendicular temperature. On the other hand, for HHFW regime with anisotropic Maxwellian fast ion distribution, the fractional beam ion absorption varies mainly with changes in the perpendicular temperature. The evaluation of the wave-field and power absorption, through the full wave solver, with the ion distribution function provided by either a Monte-Carlo particle and Fokker-Planck codes is also examined for Alcator C-Mod and NSTX plasmas. Non-Maxwellian effects generally tend to increase the absorption with respect to the equivalent Maxwellian distribution.
Full-wave simulations of ICRF heating regimes in toroidal plasma with non-Maxwellian distribution functions

DOE PAGES

Bertelli, N.; Valeo, E. J.; Green, D. L.; ...

2017-04-03

At the power levels required for significant heating and current drive in magnetically-confined toroidal plasma, modification of the particle distribution function from a Maxwellian shape is likely (Stix 1975 Nucl. Fusion 15 737), with consequent changes in wave propagation and in the location and amount of absorption. In order to study these effects computationally, both the finite-Larmor-radius and the high-harmonic fast wave (HHFW), versions of the full-wave, hot-plasma toroidal simulation code TORIC (Brambilla 1999 Plasma Phys. Control. Fusion 41 1 and Brambilla 2002 Plasma Phys. Control. Fusion 44 2423), have been extended to allow the prescription of arbitrary velocity distributionsmore » of the form f(v(parallel to), v(perpendicular to) , psi, theta). For hydrogen (H) minority heating of a deuterium (D) plasma with anisotropic Maxwellian H distributions, the fractional H absorption varies significantly with changes in parallel temperature but is essentially independent of perpendicular temperature. On the other hand, for HHFW regime with anisotropic Maxwellian fast ion distribution, the fractional beam ion absorption varies mainly with changes in the perpendicular temperature. The evaluation of the wave-field and power absorption, through the full wave solver, with the ion distribution function provided by either a Monte-Carlo particle and Fokker-Planck codes is also examined for Alcator C-Mod and NSTX plasmas. Non-Maxwellian effects generally tend to increase the absorption with respect to the equivalent Maxwellian distribution.« less
Novel 3D Compression Methods for Geometry, Connectivity and Texture

NASA Astrophysics Data System (ADS)

Siddeq, M. M.; Rodrigues, M. A.

2016-06-01

A large number of applications in medical visualization, games, engineering design, entertainment, heritage, e-commerce and so on require the transmission of 3D models over the Internet or over local networks. 3D data compression is an important requirement for fast data storage, access and transmission within bandwidth limitations. The Wavefront OBJ (object) file format is commonly used to share models due to its clear simple design. Normally each OBJ file contains a large amount of data (e.g. vertices and triangulated faces, normals, texture coordinates and other parameters) describing the mesh surface. In this paper we introduce a new method to compress geometry, connectivity and texture coordinates by a novel Geometry Minimization Algorithm (GM-Algorithm) in connection with arithmetic coding. First, each vertex ( x, y, z) coordinates are encoded to a single value by the GM-Algorithm. Second, triangle faces are encoded by computing the differences between two adjacent vertex locations, which are compressed by arithmetic coding together with texture coordinates. We demonstrate the method on large data sets achieving compression ratios between 87 and 99 % without reduction in the number of reconstructed vertices and triangle faces. The decompression step is based on a Parallel Fast Matching Search Algorithm (Parallel-FMS) to recover the structure of the 3D mesh. A comparative analysis of compression ratios is provided with a number of commonly used 3D file formats such as VRML, OpenCTM and STL highlighting the performance and effectiveness of the proposed method.
The language parallel Pascal and other aspects of the massively parallel processor

NASA Technical Reports Server (NTRS)

Reeves, A. P.; Bruner, J. D.

1982-01-01

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
Code Parallelization with CAPO: A User Manual

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

2001-01-01

A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.
Higher order Larmor radius corrections to guiding-centre equations and application to fast ion equilibrium distributions

NASA Astrophysics Data System (ADS)

Lanthaler, S.; Pfefferlé, D.; Graves, J. P.; Cooper, W. A.

2017-04-01

An improved set of guiding-centre equations, expanded to one order higher in Larmor radius than usually written for guiding-centre codes, are derived for curvilinear flux coordinates and implemented into the orbit following code VENUS-LEVIS. Aside from greatly improving the correspondence between guiding-centre and full particle trajectories, the most important effect of the additional Larmor radius corrections is to modify the definition of the guiding-centre’s parallel velocity via the so-called Baños drift. The correct treatment of the guiding-centre push-forward with the Baños term leads to an anisotropic shift in the phase-space distribution of guiding-centres, consistent with the well-known magnetization term. The consequence of these higher order terms are quantified in three cases where energetic ions are usually followed with standard guiding-centre equations: (1) neutral beam injection in a MAST-like low aspect-ratio spherical equilibrium where the fast ion driven current is significantly larger with respect to previous calculations, (2) fast ion losses due to resonant magnetic perturbations where a lower lost fraction and a better confinement is confirmed, (3) alpha particles in the ripple field of the European DEMO where the effect is found to be marginal.
Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

1998-01-01

This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Coupled Kinetic-MHD Simulations of Divertor Heat Load with ELM Perturbations

NASA Astrophysics Data System (ADS)

Cummings, Julian; Chang, C. S.; Park, Gunyoung; Sugiyama, Linda; Pankin, Alexei; Klasky, Scott; Podhorszki, Norbert; Docan, Ciprian; Parashar, Manish

2010-11-01

The effect of Type-I ELM activity on divertor plate heat load is a key component of the DOE OFES Joint Research Target milestones for this year. In this talk, we present simulations of kinetic edge physics, ELM activity, and the associated divertor heat loads in which we couple the discrete guiding-center neoclassical transport code XGC0 with the nonlinear extended MHD code M3D using the End-to-end Framework for Fusion Integrated Simulations, or EFFIS. In these coupled simulations, the kinetic code and the MHD code run concurrently on the same massively parallel platform and periodic data exchanges are performed using a memory-to-memory coupling technology provided by EFFIS. The M3D code models the fast ELM event and sends frequent updates of the magnetic field perturbations and electrostatic potential to XGC0, which in turn tracks particle dynamics under the influence of these perturbations and collects divertor particle and energy flux statistics. We describe here how EFFIS technologies facilitate these coupled simulations and discuss results for DIII-D, NSTX and Alcator C-Mod tokamak discharges.
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Seismic imaging using finite-differences and parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ober, C.C.

1997-12-31

A key to reducing the risks and costs of associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Prestack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar wave equation using finite differences. As part of an ongoing ACTI project funded by the US Department of Energy, a finite difference, 3-D prestack, depth migration code has been developed. The goal of this work is to demonstrate that massively parallel computersmore » can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite difference, prestack, depth migration practical for oil and gas exploration. Several problems had to be addressed to get an efficient code for the Intel Paragon. These include efficient I/O, efficient parallel tridiagonal solves, and high single-node performance. Furthermore, to provide portable code the author has been restricted to the use of high-level programming languages (C and Fortran) and interprocessor communications using MPI. He has been using the SUNMOS operating system, which has affected many of his programming decisions. He will present images created from two verification datasets (the Marmousi Model and the SEG/EAEG 3D Salt Model). Also, he will show recent images from real datasets, and point out locations of improved imaging. Finally, he will discuss areas of current research which will hopefully improve the image quality and reduce computational costs.« less
Distributed multitasking ITS with PVM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fan, W.C.; Halbleib, J.A. Sr.

1995-12-31

Advances in computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable to or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generatedmore » in a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of the MCNP code on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electronphoton transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load-balancing schemes for homogeneous and heterogeneous networks.« less
Fast GPU-based Monte Carlo code for SPECT/CT reconstructions generates improved 177Lu images.

PubMed

Rydén, T; Heydorn Lagerlöf, J; Hemmingsson, J; Marin, I; Svensson, J; Båth, M; Gjertsson, P; Bernhardt, P

2018-01-04

Full Monte Carlo (MC)-based SPECT reconstructions have a strong potential for correcting for image degrading factors, but the reconstruction times are long. The objective of this study was to develop a highly parallel Monte Carlo code for fast, ordered subset expectation maximum (OSEM) reconstructions of SPECT/CT images. The MC code was written in the Compute Unified Device Architecture language for a computer with four graphics processing units (GPUs) (GeForce GTX Titan X, Nvidia, USA). This enabled simulations of parallel photon emissions from the voxels matrix (128 3 or 256 3 ). Each computed tomography (CT) number was converted to attenuation coefficients for photo absorption, coherent scattering, and incoherent scattering. For photon scattering, the deflection angle was determined by the differential scattering cross sections. An angular response function was developed and used to model the accepted angles for photon interaction with the crystal, and a detector scattering kernel was used for modeling the photon scattering in the detector. Predefined energy and spatial resolution kernels for the crystal were used. The MC code was implemented in the OSEM reconstruction of clinical and phantom 177 Lu SPECT/CT images. The Jaszczak image quality phantom was used to evaluate the performance of the MC reconstruction in comparison with attenuated corrected (AC) OSEM reconstructions and attenuated corrected OSEM reconstructions with resolution recovery corrections (RRC). The performance of the MC code was 3200 million photons/s. The required number of photons emitted per voxel to obtain a sufficiently low noise level in the simulated image was 200 for a 128 3 voxel matrix. With this number of emitted photons/voxel, the MC-based OSEM reconstruction with ten subsets was performed within 20 s/iteration. The images converged after around six iterations. Therefore, the reconstruction time was around 3 min. The activity recovery for the spheres in the Jaszczak phantom was clearly improved with MC-based OSEM reconstruction, e.g., the activity recovery was 88% for the largest sphere, while it was 66% for AC-OSEM and 79% for RRC-OSEM. The GPU-based MC code generated an MC-based SPECT/CT reconstruction within a few minutes, and reconstructed patient images of 177 Lu-DOTATATE treatments revealed clearly improved resolution and contrast.
cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.

PubMed

Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro

2016-01-01

Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

NASA Technical Reports Server (NTRS)

Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

2017-01-01

Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Jdpd: an open java simulation kernel for molecular fragment dissipative particle dynamics.

PubMed

van den Broek, Karina; Kuhn, Hubert; Zielesny, Achim

2018-05-21

Jdpd is an open Java simulation kernel for Molecular Fragment Dissipative Particle Dynamics with parallelizable force calculation, efficient caching options and fast property calculations. It is characterized by an interface and factory-pattern driven design for simple code changes and may help to avoid problems of polyglot programming. Detailed input/output communication, parallelization and process control as well as internal logging capabilities for debugging purposes are supported. The new kernel may be utilized in different simulation environments ranging from flexible scripting solutions up to fully integrated "all-in-one" simulation systems.
MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation

DOE PAGES

Harrison, Robert J.; Beylkin, Gregory; Bischoff, Florian A.; ...

2016-01-01

We present MADNESS (multiresolution adaptive numerical environment for scientific simulation) that is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision that are based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.
Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems

DTIC Science & Technology

1993-01-01

The particles represent the so-called " Dark Matter " eral times the the full message latency for every remote which is believed to dominate the mass...Katz, L. Herquist, and D. H. Weinberg. Galax- R. Acad. Sci. Paris, 306(I):739-742, 1988. ies and gas in a cold dark matter universe. Ap. J., [111 A...57(3):566-569, 1983. dark matter halos. Ap. J., 378:496, 1991. [30] G. Pedrizzetti. Insight into singular vortex flows. [17] C. C. Dyer and P. S. S

Parallel Fast Multipole Method For Molecular Dynamics

DTIC Science & Technology

2007-06-01

Parallel Fast Multipole Method For Molecular Dynamics THESIS Reid G. Ormseth, Captain, USAF AFIT/GAP/ENP/07-J02 DEPARTMENT OF THE AIR FORCE AIR...the United States Government. AFIT/GAP/ENP/07-J02 Parallel Fast Multipole Method For Molecular Dynamics THESIS Presented to the Faculty Department of...has also been provided by ‘The Art of Molecular Dynamics Simulation ’ by Dennis Rapaport. This work is the clearest treatment of the Fast Multipole
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG

NASA Astrophysics Data System (ADS)

Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.

2012-12-01

The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.
A fast parallel clustering algorithm for molecular simulation trajectories.

PubMed

Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui

2013-01-15

We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less
Speed and accuracy improvements in FLAASH atmospheric correction of hyperspectral imagery

NASA Astrophysics Data System (ADS)

Perkins, Timothy; Adler-Golden, Steven; Matthew, Michael W.; Berk, Alexander; Bernstein, Lawrence S.; Lee, Jamine; Fox, Marsha

2012-11-01

Remotely sensed spectral imagery of the earth's surface can be used to fullest advantage when the influence of the atmosphere has been removed and the measurements are reduced to units of reflectance. Here, we provide a comprehensive summary of the latest version of the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes atmospheric correction algorithm. We also report some new code improvements for speed and accuracy. These include the re-working of the original algorithm in C-language code parallelized with message passing interface and containing a new radiative transfer look-up table option, which replaces executions of the MODTRAN model. With computation times now as low as ~10 s per image per computer processor, automated, real-time, on-board atmospheric correction of hyper- and multi-spectral imagery is within reach.
CRUNCH_PARALLEL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shumaker, Dana E.; Steefel, Carl I.

The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.
Crustal origin of trench-parallel shear-wave fast polarizations in the Central Andes

NASA Astrophysics Data System (ADS)

Wölbern, I.; Löbl, U.; Rümpker, G.

2014-04-01

In this study, SKS and local S phases are analyzed to investigate variations of shear-wave splitting parameters along two dense seismic profiles across the central Andean Altiplano and Puna plateaus. In contrast to previous observations, the vast majority of the measurements reveal fast polarizations sub-parallel to the subduction direction of the Nazca plate with delay times between 0.3 and 1.2 s. Local phases show larger variations of fast polarizations and exhibit delay times ranging between 0.1 and 1.1 s. Two 70 km and 100 km wide sections along the Altiplano profile exhibit larger delay times and are characterized by fast polarizations oriented sub-parallel to major fault zones. Based on finite-difference wavefield calculations for anisotropic subduction zone models we demonstrate that the observations are best explained by fossil slab anisotropy with fast symmetry axes oriented sub-parallel to the slab movement in combination with a significant component of crustal anisotropy of nearly trench-parallel fast-axis orientation. From the modeling we exclude a sub-lithospheric origin of the observed strong anomalies due to the short-scale variations of the fast polarizations. Instead, our results indicate that anisotropy in the Central Andes generally reflects the direction of plate motion while the observed trench-parallel fast polarizations likely originate in the continental crust above the subducting slab.
Performance Analysis and Optimization on the UCLA Parallel Atmospheric General Circulation Model Code

NASA Technical Reports Server (NTRS)

Lou, John; Ferraro, Robert; Farrara, John; Mechoso, Carlos

1996-01-01

An analysis is presented of several factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on massively parallel computer systems. Several modificaitons to the original parallel AGCM code aimed at improving its numerical efficiency, interprocessor communication cost, load-balance and issues affecting single-node code performance are discussed.
Global Magnetohydrodynamic Simulation Using High Performance FORTRAN on Parallel Computers

NASA Astrophysics Data System (ADS)

Ogino, T.

High Performance Fortran (HPF) is one of modern and common techniques to achieve high performance parallel computation. We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5 VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.
Parallel and pipeline computation of fast unitary transforms

NASA Technical Reports Server (NTRS)

Fino, B. J.; Algazi, V. R.

1975-01-01

The letter discusses the parallel and pipeline organization of fast-unitary-transform algorithms such as the fast Fourier transform, and points out the efficiency of a combined parallel-pipeline processor of a transform such as the Haar transform, in which (2 to the n-th power) -1 hardware 'butterflies' generate a transform of order 2 to the n-th power every computation cycle.
Integrated modeling applications for tokamak experiments with OMFIT

NASA Astrophysics Data System (ADS)

Meneghini, O.; Smith, S. P.; Lao, L. L.; Izacard, O.; Ren, Q.; Park, J. M.; Candy, J.; Wang, Z.; Luna, C. J.; Izzo, V. A.; Grierson, B. A.; Snyder, P. B.; Holland, C.; Penna, J.; Lu, G.; Raum, P.; McCubbin, A.; Orlov, D. M.; Belli, E. A.; Ferraro, N. M.; Prater, R.; Osborne, T. H.; Turnbull, A. D.; Staebler, G. M.

2015-08-01

One modeling framework for integrated tasks (OMFIT) is a comprehensive integrated modeling framework which has been developed to enable physics codes to interact in complicated workflows, and support scientists at all stages of the modeling cycle. The OMFIT development follows a unique bottom-up approach, where the framework design and capabilities organically evolve to support progressive integration of the components that are required to accomplish physics goals of increasing complexity. OMFIT provides a workflow for easily generating full kinetic equilibrium reconstructions that are constrained by magnetic and motional Stark effect measurements, and kinetic profile information that includes fast-ion pressure modeled by a transport code. It was found that magnetic measurements can be used to quantify the amount of anomalous fast-ion diffusion that is present in DIII-D discharges, and provide an estimate that is consistent with what would be needed for transport simulations to match the measured neutron rates. OMFIT was used to streamline edge-stability analyses, and evaluate the effect of resonant magnetic perturbation (RMP) on the pedestal stability, which have been found to be consistent with the experimental observations. The development of a five-dimensional numerical fluid model for estimating the effects of the interaction between magnetohydrodynamic (MHD) and microturbulence, and its systematic verification against analytic models was also supported by the framework. OMFIT was used for optimizing an innovative high-harmonic fast wave system proposed for DIII-D. For a parallel refractive index {{n}\\parallel}>3 , the conditions for strong electron-Landau damping were found to be independent of launched {{n}\\parallel} and poloidal angle. OMFIT has been the platform of choice for developing a neural-network based approach to efficiently perform a non-linear multivariate regression of local transport fluxes as a function of local dimensionless parameters. Transport predictions for thousands of DIII-D discharges showed excellent agreement with the power balance calculations across the whole plasma radius and over a broad range of operating regimes. Concerning predictive transport simulations, the framework made possible the design and automation of a workflow that enables self-consistent predictions of kinetic profiles and the plasma equilibrium. It is found that the feedback between the transport fluxes and plasma equilibrium can significantly affect the kinetic profiles predictions. Such a rich set of results provide tangible evidence of how bottom-up approaches can potentially provide a fast track to integrated modeling solutions that are functional, cost-effective, and in sync with the research effort of the community.
Exploiting Symmetry on Parallel Architectures.

NASA Astrophysics Data System (ADS)

Stiller, Lewis Benjamin

1995-01-01

This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.
Utilizing GPUs to Accelerate Turbomachinery CFD Codes

NASA Technical Reports Server (NTRS)

MacCalla, Weylin; Kulkarni, Sameer

2016-01-01

GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.
A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000 Using the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry

1998-01-01

Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
Overview of the NCC

NASA Technical Reports Server (NTRS)

Liu, Nan-Suey

2001-01-01

A multi-disciplinary design/analysis tool for combustion systems is critical for optimizing the low-emission, high-performance combustor design process. Based on discussions between then NASA Lewis Research Center and the jet engine companies, an industry-government team was formed in early 1995 to develop the National Combustion Code (NCC), which is an integrated system of computer codes for the design and analysis of combustion systems. NCC has advanced features that address the need to meet designer's requirements such as "assured accuracy", "fast turnaround", and "acceptable cost". The NCC development team is comprised of Allison Engine Company (Allison), CFD Research Corporation (CFDRC), GE Aircraft Engines (GEAE), NASA Glenn Research Center (LeRC), and Pratt & Whitney (P&W). The "unstructured mesh" capability and "parallel computing" are fundamental features of NCC from its inception. The NCC system is composed of a set of "elements" which includes grid generator, main flow solver, turbulence module, turbulence and chemistry interaction module, chemistry module, spray module, radiation heat transfer module, data visualization module, and a post-processor for evaluating engine performance parameters. Each element may have contributions from several team members. Such a multi-source multi-element system needs to be integrated in a way that facilitates inter-module data communication, flexibility in module selection, and ease of integration. The development of the NCC beta version was essentially completed in June 1998. Technical details of the NCC elements are given in the Reference List. Elements such as the baseline flow solver, turbulence module, and the chemistry module, have been extensively validated; and their parallel performance on large-scale parallel systems has been evaluated and optimized. However the scalar PDF module and the Spray module, as well as their coupling with the baseline flow solver, were developed in a small-scale distributed computing environment. As a result, the validation of the NCC beta version as a whole was quite limited. Current effort has been focused on the validation of the integrated code and the evaluation/optimization of its overall performance on large-scale parallel systems.
A note on parallel and pipeline computation of fast unitary transforms

NASA Technical Reports Server (NTRS)

Fino, B. J.; Algazi, V. R.

1974-01-01

The parallel and pipeline organization of fast unitary transform algorithms such as the Fast Fourier Transform are discussed. The efficiency is pointed out of a combined parallel-pipeline processor of a transform such as the Haar transform in which 2 to the n minus 1 power hardware butterflies generate a transform of order 2 to the n power every computation cycle.
National Combustion Code Parallel Performance Enhancements

NASA Technical Reports Server (NTRS)

Quealy, Angela; Benyo, Theresa (Technical Monitor)

2002-01-01

The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.
The novel high-performance 3-D MT inverse solver

NASA Astrophysics Data System (ADS)

Kruglyakov, Mikhail; Geraskin, Alexey; Kuvshinov, Alexey

2016-04-01

We present novel, robust, scalable, and fast 3-D magnetotelluric (MT) inverse solver. The solver is written in multi-language paradigm to make it as efficient, readable and maintainable as possible. Separation of concerns and single responsibility concepts go through implementation of the solver. As a forward modelling engine a modern scalable solver extrEMe, based on contracting integral equation approach, is used. Iterative gradient-type (quasi-Newton) optimization scheme is invoked to search for (regularized) inverse problem solution, and adjoint source approach is used to calculate efficiently the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT responses, and supports massive parallelization. Moreover, different parallelization strategies implemented in the code allow optimal usage of available computational resources for a given problem statement. To parameterize an inverse domain the so-called mask parameterization is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to HPC Piz Daint (6th supercomputer in the world) demonstrate practically linear scalability of the code up to thousands of nodes.
Nonlinear ELM simulations based on a nonideal peeling–ballooning model using the BOUT++ code

DOE PAGES

Xu, X. Q.; Dudson, B. D.; Snyder, P. B.; ...

2011-09-23

A minimum set of equations based on the peeling–ballooning (P–B) model with nonideal physics effects (diamagnetic drift, E × B drift, resistivity and anomalous electron viscosity) is found to simulate pedestal collapse when using the BOUT++ simulation code, developed in part from the original fluid edge code BOUT. Linear simulations of P–B modes find good agreement in growth rate and mode structure with ELITE calculations. The influence of the E × B drift, diamagnetic drift, resistivity, anomalous electron viscosity, ion viscosity and parallel thermal diffusivity on P–B modes is being studied; we find that (1) the diamagnetic drift and Emore » × B drift stabilize the P–B mode in a manner consistent with theoretical expectations; (2) resistivity destabilizes the P–B mode, leading to resistive P–B mode; (3) anomalous electron and parallel ion viscosities destabilize the P–B mode, leading to a viscous P–B mode; (4) perpendicular ion viscosity and parallel thermal diffusivity stabilize the P–B mode. With addition of the anomalous electron viscosity under the assumption that the anomalous kinematic electron viscosity is comparable to the anomalous electron perpendicular thermal diffusivity, or the Prandtl number is close to unity, it is found from nonlinear simulations using a realistic high Lundquist number that the pedestal collapse is limited to the edge region and the ELM size is about 5–10% of the pedestal stored energy. Furthermore, this is consistent with many observations of large ELMs. The estimated island size is consistent with the size of fast pedestal pressure collapse. In the stable α-zones of ideal P–B modes, nonlinear simulations of viscous ballooning modes or current-diffusive ballooning mode (CDBM) for ITER H-mode scenarios are presented.« less
Overcoming Challenges in Kinetic Modeling of Magnetized Plasmas and Vacuum Electronic Devices

NASA Astrophysics Data System (ADS)

Omelchenko, Yuri; Na, Dong-Yeop; Teixeira, Fernando

2017-10-01

We transform the state-of-the art of plasma modeling by taking advantage of novel computational techniques for fast and robust integration of multiscale hybrid (full particle ions, fluid electrons, no displacement current) and full-PIC models. These models are implemented in 3D HYPERS and axisymmetric full-PIC CONPIC codes. HYPERS is a massively parallel, asynchronous code. The HYPERS solver does not step fields and particles synchronously in time but instead executes local variable updates (events) at their self-adaptive rates while preserving fundamental conservation laws. The charge-conserving CONPIC code has a matrix-free explicit finite-element (FE) solver based on a sparse-approximate inverse (SPAI) algorithm. This explicit solver approximates the inverse FE system matrix (``mass'' matrix) using successive sparsity pattern orders of the original matrix. It does not reduce the set of Maxwell's equations to a vector-wave (curl-curl) equation of second order but instead utilizes the standard coupled first-order Maxwell's system. We discuss the ability of our codes to accurately and efficiently account for multiscale physical phenomena in 3D magnetized space and laboratory plasmas and axisymmetric vacuum electronic devices.

Fast data preprocessing with Graphics Processing Units for inverse problem solving in light-scattering measurements

NASA Astrophysics Data System (ADS)

Derkachov, G.; Jakubczyk, T.; Jakubczyk, D.; Archer, J.; Woźniak, M.

2017-07-01

Utilising Compute Unified Device Architecture (CUDA) platform for Graphics Processing Units (GPUs) enables significant reduction of computation time at a moderate cost, by means of parallel computing. In the paper [Jakubczyk et al., Opto-Electron. Rev., 2016] we reported using GPU for Mie scattering inverse problem solving (up to 800-fold speed-up). Here we report the development of two subroutines utilising GPU at data preprocessing stages for the inversion procedure: (i) A subroutine, based on ray tracing, for finding spherical aberration correction function. (ii) A subroutine performing the conversion of an image to a 1D distribution of light intensity versus azimuth angle (i.e. scattering diagram), fed from a movie-reading CPU subroutine running in parallel. All subroutines are incorporated in PikeReader application, which we make available on GitHub repository. PikeReader returns a sequence of intensity distributions versus a common azimuth angle vector, corresponding to the recorded movie. We obtained an overall ∼ 400 -fold speed-up of calculations at data preprocessing stages using CUDA codes running on GPU in comparison to single thread MATLAB-only code running on CPU.
Parallelization of ARC3D with Computer-Aided Tools

NASA Technical Reports Server (NTRS)

Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Support for Debugging Automatically Parallelized Programs

NASA Technical Reports Server (NTRS)

Hood, Robert; Jost, Gabriele

2001-01-01

This viewgraph presentation provides information on support sources available for the automatic parallelization of computer program. CAPTools, a support tool developed at the University of Greenwich, transforms, with user guidance, existing sequential Fortran code into parallel message passing code. Comparison routines are then run for debugging purposes, in essence, ensuring that the code transformation was accurate.
PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit

NASA Technical Reports Server (NTRS)

MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles

1999-01-01

In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.
Fundamentals, current state of the development of, and prospects for further improvement of the new-generation thermal-hydraulic computational HYDRA-IBRAE/LM code for simulation of fast reactor systems

NASA Astrophysics Data System (ADS)

Alipchenkov, V. M.; Anfimov, A. M.; Afremov, D. A.; Gorbunov, V. S.; Zeigarnik, Yu. A.; Kudryavtsev, A. V.; Osipov, S. L.; Mosunova, N. A.; Strizhov, V. F.; Usov, E. V.

2016-02-01

The conceptual fundamentals of the development of the new-generation system thermal-hydraulic computational HYDRA-IBRAE/LM code are presented. The code is intended to simulate the thermalhydraulic processes that take place in the loops and the heat-exchange equipment of liquid-metal cooled fast reactor systems under normal operation and anticipated operational occurrences and during accidents. The paper provides a brief overview of Russian and foreign system thermal-hydraulic codes for modeling liquid-metal coolants and gives grounds for the necessity of development of a new-generation HYDRA-IBRAE/LM code. Considering the specific engineering features of the nuclear power plants (NPPs) equipped with the BN-1200 and the BREST-OD-300 reactors, the processes and the phenomena are singled out that require a detailed analysis and development of the models to be correctly described by the system thermal-hydraulic code in question. Information on the functionality of the computational code is provided, viz., the thermalhydraulic two-phase model, the properties of the sodium and the lead coolants, the closing equations for simulation of the heat-mass exchange processes, the models to describe the processes that take place during the steam-generator tube rupture, etc. The article gives a brief overview of the usability of the computational code, including a description of the support documentation and the supply package, as well as possibilities of taking advantages of the modern computer technologies, such as parallel computations. The paper shows the current state of verification and validation of the computational code; it also presents information on the principles of constructing of and populating the verification matrices for the BREST-OD-300 and the BN-1200 reactor systems. The prospects are outlined for further development of the HYDRA-IBRAE/LM code, introduction of new models into it, and enhancement of its usability. It is shown that the program of development and practical application of the code will allow carrying out in the nearest future the computations to analyze the safety of potential NPP projects at a qualitatively higher level.
Collisional tests and an extension of the TEMPEST continuum gyrokinetic code

NASA Astrophysics Data System (ADS)

Cohen, R. H.; Dorr, M.; Hittinger, J.; Kerbel, G.; Nevins, W. M.; Rognlien, T.; Xiong, Z.; Xu, X. Q.

2006-04-01

An important requirement of a kinetic code for edge plasmas is the ability to accurately treat the effect of colllisions over a broad range of collisionalities. To test the interaction of collisions and parallel streaming, TEMPEST has been compared with published analytic and numerical (Monte Carlo, bounce-averaged Fokker-Planck) results for endloss of particles confined by combined electrostatic and magnetic wells. Good agreement is found over a wide range of collisionality, confining potential and mirror ratio, and the required velocity space resolution is modest. We also describe progress toward extension of (4-dimensional) TEMPEST into a ``kinetic edge transport code'' (a kinetic counterpart of UEDGE). The extension includes averaging of the gyrokinetic equations over fast timescales and approximating the averaged quadratic terms by diffusion terms which respect the boundaries of inaccessable regions in phase space. F. Najmabadi, R.W. Conn and R.H. Cohen, Nucl. Fusion 24, 75 (1984); T.D. Rognlien and T.A. Cutler, Nucl. Fusion 20, 1003 (1980).
PARAVT: Parallel Voronoi tessellation code

NASA Astrophysics Data System (ADS)

González, R. E.

2016-10-01

In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.
Thermal-hydraulic posttest analysis for the ANL/MCTF 360/sup 0/ model heat-exchanger water test under mixed convection. [LMFBR

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, C.I.; Sha, W.T.; Kasza, K.E.

As a result of the uncertainties in the understanding of the influence of thermal-buoyancy effects on the flow and heat transfer in Liquid Metal Fast Breeder Reactor heat exchangers and steam generators under off-normal operating conditions, an extensive experimental program is being conducted at Argonne National Laboratory to eliminate these uncertainties. Concurrently, a parallel analytical effort is also being pursued to develop a three-dimensional transient computer code (COMMIX-IHX) to study and predict heat exchanger performance under mixed, forced, and free convection conditions. This paper presents computational results from a heat exchanger simulation and compares them with the results from amore » test case exhibiting strong thermal buoyancy effects. Favorable agreement between experiment and code prediction is obtained.« less
Tuning iteration space slicing based tiled multi-core code implementing Nussinov's RNA folding.

PubMed

Palkowski, Marek; Bielecki, Wlodzimierz

2018-01-15

RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov's recurrence, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov's RNA folding. Such techniques are within the iteration space slicing framework - the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search space the best tile size and tile dimension maximizing target code performance. For a given search space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov's RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.
Parallel Semi-Implicit Spectral Element Atmospheric Model

NASA Astrophysics Data System (ADS)

Fournier, A.; Thomas, S.; Loft, R.

2001-05-01

The shallow-water equations (SWE) have long been used to test atmospheric-modeling numerical methods. The SWE contain essential wave-propagation and nonlinear effects of more complete models. We present a semi-implicit (SI) improvement of the Spectral Element Atmospheric Model to solve the SWE (SEAM, Taylor et al. 1997, Fournier et al. 2000, Thomas & Loft 2000). SE methods are h-p finite element methods combining the geometric flexibility of size-h finite elements with the accuracy of degree-p spectral methods. Our work suggests that exceptional parallel-computation performance is achievable by a General-Circulation-Model (GCM) dynamical core, even at modest climate-simulation resolutions (>1o). The code derivation involves weak variational formulation of the SWE, Gauss(-Lobatto) quadrature over the collocation points, and Legendre cardinal interpolators. Appropriate weak variation yields a symmetric positive-definite Helmholtz operator. To meet the Ladyzhenskaya-Babuska-Brezzi inf-sup condition and avoid spurious modes, we use a staggered grid. The SI scheme combines leapfrog and Crank-Nicholson schemes for the nonlinear and linear terms respectively. The localization of operations to elements ideally fits the method to cache-based microprocessor computer architectures --derivatives are computed as collections of small (8x8), naturally cache-blocked matrix-vector products. SEAM also has desirable boundary-exchange communication, like finite-difference models. Timings on on the IBM SP and Compaq ES40 supercomputers indicate that the SI code (20-min timestep) requires 1/3 the CPU time of the explicit code (2-min timestep) for T42 resolutions. Both codes scale nearly linearly out to 400 processors. We achieved single-processor performance up to 30% of peak for both codes on the 375-MHz IBM Power-3 processors. Fast computation and linear scaling lead to a useful climate-simulation dycore only if enough model time is computed per unit wall-clock time. An efficient SI solver is essential to substantially increase this rate. Parallel preconditioning for an iterative conjugate-gradient elliptic solver is described. We are building a GCM dycore capable of 200 GF% lOPS sustained performance on clustered RISC/cache architectures using hybrid MPI/OpenMP programming.
Implementation of a 3D mixing layer code on parallel computers

NASA Technical Reports Server (NTRS)

Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.

1995-01-01

This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.
Modelling of the EAST lower-hybrid current drive experiment using GENRAY/CQL3D and TORLH/CQL3D

NASA Astrophysics Data System (ADS)

Yang, C.; Bonoli, P. T.; Wright, J. C.; Ding, B. J.; Parker, R.; Shiraiwa, S.; Li, M. H.

2014-12-01

The coupled GENRAY-CQL3D code has been used to do systematic ray-tracing and Fokker-Planck analysis for EAST Lower Hybrid wave Current Drive (LHCD) experiments. Despite being in the weak absorption regime, the experimental level of LH current drive is successfully simulated, by taking into account the variations in the parallel wavenumber due to the toroidal effect. The effect of radial transport of the fast LH electrons in EAST has also been studied, which shows that a modest amount of radial transport diffusion can redistribute the fast LH current significantly. Taking advantage of the new capability in GENRAY, the actual Scrape Off Layer (SOL) model with magnetic field, density, temperature, and geometry is included in the simulation for both the lower and the higher density cases, so that the collisional losses of Lower Hybrid Wave (LHW) power in the SOL has been accounted for, which together with fast electron losses can reproduce the LHCD experimental observations in different discharges of EAST. We have also analyzed EAST discharges where there is a significant ohmic contribution to the total current, and good agreement with experiment in terms of total current has been obtained. Also, the full-wave code TORLH has been used for the simulation of the LH physics in the EAST, including full-wave effects such as diffraction and focusing which may also play an important role in bridging the spectral gap. The comparisons between the GENRAY and the TORLH codes are done for both the Maxwellian and the quasi-linear electron Landau damping cases. These simulations represent an important addition to the validation studies of the GENRAY-CQL3D and TORLH models being used in weak absorption scenarios of tokamaks with large aspect ratio.
Fast encryption of RGB color digital images using a tweakable cellular automaton based schema

NASA Astrophysics Data System (ADS)

Faraoun, Kamel Mohamed

2014-12-01

We propose a new tweakable construction of block-enciphers using second-order reversible cellular automata, and we apply it to encipher RGB-colored images. The proposed construction permits a parallel encryption of the image content by extending the standard definition of a block cipher to take into account a supplementary parameter used as a tweak (nonce) to control the behavior of the cipher from one region of the image to the other, and hence avoid the necessity to use slow sequential encryption's operating modes. The proposed construction defines a flexible pseudorandom permutation that can be used with efficacy to solve the electronic code book problem without the need to a specific sequential mode. Obtained results from various experiments show that the proposed schema achieves high security and execution performances, and enables an interesting mode of selective area decryption due to the parallel character of the approach.
State of the art in electromagnetic modeling for the Compact Linear Collider

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, Arno; Kabel, Andreas; Lee, Lie-Quan

SLAC's Advanced Computations Department (ACD) has developed the parallel 3D electromagnetic time-domain code T3P for simulations of wakefields and transients in complex accelerator structures. T3P is based on state-of-the-art Finite Element methods on unstructured grids and features unconditional stability, quadratic surface approximation and up to 6th-order vector basis functions for unprecedented simulation accuracy. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with fast turn-around times, aiding the design of the next generation of accelerator facilities. Applications include simulations of the proposed two-beam accelerator structures for the Compact Linear Collider (CLIC) - wakefieldmore » damping in the Power Extraction and Transfer Structure (PETS) and power transfer to the main beam accelerating structures are investigated.« less
Raytracing and Direct-Drive Targets

NASA Astrophysics Data System (ADS)

Schmitt, Andrew J.; Bates, Jason; Fyfe, David; Eimerl, David

2013-10-01

Accurate simulation of the effects of laser imprinting and drive asymmetries in directly driven targets requires the ability to distinguish between raytrace noise and the intensity structure produced by the spatial and temporal incoherence of optical smoothing. We have developed and implemented a smoother raytrace algorithm for our mpi-parallel radiation hydrodynamics code, FAST3D. The underlying approach is to connect the rays into either sheets (in 2D) or volume-enclosing chunks (in 3D) so that the absorbed energy distribution continuously covers the propagation area illuminated by the laser. We will describe the status and show the different scalings encountered in 2D and 3D problems as the computational size, parallelization strategy, and number of rays is varied. Finally, we show results using the method in current NIKE experimental target simulations and in proposed symmetric and polar direct-drive target designs. Supported by US DoE/NNSA.
National Combustion Code: Parallel Implementation and Performance

NASA Technical Reports Server (NTRS)

Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.

2000-01-01

The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Global MHD simulation of magnetosphere using HPF

NASA Astrophysics Data System (ADS)

Ogino, T.

We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5% using 56 PEs of Fujitsu VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.
Global Hybrid Simulation of Alfvenic Waves Associated with Magnetotail Reconnection and Fast Flows

NASA Astrophysics Data System (ADS)

Cheng, L.; Lin, Y.; Wang, X.; Perez, J. D.

2017-12-01

Alfvenic fluctuations have been observed near the magnetotail plasma sheet boundary layer associated with fast flows. In this presentation, we use the Auburn 3-D Global Hybrid code (ANGIE3D) to investigate the generation and propagation of Alfvenic waves in the magnetotail. Shear Alfven waves and kinetic Alfven waves (KAWs) are found to be generated in magnetic reconnection in the plasma sheet as well as in the dipole-like field region of the magnetosphere, carrying Poynting flux along magnetic field lines toward the ionosphere, and the wave structure is strongly altered by the flow braking in the tail. The 3-D structure of the wave electromagnetic field and the associated parallel currents in reconnection and the dipole-like field region is presented. The Alfvenic waves exhibit a turbulence spectrum. The roles of these Alfvenic waves in ion heating is discussed.
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.
A current drive by using the fast wave in frequency range higher than two timeslower hybrid resonance frequency on tokamaks

NASA Astrophysics Data System (ADS)

Kim, Sun Ho; Hwang, Yong Seok; Jeong, Seung Ho; Wang, Son Jong; Kwak, Jong Gu

2017-10-01

An efficient current drive scheme in central or off-axis region is required for the steady state operation of tokamak fusion reactors. The current drive by using the fast wave in frequency range higher than two times lower hybrid resonance (w>2wlh) could be such a scheme in high density, high temperature reactor-grade tokamak plasmas. First, it has relatively higher parallel electric field to the magnetic field favorable to the current generation, compared to fast waves in other frequency range. Second, it can deeply penetrate into high density plasmas compared to the slow wave in the same frequency range. Third, parasitic coupling to the slow wave can contribute also to the current drive avoiding parametric instability, thermal mode conversion and ion heating occured in the frequency range w<2wlh. In this study, the propagation boundary, accessibility, and the energy flow of the fast wave are given via cold dispersion relation and group velocity. The power absorption and current drive efficiency are discussed qualitatively through the hot dispersion relation and the polarization. Finally, those characteristics are confirmed with ray tracing code GENRAY for the KSTAR plasmas.

Digital tomosynthesis mammography using a parallel maximum-likelihood reconstruction method

NASA Astrophysics Data System (ADS)

Wu, Tao; Zhang, Juemin; Moore, Richard; Rafferty, Elizabeth; Kopans, Daniel; Meleis, Waleed; Kaeli, David

2004-05-01

A parallel reconstruction method, based on an iterative maximum likelihood (ML) algorithm, is developed to provide fast reconstruction for digital tomosynthesis mammography. Tomosynthesis mammography acquires 11 low-dose projections of a breast by moving an x-ray tube over a 50° angular range. In parallel reconstruction, each projection is divided into multiple segments along the chest-to-nipple direction. Using the 11 projections, segments located at the same distance from the chest wall are combined to compute a partial reconstruction of the total breast volume. The shape of the partial reconstruction forms a thin slab, angled toward the x-ray source at a projection angle 0°. The reconstruction of the total breast volume is obtained by merging the partial reconstructions. The overlap region between neighboring partial reconstructions and neighboring projection segments is utilized to compensate for the incomplete data at the boundary locations present in the partial reconstructions. A serial execution of the reconstruction is compared to a parallel implementation, using clinical data. The serial code was run on a PC with a single PentiumIV 2.2GHz CPU. The parallel implementation was developed using MPI and run on a 64-node Linux cluster using 800MHz Itanium CPUs. The serial reconstruction for a medium-sized breast (5cm thickness, 11cm chest-to-nipple distance) takes 115 minutes, while a parallel implementation takes only 3.5 minutes. The reconstruction time for a larger breast using a serial implementation takes 187 minutes, while a parallel implementation takes 6.5 minutes. No significant differences were observed between the reconstructions produced by the serial and parallel implementations.
Legacy Code Modernization

NASA Technical Reports Server (NTRS)

Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.
Force user's manual: A portable, parallel FORTRAN

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

1990-01-01

The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
Tough2{_}MP: A parallel version of TOUGH2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu; Ding, Chris

2003-04-09

TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
A DAFT DL_POLY distributed memory adaptation of the Smoothed Particle Mesh Ewald method

NASA Astrophysics Data System (ADS)

Bush, I. J.; Todorov, I. T.; Smith, W.

2006-09-01

The Smoothed Particle Mesh Ewald method [U. Essmann, L. Perera, M.L. Berkowtz, T. Darden, H. Lee, L.G. Pedersen, J. Chem. Phys. 103 (1995) 8577] for calculating long ranged forces in molecular simulation has been adapted for the parallel molecular dynamics code DL_POLY_3 [I.T. Todorov, W. Smith, Philos. Trans. Roy. Soc. London 362 (2004) 1835], making use of a novel 3D Fast Fourier Transform (DAFT) [I.J. Bush, The Daresbury Advanced Fourier transform, Daresbury Laboratory, 1999] that perfectly matches the Domain Decomposition (DD) parallelisation strategy [W. Smith, Comput. Phys. Comm. 62 (1991) 229; M.R.S. Pinches, D. Tildesley, W. Smith, Mol. Sim. 6 (1991) 51; D. Rapaport, Comput. Phys. Comm. 62 (1991) 217] of the DL_POLY_3 code. In this article we describe software adaptations undertaken to import this functionality and provide a review of its performance.
Multitasking TORT under UNICOS: Parallel performance models and measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Fast quantum Monte Carlo on a GPU

NASA Astrophysics Data System (ADS)

Lutsyshyn, Y.

2015-02-01

We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
streamgap-pepper: Effects of peppering streams with many small impacts

NASA Astrophysics Data System (ADS)

Bovy, Jo; Erkal, Denis; Sanders, Jason

2017-02-01

streamgap-pepper computes the effect of subhalo fly-bys on cold tidal streams based on the action-angle representation of streams. A line-of-parallel-angle approach is used to calculate the perturbed distribution function of a given stream segment by undoing the effect of all impacts. This approach allows one to compute the perturbed stream density and track in any coordinate system in minutes for realizations of the subhalo distribution down to 10^5 Msun, accounting for the stream's internal dispersion and overlapping impacts. This code uses galpy (ascl:1411.008) and the streampepperdf.py galpy extension, which implements the fast calculation of the perturbed stream structure.
The role of current sheet formation in driven plasmoid reconnection in laser-produced plasma bubbles

NASA Astrophysics Data System (ADS)

Lezhnin, Kirill; Fox, William; Bhattacharjee, Amitava

2017-10-01

We conduct a multiparametric study of driven magnetic reconnection relevant to recent experiments on colliding magnetized laser produced plasmas using the PIC code PSC. Varying the background plasma density, plasma resistivity, and plasma bubble geometry, the results demonstrate a variety of reconnection behavior and show the coupling between magnetic reconnection and global fluid evolution of the system. We consider both collision of two radially expanding bubbles where reconnection is driven through an X-point, and collision of two parallel fields where reconnection must be initiated by the tearing instability. Under various conditions, we observe transitions between fast, collisionless reconnection to a Sweet-Parker-like slow reconnection to complete stalling of the reconnection. By varying plasma resistivity, we observe the transition between fast and slow reconnection at Lundquist number S 103 . The transition from plasmoid reconnection to a single X-point reconnection also happens around S 103 . We find that the criterion δ /di < 1 is necessary for fast reconnection onset. Finally, at sufficiently high background density, magnetic reconnection can be suppressed, leading to bouncing motion of the magnetized plasma bubbles.
Purple L1 Milestone Review Panel TotalView Debugger Functionality and Performance for ASC Purple

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wolfe, M

2006-12-12

ASC code teams require a robust software debugging tool to help developers quickly find bugs in their codes and get their codes running. Development debugging commonly runs up to 512 processes. Production jobs run up to full ASC Purple scale, and at times require introspection while running. Developers want a debugger that runs on all their development and production platforms and that works with all compilers and runtimes used with ASC codes. The TotalView Multiprocess Debugger made by Etnus was specified for ASC Purple to address this needed capability. The ASC Purple environment builds on the environment seen by TotalViewmore » on ASCI White. The debugger must now operate with the Power5 CPU, Federation switch, AIX 5.3 operating system including large pages, IBM compilers 7 and 9, POE 4.2 parallel environment, and rs6000 SLURM resource manager. Users require robust, basic debugger functionality with acceptable performance at development debugging scale. A TotalView installation must be provided at the beginning of the early user access period that meets these requirements. A functional enhancement, fast conditional data watchpoints, and a scalability enhancement, capability up to 8192 processes, are to be demonstrated.« less
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2014-10-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Fast in-memory elastic full-waveform inversion using consumer-grade GPUs

NASA Astrophysics Data System (ADS)

Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge

2017-04-01

Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
Real-time photoacoustic and ultrasound dual-modality imaging system facilitated with graphics processing unit and code parallel optimization.

PubMed

Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L; Wang, Xueding; Liu, Xiaojun

2013-08-01

Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction, and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The back-projection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real-time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel, was conducted to verify the performance of this system for imaging fast biological events. The GPU-based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/patrealtime.
Analysis techniques for diagnosing runaway ion distributions in the reversed field pinch

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, J., E-mail: jkim536@wisc.edu; Anderson, J. K.; Capecchi, W.

2016-11-15

An advanced neutral particle analyzer (ANPA) on the Madison Symmetric Torus measures deuterium ions of energy ranges 8-45 keV with an energy resolution of 2-4 keV and time resolution of 10 μs. Three different experimental configurations measure distinct portions of the naturally occurring fast ion distributions: fast ions moving parallel, anti-parallel, or perpendicular to the plasma current. On a radial-facing port, fast ions moving perpendicular to the current have the necessary pitch to be measured by the ANPA. With the diagnostic positioned on a tangent line through the plasma core, a chord integration over fast ion density, background neutral density,more » and local appropriate pitch defines the measured sample. The plasma current can be reversed to measure anti-parallel fast ions in the same configuration. Comparisons of energy distributions for the three configurations show an anisotropic fast ion distribution favoring high pitch ions.« less
Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

NASA Technical Reports Server (NTRS)

Logan, Terry G.

1994-01-01

The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Skinner, David; Verdier, Francesca; Anand, Harsh

This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
Development of Parallel Computing Framework to Enhance Radiation Transport Code Capabilities for Rare Isotope Beam Facility Design

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kostin, Mikhail; Mokhov, Nikolai; Niita, Koji

A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA andmore » MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less
Development of Spectral and Atomic Models for Diagnosing Energetic Particle Characteristics in Fast Ignition Experiments

DOE Office of Scientific and Technical Information (OSTI.GOV)

MacFarlane, Joseph J.; Golovkin, I. E.; Woodruff, P. R.

2009-08-07

This Final Report summarizes work performed under DOE STTR Phase II Grant No. DE-FG02-05ER86258 during the project period from August 2006 to August 2009. The project, “Development of Spectral and Atomic Models for Diagnosing Energetic Particle Characteristics in Fast Ignition Experiments,” was led by Prism Computational Sciences (Madison, WI), and involved collaboration with subcontractors University of Nevada-Reno and Voss Scientific (Albuquerque, NM). In this project, we have: Developed and implemented a multi-dimensional, multi-frequency radiation transport model in the LSP hybrid fluid-PIC (particle-in-cell) code [1,2]. Updated the LSP code to support the use of accurate equation-of-state (EOS) tables generated by Prism’smore » PROPACEOS [3] code to compute more accurate temperatures in high energy density physics (HEDP) plasmas. Updated LSP to support the use of Prism’s multi-frequency opacity tables. Generated equation of state and opacity data for LSP simulations for several materials being used in plasma jet experimental studies. Developed and implemented parallel processing techniques for the radiation physics algorithms in LSP. Benchmarked the new radiation transport and radiation physics algorithms in LSP and compared simulation results with analytic solutions and results from numerical radiation-hydrodynamics calculations. Performed simulations using Prism radiation physics codes to address issues related to radiative cooling and ionization dynamics in plasma jet experiments. Performed simulations to study the effects of radiation transport and radiation losses due to electrode contaminants in plasma jet experiments. Updated the LSP code to generate output using NetCDF to provide a better, more flexible interface to SPECT3D [4] in order to post-process LSP output. Updated the SPECT3D code to better support the post-processing of large-scale 2-D and 3-D datasets generated by simulation codes such as LSP. Updated atomic physics modeling to provide for more comprehensive and accurate atomic databases that feed into the radiation physics modeling (spectral simulations and opacity tables). Developed polarization spectroscopy modeling techniques suitable for diagnosing energetic particle characteristics in HEDP experiments. A description of these items is provided in this report. The above efforts lay the groundwork for utilizing the LSP and SPECT3D codes in providing simulation support for DOE-sponsored HEDP experiments, such as plasma jet and fast ignition physics experiments. We believe that taken together, the LSP and SPECT3D codes have unique capabilities for advancing our understanding of the physics of these HEDP plasmas. Based on conversations early in this project with our DOE program manager, Dr. Francis Thio, our efforts emphasized developing radiation physics and atomic modeling capabilities that can be utilized in the LSP PIC code, and performing radiation physics studies for plasma jets. A relatively minor component focused on the development of methods to diagnose energetic particle characteristics in short-pulse laser experiments related to fast ignition physics. The period of performance for the grant was extended by one year to August 2009 with a one-year no-cost extension, at the request of subcontractor University of Nevada-Reno.« less

Rubus: A compiler for seamless and extensible parallelism.

PubMed

Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism

PubMed Central

Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Current and planned numerical development for improving computing performance for long duration and/or low pressure transients

DOE Office of Scientific and Technical Information (OSTI.GOV)

Faydide, B.

1997-07-01

This paper presents the current and planned numerical development for improving computing performance in case of Cathare applications needing real time, like simulator applications. Cathare is a thermalhydraulic code developed by CEA (DRN), IPSN, EDF and FRAMATOME for PWR safety analysis. First, the general characteristics of the code are presented, dealing with physical models, numerical topics, and validation strategy. Then, the current and planned applications of Cathare in the field of simulators are discussed. Some of these applications were made in the past, using a simplified and fast-running version of Cathare (Cathare-Simu); the status of the numerical improvements obtained withmore » Cathare-Simu is presented. The planned developments concern mainly the Simulator Cathare Release (SCAR) project which deals with the use of the most recent version of Cathare inside simulators. In this frame, the numerical developments are related with the speed up of the calculation process, using parallel processing and improvement of code reliability on a large set of NPP transients.« less
Parallel processing via a dual olfactory pathway in the honeybee.

PubMed

Brill, Martin F; Rosenbaum, Tobias; Reus, Isabelle; Kleineidam, Christoph J; Nawrot, Martin P; Rössler, Wolfgang

2013-02-06

In their natural environment, animals face complex and highly dynamic olfactory input. Thus vertebrates as well as invertebrates require fast and reliable processing of olfactory information. Parallel processing has been shown to improve processing speed and power in other sensory systems and is characterized by extraction of different stimulus parameters along parallel sensory information streams. Honeybees possess an elaborate olfactory system with unique neuronal architecture: a dual olfactory pathway comprising a medial projection-neuron (PN) antennal lobe (AL) protocerebral output tract (m-APT) and a lateral PN AL output tract (l-APT) connecting the olfactory lobes with higher-order brain centers. We asked whether this neuronal architecture serves parallel processing and employed a novel technique for simultaneous multiunit recordings from both tracts. The results revealed response profiles from a high number of PNs of both tracts to floral, pheromonal, and biologically relevant odor mixtures tested over multiple trials. PNs from both tracts responded to all tested odors, but with different characteristics indicating parallel processing of similar odors. Both PN tracts were activated by widely overlapping response profiles, which is a requirement for parallel processing. The l-APT PNs had broad response profiles suggesting generalized coding properties, whereas the responses of m-APT PNs were comparatively weaker and less frequent, indicating higher odor specificity. Comparison of response latencies within and across tracts revealed odor-dependent latencies. We suggest that parallel processing via the honeybee dual olfactory pathway provides enhanced odor processing capabilities serving sophisticated odor perception and olfactory demands associated with a complex olfactory world of this social insect.
Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1

NASA Technical Reports Server (NTRS)

Wright, Michael J.; White, Todd; Mangini, Nancy

2009-01-01

Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.
Constant time worker thread allocation via configuration caching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eichenberger, Alexandre E; O'Brien, John K. P.

Mechanisms are provided for allocating threads for execution of a parallel region of code. A request for allocation of worker threads to execute the parallel region of code is received from a master thread. Cached thread allocation information identifying prior thread allocations that have been performed for the master thread are accessed. Worker threads are allocated to the master thread based on the cached thread allocation information. The parallel region of code is executed using the allocated worker threads.
New Parallel computing framework for radiation transport codes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kostin, M.A.; /Michigan State U., NSCL; Mokhov, N.V.

A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility canmore » be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less
Soft-output decoding algorithms in iterative decoding of turbo codes

NASA Technical Reports Server (NTRS)

Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

1996-01-01

In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
Distributed multitasking ITS with PVM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fan, W.C.; Halbleib, J.A. Sr.

1995-02-01

Advances of computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources, and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generated inmore » a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of MCNP on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electron/photon transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load balancing schemes for homogeneous and heterogeneous networks.« less
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Capabilities of Fully Parallelized MHD Stability Code MARS

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2016-10-01

Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2015-11-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
Computer-Aided Parallelizer and Optimizer

NASA Technical Reports Server (NTRS)

Jin, Haoqiang

2011-01-01

The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
GASPRNG: GPU accelerated scalable parallel random number generator library

NASA Astrophysics Data System (ADS)

Gao, Shuang; Peterson, Gregory D.

2013-04-01

Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
An integrated runtime and compile-time approach for parallelizing structured and block structured applications

NASA Technical Reports Server (NTRS)

Agrawal, Gagan; Sussman, Alan; Saltz, Joel

1993-01-01

Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). A combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion was described. A runtime library which can be used to port these applications on distributed memory machines was designed and implemented. The library is currently implemented on several different systems. To further ease the task of application programmers, methods were developed for integrating this runtime library with compilers for HPK-like parallel programming languages. How this runtime library was integrated with the Fortran 90D compiler being developed at Syracuse University is discussed. Experimental results to demonstrate the efficacy of our approach are presented. A multiblock Navier-Stokes solver template and a multigrid code were experimented with. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20 percent of the code parallelized by manually inserting calls to the runtime library.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems

NASA Astrophysics Data System (ADS)

Ierotheou, C. S.; Forsey, C. R.; Leatham, M.

2000-04-01

The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2012-07-01

Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws

NASA Technical Reports Server (NTRS)

Cooke, Daniel; Rushton, Nelson

2013-01-01

With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less costly than development of comparable parallel code. Moreover, SequenceL not only automatically parallelizes the code, but since it is based on CSP-NT, it is provably race free, thus eliminating the largest quality challenge the parallelized software developer faces.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Earth Sciences Division; Zhang, Keni; Zhang, Keni

TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator ismore » to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used. To familiarize users with the parallel code, illustrative sample problems are presented.« less

A Data Parallel Multizone Navier-Stokes Code

NASA Technical Reports Server (NTRS)

Jespersen, Dennis C.; Levit, Creon; Kwak, Dochan (Technical Monitor)

1995-01-01

We have developed a data parallel multizone compressible Navier-Stokes code on the Connection Machine CM-5. The code is set up for implicit time-stepping on single or multiple structured grids. For multiple grids and geometrically complex problems, we follow the "chimera" approach, where flow data on one zone is interpolated onto another in the region of overlap. We will describe our design philosophy and give some timing results for the current code. The design choices can be summarized as: 1. finite differences on structured grids; 2. implicit time-stepping with either distributed solves or data motion and local solves; 3. sequential stepping through multiple zones with interzone data transfer via a distributed data structure. We have implemented these ideas on the CM-5 using CMF (Connection Machine Fortran), a data parallel language which combines elements of Fortran 90 and certain extensions, and which bears a strong similarity to High Performance Fortran (HPF). One interesting feature is the issue of turbulence modeling, where the architecture of a parallel machine makes the use of an algebraic turbulence model awkward, whereas models based on transport equations are more natural. We will present some performance figures for the code on the CM-5, and consider the issues involved in transitioning the code to HPF for portability to other parallel platforms.
Computational strategies for three-dimensional flow simulations on distributed computer systems. Ph.D. Thesis Semiannual Status Report, 15 Aug. 1993 - 15 Feb. 1994

NASA Technical Reports Server (NTRS)

Weed, Richard Allen; Sankar, L. N.

1994-01-01

An increasing amount of research activity in computational fluid dynamics has been devoted to the development of efficient algorithms for parallel computing systems. The increasing performance to price ratio of engineering workstations has led to research to development procedures for implementing a parallel computing system composed of distributed workstations. This thesis proposal outlines an ongoing research program to develop efficient strategies for performing three-dimensional flow analysis on distributed computing systems. The PVM parallel programming interface was used to modify an existing three-dimensional flow solver, the TEAM code developed by Lockheed for the Air Force, to function as a parallel flow solver on clusters of workstations. Steady flow solutions were generated for three different wing and body geometries to validate the code and evaluate code performance. The proposed research will extend the parallel code development to determine the most efficient strategies for unsteady flow simulations.
Quartic scaling MP2 for solids: A highly parallelized algorithm in the plane wave basis

NASA Astrophysics Data System (ADS)

Schäfer, Tobias; Ramberger, Benjamin; Kresse, Georg

2017-03-01

We present a low-complexity algorithm to calculate the correlation energy of periodic systems in second-order Møller-Plesset (MP2) perturbation theory. In contrast to previous approximation-free MP2 codes, our implementation possesses a quartic scaling, O ( N 4 ) , with respect to the system size N and offers an almost ideal parallelization efficiency. The general issue that the correlation energy converges slowly with the number of basis functions is eased by an internal basis set extrapolation. The key concept to reduce the scaling is to eliminate all summations over virtual orbitals which can be elegantly achieved in the Laplace transformed MP2 formulation using plane wave basis sets and fast Fourier transforms. Analogously, this approach could allow us to calculate second order screened exchange as well as particle-hole ladder diagrams with a similar low complexity. Hence, the presented method can be considered as a step towards systematically improved correlation energies.
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

DOE PAGES

Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...

1995-01-01

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
hybrid\\scriptsize{{MANTIS}}: a CPU-GPU Monte Carlo method for modeling indirect x-ray detectors with columnar scintillators

NASA Astrophysics Data System (ADS)

Sharma, Diksha; Badal, Andreu; Badano, Aldo

2012-04-01

The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code \\scriptsize{{MANTIS}}, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fast\\scriptsize{{DETECT}}2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the \\scriptsize{{MANTIS}} code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify \\scriptsize{{PENELOPE}} (the open source software package that handles the x-ray and electron transport in \\scriptsize{{MANTIS}}) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fast\\scriptsize{{DETECT}}2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybrid\\scriptsize{{MANTIS}} approach achieves a significant speed-up factor of 627 when compared to \\scriptsize{{MANTIS}} and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybrid\\scriptsize{{MANTIS}}, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical to x-ray transport. The new code requires much less memory than \\scriptsize{{MANTIS}} and, as a result, allows us to efficiently simulate large area detectors.
Performance of a parallel code for the Euler equations on hypercube computers

NASA Technical Reports Server (NTRS)

Barszcz, Eric; Chan, Tony F.; Jesperson, Dennis C.; Tuminaro, Raymond S.

1990-01-01

The performance of hypercubes were evaluated on a computational fluid dynamics problem and the parallel environment issues were considered that must be addressed, such as algorithm changes, implementation choices, programming effort, and programming environment. The evaluation focuses on a widely used fluid dynamics code, FLO52, which solves the two dimensional steady Euler equations describing flow around the airfoil. The code development experience is described, including interacting with the operating system, utilizing the message-passing communication system, and code modifications necessary to increase parallel efficiency. Results from two hypercube parallel computers (a 16-node iPSC/2, and a 512-node NCUBE/ten) are discussed and compared. In addition, a mathematical model of the execution time was developed as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained and is used to explore the performance of the code in interesting but yet physically realizable regions of the parameter space. Based on this model, predictions about future hypercubes are made.
ANNarchy: a code generation approach to neural simulations on parallel hardware

PubMed Central

Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.

2015-01-01

Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
Neutronic calculation of fast reactors by the EUCLID/V1 integrated code

NASA Astrophysics Data System (ADS)

Koltashev, D. A.; Stakhanova, A. A.

2017-01-01

This article considers neutronic calculation of a fast-neutron lead-cooled reactor BREST-OD-300 by the EUCLID/V1 integrated code. The main goal of development and application of integrated codes is a nuclear power plant safety justification. EUCLID/V1 is integrated code designed for coupled neutronics, thermomechanical and thermohydraulic fast reactor calculations under normal and abnormal operating conditions. EUCLID/V1 code is being developed in the Nuclear Safety Institute of the Russian Academy of Sciences. The integrated code has a modular structure and consists of three main modules: thermohydraulic module HYDRA-IBRAE/LM/V1, thermomechanical module BERKUT and neutronic module DN3D. In addition, the integrated code includes databases with fuel, coolant and structural materials properties. Neutronic module DN3D provides full-scale simulation of neutronic processes in fast reactors. Heat sources distribution, control rods movement, reactivity level changes and other processes can be simulated. Neutron transport equation in multigroup diffusion approximation is solved. This paper contains some calculations implemented as a part of EUCLID/V1 code validation. A fast-neutron lead-cooled reactor BREST-OD-300 transient simulation (fuel assembly floating, decompression of passive feedback system channel) and cross-validation with MCU-FR code results are presented in this paper. The calculations demonstrate EUCLID/V1 code application for BREST-OD-300 simulating and safety justification.
Fast Time and Space Parallel Algorithms for Solution of Parabolic Partial Differential Equations

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, fast time- and Space -Parallel agorithms for solution of linear parabolic PDEs are developed. It is shown that the seemingly strictly serial iterations of the time-stepping procedure for solution of the problem can be completed decoupled.
Benchmarking and performance analysis of the CM-2. [SIMD computer

NASA Technical Reports Server (NTRS)

Myers, David W.; Adams, George B., II

1988-01-01

A suite of benchmarking routines testing communication, basic arithmetic operations, and selected kernel algorithms written in LISP and PARIS was developed for the CM-2. Experiment runs are automated via a software framework that sequences individual tests, allowing for unattended overnight operation. Multiple measurements are made and treated statistically to generate well-characterized results from the noisy values given by cm:time. The results obtained provide a comparison with similar, but less extensive, testing done on a CM-1. Tests were chosen to aid the algorithmist in constructing fast, efficient, and correct code on the CM-2, as well as gain insight into what performance criteria are needed when evaluating parallel processing machines.
Pteros: fast and easy to use open-source C++ library for molecular analysis.

PubMed

Yesylevskyy, Semen O

2012-07-15

An open-source Pteros library for molecular modeling and analysis of molecular dynamics trajectories for C++ programming language is introduced. Pteros provides a number of routine analysis operations ranging from reading and writing trajectory files and geometry transformations to structural alignment and computation of nonbonded interaction energies. The library features asynchronous trajectory reading and parallel execution of several analysis routines, which greatly simplifies development of computationally intensive trajectory analysis algorithms. Pteros programming interface is very simple and intuitive while the source code is well documented and easily extendible. Pteros is available for free under open-source Artistic License from http://sourceforge.net/projects/pteros/. Copyright © 2012 Wiley Periodicals, Inc.
Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Y.

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block tridiagonal matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconstant coefficients. A method was recently proposed to parallelize and vectorize BCR. In this paper, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational compelxity lower than that of parallel BCR.
Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Youcef

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR.
Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing.

PubMed

Palkowski, Marek; Bielecki, Wlodzimierz

2017-06-02

RNA secondary structure prediction is a compute intensive task that lies at the core of several search algorithms in bioinformatics. Fortunately, the RNA folding approaches, such as the Nussinov base pair maximization, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. Polyhedral compilation techniques have proven to be a powerful tool for optimization of dense array codes. However, classical affine loop nest transformations used with these techniques do not optimize effectively codes of dynamic programming of RNA structure predictions. The purpose of this paper is to present a novel approach allowing for generation of a parallel tiled Nussinov RNA loop nest exposing significantly higher performance than that of known related code. This effect is achieved due to improving code locality and calculation parallelization. In order to improve code locality, we apply our previously published technique of automatic loop nest tiling to all the three loops of the Nussinov loop nest. This approach first forms original rectangular 3D tiles and then corrects them to establish their validity by means of applying the transitive closure of a dependence graph. To produce parallel code, we apply the loop skewing technique to a tiled Nussinov loop nest. The technique is implemented as a part of the publicly available polyhedral source-to-source TRACO compiler. Generated code was run on modern Intel multi-core processors and coprocessors. We present the speed-up factor of generated Nussinov RNA parallel code and demonstrate that it is considerably faster than related codes in which only the two outer loops of the Nussinov loop nest are tiled.
Fast adaptive composite grid methods on distributed parallel architectures

NASA Technical Reports Server (NTRS)

Lemke, Max; Quinlan, Daniel

1992-01-01

The fast adaptive composite (FAC) grid method is compared with the adaptive composite method (AFAC) under variety of conditions including vectorization and parallelization. Results are given for distributed memory multiprocessor architectures (SUPRENUM, Intel iPSC/2 and iPSC/860). It is shown that the good performance of AFAC and its superiority over FAC in a parallel environment is a property of the algorithm and not dependent on peculiarities of any machine.
Fast and accurate mock catalogue generation for low-mass galaxies

NASA Astrophysics Data System (ADS)

Koda, Jun; Blake, Chris; Beutler, Florian; Kazin, Eyal; Marin, Felipe

2016-06-01

We present an accurate and fast framework for generating mock catalogues including low-mass haloes, based on an implementation of the COmoving Lagrangian Acceleration (COLA) technique. Multiple realisations of mock catalogues are crucial for analyses of large-scale structure, but conventional N-body simulations are too computationally expensive for the production of thousands of realizations. We show that COLA simulations can produce accurate mock catalogues with a moderate computation resource for low- to intermediate-mass galaxies in 1012 M⊙ haloes, both in real and redshift space. COLA simulations have accurate peculiar velocities, without systematic errors in the velocity power spectra for k ≤ 0.15 h Mpc-1, and with only 3-per cent error for k ≤ 0.2 h Mpc-1. We use COLA with 10 time steps and a Halo Occupation Distribution to produce 600 mock galaxy catalogues of the WiggleZ Dark Energy Survey. Our parallelized code for efficient generation of accurate halo catalogues is publicly available at github.com/junkoda/cola_halo.
A portable low-cost 3D point cloud acquiring method based on structure light

NASA Astrophysics Data System (ADS)

Gui, Li; Zheng, Shunyi; Huang, Xia; Zhao, Like; Ma, Hao; Ge, Chao; Tang, Qiuxia

2018-03-01

A fast and low-cost method of acquiring 3D point cloud data is proposed in this paper, which can solve the problems of lack of texture information and low efficiency of acquiring point cloud data with only one pair of cheap cameras and projector. Firstly, we put forward a scene adaptive design method of random encoding pattern, that is, a coding pattern is projected onto the target surface in order to form texture information, which is favorable for image matching. Subsequently, we design an efficient dense matching algorithm that fits the projected texture. After the optimization of global algorithm and multi-kernel parallel development with the fusion of hardware and software, a fast acquisition system of point-cloud data is accomplished. Through the evaluation of point cloud accuracy, the results show that point cloud acquired by the method proposed in this paper has higher precision. What`s more, the scanning speed meets the demand of dynamic occasion and has better practical application value.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Janjusic, Tommy; Kartsaklis, Christos

Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an applicationmore » s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).« less
A fast combination calibration of foreground and background for pipelined ADCs

NASA Astrophysics Data System (ADS)

Kexu, Sun; Lenian, He

2012-06-01

This paper describes a fast digital calibration scheme for pipelined analog-to-digital converters (ADCs). The proposed method corrects the nonlinearity caused by finite opamp gain and capacitor mismatch in multiplying digital-to-analog converters (MDACs). The considered calibration technique takes the advantages of both foreground and background calibration schemes. In this combination calibration algorithm, a novel parallel background calibration with signal-shifted correlation is proposed, and its calibration cycle is very short. The details of this technique are described in the example of a 14-bit 100 Msample/s pipelined ADC. The high convergence speed of this background calibration is achieved by three means. First, a modified 1.5-bit stage is proposed in order to allow the injection of a large pseudo-random dithering without missing code. Second, before correlating the signal, it is shifted according to the input signal so that the correlation error converges quickly. Finally, the front pipeline stages are calibrated simultaneously rather than stage by stage to reduce the calibration tracking constants. Simulation results confirm that the combination calibration has a fast startup process and a short background calibration cycle of 2 × 221 conversions.
Spiral: Automated Computing for Linear Transforms

NASA Astrophysics Data System (ADS)

Püschel, Markus

2010-09-01

Writing fast software has become extraordinarily difficult. For optimal performance, programs and their underlying algorithms have to be adapted to take full advantage of the platform's parallelism, memory hierarchy, and available instruction set. To make things worse, the best implementations are often platform-dependent and platforms are constantly evolving, which quickly renders libraries obsolete. We present Spiral, a domain-specific program generation system for important functionality used in signal processing and communication including linear transforms, filters, and other functions. Spiral completely replaces the human programmer. For a desired function, Spiral generates alternative algorithms, optimizes them, compiles them into programs, and intelligently searches for the best match to the computing platform. The main idea behind Spiral is a mathematical, declarative, domain-specific framework to represent algorithms and the use of rewriting systems to generate and optimize algorithms at a high level of abstraction. Experimental results show that the code generated by Spiral competes with, and sometimes outperforms, the best available human-written code.

Particle acceleration and transport at a 2D CME-driven shock using the HAFv3 and PATH Code

NASA Astrophysics Data System (ADS)

Li, G.; Ao, X.; Fry, C. D.; Verkhoglyadova, O. P.; Zank, G. P.

2012-12-01

We study particle acceleration at a 2D CME-driven shock and the subsequent transport in the inner heliosphere (up to 2 AU) by coupling the kinematic Hakamada-Akasofu-Fry version 3 (HAFv3) solar wind model (Hakamada and Akasofu, 1982, Fry et al. 2003) with the Particle Acceleration and Transport in the Heliosphere (PATH) model (Zank et al., 2000, Li et al., 2003, 2005, Verkhoglyadova et al. 2009). The HAFv3 provides the evolution of a two-dimensional shock geometry and other plasma parameters, which are fed into the PATH model to investigate the effect of a varying shock geometry on particle acceleration and transport. The transport module of the PATH model is parallelized and utilizes the state-of-the-art GPU computation technique to achieve a rapid physics-based numerical description of the interplanetary energetic particles. Together with a fast execution of the HAFv3 model, the coupled code gives us a possibility to nowcast/forecast the interplanetary radiation environment.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
New Bandwidth Efficient Parallel Concatenated Coding Schemes

NASA Technical Reports Server (NTRS)

Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.

1996-01-01

We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
National Combustion Code: Parallel Performance

NASA Technical Reports Server (NTRS)

Babrauckas, Theresa

2001-01-01

This report discusses the National Combustion Code (NCC). The NCC is an integrated system of codes for the design and analysis of combustion systems. The advanced features of the NCC meet designers' requirements for model accuracy and turn-around time. The fundamental features at the inception of the NCC were parallel processing and unstructured mesh. The design and performance of the NCC are discussed.
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures

NASA Technical Reports Server (NTRS)

Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.

1998-01-01

In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
Scalability study of parallel spatial direct numerical simulation code on IBM SP1 parallel supercomputer

NASA Technical Reports Server (NTRS)

Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad

1994-01-01

The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
PFLOTRAN: Reactive Flow & Transport Code for Use on Laptops to Leadership-Class Supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hammond, Glenn E.; Lichtner, Peter C.; Lu, Chuan

PFLOTRAN, a next-generation reactive flow and transport code for modeling subsurface processes, has been designed from the ground up to run efficiently on machines ranging from leadership-class supercomputers to laptops. Based on an object-oriented design, the code is easily extensible to incorporate additional processes. It can interface seamlessly with Fortran 9X, C and C++ codes. Domain decomposition parallelism is employed, with the PETSc parallel framework used to manage parallel solvers, data structures and communication. Features of the code include a modular input file, implementation of high-performance I/O using parallel HDF5, ability to perform multiple realization simulations with multiple processors permore » realization in a seamless manner, and multiple modes for multiphase flow and multicomponent geochemical transport. Chemical reactions currently implemented in the code include homogeneous aqueous complexing reactions and heterogeneous mineral precipitation/dissolution, ion exchange, surface complexation and a multirate kinetic sorption model. PFLOTRAN has demonstrated petascale performance using 2{sup 17} processor cores with over 2 billion degrees of freedom. Accomplishments achieved to date include applications to the Hanford 300 Area and modeling CO{sub 2} sequestration in deep geologic formations.« less
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

NASA Astrophysics Data System (ADS)

de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl

2002-03-01

We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.

PubMed

Souris, Kevin; Lee, John Aldo; Sterpin, Edmond

2016-04-01

Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Chan, Tony F.; Fatoohi, Rod A.

1990-01-01

The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.
A parallel Monte Carlo code for planar and SPECT imaging: implementation, verification and applications in (131)I SPECT.

PubMed

Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F

2002-02-01

This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Nuclide Depletion Capabilities in the Shift Monte Carlo Code

DOE PAGES

Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...

2017-12-21

A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
HPCC Methodologies for Structural Design and Analysis on Parallel and Distributed Computing Platforms

NASA Technical Reports Server (NTRS)

Farhat, Charbel

1998-01-01

In this grant, we have proposed a three-year research effort focused on developing High Performance Computation and Communication (HPCC) methodologies for structural analysis on parallel processors and clusters of workstations, with emphasis on reducing the structural design cycle time. Besides consolidating and further improving the FETI solver technology to address plate and shell structures, we have proposed to tackle the following design related issues: (a) parallel coupling and assembly of independently designed and analyzed three-dimensional substructures with non-matching interfaces, (b) fast and smart parallel re-analysis of a given structure after it has undergone design modifications, (c) parallel evaluation of sensitivity operators (derivatives) for design optimization, and (d) fast parallel analysis of mildly nonlinear structures. While our proposal was accepted, support was provided only for one year.
User's Guide for ENSAERO_FE Parallel Finite Element Solver

NASA Technical Reports Server (NTRS)

Eldred, Lloyd B.; Guruswamy, Guru P.

1999-01-01

A high fidelity parallel static structural analysis capability is created and interfaced to the multidisciplinary analysis package ENSAERO-MPI of Ames Research Center. This new module replaces ENSAERO's lower fidelity simple finite element and modal modules. Full aircraft structures may be more accurately modeled using the new finite element capability. Parallel computation is performed by breaking the full structure into multiple substructures. This approach is conceptually similar to ENSAERO's multizonal fluid analysis capability. The new substructure code is used to solve the structural finite element equations for each substructure in parallel. NASTRANKOSMIC is utilized as a front end for this code. Its full library of elements can be used to create an accurate and realistic aircraft model. It is used to create the stiffness matrices for each substructure. The new parallel code then uses an iterative preconditioned conjugate gradient method to solve the global structural equations for the substructure boundary nodes.
A Parallel Numerical Algorithm To Solve Linear Systems Of Equations Emerging From 3D Radiative Transfer

NASA Astrophysics Data System (ADS)

Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.

2016-10-01

Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.
Portable multi-node LQCD Monte Carlo simulations using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Orthorectification by Using Gpgpu Method

NASA Astrophysics Data System (ADS)

Sahin, H.; Kulur, S.

2012-07-01

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
Fast, Massively Parallel Data Processors

NASA Technical Reports Server (NTRS)

Heaton, Robert A.; Blevins, Donald W.; Davis, ED

1994-01-01

Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.

The Tera Multithreaded Architecture and Unstructured Meshes

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Mavriplis, Dimitri J.

1998-01-01

The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
A Domain Decomposition Parallelization of the Fast Marching Method

NASA Technical Reports Server (NTRS)

Herrmann, M.

2003-01-01

In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.
GRADSPMHD: A parallel MHD code based on the SPH formalism

NASA Astrophysics Data System (ADS)

Vanaverbeke, S.; Keppens, R.; Poedts, S.

2014-03-01

We present GRADSPMHD, a completely Lagrangian parallel magnetohydrodynamics code based on the SPH formalism. The implementation of the equations of SPMHD in the “GRAD-h” formalism assembles known results, including the derivation of the discretized MHD equations from a variational principle, the inclusion of time-dependent artificial viscosity, resistivity and conductivity terms, as well as the inclusion of a mixed hyperbolic/parabolic correction scheme for satisfying the ∇ṡB→ constraint on the magnetic field. The code uses a tree-based formalism for neighbor finding and can optionally use the tree code for computing the self-gravity of the plasma. The structure of the code closely follows the framework of our parallel GRADSPH FORTRAN 90 code which we added previously to the CPC program library. We demonstrate the capabilities of GRADSPMHD by running 1, 2, and 3 dimensional standard benchmark tests and we find good agreement with previous work done by other researchers. The code is also applied to the problem of simulating the magnetorotational instability in 2.5D shearing box tests as well as in global simulations of magnetized accretion disks. We find good agreement with available results on this subject in the literature. Finally, we discuss the performance of the code on a parallel supercomputer with distributed memory architecture. Catalogue identifier: AERP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERP_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 620503 No. of bytes in distributed program, including test data, etc.: 19837671 Distribution format: tar.gz Programming language: FORTRAN 90/MPI. Computer: HPC cluster. Operating system: Unix. Has the code been vectorized or parallelized?: Yes, parallelized using MPI. RAM: ˜30 MB for a Sedov test including 15625 particles on a single CPU. Classification: 12. Nature of problem: Evolution of a plasma in the ideal MHD approximation. Solution method: The equations of magnetohydrodynamics are solved using the SPH method. Running time: The test provided takes approximately 20 min using 4 processors.
Health and nutrition content claims on Australian fast-food websites.

PubMed

Wellard, Lyndal; Koukoumas, Alexandra; Watson, Wendy L; Hughes, Clare

2017-03-01

To determine the extent that Australian fast-food websites contain nutrition content and health claims, and whether these claims are compliant with the new provisions of the Australia New Zealand Food Standards Code ('the Code'). Systematic content analysis of all web pages to identify nutrition content and health claims. Nutrition information panels were used to determine whether products with claims met Nutrient Profiling Scoring Criteria (NPSC) and qualifying criteria, and to compare them with the Code to determine compliance. Australian websites of forty-four fast-food chains including meals, bakery, ice cream, beverage and salad chains. Any products marketed on the websites using health or nutrition content claims. Of the forty-four fast-food websites, twenty (45 %) had at least one claim. A total of 2094 claims were identified on 371 products, including 1515 nutrition content (72 %) and 579 health claims (28 %). Five fast-food products with health (5 %) and 157 products with nutrition content claims (43 %) did not meet the requirements of the Code to allow them to carry such claims. New provisions in the Code came into effect in January 2016 after a 3-year transition. Food regulatory agencies should review fast-food websites to ensure compliance with the qualifying criteria for nutrition content and health claim regulations. This would prevent consumers from viewing unhealthy foods as healthier choices. Healthy choices could be facilitated by applying NPSC to nutrition content claims. Fast-food chains should be educated on the requirements of the Code regarding claims.
CUBE: Information-optimized parallel cosmological N-body simulation code

NASA Astrophysics Data System (ADS)

Yu, Hao-Ran; Pen, Ue-Li; Wang, Xin

2018-05-01

CUBE, written in Coarray Fortran, is a particle-mesh based parallel cosmological N-body simulation code. The memory usage of CUBE can approach as low as 6 bytes per particle. Particle pairwise (PP) force, cosmological neutrinos, spherical overdensity (SO) halofinder are included.
The Particle Accelerator Simulation Code PyORBIT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorlov, Timofey V; Holmes, Jeffrey A; Cousineau, Sarah M

2015-01-01

The particle accelerator simulation code PyORBIT is presented. The structure, implementation, history, parallel and simulation capabilities, and future development of the code are discussed. The PyORBIT code is a new implementation and extension of algorithms of the original ORBIT code that was developed for the Spallation Neutron Source accelerator at the Oak Ridge National Laboratory. The PyORBIT code has a two level structure. The upper level uses the Python programming language to control the flow of intensive calculations performed by the lower level code implemented in the C++ language. The parallel capabilities are based on MPI communications. The PyORBIT ismore » an open source code accessible to the public through the Google Open Source Projects Hosting service.« less
Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML.

PubMed

Cusack, Rhodri; Vicente-Grabovetsky, Alejandro; Mitchell, Daniel J; Wild, Conor J; Auer, Tibor; Linke, Annika C; Peelle, Jonathan E

2014-01-01

Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (aa), to address these concerns. Human efficiency is increased by making code modular and reusable, and managing its execution with a processing engine that tracks what has been completed and what needs to be (re)done. Analysis is accelerated by optional parallel processing of independent tasks on cluster or cloud computing resources. A pipeline comprises a series of modules that each perform a specific task. The processing engine keeps track of the data, calculating a map of upstream and downstream dependencies for each module. Existing modules are available for many analysis tasks, such as SPM-based fMRI preprocessing, individual and group level statistics, voxel-based morphometry, tractography, and multi-voxel pattern analyses (MVPA). However, aa also allows for full customization, and encourages efficient management of code: new modules may be written with only a small code overhead. aa has been used by more than 50 researchers in hundreds of neuroimaging studies comprising thousands of subjects. It has been found to be robust, fast, and efficient, for simple-single subject studies up to multimodal pipelines on hundreds of subjects. It is attractive to both novice and experienced users. aa can reduce the amount of time neuroimaging laboratories spend performing analyses and reduce errors, expanding the range of scientific questions it is practical to address.
Implementation of a Message Passing Interface into a Cloud-Resolving Model for Massively Parallel Computing

NASA Technical Reports Server (NTRS)

Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve

2004-01-01

The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
Applications Performance Under MPL and MPI on NAS IBM SP2

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

1994-01-01

On July 5, 1994, an IBM Scalable POWER parallel System (IBM SP2) with 64 nodes, was installed at the Numerical Aerodynamic Simulation (NAS) Facility Each node of NAS IBM SP2 is a "wide node" consisting of a RISC 6000/590 workstation module with a clock of 66.5 MHz which can perform four floating point operations per clock with a peak performance of 266 Mflop/s. By the end of 1994, 64 nodes of IBM SP2 will be upgraded to 160 nodes with a peak performance of 42.5 Gflop/s. An overview of the IBM SP2 hardware is presented. The basic understanding of architectural details of RS 6000/590 will help application scientists the porting, optimizing, and tuning of codes from other machines such as the CRAY C90 and the Paragon to the NAS SP2. Optimization techniques such as quad-word loading, effective utilization of two floating point units, and data cache optimization of RS 6000/590 is illustrated, with examples giving performance gains at each optimization step. The conversion of codes using Intel's message passing library NX to codes using native Message Passing Library (MPL) and the Message Passing Interface (NMI) library available on the IBM SP2 is illustrated. In particular, we will present the performance of Fast Fourier Transform (FFT) kernel from NAS Parallel Benchmarks (NPB) under MPL and MPI. We have also optimized some of Fortran BLAS 2 and BLAS 3 routines, e.g., the optimized Fortran DAXPY runs at 175 Mflop/s and optimized Fortran DGEMM runs at 230 Mflop/s per node. The performance of the NPB (Class B) on the IBM SP2 is compared with the CRAY C90, Intel Paragon, TMC CM-5E, and the CRAY T3D.
Parallel Event Analysis Under Unix

NASA Astrophysics Data System (ADS)

Looney, S.; Nilsson, B. S.; Oest, T.; Pettersson, T.; Ranjard, F.; Thibonnier, J.-P.

The ALEPH experiment at LEP, the CERN CN division and Digital Equipment Corp. have, in a joint project, developed a parallel event analysis system. The parallel physics code is identical to ALEPH's standard analysis code, ALPHA, only the organisation of input/output is changed. The user may switch between sequential and parallel processing by simply changing one input "card". The initial implementation runs on an 8-node DEC 3000/400 farm, using the PVM software, and exhibits a near-perfect speed-up linearity, reducing the turn-around time by a factor of 8.
Code Optimization and Parallelization on the Origins: Looking from Users' Perspective

NASA Technical Reports Server (NTRS)

Chang, Yan-Tyng Sherry; Thigpen, William W. (Technical Monitor)

2002-01-01

Parallel machines are becoming the main compute engines for high performance computing. Despite their increasing popularity, it is still a challenge for most users to learn the basic techniques to optimize/parallelize their codes on such platforms. In this paper, we present some experiences on learning these techniques for the Origin systems at the NASA Advanced Supercomputing Division. Emphasis of this paper will be on a few essential issues (with examples) that general users should master when they work with the Origins as well as other parallel systems.
Fast transform decoding of nonsystematic Reed-Solomon codes

NASA Technical Reports Server (NTRS)

Truong, T. K.; Cheung, K.-M.; Reed, I. S.; Shiozaki, A.

1989-01-01

A Reed-Solomon (RS) code is considered to be a special case of a redundant residue polynomial (RRP) code, and a fast transform decoding algorithm to correct both errors and erasures is presented. This decoding scheme is an improvement of the decoding algorithm for the RRP code suggested by Shiozaki and Nishida, and can be realized readily on very large scale integration chips.
Parallel DSMC Solution of Three-Dimensional Flow Over a Finite Flat Plate

NASA Technical Reports Server (NTRS)

Nance, Robert P.; Wilmoth, Richard G.; Moon, Bongki; Hassan, H. A.; Saltz, Joel

1994-01-01

This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that a one-dimensional chain-partitioning method works best for the problems considered. The parallel code is then used to simulate the Mach 20 nitrogen flow over a finite-thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times than the scalar code, as well as very good load-balance characteristics.
S-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Simon, H.

1998-01-01

Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

NASA Technical Reports Server (NTRS)

Dubey, A.; Zubair, M.; Grosch, C. E.

1992-01-01

One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
Moving magnets in a micromagnetic finite-difference framework

NASA Astrophysics Data System (ADS)

Rissanen, Ilari; Laurson, Lasse

2018-05-01

We present a method and an implementation for smooth linear motion in a finite-difference-based micromagnetic simulation code, to be used in simulating magnetic friction and other phenomena involving moving microscale magnets. Our aim is to accurately simulate the magnetization dynamics and relative motion of magnets while retaining high computational speed. To this end, we combine techniques for fast scalar potential calculation and cubic b-spline interpolation, parallelizing them on a graphics processing unit (GPU). The implementation also includes the possibility of explicitly simulating eddy currents in the case of conducting magnets. We test our implementation by providing numerical examples of stick-slip motion of thin films pulled by a spring and the effect of eddy currents on the switching time of magnetic nanocubes.
Integrated protocol for reliable and fast quantification and documentation of electrophoresis gels.

PubMed

Rehbein, Peter; Schwalbe, Harald

2015-06-01

Quantitative analysis of electrophoresis gels is an important part in molecular cloning, as well as in protein expression and purification. Parallel quantifications in yield and purity can be most conveniently obtained from densitometric analysis. This communication reports a comprehensive, reliable and simple protocol for gel quantification and documentation, applicable for single samples and with special features for protein expression screens. As major component of the protocol, the fully annotated code of a proprietary open source computer program for semi-automatic densitometric quantification of digitized electrophoresis gels is disclosed. The program ("GelQuant") is implemented for the C-based macro-language of the widespread integrated development environment of IGOR Pro. Copyright © 2014 Elsevier Inc. All rights reserved.
Highly fault-tolerant parallel computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Spielman, D.A.

We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less
Enhancing Scalability and Efficiency of the TOUGH2_MP for LinuxClusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu

2006-04-17

TOUGH2{_}MP, the parallel version TOUGH2 code, has been enhanced by implementing more efficient communication schemes. This enhancement is achieved through reducing the amount of small-size messages and the volume of large messages. The message exchange speed is further improved by using non-blocking communications for both linear and nonlinear iterations. In addition, we have modified the AZTEC parallel linear-equation solver to nonblocking communication. Through the improvement of code structuring and bug fixing, the new version code is now more stable, while demonstrating similar or even better nonlinear iteration converging speed than the original TOUGH2 code. As a result, the new versionmore » of TOUGH2{_}MP is improved significantly in its efficiency. In this paper, the scalability and efficiency of the parallel code are demonstrated by solving two large-scale problems. The testing results indicate that speedup of the code may depend on both problem size and complexity. In general, the code has excellent scalability in memory requirement as well as computing time.« less
Neural representation of objects in space: a dual coding account.

PubMed Central

Humphreys, G W

1998-01-01

I present evidence on the nature of object coding in the brain and discuss the implications of this coding for models of visual selective attention. Neuropsychological studies of task-based constraints on: (i) visual neglect; and (ii) reading and counting, reveal the existence of parallel forms of spatial representation for objects: within-object representations, where elements are coded as parts of objects, and between-object representations, where elements are coded as independent objects. Aside from these spatial codes for objects, however, the coding of visual space is limited. We are extremely poor at remembering small spatial displacements across eye movements, indicating (at best) impoverished coding of spatial position per se. Also, effects of element separation on spatial extinction can be eliminated by filling the space with an occluding object, indicating that spatial effects on visual selection are moderated by object coding. Overall, there are separate limits on visual processing reflecting: (i) the competition to code parts within objects; (ii) the small number of independent objects that can be coded in parallel; and (iii) task-based selection of whether within- or between-object codes determine behaviour. Between-object coding may be linked to the dorsal visual system while parallel coding of parts within objects takes place in the ventral system, although there may additionally be some dorsal involvement either when attention must be shifted within objects or when explicit spatial coding of parts is necessary for object identification. PMID:9770227

A Validation and Code-to-Code Verification of FAST for a Megawatt-Scale Wind Turbine with Aeroelastically Tailored Blades

DOE Office of Scientific and Technical Information (OSTI.GOV)

Guntur, Srinivas; Jonkman, Jason; Sievers, Ryan

This paper presents validation and code-to-code verification of the latest version of the U.S. Department of Energy, National Renewable Energy Laboratory wind turbine aeroelastic engineering simulation tool, FAST v8. A set of 1,141 test cases, for which experimental data from a Siemens 2.3 MW machine have been made available and were in accordance with the International Electrotechnical Commission 61400-13 guidelines, were identified. These conditions were simulated using FAST as well as the Siemens in-house aeroelastic code, BHawC. This paper presents a detailed analysis comparing results from FAST with those from BHawC as well as experimental measurements, using statistics including themore » means and the standard deviations along with the power spectral densities of select turbine parameters and loads. Results indicate a good agreement among the predictions using FAST, BHawC, and experimental measurements. Here, these agreements are discussed in detail in this paper, along with some comments regarding the differences seen in these comparisons relative to the inherent uncertainties in such a model-based analysis.« less
A Validation and Code-to-Code Verification of FAST for a Megawatt-Scale Wind Turbine with Aeroelastically Tailored Blades

DOE PAGES

Guntur, Srinivas; Jonkman, Jason; Sievers, Ryan; ...

2017-08-29

This paper presents validation and code-to-code verification of the latest version of the U.S. Department of Energy, National Renewable Energy Laboratory wind turbine aeroelastic engineering simulation tool, FAST v8. A set of 1,141 test cases, for which experimental data from a Siemens 2.3 MW machine have been made available and were in accordance with the International Electrotechnical Commission 61400-13 guidelines, were identified. These conditions were simulated using FAST as well as the Siemens in-house aeroelastic code, BHawC. This paper presents a detailed analysis comparing results from FAST with those from BHawC as well as experimental measurements, using statistics including themore » means and the standard deviations along with the power spectral densities of select turbine parameters and loads. Results indicate a good agreement among the predictions using FAST, BHawC, and experimental measurements. Here, these agreements are discussed in detail in this paper, along with some comments regarding the differences seen in these comparisons relative to the inherent uncertainties in such a model-based analysis.« less
Rapid Prediction of Unsteady Three-Dimensional Viscous Flows in Turbopump Geometries

NASA Technical Reports Server (NTRS)

Dorney, Daniel J.

1998-01-01

A program is underway to improve the efficiency of a three-dimensional Navier-Stokes code and generalize it for nozzle and turbopump geometries. Code modifications have included the implementation of parallel processing software, incorporation of new physical models and generalization of the multiblock capability. The final report contains details of code modifications, numerical results for several nozzle and turbopump geometries, and the implementation of the parallelization software.
Parallel Grand Canonical Monte Carlo (ParaGrandMC) Simulation Code

NASA Technical Reports Server (NTRS)

Yamakov, Vesselin I.

2016-01-01

This report provides an overview of the Parallel Grand Canonical Monte Carlo (ParaGrandMC) simulation code. This is a highly scalable parallel FORTRAN code for simulating the thermodynamic evolution of metal alloy systems at the atomic level, and predicting the thermodynamic state, phase diagram, chemical composition and mechanical properties. The code is designed to simulate multi-component alloy systems, predict solid-state phase transformations such as austenite-martensite transformations, precipitate formation, recrystallization, capillary effects at interfaces, surface absorption, etc., which can aid the design of novel metallic alloys. While the software is mainly tailored for modeling metal alloys, it can also be used for other types of solid-state systems, and to some degree for liquid or gaseous systems, including multiphase systems forming solid-liquid-gas interfaces.
DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dr. Dale M. Snider

2011-02-28

This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendlymore » environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and jointly parallelize subroutines (CPFD chose the small business EMPhotonics for the Phase-1 the technical partner. See Section Technical Objective and Approach) Task 3: Integrate parallel subroutines into Barracuda (See Section Results from Phase-1 and its subsections) Task 4: Testing, refinement, and optimization of parallel methodology (See Section Results from Phase-1 and Section Result Comparison Program) Task 5: Integrate Phase-1 parallel subroutines into Barracuda and release (See Section Results from Phase-1 and its subsections) Task 6: Roadmap of Phase-2 (See Section Plan for Phase-2) With the completion of Phase 1 we have the base understanding to completely parallelize Barracuda. An overview of the work to move Barracuda to a parallelized code is given in Plan for Phase-2.« less
BISON and MARMOT Development for Modeling Fast Reactor Fuel Performance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gamble, Kyle Allan Lawrence; Williamson, Richard L.; Schwen, Daniel

2015-09-01

BISON and MARMOT are two codes under development at the Idaho National Laboratory for engineering scale and lower length scale fuel performance modeling. It is desired to add capabilities for fast reactor applications to these codes. The fast reactor fuel types under consideration are metal (U-Pu-Zr) and oxide (MOX). The cladding types of interest include 316SS, D9, and HT9. The purpose of this report is to outline the proposed plans for code development and provide an overview of the models added to the BISON and MARMOT codes for fast reactor fuel behavior. A brief overview of preliminary discussions on themore » formation of a bilateral agreement between the Idaho National Laboratory and the National Nuclear Laboratory in the United Kingdom is presented.« less
THC-MP: High performance numerical simulation of reactive transport and multiphase flow in porous media

NASA Astrophysics Data System (ADS)

Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu

2015-07-01

The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
A fast non-Fourier method for Landau-fluid operators

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dimits, A. M., E-mail: dimits1@llnl.gov; Joseph, I.; Umansky, M. V.

An efficient and versatile non-Fourier method for the computation of Landau-fluid (LF) closure operators [Hammett and Perkins, Phys. Rev. Lett. 64, 3019 (1990)] is presented, based on an approximation by a sum of modified-Helmholtz-equation solves (SMHS) in configuration space. This method can yield fast-Fourier-like scaling of the computational time requirements and also provides a very compact data representation of these operators, even for plasmas with large spatial nonuniformity. As a result, the method can give significant savings compared with direct application of “delocalization kernels” [e.g., Schurtz et al., Phys. Plasmas 7, 4238 (2000)], both in terms of computational cost andmore » memory requirements. The method is of interest for the implementation of Landau-fluid models in situations where the spatial nonuniformity, particular geometry, or boundary conditions render a Fourier implementation difficult or impossible. Systematic procedures have been developed to optimize the resulting operators for accuracy and computational cost. The four-moment Landau-fluid model of Hammett and Perkins has been implemented in the BOUT++ code using the SMHS method for LF closure. Excellent agreement has been obtained for the one-dimensional plasma density response function between driven initial-value calculations using this BOUT++ implementation and matrix eigenvalue calculations using both Fourier and SMHS non-Fourier implementations of the LF closures. The SMHS method also forms the basis for the implementation, which has been carried out in the BOUT++ code, of the parallel and toroidal drift-resonance LF closures. The method is a key enabling tool for the extension of gyro-Landau-fluid models [e.g., Beer and Hammett, Phys. Plasmas 3, 4046 (1996)] to codes that treat regions with strong profile variation, such as the tokamak edge and scrapeoff-layer.« less
A fast non-Fourier method for Landau-fluid operatorsa)

NASA Astrophysics Data System (ADS)

Dimits, A. M.; Joseph, I.; Umansky, M. V.

2014-05-01

An efficient and versatile non-Fourier method for the computation of Landau-fluid (LF) closure operators [Hammett and Perkins, Phys. Rev. Lett. 64, 3019 (1990)] is presented, based on an approximation by a sum of modified-Helmholtz-equation solves (SMHS) in configuration space. This method can yield fast-Fourier-like scaling of the computational time requirements and also provides a very compact data representation of these operators, even for plasmas with large spatial nonuniformity. As a result, the method can give significant savings compared with direct application of "delocalization kernels" [e.g., Schurtz et al., Phys. Plasmas 7, 4238 (2000)], both in terms of computational cost and memory requirements. The method is of interest for the implementation of Landau-fluid models in situations where the spatial nonuniformity, particular geometry, or boundary conditions render a Fourier implementation difficult or impossible. Systematic procedures have been developed to optimize the resulting operators for accuracy and computational cost. The four-moment Landau-fluid model of Hammett and Perkins has been implemented in the BOUT++ code using the SMHS method for LF closure. Excellent agreement has been obtained for the one-dimensional plasma density response function between driven initial-value calculations using this BOUT++ implementation and matrix eigenvalue calculations using both Fourier and SMHS non-Fourier implementations of the LF closures. The SMHS method also forms the basis for the implementation, which has been carried out in the BOUT++ code, of the parallel and toroidal drift-resonance LF closures. The method is a key enabling tool for the extension of gyro-Landau-fluid models [e.g., Beer and Hammett, Phys. Plasmas 3, 4046 (1996)] to codes that treat regions with strong profile variation, such as the tokamak edge and scrapeoff-layer.
SU-D-206-02: Evaluation of Partial Storage of the System Matrix for Cone Beam Computed Tomography Using a GPU Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Matenine, D; Cote, G; Mascolo-Fortin, J

2016-06-15

Purpose: Iterative reconstruction algorithms in computed tomography (CT) require a fast method for computing the intersections between the photons’ trajectories and the object, also called ray-tracing or system matrix computation. This work evaluates different ways to store the system matrix, aiming to reconstruct dense image grids in reasonable time. Methods: We propose an optimized implementation of the Siddon’s algorithm using graphics processing units (GPUs) with a novel data storage scheme. The algorithm computes a part of the system matrix on demand, typically, for one projection angle. The proposed method was enhanced with accelerating options: storage of larger subsets of themore » system matrix, systematic reuse of data via geometric symmetries, an arithmetic-rich parallel code and code configuration via machine learning. It was tested on geometries mimicking a cone beam CT acquisition of a human head. To realistically assess the execution time, the ray-tracing routines were integrated into a regularized Poisson-based reconstruction algorithm. The proposed scheme was also compared to a different approach, where the system matrix is fully pre-computed and loaded at reconstruction time. Results: Fast ray-tracing of realistic acquisition geometries, which often lack spatial symmetry properties, was enabled via the proposed method. Ray-tracing interleaved with projection and backprojection operations required significant additional time. In most cases, ray-tracing was shown to use about 66 % of the total reconstruction time. In absolute terms, tracing times varied from 3.6 s to 7.5 min, depending on the problem size. The presence of geometrical symmetries allowed for non-negligible ray-tracing and reconstruction time reduction. Arithmetic-rich parallel code and machine learning permitted a modest reconstruction time reduction, in the order of 1 %. Conclusion: Partial system matrix storage permitted the reconstruction of higher 3D image grid sizes and larger projection datasets at the cost of additional time, when compared to the fully pre-computed approach. This work was supported in part by the Fonds de recherche du Quebec - Nature et technologies (FRQ-NT). The authors acknowledge partial support by the CREATE Medical Physics Research Training Network grant of the Natural Sciences and Engineering Research Council of Canada (Grant No. 432290).« less
A Very Fast and Angular Momentum Conserving Tree Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Marcello, Dominic C., E-mail: dmarce504@gmail.com

There are many methods used to compute the classical gravitational field in astrophysical simulation codes. With the exception of the typically impractical method of direct computation, none ensure conservation of angular momentum to machine precision. Under uniform time-stepping, the Cartesian fast multipole method of Dehnen (also known as the very fast tree code) conserves linear momentum to machine precision. We show that it is possible to modify this method in a way that conserves both angular and linear momenta.
Trinary signed-digit arithmetic using an efficient encoding scheme

NASA Astrophysics Data System (ADS)

Salim, W. Y.; Alam, M. S.; Fyath, R. S.; Ali, S. A.

2000-09-01

The trinary signed-digit (TSD) number system is of interest for ultrafast optoelectronic computing systems since it permits parallel carry-free addition and borrow-free subtraction of two arbitrary length numbers in constant time. In this paper, a simple coding scheme is proposed to encode the decimal number directly into the TSD form. The coding scheme enables one to perform parallel one-step TSD arithmetic operation. The proposed coding scheme uses only a 5-combination coding table instead of the 625-combination table reported recently for recoded TSD arithmetic technique.
The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit

NASA Technical Reports Server (NTRS)

Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete;

1998-01-01

Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.

Xyce parallel electronic simulator users guide, version 6.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Transferring ecosystem simulation codes to supercomputers

NASA Technical Reports Server (NTRS)

Skiles, J. W.; Schulbach, C. H.

1995-01-01

Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.
Analyzing and Visualizing Cosmological Simulations with ParaView

NASA Astrophysics Data System (ADS)

Woodring, Jonathan; Heitmann, Katrin; Ahrens, James; Fasel, Patricia; Hsu, Chung-Hsing; Habib, Salman; Pope, Adrian

2011-07-01

The advent of large cosmological sky surveys—ushering in the era of precision cosmology—has been accompanied by ever larger cosmological simulations. The analysis of these simulations, which currently encompass tens of billions of particles and up to a trillion particles in the near future, is often as daunting as carrying out the simulations in the first place. Therefore, the development of very efficient analysis tools combining qualitative and quantitative capabilities is a matter of some urgency. In this paper, we introduce new analysis features implemented within ParaView, a fully parallel, open-source visualization toolkit, to analyze large N-body simulations. A major aspect of ParaView is that it can live and operate on the same machines and utilize the same parallel power as the simulation codes themselves. In addition, data movement is in a serious bottleneck now and will become even more of an issue in the future; an interactive visualization and analysis tool that can handle data in situ is fast becoming essential. The new features in ParaView include particle readers and a very efficient halo finder that identifies friends-of-friends halos and determines common halo properties, including spherical overdensity properties. In combination with many other functionalities already existing within ParaView, such as histogram routines or interfaces to programming languages like Python, this enhanced version enables fast, interactive, and convenient analyses of large cosmological simulations. In addition, development paths are available for future extensions.
PETSc Users Manual Revision 3.3

DOE Office of Scientific and Technical Information (OSTI.GOV)

Balay, S.; Brown, J.; Buschelman, K.

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself; For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less
PETSc Users Manual Revision 3.4

DOE Office of Scientific and Technical Information (OSTI.GOV)

Balay, S.; Brown, J.; Buschelman, K.

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself; For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less

PETSc Users Manual Revision 3.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Balay, S.; Abhyankar, S.; Adams, M.

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself. ;For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less
Parallelization of the TRIGRS model for rainfall-induced landslides using the message passing interface

USGS Publications Warehouse

Alvioli, M.; Baum, R.L.

2016-01-01

We describe a parallel implementation of TRIGRS, the Transient Rainfall Infiltration and Grid-Based Regional Slope-Stability Model for the timing and distribution of rainfall-induced shallow landslides. We have parallelized the four time-demanding execution modes of TRIGRS, namely both the saturated and unsaturated model with finite and infinite soil depth options, within the Message Passing Interface framework. In addition to new features of the code, we outline details of the parallel implementation and show the performance gain with respect to the serial code. Results are obtained both on commercial hardware and on a high-performance multi-node machine, showing the different limits of applicability of the new code. We also discuss the implications for the application of the model on large-scale areas and as a tool for real-time landslide hazard monitoring.
Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

NASA Technical Reports Server (NTRS)

Yan, Jerry; Jin, Haoqiang; Frumkin, Michael; Yan, Jerry (Technical Monitor)

2000-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.
Highly parallel implementation of non-adiabatic Ehrenfest molecular dynamics

NASA Astrophysics Data System (ADS)

Kanai, Yosuke; Schleife, Andre; Draeger, Erik; Anisimov, Victor; Correa, Alfredo

2014-03-01

While the adiabatic Born-Oppenheimer approximation tremendously lowers computational effort, many questions in modern physics, chemistry, and materials science require an explicit description of coupled non-adiabatic electron-ion dynamics. Electronic stopping, i.e. the energy transfer of a fast projectile atom to the electronic system of the target material, is a notorious example. We recently implemented real-time time-dependent density functional theory based on the plane-wave pseudopotential formalism in the Qbox/qb@ll codes. We demonstrate that explicit integration using a fourth-order Runge-Kutta scheme is very suitable for modern highly parallelized supercomputers. Applying the new implementation to systems with hundreds of atoms and thousands of electrons, we achieved excellent performance and scalability on a large number of nodes both on the BlueGene based ``Sequoia'' system at LLNL as well as the Cray architecture of ``Blue Waters'' at NCSA. As an example, we discuss our work on computing the electronic stopping power of aluminum and gold for hydrogen projectiles, showing an excellent agreement with experiment. These first-principles calculations allow us to gain important insight into the the fundamental physics of electronic stopping.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

NASA Astrophysics Data System (ADS)

Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing

2012-12-01

We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.
Communication Studies of DMP and SMP Machines

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
Development Of A Parallel Performance Model For The THOR Neutral Particle Transport Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yessayan, Raffi; Azmy, Yousry; Schunert, Sebastian

The THOR neutral particle transport code enables simulation of complex geometries for various problems from reactor simulations to nuclear non-proliferation. It is undergoing a thorough V&V requiring computational efficiency. This has motivated various improvements including angular parallelization, outer iteration acceleration, and development of peripheral tools. For guiding future improvements to the code’s efficiency, better characterization of its parallel performance is useful. A parallel performance model (PPM) can be used to evaluate the benefits of modifications and to identify performance bottlenecks. Using INL’s Falcon HPC, the PPM development incorporates an evaluation of network communication behavior over heterogeneous links and a functionalmore » characterization of the per-cell/angle/group runtime of each major code component. After evaluating several possible sources of variability, this resulted in a communication model and a parallel portion model. The former’s accuracy is bounded by the variability of communication on Falcon while the latter has an error on the order of 1%.« less
An Expert System for the Development of Efficient Parallel Code

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

2004-01-01

We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.
Light-weight Parallel Python Tools for Earth System Modeling Workflows

NASA Astrophysics Data System (ADS)

Mickelson, S. A.; Paul, K.; Xu, H.; Dennis, J.; Brown, D. I.

2015-12-01

With the growth in computing power over the last 30 years, earth system modeling codes have become increasingly data-intensive. As an example, it is expected that the data required for the next Intergovernmental Panel on Climate Change (IPCC) Assessment Report (AR6) will increase by more than 10x to an expected 25PB per climate model. Faced with this daunting challenge, developers of the Community Earth System Model (CESM) have chosen to change the format of their data for long-term storage from time-slice to time-series, in order to reduce the required download bandwidth needed for later analysis and post-processing by climate scientists. Hence, efficient tools are required to (1) perform the transformation of the data from time-slice to time-series format and to (2) compute climatology statistics, needed for many diagnostic computations, on the resulting time-series data. To address the first of these two challenges, we have developed a parallel Python tool for converting time-slice model output to time-series format. To address the second of these challenges, we have developed a parallel Python tool to perform fast time-averaging of time-series data. These tools are designed to be light-weight, be easy to install, have very few dependencies, and can be easily inserted into the Earth system modeling workflow with negligible disruption. In this work, we present the motivation, approach, and testing results of these two light-weight parallel Python tools, as well as our plans for future research and development.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
Validation of fast-ion D-alpha spectrum measurements during EAST neutral-beam heated plasmas

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huang, J., E-mail: juan.huang@ipp.ac.cn; Wu, C. R.; Hou, Y. M.

2016-11-15

To investigate the fast ion behavior, a fast ion D-alpha (FIDA) diagnostic system has been installed on EAST. Fast ion features can be inferred from the Doppler shifted spectrum of Balmer-alpha light from energetic hydrogenic atoms. This paper will focus on the validation of FIDA measurements performed using MHD-quiescent discharges in 2015 campaign. Two codes have been applied to calculate the D{sub α} spectrum: one is a Monte Carlo code, Fortran 90 version FIDASIM, and the other is an analytical code, Simulation of Spectra (SOS). The predicted SOS fast-ion spectrum agrees well with the measurement; however, the level of fast-ionmore » part from FIDASIM is lower. The discrepancy is possibly due to the difference between FIDASIM and SOS velocity distribution function. The details will be presented in the paper to primarily address comparisons of predicted and observed spectrum shapes/amplitudes.« less
High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

ERIC Educational Resources Information Center

von Davier, Matthias

2016-01-01

This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.

2000-01-01

Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.
Improve load balancing and coding efficiency of tiles in high efficiency video coding by adaptive tile boundary

NASA Astrophysics Data System (ADS)

Chan, Chia-Hsin; Tu, Chun-Chuan; Tsai, Wen-Jiin

2017-01-01

High efficiency video coding (HEVC) not only improves the coding efficiency drastically compared to the well-known H.264/AVC but also introduces coding tools for parallel processing, one of which is tiles. Tile partitioning is allowed to be arbitrary in HEVC, but how to decide tile boundaries remains an open issue. An adaptive tile boundary (ATB) method is proposed to select a better tile partitioning to improve load balancing (ATB-LoadB) and coding efficiency (ATB-Gain) with a unified scheme. Experimental results show that, compared to ordinary uniform-space partitioning, the proposed ATB can save up to 17.65% of encoding times in parallel encoding scenarios and can reduce up to 0.8% of total bit rates for coding efficiency.
Error Control Coding Techniques for Space and Satellite Communications

NASA Technical Reports Server (NTRS)

Costello, Daniel J., Jr.; Takeshita, Oscar Y.; Cabral, Hermano A.

1998-01-01

It is well known that the BER performance of a parallel concatenated turbo-code improves roughly as 1/N, where N is the information block length. However, it has been observed by Benedetto and Montorsi that for most parallel concatenated turbo-codes, the FER performance does not improve monotonically with N. In this report, we study the FER of turbo-codes, and the effects of their concatenation with an outer code. Two methods of concatenation are investigated: across several frames and within each frame. Some asymmetric codes are shown to have excellent FER performance with an information block length of 16384. We also show that the proposed outer coding schemes can improve the BER performance as well by eliminating pathological frames generated by the iterative MAP decoding process.
Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems

DTIC Science & Technology

2017-04-13

modelling code, a parallel benchmark , and a communication avoiding version of the QR algorithm. Further, several improvements to the OmpSs model were...movement; and a port of the dynamic load balancing library to OmpSs. Finally, several updates to the tools infrastructure were accomplished, including: an...OmpSs: a basic algorithm on image processing applications, a mini application representative of an ocean modelling code, a parallel benchmark , and a
SUPREM-DSMC: A New Scalable, Parallel, Reacting, Multidimensional Direct Simulation Monte Carlo Flow Code

NASA Technical Reports Server (NTRS)

Campbell, David; Wysong, Ingrid; Kaplan, Carolyn; Mott, David; Wadsworth, Dean; VanGilder, Douglas

2000-01-01

An AFRL/NRL team has recently been selected to develop a scalable, parallel, reacting, multidimensional (SUPREM) Direct Simulation Monte Carlo (DSMC) code for the DoD user community under the High Performance Computing Modernization Office (HPCMO) Common High Performance Computing Software Support Initiative (CHSSI). This paper will introduce the JANNAF Exhaust Plume community to this three-year development effort and present the overall goals, schedule, and current status of this new code.
Rapid Prediction of Unsteady Three-Dimensional Viscous Flows in Turbopump Geometries

NASA Technical Reports Server (NTRS)

Dorney, Daniel J.

1998-01-01

A program is underway to improve the efficiency of a three-dimensional Navier-Stokes code and generalize it for nozzle and turbopump geometries. Code modifications will include the implementation of parallel processing software, incorporating new physical models and generalizing the multi-block capability to allow the simultaneous simulation of nozzle and turbopump configurations. The current report contains details of code modifications, numerical results of several flow simulations and the status of the parallelization effort.

Visual saliency-based fast intracoding algorithm for high efficiency video coding

NASA Astrophysics Data System (ADS)

Zhou, Xin; Shi, Guangming; Zhou, Wei; Duan, Zhemin

2017-01-01

Intraprediction has been significantly improved in high efficiency video coding over H.264/AVC with quad-tree-based coding unit (CU) structure from size 64×64 to 8×8 and more prediction modes. However, these techniques cause a dramatic increase in computational complexity. An intracoding algorithm is proposed that consists of perceptual fast CU size decision algorithm and fast intraprediction mode decision algorithm. First, based on the visual saliency detection, an adaptive and fast CU size decision method is proposed to alleviate intraencoding complexity. Furthermore, a fast intraprediction mode decision algorithm with step halving rough mode decision method and early modes pruning algorithm is presented to selectively check the potential modes and effectively reduce the complexity of computation. Experimental results show that our proposed fast method reduces the computational complexity of the current HM to about 57% in encoding time with only 0.37% increases in BD rate. Meanwhile, the proposed fast algorithm has reasonable peak signal-to-noise ratio losses and nearly the same subjective perceptual quality.
EUPDF: An Eulerian-Based Monte Carlo Probability Density Function (PDF) Solver. User's Manual

NASA Technical Reports Server (NTRS)

Raju, M. S.

1998-01-01

EUPDF is an Eulerian-based Monte Carlo PDF solver developed for application with sprays, combustion, parallel computing and unstructured grids. It is designed to be massively parallel and could easily be coupled with any existing gas-phase flow and spray solvers. The solver accommodates the use of an unstructured mesh with mixed elements of either triangular, quadrilateral, and/or tetrahedral type. The manual provides the user with the coding required to couple the PDF code to any given flow code and a basic understanding of the EUPDF code structure as well as the models involved in the PDF formulation. The source code of EUPDF will be available with the release of the National Combustion Code (NCC) as a complete package.
ProGeRF: Proteome and Genome Repeat Finder Utilizing a Fast Parallel Hash Function

PubMed Central

Moraes, Walas Jhony Lopes; Rodrigues, Thiago de Souza; Bartholomeu, Daniella Castanheira

2015-01-01

Repetitive element sequences are adjacent, repeating patterns, also called motifs, and can be of different lengths; repetitions can involve their exact or approximate copies. They have been widely used as molecular markers in population biology. Given the sizes of sequenced genomes, various bioinformatics tools have been developed for the extraction of repetitive elements from DNA sequences. However, currently available tools do not provide options for identifying repetitive elements in the genome or proteome, displaying a user-friendly web interface, and performing-exhaustive searches. ProGeRF is a web site for extracting repetitive regions from genome and proteome sequences. It was designed to be efficient, fast, and accurate and primarily user-friendly web tool allowing many ways to view and analyse the results. ProGeRF (Proteome and Genome Repeat Finder) is freely available as a stand-alone program, from which the users can download the source code, and as a web tool. It was developed using the hash table approach to extract perfect and imperfect repetitive regions in a (multi)FASTA file, while allowing a linear time complexity. PMID:25811026
Assessing the Role of Place and Timing Cues in Coding Frequency and Amplitude Modulation as a Function of Age.

PubMed

Whiteford, Kelly L; Kreft, Heather A; Oxenham, Andrew J

2017-08-01

Natural sounds can be characterized by their fluctuations in amplitude and frequency. Ageing may affect sensitivity to some forms of fluctuations more than others. The present study used individual differences across a wide age range (20-79 years) to test the hypothesis that slow-rate, low-carrier frequency modulation (FM) is coded by phase-locked auditory-nerve responses to temporal fine structure (TFS), whereas fast-rate FM is coded via rate-place (tonotopic) cues, based on amplitude modulation (AM) of the temporal envelope after cochlear filtering. Using a low (500 Hz) carrier frequency, diotic FM and AM detection thresholds were measured at slow (1 Hz) and fast (20 Hz) rates in 85 listeners. Frequency selectivity and TFS coding were assessed using forward masking patterns and interaural phase disparity tasks (slow dichotic FM), respectively. Comparable interaural level disparity tasks (slow and fast dichotic AM and fast dichotic FM) were measured to control for effects of binaural processing not specifically related to TFS coding. Thresholds in FM and AM tasks were correlated, even across tasks thought to use separate peripheral codes. Age was correlated with slow and fast FM thresholds in both diotic and dichotic conditions. The relationship between age and AM thresholds was generally not significant. Once accounting for AM sensitivity, only diotic slow-rate FM thresholds remained significantly correlated with age. Overall, results indicate stronger effects of age on FM than AM. However, because of similar effects for both slow and fast FM when not accounting for AM sensitivity, the effects cannot be unambiguously ascribed to TFS coding.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Bdzil, John Bohdan

The full level-set function code, DSD3D, is fully described in LA-14336 (2007) [1]. This ASCI-supported, DSD code project was the last such LANL DSD code project that I was involved with before my retirement in 2007. My part in the project was to design and build the core DSD3D solver, which was to include a robust DSD boundary condition treatment. A robust boundary condition treatment was required, since for an important local “customer,” the only description of the explosives’ boundary was through volume fraction data. Given this requirement, the accuracy issues I had encountered with our “fast-tube,” narrowband, DSD2D solver,more » and the difficulty we had building an efficient MPI-parallel version of the narrowband DSD2D, I decided DSD3D should be built as a full level-set function code, using a totally local DSD boundary condition algorithm for the level-set function, phi, which did not rely on the gradient of the level-set function being one, |grad(phi)| = 1. The narrowband DSD2D solver was built on the assumption that |grad(phi)| could be driven to one, and near the boundaries of the explosive this condition was not being satisfied. Since the narrowband is typically no more than10*dx wide, narrowband methods are discrete methods with a fixed, non-resolvable error, where the error is related to the thickness of the band: the narrower the band the larger the errors. Such a solution represents a discrete approximation to the true solution and does not limit to the solution of the underlying PDEs under grid resolution.The full level-set function code, DSD3D, is fully described in LA-14336 (2007) [1]. This ASCI-supported, DSD code project was the last such LANL DSD code project that I was involved with before my retirement in 2007. My part in the project was to design and build the core DSD3D solver, which was to include a robust DSD boundary condition treatment. A robust boundary condition treatment was required, since for an important local “customer,” the only description of the explosives’ boundary was through volume fraction data. Given this requirement, the accuracy issues I had encountered with our “fast-tube,” narrowband, DSD2D solver, and the difficulty we had building an efficient MPI-parallel version of the narrowband DSD2D, I decided DSD3D should be built as a full level-set function code, using a totally local DSD boundary condition algorithm for the level-set function, phi, which did not rely on the gradient of the level-set function being one, |grad(phi)| = 1. The narrowband DSD2D solver was built on the assumption that |grad(phi)| could be driven to one, and near the boundaries of the explosive this condition was not being satisfied. Since the narrowband is typically no more than10*dx wide, narrowband methods are discrete methods with a fixed, non-resolvable error, where the error is related to the thickness of the band: the narrower the band the larger the errors. Such a solution represents a discrete approximation to the true solution and does not limit to the solution of the underlying PDEs under grid resolution.« less
Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

NASA Astrophysics Data System (ADS)

Russkova, Tatiana V.

2017-11-01

One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.
Parallel Adaptive Mesh Refinement Library

NASA Technical Reports Server (NTRS)

Mac-Neice, Peter; Olson, Kevin

2005-01-01

Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.
Parallel software for lattice N = 4 supersymmetric Yang-Mills theory

NASA Astrophysics Data System (ADS)

Schaich, David; DeGrand, Thomas

2015-05-01

We present new parallel software, SUSY LATTICE, for lattice studies of four-dimensional N = 4 supersymmetric Yang-Mills theory with gauge group SU(N). The lattice action is constructed to exactly preserve a single supersymmetry charge at non-zero lattice spacing, up to additional potential terms included to stabilize numerical simulations. The software evolved from the MILC code for lattice QCD, and retains a similar large-scale framework despite the different target theory. Many routines are adapted from an existing serial code (Catterall and Joseph, 2012), which SUSY LATTICE supersedes. This paper provides an overview of the new parallel software, summarizing the lattice system, describing the applications that are currently provided and explaining their basic workflow for non-experts in lattice gauge theory. We discuss the parallel performance of the code, and highlight some notable aspects of the documentation for those interested in contributing to its future development.
Parallelization of Lower-Upper Symmetric Gauss-Seidel Method for Chemically Reacting Flow

NASA Technical Reports Server (NTRS)

Yoon, Seokkwan; Jost, Gabriele; Chang, Sherry

2005-01-01

Development of technologies for exploration of the solar system has revived an interest in computational simulation of chemically reacting flows since planetary probe vehicles exhibit non-equilibrium phenomena during the atmospheric entry of a planet or a moon as well as the reentry to the Earth. Stability in combustion is essential for new propulsion systems. Numerical solution of real-gas flows often increases computational work by an order-of-magnitude compared to perfect gas flow partly because of the increased complexity of equations to solve. Recently, as part of Project Columbia, NASA has integrated a cluster of interconnected SGI Altix systems to provide a ten-fold increase in current supercomputing capacity that includes an SGI Origin system. Both the new and existing machines are based on cache coherent non-uniform memory access architecture. Lower-Upper Symmetric Gauss-Seidel (LU-SGS) relaxation method has been implemented into both perfect and real gas flow codes including Real-Gas Aerodynamic Simulator (RGAS). However, the vectorized RGAS code runs inefficiently on cache-based shared-memory machines such as SGI system. Parallelization of a Gauss-Seidel method is nontrivial due to its sequential nature. The LU-SGS method has been vectorized on an oblique plane in INS3D-LU code that has been one of the base codes for NAS Parallel benchmarks. The oblique plane has been called a hyperplane by computer scientists. It is straightforward to parallelize a Gauss-Seidel method by partitioning the hyperplanes once they are formed. Another way of parallelization is to schedule processors like a pipeline using software. Both hyperplane and pipeline methods have been implemented using openMP directives. The present paper reports the performance of the parallelized RGAS code on SGI Origin and Altix systems.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Implementing Shared Memory Parallelism in MCBEND

NASA Astrophysics Data System (ADS)

Bird, Adam; Long, David; Dobson, Geoff

2017-09-01

MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
Full-wave simulations of ICRF heating regimes in toroidal plasmas with non-Maxwellian distribution functions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bertelli, N.; Valeo, E.J.; Green, D.L.

At the power levels required for significant heating and current drive in magnetically-confined toroidal plasma, modification of the particle distribution function from a Maxwellian shape is likely [T. H. Stix, Nucl. Fusion, 15 737 (1975)], with consequent changes in wave propagation and in the location and amount of absorption. In order to study these effects computationally, both the finite-Larmor-radius and the high-harmonic fast wave (HHFW), versions of the full-wave, hot-plasma toroidal simulation code TORIC [M. Brambilla, Plasma Phys. Control. Fusion 41, 1 (1999) and M. Brambilla, Plasma Phys. Control. Fusion 44, 2423 (2002)], have been extended to allow the prescriptionmore » of arbitrary velocity distributions of the form f(v||, v_perp, psi , theta). For hydrogen (H) minority heating of a deuterium (D) plasma with anisotropic Maxwellian H distributions, the fractional H absorption varies significantly with changes in parallel temperature but is essentially independent of perpendicular temperature. On the other hand, for HHFW regime with anisotropic Maxwellian fast ion distribution, the fractional beam ion absorption varies mainly with changes in the perpendicular temperature. The evaluation of the wave-field and power absorption, through the full wave solver, with the ion distribution function provided by either aMonte-Carlo particle and Fokker-Planck codes is also examined for Alcator C-Mod and NSTX plasmas. Non-Maxwellian effects generally tends to increase the absorption with respect to the equivalent Maxwellian distribution.« less
VVER-440 and VVER-1000 reactor dosimetry benchmark - BUGLE-96 versus ALPAN VII.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Duo, J. I.

2011-07-01

Document available in abstract form only, full text of document follows: Analytical results of the vodo-vodyanoi energetichesky reactor-(VVER-) 440 and VVER-1000 reactor dosimetry benchmarks developed from engineering mockups at the Nuclear Research Inst. Rez LR-0 reactor are discussed. These benchmarks provide accurate determination of radiation field parameters in the vicinity and over the thickness of the reactor pressure vessel. Measurements are compared to calculated results with two sets of tools: TORT discrete ordinates code and BUGLE-96 cross-section library versus the newly Westinghouse-developed RAPTOR-M3G and ALPAN VII.0. The parallel code RAPTOR-M3G enables detailed neutron distributions in energy and space in reducedmore » computational time. ALPAN VII.0 cross-section library is based on ENDF/B-VII.0 and is designed for reactor dosimetry applications. It uses a unique broad group structure to enhance resolution in thermal-neutron-energy range compared to other analogous libraries. The comparison of fast neutron (E > 0.5 MeV) results shows good agreement (within 10%) between BUGLE-96 and ALPAN VII.O libraries. Furthermore, the results compare well with analogous results of participants of the REDOS program (2005). Finally, the analytical results for fast neutrons agree within 15% with the measurements, for most locations in all three mockups. In general, however, the analytical results underestimate the attenuation through the reactor pressure vessel thickness compared to the measurements. (authors)« less
Xyce Parallel Electronic Simulator Users' Guide Version 6.8

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Parallel performance of TORT on the CRAY J90: Model and measurement

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1997-10-01

A limitation on the parallel performance of TORT on the CRAY J90 is the amount of extra work introduced by the multitasking algorithm itself. The extra work beyond that of the serial version of the code, called overhead, arises from the synchronization of the parallel tasks and the accumulation of results by the master task. The goal of recent updates to TORT was to reduce the time consumed by these activities. To help understand which components of the multitasking algorithm contribute significantly to the overhead, a parallel performance model was constructed and compared to measurements of actual timings of themore » code.« less
Boltzmann Transport Code Update: Parallelization and Integrated Design Updates

NASA Technical Reports Server (NTRS)

Heinbockel, J. H.; Nealy, J. E.; DeAngelis, G.; Feldman, G. A.; Chokshi, S.

2003-01-01

The on going efforts at developing a web site for radiation analysis is expected to result in an increased usage of the High Charge and Energy Transport Code HZETRN. It would be nice to be able to do the requested calculations quickly and efficiently. Therefore the question arose, "Could the implementation of parallel processing speed up the calculations required?" To answer this question two modifications of the HZETRN computer code were created. The first modification selected the shield material of Al(2219) , then polyethylene and then Al(2219). The modified Fortran code was labeled 1SSTRN.F. The second modification considered the shield material of CO2 and Martian regolith. This modified Fortran code was labeled MARSTRN.F.
Development and Application of a Parallel LCAO Cluster Method

NASA Astrophysics Data System (ADS)

Patton, David C.

1997-08-01

CPU intensive steps in the SCF electronic structure calculations of clusters and molecules with a first-principles LCAO method have been fully parallelized via a message passing paradigm. Identification of the parts of the code that are composed of many independent compute-intensive steps is discussed in detail as they are the most readily parallelized. Most of the parallelization involves spatially decomposing numerical operations on a mesh. One exception is the solution of Poisson's equation which relies on distribution of the charge density and multipole methods. The method we use to parallelize this part of the calculation is quite novel and is covered in detail. We present a general method for dynamically load-balancing a parallel calculation and discuss how we use this method in our code. The results of benchmark calculations of the IR and Raman spectra of PAH molecules such as anthracene (C_14H_10) and tetracene (C_18H_12) are presented. These benchmark calculations were performed on an IBM SP2 and a SUN Ultra HPC server with both MPI and PVM. Scalability and speedup for these calculations is analyzed to determine the efficiency of the code. In addition, performance and usage issues for MPI and PVM are presented.
A visual parallel-BCI speller based on the time-frequency coding strategy.

PubMed

Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

2014-04-01

Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min(-1), with an average of 54.0 bit min(-1) and 43.0 bit min(-1) in the three rounds and five rounds, respectively. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
Ion absorption of the high harmonic fast wave in the National Spherical Torus Experiment

NASA Astrophysics Data System (ADS)

Rosenberg, Adam Lewis

Ion absorption of the high harmonic fast wave in a spherical torus is of critical importance to assessing the viability of the wave as a means of heating and driving current. Analysis of recent NSTX shots has revealed that under some conditions when neutral beam and RF power are injected into the plasma simultaneously, a fast ion population with energy above the beam injection energy is sustained by the wave. In agreement with modeling, these experiments find the RF-induced fast ion tail strength and neutron rate at lower B-fields to be less enhanced, likely due to a larger β profile, which promotes greater off-axis absorption where the fast ion population is small. Ion loss codes find the increased loss fraction with decreased B insufficient to account for the changes in tail strength, providing further evidence that this is an RF interaction effect. Though greater ion absorption is predicted with lower k∥, surprisingly little variation in the tail was observed, along with a neutron rate enhancement with higher k∥. Data from the neutral particle analyzer, neutron detectors, x-ray crystal spectrometer, and Thomson scattering is presented, along with results from the TRANSP transport analysis code, ray-tracing codes HPRT and CURRAY, full-wave code and AORSA, quasilinear code CQL3D, and ion loss codes EIGOL and CONBEAM.
High-speed 3D surface measurement with a fringe projection based optical sensor

NASA Astrophysics Data System (ADS)

Bräuer-Burchardt, Christian; Heist, Stefan; Kühmstedt, Peter; Notni, Gunther

2014-05-01

A new optical sensor based on fringe projection technique for the accurate and fast measurement of the surface of objects mainly for industrial inspection tasks is introduced. High-speed fringe projection and image recording with 180 Hz allows 3D rates up to 60 Hz. The high measurement velocity was achieved by consequent fringe code reduction and parallel data processing. Reduction of the image sequence length was obtained by omission of the Gray-code sequence by using the geometric restrictions of the measurement objects. The sensor realizes three different measurement fields between 20 x 20 mm2 and 40 x 40 mm2 with lateral spatial solutions between 10 μm and 20 μm with the same working distance. Measurement object height extension is between +/- 0.5 mm and +/- 2 mm. Height resolution between 1 μm and 5 μm can be achieved depending on the properties of the measurement objects. The sensor may be used e.g. for quality inspection of conductor boards or plugs in real-time industrial applications.

Some Progress in Large-Eddy Simulation using the 3-D Vortex Particle Method

NASA Technical Reports Server (NTRS)

Winckelmans, G. S.

1995-01-01

This two-month visit at CTR was devoted to investigating possibilities in LES modeling in the context of the 3-D vortex particle method (=vortex element method, VEM) for unbounded flows. A dedicated code was developed for that purpose. Although O(N(sup 2)) and thus slow, it offers the advantage that it can easily be modified to try out many ideas on problems involving up to N approx. 10(exp 4) particles. Energy spectrums (which require O(N(sup 2)) operations per wavenumber) are also computed. Progress was realized in the following areas: particle redistribution schemes, relaxation schemes to maintain the solenoidal condition on the particle vorticity field, simple LES models and their VEM extension, possible new avenues in LES. Model problems that involve strong interaction between vortex tubes were computed, together with diagnostics: total vorticity, linear and angular impulse, energy and energy spectrum, enstrophy. More work is needed, however, especially regarding relaxation schemes and further validation and development of LES models for VEM. Finally, what works well will eventually have to be incorporated into the fast parallel tree code.
Discrete Event-based Performance Prediction for Temperature Accelerated Dynamics

NASA Astrophysics Data System (ADS)

Junghans, Christoph; Mniszewski, Susan; Voter, Arthur; Perez, Danny; Eidenbenz, Stephan

2014-03-01

We present an example of a new class of tools that we call application simulators, parameterized fast-running proxies of large-scale scientific applications using parallel discrete event simulation (PDES). We demonstrate our approach with a TADSim application simulator that models the Temperature Accelerated Dynamics (TAD) method, which is an algorithmically complex member of the Accelerated Molecular Dynamics (AMD) family. The essence of the TAD application is captured without the computational expense and resource usage of the full code. We use TADSim to quickly characterize the runtime performance and algorithmic behavior for the otherwise long-running simulation code. We further extend TADSim to model algorithm extensions to standard TAD, such as speculative spawning of the compute-bound stages of the algorithm, and predict performance improvements without having to implement such a method. Focused parameter scans have allowed us to study algorithm parameter choices over far more scenarios than would be possible with the actual simulation. This has led to interesting performance-related insights into the TAD algorithm behavior and suggested extensions to the TAD method.
Fast non-overlapping Schwarz domain decomposition methods for solving the neutron diffusion equation

NASA Astrophysics Data System (ADS)

Jamelot, Erell; Ciarlet, Patrick

2013-05-01

Studying numerically the steady state of a nuclear core reactor is expensive, in terms of memory storage and computational time. In order to address both requirements, one can use a domain decomposition method, implemented on a parallel computer. We present here such a method for the mixed neutron diffusion equations, discretized with Raviart-Thomas-Nédélec finite elements. This method is based on the Schwarz iterative algorithm with Robin interface conditions to handle communications. We analyse this method from the continuous point of view to the discrete point of view, and we give some numerical results in a realistic highly heterogeneous 3D configuration. Computations are carried out with the MINOS solver of the APOLLO3® neutronics code. APOLLO3 is a registered trademark in France.
A fast sorting algorithm for a hypersonic rarefied flow particle simulation on the connection machine

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1989-01-01

The data parallel implementation of a particle simulation for hypersonic rarefied flow described by Dagum associates a single parallel data element with each particle in the simulation. The simulated space is divided into discrete regions called cells containing a variable and constantly changing number of particles. The implementation requires a global sort of the parallel data elements so as to arrange them in an order that allows immediate access to the information associated with cells in the simulation. Described here is a very fast algorithm for performing the necessary ranking of the parallel data elements. The performance of the new algorithm is compared with that of the microcoded instruction for ranking on the Connection Machine.
Aerodynamic simulation on massively parallel systems

NASA Technical Reports Server (NTRS)

Haeuser, Jochem; Simon, Horst D.

1992-01-01

This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but message passing MIMD systems seem to be best suited for large miltiblock applications.
Bilingual parallel programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Foster, I.; Overbeek, R.

1990-01-01

Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach providesmore » and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.« less
Extreme Performance Scalable Operating Systems Final Progress Report (July 1, 2008 - October 31, 2011)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Malony, Allen D; Shende, Sameer

This is the final progress report for the FastOS (Phase 2) (FastOS-2) project with Argonne National Laboratory and the University of Oregon (UO). The project started at UO on July 1, 2008 and ran until April 30, 2010, at which time a six-month no-cost extension began. The FastOS-2 work at UO delivered excellent results in all research work areas: * scalable parallel monitoring * kernel-level performance measurement * parallel I/0 system measurement * large-scale and hybrid application performance measurement * onlne scalable performance data reduction and analysis * binary instrumentation
MHD Code Optimizations and Jets in Dense Gaseous Halos

NASA Astrophysics Data System (ADS)

Gaibler, Volker; Vigelius, Matthias; Krause, Martin; Camenzind, Max

We have further optimized and extended the 3D-MHD-code NIRVANA. The magnetized part runs in parallel, reaching 19 Gflops per SX-6 node, and has a passively advected particle population. In addition, the code is MPI-parallel now - on top of the shared memory parallelization. On a 512^3 grid, we reach 561 Gflops with 32 nodes on the SX-8. Also, we have successfully used FLASH on the Opteron cluster. Scientific results are preliminary so far. We report one computation of highly resolved cocoon turbulence. While we find some similarities to earlier 2D work by us and others, we note a strange reluctancy of cold material to enter the low density cocoon, which has to be investigated further.
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX

NASA Astrophysics Data System (ADS)

Krotkiewski, Marcin; Dabrowski, Marcin

2013-04-01

The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU

NASA Astrophysics Data System (ADS)

Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.

1982-06-01

In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.
Efficient Parallel Formulations of Hierarchical Methods and Their Applications

NASA Astrophysics Data System (ADS)

Grama, Ananth Y.

1996-01-01

Hierarchical methods such as the Fast Multipole Method (FMM) and Barnes-Hut (BH) are used for rapid evaluation of potential (gravitational, electrostatic) fields in particle systems. They are also used for solving integral equations using boundary element methods. The linear systems arising from these methods are dense and are solved iteratively. Hierarchical methods reduce the complexity of the core matrix-vector product from O(n^2) to O(n log n) and the memory requirement from O(n^2) to O(n). We have developed highly scalable parallel formulations of a hybrid FMM/BH method that are capable of handling arbitrarily irregular distributions. We apply these formulations to astrophysical simulations of Plummer and Gaussian galaxies. We have used our parallel formulations to solve the integral form of the Laplace equation. We show that our parallel hierarchical mat-vecs yield high efficiency and overall performance even on relatively small problems. A problem containing approximately 200K nodes takes under a second to compute on 256 processors and yet yields over 85% efficiency. The efficiency and raw performance is expected to increase for bigger problems. For the 200K node problem, our code delivers about 5 GFLOPS of performance on a 256 processor T3D. This is impressive considering the fact that the problem has floating point divides and roots, and very little locality resulting in poor cache performance. A dense matrix-vector product of the same dimensions would require about 0.5 TeraBytes of memory and about 770 TeraFLOPS of computing speed. Clearly, if the loss in accuracy resulting from the use of hierarchical methods is acceptable, our code yields significant savings in time and memory. We also study the convergence of a GMRES solver built around this mat-vec. We accelerate the convergence of the solver using three preconditioning techniques: diagonal scaling, block-diagonal preconditioning, and inner-outer preconditioning. We study the performance and parallel efficiency of these preconditioned solvers. Using this solver, we solve dense linear systems with hundreds of thousands of unknowns. Solving a 105K unknown problem takes about 10 minutes on a 64 processor T3D. Until very recently, boundary element problems of this magnitude could not even be generated, let alone solved.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.

2015-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Fast hydrological model calibration based on the heterogeneous parallel computing accelerated shuffled complex evolution method

NASA Astrophysics Data System (ADS)

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Zuo, Depeng; Ren, Minglei; Lei, Tianjie; Liang, Ke

2018-01-01

Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneous computing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.
A high-speed linear algebra library with automatic parallelism

NASA Technical Reports Server (NTRS)

Boucher, Michael L.

1994-01-01

Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.
Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry

1998-01-01

This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.

The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
A new conformal absorbing boundary condition for finite element meshes and parallelization of FEMATS

NASA Technical Reports Server (NTRS)

Chatterjee, A.; Volakis, J. L.; Nguyen, J.; Nurnberger, M.; Ross, D.

1993-01-01

Some of the progress toward the development and parallelization of an improved version of the finite element code FEMATS is described. This is a finite element code for computing the scattering by arbitrarily shaped three dimensional surfaces composite scatterers. The following tasks were worked on during the report period: (1) new absorbing boundary conditions (ABC's) for truncating the finite element mesh; (2) mixed mesh termination schemes; (3) hierarchical elements and multigridding; (4) parallelization; and (5) various modeling enhancements (antenna feeds, anisotropy, and higher order GIBC).
Porting LAMMPS to GPUs.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, William Michael; Plimpton, Steven James; Wang, Peng

2010-03-01

LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. LAMMPS has potentials for soft materials (biomolecules, polymers) and solid-state materials (metals, semiconductors) and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.
Modulation and coding for fast fading mobile satellite communication channels

NASA Technical Reports Server (NTRS)

Mclane, P. J.; Wittke, P. H.; Smith, W. S.; Lee, A.; Ho, P. K. M.; Loo, C.

1988-01-01

The performance of Gaussian baseband filtered minimum shift keying (GMSK) using differential detection in fast Rician fading, with a novel treatment of the inherent intersymbol interference (ISI) leading to an exact solution is discussed. Trellis-coded differentially coded phase shift keying (DPSK) with a convolutional interleaver is considered. The channel is the Rician Channel with the line-of-sight component subject to a lognormal transformation.

Hypercube matrix computation task

NASA Technical Reports Server (NTRS)

Calalo, R.; Imbriale, W.; Liewer, P.; Lyons, J.; Manshadi, F.; Patterson, J.

1987-01-01

The Hypercube Matrix Computation (Year 1986-1987) task investigated the applicability of a parallel computing architecture to the solution of large scale electromagnetic scattering problems. Two existing electromagnetic scattering codes were selected for conversion to the Mark III Hypercube concurrent computing environment. They were selected so that the underlying numerical algorithms utilized would be different thereby providing a more thorough evaluation of the appropriateness of the parallel environment for these types of problems. The first code was a frequency domain method of moments solution, NEC-2, developed at Lawrence Livermore National Laboratory. The second code was a time domain finite difference solution of Maxwell's equations to solve for the scattered fields. Once the codes were implemented on the hypercube and verified to obtain correct solutions by comparing the results with those from sequential runs, several measures were used to evaluate the performance of the two codes. First, a comparison was provided of the problem size possible on the hypercube with 128 megabytes of memory for a 32-node configuration with that available in a typical sequential user environment of 4 to 8 megabytes. Then, the performance of the codes was anlyzed for the computational speedup attained by the parallel architecture.
4P: fast computing of population genetics statistics from large DNA polymorphism panels

PubMed Central

Benazzo, Andrea; Panziera, Alex; Bertorelle, Giorgio

2015-01-01

Massive DNA sequencing has significantly increased the amount of data available for population genetics and molecular ecology studies. However, the parallel computation of simple statistics within and between populations from large panels of polymorphic sites is not yet available, making the exploratory analyses of a set or subset of data a very laborious task. Here, we present 4P (parallel processing of polymorphism panels), a stand-alone software program for the rapid computation of genetic variation statistics (including the joint frequency spectrum) from millions of DNA variants in multiple individuals and multiple populations. It handles a standard input file format commonly used to store DNA variation from empirical or simulation experiments. The computational performance of 4P was evaluated using large SNP (single nucleotide polymorphism) datasets from human genomes or obtained by simulations. 4P was faster or much faster than other comparable programs, and the impact of parallel computing using multicore computers or servers was evident. 4P is a useful tool for biologists who need a simple and rapid computer program to run exploratory population genetics analyses in large panels of genomic data. It is also particularly suitable to analyze multiple data sets produced in simulation studies. Unix, Windows, and MacOs versions are provided, as well as the source code for easier pipeline implementations. PMID:25628874
Load Balancing Scientific Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pearce, Olga Tkachyshyn

2014-12-01

The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one atmore » the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime.« less
Porting plasma physics simulation codes to modern computing architectures using the libmrc framework

NASA Astrophysics Data System (ADS)

Germaschewski, Kai; Abbott, Stephen

2015-11-01

Available computing power has continued to grow exponentially even after single-core performance satured in the last decade. The increase has since been driven by more parallelism, both using more cores and having more parallelism in each core, e.g. in GPUs and Intel Xeon Phi. Adapting existing plasma physics codes is challenging, in particular as there is no single programming model that covers current and future architectures. We will introduce the open-source libmrc framework that has been used to modularize and port three plasma physics codes: The extended MHD code MRCv3 with implicit time integration and curvilinear grids; the OpenGGCM global magnetosphere model; and the particle-in-cell code PSC. libmrc consolidates basic functionality needed for simulations based on structured grids (I/O, load balancing, time integrators), and also introduces a parallel object model that makes it possible to maintain multiple implementations of computational kernels, on e.g. conventional processors and GPUs. It handles data layout conversions and enables us to port performance-critical parts of a code to a new architecture step-by-step, while the rest of the code can remain unchanged. We will show examples of the performance gains and some physics applications.
Runtime Detection of C-Style Errors in UPC Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pirkelbauer, P; Liao, C; Panas, T

2011-09-29

Unified Parallel C (UPC) extends the C programming language (ISO C 99) with explicit parallel programming support for the partitioned global address space (PGAS), which provides a global memory space with localized partitions to each thread. Like its ancestor C, UPC is a low-level language that emphasizes code efficiency over safety. The absence of dynamic (and static) safety checks allows programmer oversights and software flaws that can be hard to spot. In this paper, we present an extension of a dynamic analysis tool, ROSE-Code Instrumentation and Runtime Monitor (ROSECIRM), for UPC to help programmers find C-style errors involving the globalmore » address space. Built on top of the ROSE source-to-source compiler infrastructure, the tool instruments source files with code that monitors operations and keeps track of changes to the system state. The resulting code is linked to a runtime monitor that observes the program execution and finds software defects. We describe the extensions to ROSE-CIRM that were necessary to support UPC. We discuss complications that arise from parallel code and our solutions. We test ROSE-CIRM against a runtime error detection test suite, and present performance results obtained from running error-free codes. ROSE-CIRM is released as part of the ROSE compiler under a BSD-style open source license.« less
Mechanic: The MPI/HDF code framework for dynamical astronomy

NASA Astrophysics Data System (ADS)

Słonina, Mariusz; Goździewski, Krzysztof; Migaszewski, Cezary

2015-01-01

We introduce the Mechanic, a new open-source code framework. It is designed to reduce the development effort of scientific applications by providing unified API (Application Programming Interface) for configuration, data storage and task management. The communication layer is based on the well-established Message Passing Interface (MPI) standard, which is widely used on variety of parallel computers and CPU-clusters. The data storage is performed within the Hierarchical Data Format (HDF5). The design of the code follows core-module approach which allows to reduce the user’s codebase and makes it portable for single- and multi-CPU environments. The framework may be used in a local user’s environment, without administrative access to the cluster, under the PBS or Slurm job schedulers. It may become a helper tool for a wide range of astronomical applications, particularly focused on processing large data sets, such as dynamical studies of long-term orbital evolution of planetary systems with Monte Carlo methods, dynamical maps or evolutionary algorithms. It has been already applied in numerical experiments conducted for Kepler-11 (Migaszewski et al., 2012) and νOctantis planetary systems (Goździewski et al., 2013). In this paper we describe the basics of the framework, including code listings for the implementation of a sample user’s module. The code is illustrated on a model Hamiltonian introduced by (Froeschlé et al., 2000) presenting the Arnold diffusion. The Arnold web is shown with the help of the MEGNO (Mean Exponential Growth of Nearby Orbits) fast indicator (Goździewski et al., 2008a) applied onto symplectic SABAn integrators family (Laskar and Robutel, 2001).
High-performance computational fluid dynamics: a custom-code approach

NASA Astrophysics Data System (ADS)

Fannon, James; Loiseau, Jean-Christophe; Valluri, Prashant; Bethune, Iain; Náraigh, Lennon Ó.

2016-07-01

We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computational fluid dynamics (CFDs) and parallel computing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallel computing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.
Cyclotron resonant scattering feature simulations. II. Description of the CRSF simulation process

NASA Astrophysics Data System (ADS)

Schwarm, F.-W.; Ballhausen, R.; Falkner, S.; Schönherr, G.; Pottschmidt, K.; Wolff, M. T.; Becker, P. A.; Fürst, F.; Marcu-Cheatham, D. M.; Hemphill, P. B.; Sokolova-Lapa, E.; Dauser, T.; Klochkov, D.; Ferrigno, C.; Wilms, J.

2017-05-01

Context. Cyclotron resonant scattering features (CRSFs) are formed by scattering of X-ray photons off quantized plasma electrons in the strong magnetic field (of the order 1012 G) close to the surface of an accreting X-ray pulsar. Due to the complex scattering cross-sections, the line profiles of CRSFs cannot be described by an analytic expression. Numerical methods, such as Monte Carlo (MC) simulations of the scattering processes, are required in order to predict precise line shapes for a given physical setup, which can be compared to observations to gain information about the underlying physics in these systems. Aims: A versatile simulation code is needed for the generation of synthetic cyclotron lines. Sophisticated geometries should be investigatable by making their simulation possible for the first time. Methods: The simulation utilizes the mean free path tables described in the first paper of this series for the fast interpolation of propagation lengths. The code is parallelized to make the very time-consuming simulations possible on convenient time scales. Furthermore, it can generate responses to monoenergetic photon injections, producing Green's functions, which can be used later to generate spectra for arbitrary continua. Results: We develop a new simulation code to generate synthetic cyclotron lines for complex scenarios, allowing for unprecedented physical interpretation of the observed data. An associated XSPEC model implementation is used to fit synthetic line profiles to NuSTAR data of Cep X-4. The code has been developed with the main goal of overcoming previous geometrical constraints in MC simulations of CRSFs. By applying this code also to more simple, classic geometries used in previous works, we furthermore address issues of code verification and cross-comparison of various models. The XSPEC model and the Green's function tables are available online (see link in footnote, page 1).
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
FBCOT: a fast block coding option for JPEG 2000

NASA Astrophysics Data System (ADS)

Taubman, David; Naman, Aous; Mathew, Reji

2017-09-01

Based on the EBCOT algorithm, JPEG 2000 finds application in many fields, including high performance scientific, geospatial and video coding applications. Beyond digital cinema, JPEG 2000 is also attractive for low-latency video communications. The main obstacle for some of these applications is the relatively high computational complexity of the block coder, especially at high bit-rates. This paper proposes a drop-in replacement for the JPEG 2000 block coding algorithm, achieving much higher encoding and decoding throughputs, with only modest loss in coding efficiency (typically < 0.5dB). The algorithm provides only limited quality/SNR scalability, but offers truly reversible transcoding to/from any standard JPEG 2000 block bit-stream. The proposed FAST block coder can be used with EBCOT's post-compression RD-optimization methodology, allowing a target compressed bit-rate to be achieved even at low latencies, leading to the name FBCOT (Fast Block Coding with Optimized Truncation).
Research on the Application of Fast-steering Mirror in Stellar Interferometer

NASA Astrophysics Data System (ADS)

Mei, R.; Hu, Z. W.; Xu, T.; Sun, C. S.

2017-07-01

For a stellar interferometer, the fast-steering mirror (FSM) is widely utilized to correct wavefront tilt caused by atmospheric turbulence and internal instrumental vibration due to its high resolution and fast response frequency. In this study, the non-coplanar error between the FSM and actuator deflection axis introduced by manufacture, assembly, and adjustment is analyzed. Via a numerical method, the additional optical path difference (OPD) caused by above factors is studied, and its effects on tracking accuracy of stellar interferometer are also discussed. On the other hand, the starlight parallelism between the beams of two arms is one of the main factors of the loss of fringe visibility. By analyzing the influence of wavefront tilt caused by the atmospheric turbulence on fringe visibility, a simple and efficient real-time correction scheme of starlight parallelism is proposed based on a single array detector. The feasibility of this scheme is demonstrated by laboratory experiment. The results show that starlight parallelism meets the requirement of stellar interferometer in wavefront tilt preliminarily after the correction of fast-steering mirror.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond

2016-04-15

Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

2014-07-01

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.
EUPDF-II: An Eulerian Joint Scalar Monte Carlo PDF Module : User's Manual

NASA Technical Reports Server (NTRS)

Raju, M. S.; Liu, Nan-Suey (Technical Monitor)

2004-01-01

EUPDF-II provides the solution for the species and temperature fields based on an evolution equation for PDF (Probability Density Function) and it is developed mainly for application with sprays, combustion, parallel computing, and unstructured grids. It is designed to be massively parallel and could easily be coupled with any existing gas-phase CFD and spray solvers. The solver accommodates the use of an unstructured mesh with mixed elements of either triangular, quadrilateral, and/or tetrahedral type. The manual provides the user with an understanding of the various models involved in the PDF formulation, its code structure and solution algorithm, and various other issues related to parallelization and its coupling with other solvers. The source code of EUPDF-II will be available with National Combustion Code (NCC) as a complete package.
High Performance Fortran for Aerospace Applications

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Zima, Hans; Bushnell, Dennis M. (Technical Monitor)

2000-01-01

This paper focuses on the use of High Performance Fortran (HPF) for important classes of algorithms employed in aerospace applications. HPF is a set of Fortran extensions designed to provide users with a high-level interface for programming data parallel scientific applications, while delegating to the compiler/runtime system the task of generating explicitly parallel message-passing programs. We begin by providing a short overview of the HPF language. This is followed by a detailed discussion of the efficient use of HPF for applications involving multiple structured grids such as multiblock and adaptive mesh refinement (AMR) codes as well as unstructured grid codes. We focus on the data structures and computational structures used in these codes and on the high-level strategies that can be expressed in HPF to optimally exploit the parallelism in these algorithms.
Simulation of Hypervelocity Impact on Aluminum-Nextel-Kevlar Orbital Debris Shields

NASA Technical Reports Server (NTRS)

Fahrenthold, Eric P.

2000-01-01

An improved hybrid particle-finite element method has been developed for hypervelocity impact simulation. The method combines the general contact-impact capabilities of particle codes with the true Lagrangian kinematics of large strain finite element formulations. Unlike some alternative schemes which couple Lagrangian finite element models with smooth particle hydrodynamics, the present formulation makes no use of slidelines or penalty forces. The method has been implemented in a parallel, three dimensional computer code. Simulations of three dimensional orbital debris impact problems using this parallel hybrid particle-finite element code, show good agreement with experiment and good speedup in parallel computation. The simulations included single and multi-plate shields as well as aluminum and composite shielding materials. at an impact velocity of eleven kilometers per second.
PELEC

DOE Office of Scientific and Technical Information (OSTI.GOV)

2017-05-17

PeleC is an adaptive-mesh compressible hydrodynamics code for reacting flows. It solves the compressible Navier-Stokes with multispecies transport in a block structured framework. The resulting algorithm is well suited for flows with localized resolution requirements and robust to discontinuities. User controllable refinement crieteria has the potential to result in extremely small numerical dissipation and dispersion, making this code appropriate for both research and applied usage. The code is built on the AMReX library which facilitates hierarchical parallelism and manages distributed memory parallism. PeleC algorithms are implemented to express shared memory parallelism.
Understanding the Cray X1 System

NASA Technical Reports Server (NTRS)

Cheung, Samson

2004-01-01

This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms that are available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.
Composing Data Parallel Code for a SPARQL Graph Engine

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste

Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

NASA Astrophysics Data System (ADS)

Marx, Alain; Lütjens, Hinrich

2017-03-01

A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.

Evaluating the performance of parallel subsurface simulators: An illustrative example with PFLOTRAN

PubMed Central

Hammond, G E; Lichtner, P C; Mills, R T

2014-01-01

[1] To better inform the subsurface scientist on the expected performance of parallel simulators, this work investigates performance of the reactive multiphase flow and multicomponent biogeochemical transport code PFLOTRAN as it is applied to several realistic modeling scenarios run on the Jaguar supercomputer. After a brief introduction to the code's parallel layout and code design, PFLOTRAN's parallel performance (measured through strong and weak scalability analyses) is evaluated in the context of conceptual model layout, software and algorithmic design, and known hardware limitations. PFLOTRAN scales well (with regard to strong scaling) for three realistic problem scenarios: (1) in situ leaching of copper from a mineral ore deposit within a 5-spot flow regime, (2) transient flow and solute transport within a regional doublet, and (3) a real-world problem involving uranium surface complexation within a heterogeneous and extremely dynamic variably saturated flow field. Weak scalability is discussed in detail for the regional doublet problem, and several difficulties with its interpretation are noted. PMID:25506097
Evaluating the performance of parallel subsurface simulators: An illustrative example with PFLOTRAN.

PubMed

Hammond, G E; Lichtner, P C; Mills, R T

2014-01-01

[1] To better inform the subsurface scientist on the expected performance of parallel simulators, this work investigates performance of the reactive multiphase flow and multicomponent biogeochemical transport code PFLOTRAN as it is applied to several realistic modeling scenarios run on the Jaguar supercomputer. After a brief introduction to the code's parallel layout and code design, PFLOTRAN's parallel performance (measured through strong and weak scalability analyses) is evaluated in the context of conceptual model layout, software and algorithmic design, and known hardware limitations. PFLOTRAN scales well (with regard to strong scaling) for three realistic problem scenarios: (1) in situ leaching of copper from a mineral ore deposit within a 5-spot flow regime, (2) transient flow and solute transport within a regional doublet, and (3) a real-world problem involving uranium surface complexation within a heterogeneous and extremely dynamic variably saturated flow field. Weak scalability is discussed in detail for the regional doublet problem, and several difficulties with its interpretation are noted.
Parallelization of an Object-Oriented Unstructured Aeroacoustics Solver

NASA Technical Reports Server (NTRS)

Baggag, Abdelkader; Atkins, Harold; Oezturan, Can; Keyes, David

1999-01-01

A computational aeroacoustics code based on the discontinuous Galerkin method is ported to several parallel platforms using MPI. The discontinuous Galerkin method is a compact high-order method that retains its accuracy and robustness on non-smooth unstructured meshes. In its semi-discrete form, the discontinuous Galerkin method can be combined with explicit time marching methods making it well suited to time accurate computations. The compact nature of the discontinuous Galerkin method also makes it well suited for distributed memory parallel platforms. The original serial code was written using an object-oriented approach and was previously optimized for cache-based machines. The port to parallel platforms was achieved simply by treating partition boundaries as a type of boundary condition. Code modifications were minimal because boundary conditions were abstractions in the original program. Scalability results are presented for the SCI Origin, IBM SP2, and clusters of SGI and Sun workstations. Slightly superlinear speedup is achieved on a fixed-size problem on the Origin, due to cache effects.
Fast I/O for Massively Parallel Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew T.

1996-01-01

The two primary goals for this report were the design, contruction and modeling of parallel disk arrays for scientific visualization and animation, and a study of the IO requirements of highly parallel applications. In addition, further work in parallel display systems required to project and animate the very high-resolution frames resulting from our supercomputing simulations in ocean circulation and compressible gas dynamics.
A visual parallel-BCI speller based on the time-frequency coding strategy

NASA Astrophysics Data System (ADS)

Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

2014-04-01

Objective. Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. Approach. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Main results. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min-1, with an average of 54.0 bit min-1 and 43.0 bit min-1 in the three rounds and five rounds, respectively. Significance. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
Retargeting of existing FORTRAN program and development of parallel compilers

NASA Technical Reports Server (NTRS)

Agrawal, Dharma P.

1988-01-01

The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

1992-01-01

Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Hamming and Accumulator Codes Concatenated with MPSK or QAM

NASA Technical Reports Server (NTRS)

Divsalar, Dariush; Dolinar, Samuel

2009-01-01

In a proposed coding-and-modulation scheme, a high-rate binary data stream would be processed as follows: 1. The input bit stream would be demultiplexed into multiple bit streams. 2. The multiple bit streams would be processed simultaneously into a high-rate outer Hamming code that would comprise multiple short constituent Hamming codes a distinct constituent Hamming code for each stream. 3. The streams would be interleaved. The interleaver would have a block structure that would facilitate parallelization for high-speed decoding. 4. The interleaved streams would be further processed simultaneously into an inner two-state, rate-1 accumulator code that would comprise multiple constituent accumulator codes - a distinct accumulator code for each stream. 5. The resulting bit streams would be mapped into symbols to be transmitted by use of a higher-order modulation - for example, M-ary phase-shift keying (MPSK) or quadrature amplitude modulation (QAM). The novelty of the scheme lies in the concatenation of the multiple-constituent Hamming and accumulator codes and the corresponding parallel architectures of the encoder and decoder circuitry (see figure) needed to process the multiple bit streams simultaneously. As in the cases of other parallel-processing schemes, one advantage of this scheme is that the overall data rate could be much greater than the data rate of each encoder and decoder stream and, hence, the encoder and decoder could handle data at an overall rate beyond the capability of the individual encoder and decoder circuits.
Turbo Trellis Coded Modulation With Iterative Decoding for Mobile Satellite Communications

NASA Technical Reports Server (NTRS)

Divsalar, D.; Pollara, F.

1997-01-01

In this paper, analytical bounds on the performance of parallel concatenation of two codes, known as turbo codes, and serial concatenation of two codes over fading channels are obtained. Based on this analysis, design criteria for the selection of component trellis codes for MPSK modulation, and a suitable bit-by-bit iterative decoding structure are proposed. Examples are given for throughput of 2 bits/sec/Hz with 8PSK modulation. The parallel concatenation example uses two rate 4/5 8-state convolutional codes with two interleavers. The convolutional codes' outputs are then mapped to two 8PSK modulations. The serial concatenated code example uses an 8-state outer code with rate 4/5 and a 4-state inner trellis code with 5 inputs and 2 x 8PSK outputs per trellis branch. Based on the above mentioned design criteria for fading channels, a method to obtain he structure of the trellis code with maximum diversity is proposed. Simulation results are given for AWGN and an independent Rayleigh fading channel with perfect Channel State Information (CSI).
Solutions of large-scale electromagnetics problems involving dielectric objects with the parallel multilevel fast multipole algorithm.

PubMed

Ergül, Özgür

2011-11-01

Fast and accurate solutions of large-scale electromagnetics problems involving homogeneous dielectric objects are considered. Problems are formulated with the electric and magnetic current combined-field integral equation and discretized with the Rao-Wilton-Glisson functions. Solutions are performed iteratively by using the multilevel fast multipole algorithm (MLFMA). For the solution of large-scale problems discretized with millions of unknowns, MLFMA is parallelized on distributed-memory architectures using a rigorous technique, namely, the hierarchical partitioning strategy. Efficiency and accuracy of the developed implementation are demonstrated on very large problems involving as many as 100 million unknowns.
Effective Vectorization with OpenMP 4.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM

NASA Astrophysics Data System (ADS)

Rodriguez, M.; Brualla, L.

2018-04-01

Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
Parallel Monte Carlo transport modeling in the context of a time-dependent, three-dimensional multi-physics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Procassini, R.J.

1997-12-31

The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution ofmore » particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.« less
Implementation of parallel moment equations in NIMROD

NASA Astrophysics Data System (ADS)

Lee, Hankyu Q.; Held, Eric D.; Ji, Jeong-Young

2017-10-01

As collisionality is low (the Knudsen number is large) in many plasma applications, kinetic effects become important, particularly in parallel dynamics for magnetized plasmas. Fluid models can capture some kinetic effects when integral parallel closures are adopted. The adiabatic and linear approximations are used in solving general moment equations to obtain the integral closures. In this work, we present an effort to incorporate non-adiabatic (time-dependent) and nonlinear effects into parallel closures. Instead of analytically solving the approximate moment system, we implement exact parallel moment equations in the NIMROD fluid code. The moment code is expected to provide a natural convergence scheme by increasing the number of moments. Work in collaboration with the PSI Center and supported by the U.S. DOE under Grant Nos. DE-SC0014033, DE-SC0016256, and DE-FG02-04ER54746.
The Yambo code: a comprehensive tool to perform ab-initio simulations of equilibrium and out-of-equilibrium properties

NASA Astrophysics Data System (ADS)

Marini, Andrea

Density functional theory and many-body perturbation theory methods (such as GW and Bethe-Selpether equation) are standard approaches to the equilibrium ground and excited state properties of condensed matter systems, surfaces, molecules and other several kind of materials. At the same time ultra-fast optical spectroscopy is becoming a widely used and powerful tool for the observation of the out-of-equilibrium dynamical processes. In this case the theoretical tools (such as the Baym-Kadanoff equation) are well known but, only recently, have been merged with the ab-Initio approach. And, for this reason, highly parallel and efficient codes are lacking. Nevertheless, the combination of these two areas of research represents, for the ab-initio community, a challenging prespective as it requires the development of advanced theoretical, methodological and numerical tools. Yambo is a popular community software implementing the above methods using plane-waves and pseudo-potentials. Yambo is available to the community as open-source software, and oriented to high-performance computing. The Yambo project aims at making the simulation of these equilibrium and out-of-equilibrium complex processes available to a wide community of users. Indeed the code is used, in practice, in many countries and well beyond the European borders. Yambo is a member of the suite of codes of the MAX European Center of Excellence (Materials design at the exascale) . It is also used by the user facilities of the European Spectroscopy Facility and of the NFFA European Center (nanoscience foundries & fine analysis). In this talk I will discuss some recent numerical and methodological developments that have been implemented in Yambo towards to exploitation of next generation HPC supercomputers. In particular, I will present the hybrid MPI+OpenMP parallelization and the specific case of the response function calculation. I will also discuss the future plans of the Yambo project and its potential use as tool for science dissemination, also in third world countries. Etsf, MAX European Center of Excellence and NFFA European Center.
Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

NASA Technical Reports Server (NTRS)

Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

2017-01-01

In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.
Fast-ion D(alpha) measurements and simulations in DIII-D

NASA Astrophysics Data System (ADS)

Luo, Yadong

The fast-ion Dalpha diagnostic measures the Doppler-shifted Dalpha light emitted by neutralized fast ions. For a favorable viewing geometry, the bright interferences from beam neutrals, halo neutrals, and edge neutrals span over a small wavelength range around the Dalpha rest wavelength and are blocked by a vertical bar at the exit focal plane of the spectrometer. Background subtraction and fitting techniques eliminate various contaminants in the spectrum. Fast-ion data are acquired with a time evolution of ˜1 ms, spatial resolution of ˜5 cm, and energy resolution of ˜10 keV. A weighted Monte Carlo simulation code models the fast-ion Dalpha spectra based on the fast-ion distribution function from other sources. In quiet plasmas, the spectral shape is in excellent agreement and absolute magnitude also has reasonable agreement. The fast-ion D alpha signal has the expected dependencies on plasma and neutral beam parameters. The neutral particle diagnostic and neutron diagnostic corroborate the fast-ion Dalpha measurements. The relative spatial profile is in agreement with the simulated profile based on the fast-ion distribution function from the TRANSP analysis code. During ion cyclotron heating, fast ions with high perpendicular energy are accelerated, while those with low perpendicular energy are barely affected. The spatial profile is compared with the simulated profiles based on the fast-ion distribution functions from the CQL Fokker-Planck code. In discharges with Alfven instabilities, both the spatial profile and spectral shape suggests that fast ions are redistributed. The flattened fast-ion Dalpha profile is in agreement with the fast-ion pressure profile.
Parallel processing a three-dimensional free-lagrange code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.A.; Trease, H.E.

1989-01-01

A three-dimensional, time-dependent free-Lagrange hydrodynamics code has been multitasked and autotasked on a CRAY X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the CRAY multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The three-dimensional algorithm has presented a number of problems that simpler algorithms, such as those for one-dimensional hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a CRAY-1, to a multitasking code are discussed. Autotasking of a rewritten versionmore » of the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given.« less
Parallel processing a real code: A case history

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.A.; Trease, H.E.

1988-01-01

A three-dimensional, time-dependent Free-Lagrange hydrodynamics code has been multitasked and autotasked on a Cray X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the Cray multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The 3-D algorithm has presented a number of problems that simpler algorithms, such as 1-D hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a Cray 1, to a multitasking code are discussed, Autotasking of a rewritten version ofmore » the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given. 8 refs., 13 figs.« less
Fast Numerical Solution of the Plasma Response Matrix for Real-time Ideal MHD Control

DOE Office of Scientific and Technical Information (OSTI.GOV)

Glasser, Alexander; Kolemen, Egemen; Glasser, Alan H.

To help effectuate near real-time feedback control of ideal MHD instabilities in tokamak geometries, a parallelized version of A.H. Glasser’s DCON (Direct Criterion of Newcomb) code is developed. To motivate the numerical implementation, we first solve DCON’s δW formulation with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD Riccati equation. We then describe our adaptation of DCON with numerical methods natural to solutions of the Riccati equation, parallelizing it to enable its operation in near real-time. We replace DCON’s serial integration of perturbed modes—which satisfy a singular Euler- Lagrange equation—with a domain-decomposed integration of state transition matrices. Output is shown to match results from DCON with high accuracy, and with computation time < 1s. Such computational speed may enable active feedback ideal MHD stability control, especially in plasmas whose ideal MHD equilibria evolve with inductive timescalemore » $$\\tau$$ ≳ 1s—as in ITER. Further potential applications of this theory are discussed.« less

Acoustic 3D modeling by the method of integral equations

NASA Astrophysics Data System (ADS)

Malovichko, M.; Khokhlov, N.; Yavich, N.; Zhdanov, M.

2018-02-01

This paper presents a parallel algorithm for frequency-domain acoustic modeling by the method of integral equations (IE). The algorithm is applied to seismic simulation. The IE method reduces the size of the problem but leads to a dense system matrix. A tolerable memory consumption and numerical complexity were achieved by applying an iterative solver, accompanied by an effective matrix-vector multiplication operation, based on the fast Fourier transform (FFT). We demonstrate that, the IE system matrix is better conditioned than that of the finite-difference (FD) method, and discuss its relation to a specially preconditioned FD matrix. We considered several methods of matrix-vector multiplication for the free-space and layered host models. The developed algorithm and computer code were benchmarked against the FD time-domain solution. It was demonstrated that, the method could accurately calculate the seismic field for the models with sharp material boundaries and a point source and receiver located close to the free surface. We used OpenMP to speed up the matrix-vector multiplication, while MPI was used to speed up the solution of the system equations, and also for parallelizing across multiple sources. The practical examples and efficiency tests are presented as well.
Fast Numerical Solution of the Plasma Response Matrix for Real-time Ideal MHD Control

DOE PAGES

Glasser, Alexander; Kolemen, Egemen; Glasser, Alan H.

2018-03-26

To help effectuate near real-time feedback control of ideal MHD instabilities in tokamak geometries, a parallelized version of A.H. Glasser’s DCON (Direct Criterion of Newcomb) code is developed. To motivate the numerical implementation, we first solve DCON’s δW formulation with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD Riccati equation. We then describe our adaptation of DCON with numerical methods natural to solutions of the Riccati equation, parallelizing it to enable its operation in near real-time. We replace DCON’s serial integration of perturbed modes—which satisfy a singular Euler- Lagrange equation—with a domain-decomposed integration of state transition matrices. Output is shown to match results from DCON with high accuracy, and with computation time < 1s. Such computational speed may enable active feedback ideal MHD stability control, especially in plasmas whose ideal MHD equilibria evolve with inductive timescalemore » $$\\tau$$ ≳ 1s—as in ITER. Further potential applications of this theory are discussed.« less
Architecture of the parallel hierarchical network for fast image recognition

NASA Astrophysics Data System (ADS)

Timchenko, Leonid; Wójcik, Waldemar; Kokriatskaia, Natalia; Kutaev, Yuriy; Ivasyuk, Igor; Kotyra, Andrzej; Smailova, Saule

2016-09-01

Multistage integration of visual information in the brain allows humans to respond quickly to most significant stimuli while maintaining their ability to recognize small details in the image. Implementation of this principle in technical systems can lead to more efficient processing procedures. The multistage approach to image processing includes main types of cortical multistage convergence. The input images are mapped into a flexible hierarchy that reflects complexity of image data. Procedures of the temporal image decomposition and hierarchy formation are described in mathematical expressions. The multistage system highlights spatial regularities, which are passed through a number of transformational levels to generate a coded representation of the image that encapsulates a structure on different hierarchical levels in the image. At each processing stage a single output result is computed to allow a quick response of the system. The result is presented as an activity pattern, which can be compared with previously computed patterns on the basis of the closest match. With regard to the forecasting method, its idea lies in the following. In the results synchronization block, network-processed data arrive to the database where a sample of most correlated data is drawn using service parameters of the parallel-hierarchical network.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Visual analysis of inter-process communication for large-scale parallel computing.

PubMed

Muelder, Chris; Gygi, Francois; Ma, Kwan-Liu

2009-01-01

In serial computation, program profiling is often helpful for optimization of key sections of code. When moving to parallel computation, not only does the code execution need to be considered but also communication between the different processes which can induce delays that are detrimental to performance. As the number of processes increases, so does the impact of the communication delays on performance. For large-scale parallel applications, it is critical to understand how the communication impacts performance in order to make the code more efficient. There are several tools available for visualizing program execution and communications on parallel systems. These tools generally provide either views which statistically summarize the entire program execution or process-centric views. However, process-centric visualizations do not scale well as the number of processes gets very large. In particular, the most common representation of parallel processes is a Gantt char t with a row for each process. As the number of processes increases, these charts can become difficult to work with and can even exceed screen resolution. We propose a new visualization approach that affords more scalability and then demonstrate it on systems running with up to 16,384 processes.
Scaling Optimization of the SIESTA MHD Code

NASA Astrophysics Data System (ADS)

Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan

2013-10-01

SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.
Solar wind interaction with Venus and Mars in a parallel hybrid code

NASA Astrophysics Data System (ADS)

Jarvinen, Riku; Sandroos, Arto

2013-04-01

We discuss the development and applications of a new parallel hybrid simulation, where ions are treated as particles and electrons as a charge-neutralizing fluid, for the interaction between the solar wind and Venus and Mars. The new simulation code under construction is based on the algorithm of the sequential global planetary hybrid model developed at the Finnish Meteorological Institute (FMI) and on the Corsair parallel simulation platform also developed at the FMI. The FMI's sequential hybrid model has been used for studies of plasma interactions of several unmagnetized and weakly magnetized celestial bodies for more than a decade. Especially, the model has been used to interpret in situ particle and magnetic field observations from plasma environments of Mars, Venus and Titan. Further, Corsair is an open source MPI (Message Passing Interface) particle and mesh simulation platform, mainly aimed for simulations of diffusive shock acceleration in solar corona and interplanetary space, but which is now also being extended for global planetary hybrid simulations. In this presentation we discuss challenges and strategies of parallelizing a legacy simulation code as well as possible applications and prospects of a scalable parallel hybrid model for the solar wind interactions of Venus and Mars.
TORUS: Radiation transport and hydrodynamics code

NASA Astrophysics Data System (ADS)

Harries, Tim

2014-04-01

TORUS is a flexible radiation transfer and radiation-hydrodynamics code. The code has a basic infrastructure that includes the AMR mesh scheme that is used by several physics modules including atomic line transfer in a moving medium, molecular line transfer, photoionization, radiation hydrodynamics and radiative equilibrium. TORUS is useful for a variety of problems, including magnetospheric accretion onto T Tauri stars, spiral nebulae around Wolf-Rayet stars, discs around Herbig AeBe stars, structured winds of O supergiants and Raman-scattered line formation in symbiotic binaries, and dust emission and molecular line formation in star forming clusters. The code is written in Fortran 2003 and is compiled using a standard Gnu makefile. The code is parallelized using both MPI and OMP, and can use these parallel sections either separately or in a hybrid mode.
Wind turbine design codes: A comparison of the structural response

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buhl, M.L. Jr.; Wright, A.D.; Pierce, K.G.

2000-03-01

The National Wind Technology Center (NWTC) of the National Renewable Energy Laboratory is continuing a comparison of several computer codes used in the design and analysis of wind turbines. The second part of this comparison determined how well the programs predict the structural response of wind turbines. In this paper, the authors compare the structural response for four programs: ADAMS, BLADED, FAST{_}AD, and YawDyn. ADAMS is a commercial, multibody-dynamics code from Mechanical Dynamics, Inc. BLADED is a commercial, performance and structural-response code from Garrad Hassan and Partners Limited. FAST{_}AD is a structural-response code developed by Oregon State University and themore » University of Utah for the NWTC. YawDyn is a structural-response code developed by the University of Utah for the NWTC. ADAMS, FAST{_}AD, and YawDyn use the University of Utah's AeroDyn subroutine package for calculating aerodynamic forces. Although errors were found in all the codes during this study, once they were fixed, the codes agreed surprisingly well for most of the cases and configurations that were evaluated. One unresolved discrepancy between BLADED and the AeroDyn-based codes was when there was blade and/or teeter motion in addition to a large yaw error.« less
Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

NASA Astrophysics Data System (ADS)

Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro

2016-08-01

We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
A dosimetry study comparing NCS report-5, IAEA TRS-381, AAPM TG-51 and IAEA TRS-398 in three clinical electron beam energies

NASA Astrophysics Data System (ADS)

Palmans, Hugo; Nafaa, Laila; de Patoul, Nathalie; Denis, Jean-Marc; Tomsej, Milan; Vynckier, Stefaan

2003-05-01

New codes of practice for reference dosimetry in clinical high-energy photon and electron beams have been published recently, to replace the air kerma based codes of practice that have determined the dosimetry of these beams for the past twenty years. In the present work, we compared dosimetry based on the two most widespread absorbed dose based recommendations (AAPM TG-51 and IAEA TRS-398) with two air kerma based recommendations (NCS report-5 and IAEA TRS-381). Measurements were performed in three clinical electron beam energies using two NE2571-type cylindrical chambers, two Markus-type plane-parallel chambers and two NACP-02-type plane-parallel chambers. Dosimetry based on direct calibrations of all chambers in 60Co was investigated, as well as dosimetry based on cross-calibrations of plane-parallel chambers against a cylindrical chamber in a high-energy electron beam. Furthermore, 60Co perturbation factors for plane-parallel chambers were derived. It is shown that the use of 60Co calibration factors could result in deviations of more than 2% for plane-parallel chambers between the old and new codes of practice, whereas the use of cross-calibration factors, which is the first recommendation in the new codes, reduces the differences to less than 0.8% for all situations investigated here. The results thus show that neither the chamber-to-chamber variations, nor the obtained absolute dose values are significantly altered by changing from air kerma based dosimetry to absorbed dose based dosimetry when using calibration factors obtained from the Laboratory for Standard Dosimetry, Ghent, Belgium. The values of the 60Co perturbation factor for plane-parallel chambers (katt . km for the air kerma based and pwall for the absorbed dose based codes of practice) that are obtained from comparing the results based on 60Co calibrations and cross-calibrations are within the experimental uncertainties in agreement with the results from other investigators.
Interfacing Computer Aided Parallelization and Performance Analysis

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

2003-01-01

When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
DOUBLE code simulations of emissivities of fast neutrals for different plasma observation view-lines of neutral particle analyzers on the COMPASS tokamak

NASA Astrophysics Data System (ADS)

Mitosinkova, K.; Tomes, M.; Stockel, J.; Varju, J.; Stano, M.

2018-03-01

Neutral particle analyzers (NPA) measure line-integrated energy spectra of fast neutral atoms escaping the tokamak plasma, which are a product of charge-exchange (CX) collisions of plasma ions with background neutrals. They can observe variations in the ion temperature T i of non-thermal fast ions created by additional plasma heating. However, the plasma column which a fast atom has to pass through must be sufficiently short in comparison with the fast atom’s mean-free-path. Tokamak COMPASS is currently equipped with one NPA installed at a tangential mid-plane port. This orientation is optimal for observing non-thermal fast ions. However, in this configuration the signal at energies useful for T i derivation is lost in noise due to the too long fast atoms’ trajectories. Thus, a second NPA is planned to be connected for the purpose of measuring T i. We analyzed different possible view-lines (perpendicular mid-plane, tangential mid-plane, and top view) for the second NPA using the DOUBLE Monte-Carlo code and compared the results with the performance of the present NPA with tangential orientation. The DOUBLE code provides fast-atoms’ emissivity functions along the NPA view-line. The position of the median of these emissivity functions is related to the location from where the measured signal originates. Further, we compared the difference between the real central T i used as a DOUBLE code input and the T iCX derived from the exponential decay of simulated energy spectra. The advantages and disadvantages of each NPA location are discussed.
The Consequences of Alfven Waves and Parallel Potential Drops in the Auroral Zone

NASA Technical Reports Server (NTRS)

Schriver, David

2003-01-01

The goal of this research is to examine the causes of field-aligned plasma acceleration in the auroral zone using satellite data and numerical simulations. A primary question to be addressed is what causes the field-aligned acceleration of electrons (leading to precipitation) and ions (leading to upwelling ions) in the auroral zone. Data from the Fast Auroral SnapshoT (FAST) and Polar satellites is used when the two satellites are in approximate magnetic conjunction and are in the auroral region. FAST is at relatively low altitudes and samples plasma in the midst of the auroral acceleration region while Polar is at much higher altitudes and can measure plasmas and waves propagating towards the Earth. Polar can determine the sources of energy streaming earthward from the magnetotail, either in the form of field-aligned currents, electromagnetic waves or kinetic particle energy, that ultimately leads to the acceleration of plasma in the auroral zone. After identifying and examining several events, numerical simulations are run that bridges the spatial region between the two satellites. The code is a one-dimensional, long system length particle in cell simulation that has been developed to model the auroral region. A main goal of this research project is to include Alfven waves in the simulation to examine how these waves can accelerate plasma in the auroral zone.
Parallelization of PANDA discrete ordinates code using spatial decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humbert, P.

2006-07-01

We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7more » benchmark of the OECD/NEA. (authors)« less
Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

NASA Astrophysics Data System (ADS)

Boyko, Oleksiy; Zheleznyak, Mark

2015-04-01

The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
The geospatial data quality REST API for primary biodiversity data

PubMed Central

Otegui, Javier; Guralnick, Robert P.

2016-01-01

Summary: We present a REST web service to assess the geospatial quality of primary biodiversity data. It enables access to basic and advanced functions to detect completeness and consistency issues as well as general errors in the provided record or set of records. The API uses JSON for data interchange and efficient parallelization techniques for fast assessments of large datasets. Availability and implementation: The Geospatial Data Quality API is part of the VertNet set of APIs. It can be accessed at http://api-geospatial.vertnet-portal.appspot.com/geospatial and is already implemented in the VertNet data portal for quality reporting. Source code is freely available under GPL license from http://www.github.com/vertnet/api-geospatial. Contact: javier.otegui@gmail.com or rguralnick@flmnh.ufl.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26833340
The geospatial data quality REST API for primary biodiversity data.

PubMed

Otegui, Javier; Guralnick, Robert P

2016-06-01

We present a REST web service to assess the geospatial quality of primary biodiversity data. It enables access to basic and advanced functions to detect completeness and consistency issues as well as general errors in the provided record or set of records. The API uses JSON for data interchange and efficient parallelization techniques for fast assessments of large datasets. The Geospatial Data Quality API is part of the VertNet set of APIs. It can be accessed at http://api-geospatial.vertnet-portal.appspot.com/geospatial and is already implemented in the VertNet data portal for quality reporting. Source code is freely available under GPL license from http://www.github.com/vertnet/api-geospatial javier.otegui@gmail.com or rguralnick@flmnh.ufl.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Simulator platform for fast reactor operation and safety technology demonstration

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vilim, R. B.; Park, Y. S.; Grandy, C.

2012-07-30

A simulator platform for visualization and demonstration of innovative concepts in fast reactor technology is described. The objective is to make more accessible the workings of fast reactor technology innovations and to do so in a human factors environment that uses state-of-the art visualization technologies. In this work the computer codes in use at Argonne National Laboratory (ANL) for the design of fast reactor systems are being integrated to run on this platform. This includes linking reactor systems codes with mechanical structures codes and using advanced graphics to depict the thermo-hydraulic-structure interactions that give rise to an inherently safe responsemore » to upsets. It also includes visualization of mechanical systems operation including advanced concepts that make use of robotics for operations, in-service inspection, and maintenance.« less
Multiple grid problems on concurrent-processing computers

NASA Technical Reports Server (NTRS)

Eberhardt, D. S.; Baganoff, D.

1986-01-01

Three computer codes were studied which make use of concurrent processing computer architectures in computational fluid dynamics (CFD). The three parallel codes were tested on a two processor multiple-instruction/multiple-data (MIMD) facility at NASA Ames Research Center, and are suggested for efficient parallel computations. The first code is a well-known program which makes use of the Beam and Warming, implicit, approximate factored algorithm. This study demonstrates the parallelism found in a well-known scheme and it achieved speedups exceeding 1.9 on the two processor MIMD test facility. The second code studied made use of an embedded grid scheme which is used to solve problems having complex geometries. The particular application for this study considered an airfoil/flap geometry in an incompressible flow. The scheme eliminates some of the inherent difficulties found in adapting approximate factorization techniques onto MIMD machines and allows the use of chaotic relaxation and asynchronous iteration techniques. The third code studied is an application of overset grids to a supersonic blunt body problem. The code addresses the difficulties encountered when using embedded grids on a compressible, and therefore nonlinear, problem. The complex numerical boundary system associated with overset grids is discussed and several boundary schemes are suggested. A boundary scheme based on the method of characteristics achieved the best results.

A practical approach to portability and performance problems on massively parallel supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Beazley, D.M.; Lomdahl, P.S.

1994-12-08

We present an overview of the tactics we have used to achieve a high-level of performance while improving portability for a large-scale molecular dynamics code SPaSM. SPaSM was originally implemented in ANSI C with message passing for the Connection Machine 5 (CM-5). In 1993, SPaSM was selected as one of the winners in the IEEE Gordon Bell Prize competition for sustaining 50 Gflops on the 1024 node CM-5 at Los Alamos National Laboratory. Achieving this performance on the CM-5 required rewriting critical sections of code in CDPEAC assembler language. In addition, the code made extensive use of CM-5 parallel I/Omore » and the CMMD message passing library. Given this highly specialized implementation, we describe how we have ported the code to the Cray T3D and high performance workstations. In addition we will describe how it has been possible to do this using a single version of source code that runs on all three platforms without sacrificing any performance. Sound too good to be true? We hope to demonstrate that one can realize both code performance and portability without relying on the latest and greatest prepackaged tool or parallelizing compiler.« less
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing

NASA Technical Reports Server (NTRS)

Ozguner, Fusun

1996-01-01

Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
GANDALF - Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids

NASA Astrophysics Data System (ADS)

Hubber, D. A.; Rosotti, G. P.; Booth, R. A.

2018-01-01

GANDALF is a new hydrodynamics and N-body dynamics code designed for investigating planet formation, star formation and star cluster problems. GANDALF is written in C++, parallelized with both OPENMP and MPI and contains a PYTHON library for analysis and visualization. The code has been written with a fully object-oriented approach to easily allow user-defined implementations of physics modules or other algorithms. The code currently contains implementations of smoothed particle hydrodynamics, meshless finite-volume and collisional N-body schemes, but can easily be adapted to include additional particle schemes. We present in this paper the details of its implementation, results from the test suite, serial and parallel performance results and discuss the planned future development. The code is freely available as an open source project on the code-hosting website github at https://github.com/gandalfcode/gandalf and is available under the GPLv2 license.
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations

NASA Astrophysics Data System (ADS)

Darmawan, J. B. B.; Mungkasi, S.

2017-01-01

In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
Parallel computing techniques for rotorcraft aerodynamics

NASA Astrophysics Data System (ADS)

Ekici, Kivanc

The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

PubMed

Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

2015-06-01

Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
Mode-dependent templates and scan order for H.264/AVC-based intra lossless coding.

PubMed

Gu, Zhouye; Lin, Weisi; Lee, Bu-Sung; Lau, Chiew Tong; Sun, Ming-Ting

2012-09-01

In H.264/advanced video coding (AVC), lossless coding and lossy coding share the same entropy coding module. However, the entropy coders in the H.264/AVC standard were original designed for lossy video coding and do not yield adequate performance for lossless video coding. In this paper, we analyze the problem with the current lossless coding scheme and propose a mode-dependent template (MD-template) based method for intra lossless coding. By exploring the statistical redundancy of the prediction residual in the H.264/AVC intra prediction modes, more zero coefficients are generated. By designing a new scan order for each MD-template, the scanned coefficients sequence fits the H.264/AVC entropy coders better. A fast implementation algorithm is also designed. With little computation increase, experimental results confirm that the proposed fast algorithm achieves about 7.2% bit saving compared with the current H.264/AVC fidelity range extensions high profile.
Parallel MR imaging: a user's guide.

PubMed

Glockner, James F; Hu, Houchun H; Stanley, David W; Angelos, Lisa; King, Kevin

2005-01-01

Parallel imaging is a recently developed family of techniques that take advantage of the spatial information inherent in phased-array radiofrequency coils to reduce acquisition times in magnetic resonance imaging. In parallel imaging, the number of sampled k-space lines is reduced, often by a factor of two or greater, thereby significantly shortening the acquisition time. Parallel imaging techniques have only recently become commercially available, and the wide range of clinical applications is just beginning to be explored. The potential clinical applications primarily involve reduction in acquisition time, improved spatial resolution, or a combination of the two. Improvements in image quality can be achieved by reducing the echo train lengths of fast spin-echo and single-shot fast spin-echo sequences. Parallel imaging is particularly attractive for cardiac and vascular applications and will likely prove valuable as 3-T body and cardiovascular imaging becomes part of standard clinical practice. Limitations of parallel imaging include reduced signal-to-noise ratio and reconstruction artifacts. It is important to consider these limitations when deciding when to use these techniques. (c) RSNA, 2005.
The Advanced Software Development and Commercialization Project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gallopoulos, E.; Canfield, T.R.; Minkoff, M.

1990-09-01

This is the first of a series of reports pertaining to progress in the Advanced Software Development and Commercialization Project, a joint collaborative effort between the Center for Supercomputing Research and Development of the University of Illinois and the Computing and Telecommunications Division of Argonne National Laboratory. The purpose of this work is to apply techniques of parallel computing that were pioneered by University of Illinois researchers to mature computational fluid dynamics (CFD) and structural dynamics (SD) computer codes developed at Argonne. The collaboration in this project will bring this unique combination of expertise to bear, for the first time,more » on industrially important problems. By so doing, it will expose the strengths and weaknesses of existing techniques for parallelizing programs and will identify those problems that need to be solved in order to enable wide spread production use of parallel computers. Secondly, the increased efficiency of the CFD and SD codes themselves will enable the simulation of larger, more accurate engineering models that involve fluid and structural dynamics. In order to realize the above two goals, we are considering two production codes that have been developed at ANL and are widely used by both industry and Universities. These are COMMIX and WHAMS-3D. The first is a computational fluid dynamics code that is used for both nuclear reactor design and safety and as a design tool for the casting industry. The second is a three-dimensional structural dynamics code used in nuclear reactor safety as well as crashworthiness studies. These codes are currently available for both sequential and vector computers only. Our main goal is to port and optimize these two codes on shared memory multiprocessors. In so doing, we shall establish a process that can be followed in optimizing other sequential or vector engineering codes for parallel processors.« less
FastID: Extremely Fast Forensic DNA Comparisons

DTIC Science & Technology

2017-05-19

FastID: Extremely Fast Forensic DNA Comparisons Darrell O. Ricke, PhD Bioengineering Systems & Technologies Massachusetts Institute of...Technology Lincoln Laboratory Lexington, MA USA Darrell.Ricke@ll.mit.edu Abstract—Rapid analysis of DNA forensic samples can have a critical impact on...time sensitive investigations. Analysis of forensic DNA samples by massively parallel sequencing is creating the next gold standard for DNA
Exploring the Ability of a Coarse-grained Potential to Describe the Stress-strain Response of Glassy Polystyrene

DTIC Science & Technology

2012-10-01

using the open-source code Large-scale Atomic/Molecular Massively Parallel Simulator ( LAMMPS ) (http://lammps.sandia.gov) (23). The commercial...parameters are proprietary and cannot be ported to the LAMMPS 4 simulation code. In our molecular dynamics simulations at the atomistic resolution, we...IBI iterative Boltzmann inversion LAMMPS Large-scale Atomic/Molecular Massively Parallel Simulator MAPS Materials Processes and Simulations MS
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation

NASA Astrophysics Data System (ADS)

Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru

1999-09-01

We have developed a new parallel tree method which will be called the forest method hereafter. This new method uses the sectional Voronoi tessellation (SVT) for the domain decomposition. The SVT decomposes a whole space into polyhedra and allows their flat borders to move by assigning different weights. The forest method determines these weights based on the load balancing among processors by means of the overload diffusion (OLD). Moreover, since all the borders are flat, before receiving the data from other processors, each processor can collect enough data to calculate the gravity force with precision. Both the SVT and the OLD are coded in a highly vectorizable manner to accommodate on vector parallel processors. The parallel code based on the forest method with the Message Passing Interface is run on various platforms so that a wide portability is guaranteed. Extensive calculations with 15 processors of Fujitsu VPP300/16R indicate that the code can calculate the gravity force exerted on 105 particles in each second for some ideal dark halo. This code is found to enable an N-body simulation with 107 or more particles for a wide dynamic range and is therefore a very powerful tool for the study of galaxy formation and large-scale structure in the universe.
Implementation and Characterization of Three-Dimensional Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data Massively Parallel Supercomputers

NASA Technical Reports Server (NTRS)

Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.

1995-01-01

A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
MIST: An Open Source Environmental Modelling Programming Language Incorporating Easy to Use Data Parallelism.

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2014-05-01

Model Integration System (MIST) is open-source environmental modelling programming language that directly incorporates data parallelism. The language is designed to enable straightforward programming structures, such as nested loops and conditional statements to be directly translated into sequences of whole-array (or more generally whole data-structure) operations. MIST thus enables the programmer to use well-understood constructs, directly relating to the mathematical structure of the model, without having to explicitly vectorize code or worry about details of parallelization. A range of common modelling operations are supported by dedicated language structures operating on cell neighbourhoods rather than individual cells (e.g.: the 3x3 local neighbourhood needed to implement an averaging image filter can be simply accessed from within a simple loop traversing all image pixels). This facility hides details of inter-process communication behind more mathematically relevant descriptions of model dynamics. The MIST automatic vectorization/parallelization process serves both to distribute work among available nodes and separately to control storage requirements for intermediate expressions - enabling operations on very large domains for which memory availability may be an issue. MIST is designed to facilitate efficient interpreter based implementations. A prototype open source interpreter is available, coded in standard FORTRAN 95, with tools to rapidly integrate existing FORTRAN 77 or 95 code libraries. The language is formally specified and thus not limited to FORTRAN implementation or to an interpreter-based approach. A MIST to FORTRAN compiler is under development and volunteers are sought to create an ANSI-C implementation. Parallel processing is currently implemented using OpenMP. However, parallelization code is fully modularised and could be replaced with implementations using other libraries. GPU implementation is potentially possible.
Convergence of highly parallel stray field calculation using the fast multipole method on irregular meshes

NASA Astrophysics Data System (ADS)

Palmesi, P.; Abert, C.; Bruckner, F.; Suess, D.

2018-05-01

Fast stray field calculation is commonly considered of great importance for micromagnetic simulations, since it is the most time consuming part of the simulation. The Fast Multipole Method (FMM) has displayed linear O(N) parallelization behavior on many cores. This article investigates the error of a recent FMM approach approximating sources using linear—instead of constant—finite elements in the singular integral for calculating the stray field and the corresponding potential. After measuring performance in an earlier manuscript, this manuscript investigates the convergence of the relative L2 error for several FMM simulation parameters. Various scenarios either calculating the stray field directly or via potential are discussed.
Automatic Multilevel Parallelization Using OpenMP

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

2002-01-01

In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units

NASA Astrophysics Data System (ADS)

Corral, Roque; Gisbert, Fernando; Pueblas, Jesus

2017-02-01

The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
Modelling of radio frequency sheath and fast wave coupling on the realistic ion cyclotron resonant antenna surroundings and the outer wall

NASA Astrophysics Data System (ADS)

Lu, L.; Colas, L.; Jacquot, J.; Després, B.; Heuraux, S.; Faudot, E.; Van Eester, D.; Crombé, K.; Křivská, A.; Noterdaeme, J.-M.; Helou, W.; Hillairet, J.

2018-03-01

In order to model the sheath rectification in a realistic geometry over the size of ion cyclotron resonant heating (ICRH) antennas, the self-consistent sheaths and waves for ICH (SSWICH) code couples self-consistently the RF wave propagation and the DC SOL biasing via nonlinear RF and DC sheath boundary conditions applied at plasma/wall interfaces. A first version of SSWICH had 2D (toroidal and radial) geometry, rectangular walls either normal or parallel to the confinement magnetic field B 0 and only included the evanescent slow wave (SW) excited parasitically by the ICRH antenna. The main wave for plasma heating, the fast wave (FW) plays no role on the sheath excitation in this version. A new version of the code, 2D SSWICH-full wave, was developed based on the COMSOL software, to accommodate full RF field polarization and shaped walls tilted with respect to B 0 . SSWICH-full wave simulations have shown the mode conversion of FW into SW occurring at the sharp corners where the boundary shape varies rapidly. It has also evidenced ‘far-field’ sheath oscillations appearing at the shaped walls with a relatively long magnetic connection length to the antenna, that are only accessible to the propagating FW. Joint simulation, conducted by SSWICH-full wave within a multi-2D approach excited using the 3D wave coupling code (RAPLICASOL), has recovered the double-hump poloidal structure measured in the experimental temperature and potential maps when only the SW is modelled. The FW contribution on the potential poloidal structure seems to be affected by the 3D effects, which was ignored in the current stage. Finally, SSWICH-full wave simulation revealed the left-right asymmetry that has been observed extensively in the unbalanced strap feeding experiments, suggesting that the spatial proximity effects in RF sheath excitation, studied for SW only previously, is still important in the vicinity of the wave launcher under full wave polarizations.
Modeling Submarine Lava Flow with ASPECT

NASA Astrophysics Data System (ADS)

Storvick, E. R.; Lu, H.; Choi, E.

2017-12-01

Submarine lava flow is not easily observed and experimented on due to limited accessibility and challenges posed by the fast solidification of lava and the associated drastic changes in rheology. However, recent advances in numerical modeling techniques might address some of these challenges and provide unprecedented insight into the mechanics of submarine lava flow and conditions determining its wide-ranging morphologies. In this study, we explore the applicability ASPECT, Advanced Solver for Problems in Earth's ConvecTion, to submarine lava flow. ASPECT is a parallel finite element code that solves problems of thermal convection in the Earth's mantle. We will assess ASPECT's capability to model submarine lava flow by observing models of lava flow morphology simulated with GALE, a long-term tectonics finite element analysis code, with models created using comparable settings and parameters in ASPECT. From these observations we will contrast the differing models in order to identify the benefits of each code. While doing so, we anticipate we will learn about the conditions required for end-members of lava flow morphology, for example, pillows and sheet flows. With ASPECT specifically we focus on 1) whether the lava rheology can be implemented; 2) how effective the AMR is in resolving morphologies of the solidified crust; 3) whether and under what conditions the end-members of the lava flow morphologies, pillows and sheets, can be reproduced.
Incompressible SPH (ISPH) with fast Poisson solver on a GPU

NASA Astrophysics Data System (ADS)

Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

2018-05-01

This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.

Neptune: An astrophysical smooth particle hydrodynamics code for massively parallel computer architectures

NASA Astrophysics Data System (ADS)

Sandalski, Stou

Smooth particle hydrodynamics is an efficient method for modeling the dynamics of fluids. It is commonly used to simulate astrophysical processes such as binary mergers. We present a newly developed GPU accelerated smooth particle hydrodynamics code for astrophysical simulations. The code is named neptune after the Roman god of water. It is written in OpenMP parallelized C++ and OpenCL and includes octree based hydrodynamic and gravitational acceleration. The design relies on object-oriented methodologies in order to provide a flexible and modular framework that can be easily extended and modified by the user. Several pre-built scenarios for simulating collisions of polytropes and black-hole accretion are provided. The code is released under the MIT Open Source license and publicly available at http://code.google.com/p/neptune-sph/.
The UPSF code: a metaprogramming-based high-performance automatically parallelized plasma simulation framework

NASA Astrophysics Data System (ADS)

Gao, Xiatian; Wang, Xiaogang; Jiang, Binhao

2017-10-01

UPSF (Universal Plasma Simulation Framework) is a new plasma simulation code designed for maximum flexibility by using edge-cutting techniques supported by C++17 standard. Through use of metaprogramming technique, UPSF provides arbitrary dimensional data structures and methods to support various kinds of plasma simulation models, like, Vlasov, particle in cell (PIC), fluid, Fokker-Planck, and their variants and hybrid methods. Through C++ metaprogramming technique, a single code can be used to arbitrary dimensional systems with no loss of performance. UPSF can also automatically parallelize the distributed data structure and accelerate matrix and tensor operations by BLAS. A three-dimensional particle in cell code is developed based on UPSF. Two test cases, Landau damping and Weibel instability for electrostatic and electromagnetic situation respectively, are presented to show the validation and performance of the UPSF code.
Modeling and inversion Matlab algorithms for resistivity, induced polarization and seismic data

NASA Astrophysics Data System (ADS)

Karaoulis, M.; Revil, A.; Minsley, B. J.; Werkema, D. D.

2011-12-01

M. Karaoulis (1), D.D. Werkema (3), A. Revil (1,2), A., B. Minsley (4), (1) Colorado School of Mines, Dept. of Geophysics, Golden, CO, USA. (2) ISTerre, CNRS, UMR 5559, Université de Savoie, Equipe Volcan, Le Bourget du Lac, France. (3) U.S. EPA, ORD, NERL, ESD, CMB, Las Vegas, Nevada, USA . (4) USGS, Federal Center, Lakewood, 10, 80225-0046, CO. Abstract We propose 2D and 3D forward modeling and inversion package for DC resistivity, time domain induced polarization (IP), frequency-domain IP, and seismic refraction data. For the resistivity and IP case, discretization is based on rectangular cells, where each cell has as unknown resistivity in the case of DC modelling, resistivity and chargeability in the time domain IP modelling, and complex resistivity in the spectral IP modelling. The governing partial-differential equations are solved with the finite element method, which can be applied to both real and complex variables that are solved for. For the seismic case, forward modeling is based on solving the eikonal equation using a second-order fast marching method. The wavepaths are materialized by Fresnel volumes rather than by conventional rays. This approach accounts for complicated velocity models and is advantageous because it considers frequency effects on the velocity resolution. The inversion can accommodate data at a single time step, or as a time-lapse dataset if the geophysical data are gathered for monitoring purposes. The aim of time-lapse inversion is to find the change in the velocities or resistivities of each model cell as a function of time. Different time-lapse algorithms can be applied such as independent inversion, difference inversion, 4D inversion, and 4D active time constraint inversion. The forward algorithms are benchmarked against analytical solutions and inversion results are compared with existing ones. The algorithms are packaged as Matlab codes with a simple Graphical User Interface. Although the code is parallelized for multi-core cpus, it is not as fast as machine code. In the case of large datasets, someone should consider transferring parts of the code to C or Fortran through mex files. This code is available through EPA's website on the following link http://www.epa.gov/esd/cmb/GeophysicsWebsite/index.html Although this work was reviewed by EPA and approved for publication, it may not necessarily reflect official Agency policy.
Large-scale trench-normal mantle flow beneath central South America

NASA Astrophysics Data System (ADS)

Reiss, M. C.; Rümpker, G.; Wölbern, I.

2018-01-01

We investigate the anisotropic properties of the fore-arc region of the central Andean margin between 17-25°S by analyzing shear-wave splitting from teleseismic and local earthquakes from the Nazca slab. With partly over ten years of recording time, the data set is uniquely suited to address the long-standing debate about the mantle flow field at the South American margin and in particular whether the flow field beneath the slab is parallel or perpendicular to the trench. Our measurements suggest two anisotropic layers located within the crust and mantle beneath the stations, respectively. The teleseismic measurements show a moderate change of fast polarizations from North to South along the trench ranging from parallel to subparallel to the absolute plate motion and, are oriented mostly perpendicular to the trench. Shear-wave splitting measurements from local earthquakes show fast polarizations roughly aligned trench-parallel but exhibit short-scale variations which are indicative of a relatively shallow origin. Comparisons between fast polarization directions from local earthquakes and the strike of the local fault systems yield a good agreement. To infer the parameters of the lower anisotropic layer we employ an inversion of the teleseismic waveforms based on two-layer models, where the anisotropy of the upper (crustal) layer is constrained by the results from the local splitting. The waveform inversion yields a mantle layer that is best characterized by a fast axis parallel to the absolute plate motion which is more-or-less perpendicular to the trench. This orientation is likely caused by a combination of the fossil crystallographic preferred orientation of olivine within the slab and entrained mantle flow beneath the slab. The anisotropy within the crust of the overriding continental plate is explained by the shape-preferred orientation of micro-cracks in relation to local fault zones which are oriented parallel to the overall strike of the Andean range. Our results do not provide any evidence for a significant contribution of trench-parallel mantle flow beneath the subducting slab.
The multigrid preconditioned conjugate gradient method

NASA Technical Reports Server (NTRS)

Tatebe, Osamu

1993-01-01

A multigrid preconditioned conjugate gradient method (MGCG method), which uses the multigrid method as a preconditioner of the PCG method, is proposed. The multigrid method has inherent high parallelism and improves convergence of long wavelength components, which is important in iterative methods. By using this method as a preconditioner of the PCG method, an efficient method with high parallelism and fast convergence is obtained. First, it is considered a necessary condition of the multigrid preconditioner in order to satisfy requirements of a preconditioner of the PCG method. Next numerical experiments show a behavior of the MGCG method and that the MGCG method is superior to both the ICCG method and the multigrid method in point of fast convergence and high parallelism. This fast convergence is understood in terms of the eigenvalue analysis of the preconditioned matrix. From this observation of the multigrid preconditioner, it is realized that the MGCG method converges in very few iterations and the multigrid preconditioner is a desirable preconditioner of the conjugate gradient method.
Performance of a parallel thermal-hydraulics code TEMPEST

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fann, G.I.; Trent, D.S.

The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less
Hybrid spread spectrum radio system

DOEpatents

Smith, Stephen F.; Dress, William B.

2010-02-02

Systems and methods are described for hybrid spread spectrum radio systems. A method includes modulating a signal by utilizing a subset of bits from a pseudo-random code generator to control an amplification circuit that provides a gain to the signal. Another method includes: modulating a signal by utilizing a subset of bits from a pseudo-random code generator to control a fast hopping frequency synthesizer; and fast frequency hopping the signal with the fast hopping frequency synthesizer, wherein multiple frequency hops occur within a single data-bit time.
Automatic Multilevel Parallelization Using OpenMP

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

2002-01-01

In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
High-speed three-dimensional measurements with a fringe projection-based optical sensor

NASA Astrophysics Data System (ADS)

Bräuer-Burchardt, Christian; Breitbarth, Andreas; Kühmstedt, Peter; Notni, Gunther

2014-11-01

An optical three-dimensional (3-D) sensor based on a fringe projection technique that realizes the acquisition of the surface geometry of small objects was developed for highly resolved and ultrafast measurements. It realizes a data acquisition rate up to 60 high-resolution 3-D datasets per second. The high measurement velocity was achieved by consequent fringe code reduction and parallel data processing. The reduction of the length of the fringe image sequence was obtained by omission of the Gray code sequence using the geometric restrictions of the measurement objects and the geometric constraints of the sensor arrangement. The sensor covers three different measurement fields between 20 mm×20 mm and 40 mm×40 mm with a spatial resolution between 10 and 20 μm, respectively. In order to obtain a robust and fast recalibration of the sensor after change of the measurement field, a calibration procedure based on single shot analysis of a special test object was applied which works with low effort and time. The sensor may be used, e.g., for quality inspection of conductor boards or plugs in real-time industrial applications.
Benchmark studies of thermal jet mixing in SFRs using a two-jet model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Omotowa, O. A.; Skifton, R.; Tokuhiro, A.

To guide the modeling, simulations and design of Sodium Fast Reactors (SFRs), we explore and compare the predictive capabilities of two numerical solvers COMSOL and OpenFOAM in the thermal jet mixing of two buoyant jets typical of the outlet flow from a SFR tube bundle. This process will help optimize on-going experimental efforts at obtaining high resolution data for V and V of CFD codes as anticipated in next generation nuclear systems. Using the k-{epsilon} turbulence models of both codes as reference, their ability to simulate the turbulence behavior in similar environments was first validated for single jet experimental datamore » reported in literature. This study investigates the thermal mixing of two parallel jets having a temperature difference (hot-to-cold) {Delta}T{sub hc}= 5 deg. C, 10 deg. C and velocity ratios U{sub c}/U{sub h} = 0.5, 1. Results of the computed turbulent quantities due to convective mixing and the variations in flow field along the axial position are presented. In addition, this study also evaluates the effect of spacing ratio between jets in predicting the flow field and jet behavior in near and far fields. (authors)« less
A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)

NASA Astrophysics Data System (ADS)

Wan, Hui; Zhang, Kai; Rasch, Philip J.; Singh, Balwinder; Chen, Xingyuan; Edwards, Jim

2017-02-01

A test procedure is proposed for identifying numerically significant solution changes in evolution equations used in atmospheric models. The test issues a fail signal when any code modifications or computing environment changes lead to solution differences that exceed the known time step sensitivity of the reference model. Initial evidence is provided using the Community Atmosphere Model (CAM) version 5.3 that the proposed procedure can be used to distinguish rounding-level solution changes from impacts of compiler optimization or parameter perturbation, which are known to cause substantial differences in the simulated climate. The test is not exhaustive since it does not detect issues associated with diagnostic calculations that do not feedback to the model state variables. Nevertheless, it provides a practical and objective way to assess the significance of solution changes. The short simulation length implies low computational cost. The independence between ensemble members allows for parallel execution of all simulations, thus facilitating fast turnaround. The new method is simple to implement since it does not require any code modifications. We expect that the same methodology can be used for any geophysical model to which the concept of time step convergence is applicable.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A; Kabel, A.; Lee, L.

In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

PubMed

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-06-01

We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

PubMed Central

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek

2016-10-06

We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
FastChem: An ultra-fast equilibrium chemistry

NASA Astrophysics Data System (ADS)

Kitzmann, Daniel; Stock, Joachim

2018-04-01

FastChem is an equilibrium chemistry code that calculates the chemical composition of the gas phase for given temperatures and pressures. Written in C++, it is based on a semi-analytic approach, and is optimized for extremely fast and accurate calculations.
Efficiently modeling neural networks on massively parallel computers

NASA Technical Reports Server (NTRS)

Farber, Robert M.

1993-01-01

Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
Support for Debugging Automatically Parallelized Programs

NASA Technical Reports Server (NTRS)

Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)

2001-01-01

This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.
PIXIE3D: A Parallel, Implicit, eXtended MHD 3D Code.

NASA Astrophysics Data System (ADS)

Chacon, L.; Knoll, D. A.

2004-11-01

We report on the development of PIXIE3D, a 3D parallel, fully implicit Newton-Krylov extended primitive-variable MHD code in general curvilinear geometry. PIXIE3D employs a second-order, finite-volume-based spatial discretization that satisfies remarkable properties such as being conservative, solenoidal in the magnetic field, non-dissipative, and stable in the absence of physical dissipation.(L. Chacón , phComput. Phys. Comm.) submitted (2004) PIXIE3D employs fully-implicit Newton-Krylov methods for the time advance. Currently, first and second-order implicit schemes are available, although higher-order temporal implicit schemes can be effortlessly implemented within the Newton-Krylov framework. A successful, scalable, MG physics-based preconditioning strategy, similar in concept to previous 2D MHD efforts,(L. Chacón et al., phJ. Comput. Phys). 178 (1), 15- 36 (2002); phJ. Comput. Phys., 188 (2), 573-592 (2003) has been developed. We are currently in the process of parallelizing the code using the PETSc library, and a Newton-Krylov-Schwarz approach for the parallel treatment of the preconditioner. In this poster, we will report on both the serial and parallel performance of PIXIE3D, focusing primarily on scalability and CPU speedup vs. an explicit approach.

TIMEDELN: A programme for the detection and parametrization of overlapping resonances using the time-delay method

NASA Astrophysics Data System (ADS)

Little, Duncan A.; Tennyson, Jonathan; Plummer, Martin; Noble, Clifford J.; Sunderland, Andrew G.

2017-06-01

TIMEDELN implements the time-delay method of determining resonance parameters from the characteristic Lorentzian form displayed by the largest eigenvalues of the time-delay matrix. TIMEDELN constructs the time-delay matrix from input K-matrices and analyses its eigenvalues. This new version implements multi-resonance fitting and may be run serially or as a high performance parallel code with three levels of parallelism. TIMEDELN takes K-matrices from a scattering calculation, either read from a file or calculated on a dynamically adjusted grid, and calculates the time-delay matrix. This is then diagonalized, with the largest eigenvalue representing the longest time-delay experienced by the scattering particle. A resonance shows up as a characteristic Lorentzian form in the time-delay: the programme searches the time-delay eigenvalues for maxima and traces resonances when they pass through different eigenvalues, separating overlapping resonances. It also performs the fitting of the calculated data to the Lorentzian form and outputs resonance positions and widths. Any remaining overlapping resonances can be fitted jointly. The branching ratios of decay into the open channels can also be found. The programme may be run serially or in parallel with three levels of parallelism. The parallel code modules are abstracted from the main physics code and can be used independently.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Fast Exact Search in Hamming Space With Multi-Index Hashing.

PubMed

Norouzi, Mohammad; Punjani, Ali; Fleet, David J

2014-06-01

There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straight-forward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.
Experimental verification of the role of electron pressure in fast magnetic reconnection with a guide field

DOE PAGES

Fox, W.; Sciortino, F.; v. Stechow, A.; ...

2017-03-21

We report detailed laboratory observations of the structure of a reconnection current sheet in a two-fluid plasma regime with a guide magnetic field. We observe and quantitatively analyze the quadrupolar electron pressure variation in the ion-diffusion region, as originally predicted by extended magnetohydrodynamics simulations. The projection of the electron pressure gradient parallel to the magnetic field contributes significantly to balancing the parallel electric field, and the resulting cross-field electron jets in the reconnection layer are diamagnetic in origin. Furthermore, these results demonstrate how parallel and perpendicular force balance are coupled in guide field reconnection and confirm basic theoretical models ofmore » the importance of electron pressure gradients for obtaining fast magnetic reconnection.« less
Xyce parallel electronic simulator : users' guide.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.

2011-05-01

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.« less
Automatic recognition of vector and parallel operations in a higher level language

NASA Technical Reports Server (NTRS)

Schneck, P. B.

1971-01-01

A compiler for recognizing statements of a FORTRAN program which are suited for fast execution on a parallel or pipeline machine such as Illiac-4, Star or ASC is described. The technique employs interval analysis to provide flow information to the vector/parallel recognizer. Where profitable the compiler changes scalar variables to subscripted variables. The output of the compiler is an extension to FORTRAN which shows parallel and vector operations explicitly.
Interaction between high harmonic fast waves and fast ions in NSTX/NSTX-U plasmas

NASA Astrophysics Data System (ADS)

Bertelli, N.; Valeo, E. J.; Gorelenkova, M.; Green, D. L.; RF SciDAC Team

2016-10-01

Fast wave (FW) heating in the ion cyclotron range of frequency (ICRF) has been successfully used to sustain and control the fusion plasma performance, and it will likely play an important role in the ITER experiment. As demonstrated in the NSTX and DIII-D experiments the interactions between fast waves and fast ions can be so strong to significantly modify the fast ion population from neutral beam injection. In fact, it has been recently found in NSTX that FWs can modify and, under certain conditions, even suppress the energetic particle driven instabilities, such as toroidal Alfvén eigenmodes and global Alfvén eigenmodes and fishbones. This paper examines such interactions in NSTX/NSTX-U plasmas by using the recent extension of the RF full-wave code TORIC to include non-Maxwellian ions distribution functions. Particular attention is given to the evolution of the fast ions distribution function w/ and w/o RF. Tests on the RF kick-operator implemented in the Monte-Carlo particle code NUBEAM is also discussed in order to move towards a self consistent evaluation of the RF wave-field and the ion distribution functions in the TRANSP code. Work supported by US DOE Contract DE-AC02-09CH11466.
A domain specific language for performance portable molecular dynamics algorithms

NASA Astrophysics Data System (ADS)

Saunders, William Robert; Grant, James; Müller, Eike Hermann

2018-03-01

Developers of Molecular Dynamics (MD) codes face significant challenges when adapting existing simulation packages to new hardware. In a continuously diversifying hardware landscape it becomes increasingly difficult for scientists to be experts both in their own domain (physics/chemistry/biology) and specialists in the low level parallelisation and optimisation of their codes. To address this challenge, we describe a "Separation of Concerns" approach for the development of parallel and optimised MD codes: the science specialist writes code at a high abstraction level in a domain specific language (DSL), which is then translated into efficient computer code by a scientific programmer. In a related context, an abstraction for the solution of partial differential equations with grid based methods has recently been implemented in the (Py)OP2 library. Inspired by this approach, we develop a Python code generation system for molecular dynamics simulations on different parallel architectures, including massively parallel distributed memory systems and GPUs. We demonstrate the efficiency of the auto-generated code by studying its performance and scalability on different hardware and compare it to other state-of-the-art simulation packages. With growing data volumes the extraction of physically meaningful information from the simulation becomes increasingly challenging and requires equally efficient implementations. A particular advantage of our approach is the easy expression of such analysis algorithms. We consider two popular methods for deducing the crystalline structure of a material from the local environment of each atom, show how they can be expressed in our abstraction and implement them in the code generation framework.
Optimization of Particle-in-Cell Codes on RISC Processors

NASA Technical Reports Server (NTRS)

Decyk, Viktor K.; Karmesin, Steve Roy; Boer, Aeint de; Liewer, Paulette C.

1996-01-01

General strategies are developed to optimize particle-cell-codes written in Fortran for RISC processors which are commonly used on massively parallel computers. These strategies include data reorganization to improve cache utilization and code reorganization to improve efficiency of arithmetic pipelines.
Non-Maxwellian fast particle effects in gyrokinetic GENE simulations

NASA Astrophysics Data System (ADS)

Di Siena, A.; Görler, T.; Doerk, H.; Bilato, R.; Citrin, J.; Johnson, T.; Schneider, M.; Poli, E.; JET Contributors

2018-04-01

Fast ions have recently been found to significantly impact and partially suppress plasma turbulence both in experimental and numerical studies in a number of scenarios. Understanding the underlying physics and identifying the range of their beneficial effect is an essential task for future fusion reactors, where highly energetic ions are generated through fusion reactions and external heating schemes. However, in many of the gyrokinetic codes fast ions are, for simplicity, treated as equivalent-Maxwellian-distributed particle species, although it is well known that to rigorously model highly non-thermalised particles, a non-Maxwellian background distribution function is needed. To study the impact of this assumption, the gyrokinetic code GENE has recently been extended to support arbitrary background distribution functions which might be either analytical, e.g., slowing down and bi-Maxwellian, or obtained from numerical fast ion models. A particular JET plasma with strong fast-ion related turbulence suppression is revised with these new code capabilities both with linear and nonlinear gyrokinetic simulations. It appears that the fast ion stabilization tends to be less strong but still substantial with more realistic distributions, and this improves the quantitative power balance agreement with experiments.
Parallelization of sequential Gaussian, indicator and direct simulation algorithms

NASA Astrophysics Data System (ADS)

Nunes, Ruben; Almeida, José A.

2010-08-01

Improving the performance and robustness of algorithms on new high-performance parallel computing architectures is a key issue in efficiently performing 2D and 3D studies with large amount of data. In geostatistics, sequential simulation algorithms are good candidates for parallelization. When compared with other computational applications in geosciences (such as fluid flow simulators), sequential simulation software is not extremely computationally intensive, but parallelization can make it more efficient and creates alternatives for its integration in inverse modelling approaches. This paper describes the implementation and benchmarking of a parallel version of the three classic sequential simulation algorithms: direct sequential simulation (DSS), sequential indicator simulation (SIS) and sequential Gaussian simulation (SGS). For this purpose, the source used was GSLIB, but the entire code was extensively modified to take into account the parallelization approach and was also rewritten in the C programming language. The paper also explains in detail the parallelization strategy and the main modifications. Regarding the integration of secondary information, the DSS algorithm is able to perform simple kriging with local means, kriging with an external drift and collocated cokriging with both local and global correlations. SIS includes a local correction of probabilities. Finally, a brief comparison is presented of simulation results using one, two and four processors. All performance tests were carried out on 2D soil data samples. The source code is completely open source and easy to read. It should be noted that the code is only fully compatible with Microsoft Visual C and should be adapted for other systems/compilers.
RY-Coding and Non-Homogeneous Models Can Ameliorate the Maximum-Likelihood Inferences From Nucleotide Sequence Data with Parallel Compositional Heterogeneity.

PubMed

Ishikawa, Sohta A; Inagaki, Yuji; Hashimoto, Tetsuo

2012-01-01

In phylogenetic analyses of nucleotide sequences, 'homogeneous' substitution models, which assume the stationarity of base composition across a tree, are widely used, albeit individual sequences may bear distinctive base frequencies. In the worst-case scenario, a homogeneous model-based analysis can yield an artifactual union of two distantly related sequences that achieved similar base frequencies in parallel. Such potential difficulty can be countered by two approaches, 'RY-coding' and 'non-homogeneous' models. The former approach converts four bases into purine and pyrimidine to normalize base frequencies across a tree, while the heterogeneity in base frequency is explicitly incorporated in the latter approach. The two approaches have been applied to real-world sequence data; however, their basic properties have not been fully examined by pioneering simulation studies. Here, we assessed the performances of the maximum-likelihood analyses incorporating RY-coding and a non-homogeneous model (RY-coding and non-homogeneous analyses) on simulated data with parallel convergence to similar base composition. Both RY-coding and non-homogeneous analyses showed superior performances compared with homogeneous model-based analyses. Curiously, the performance of RY-coding analysis appeared to be significantly affected by a setting of the substitution process for sequence simulation relative to that of non-homogeneous analysis. The performance of a non-homogeneous analysis was also validated by analyzing a real-world sequence data set with significant base heterogeneity.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Haghighat, A.; Sjoden, G.E.; Wagner, J.C.

In the past 10 yr, the Penn State Transport Theory Group (PSTTG) has concentrated its efforts on developing accurate and efficient particle transport codes to address increasing needs for efficient and accurate simulation of nuclear systems. The PSTTG's efforts have primarily focused on shielding applications that are generally treated using multigroup, multidimensional, discrete ordinates (S{sub n}) deterministic and/or statistical Monte Carlo methods. The difficulty with the existing public codes is that they require significant (impractical) computation time for simulation of complex three-dimensional (3-D) problems. For the S{sub n} codes, the large memory requirements are handled through the use of scratchmore » files (i.e., read-from and write-to-disk) that significantly increases the necessary execution time. Further, the lack of flexible features and/or utilities for preparing input and processing output makes these codes difficult to use. The Monte Carlo method becomes impractical because variance reduction (VR) methods have to be used, and normally determination of the necessary parameters for the VR methods is very difficult and time consuming for a complex 3-D problem. For the deterministic method, the authors have developed the 3-D parallel PENTRAN (Parallel Environment Neutral-particle TRANsport) code system that, in addition to a parallel 3-D S{sub n} solver, includes pre- and postprocessing utilities. PENTRAN provides for full phase-space decomposition, memory partitioning, and parallel input/output to provide the capability of solving large problems in a relatively short time. Besides having a modular parallel structure, PENTRAN has several unique new formulations and features that are necessary for achieving high parallel performance. For the Monte Carlo method, the major difficulty currently facing most users is the selection of an effective VR method and its associated parameters. For complex problems, generally, this process is very time consuming and may be complicated due to the possibility of biasing the results. In an attempt to eliminate this problem, the authors have developed the A{sup 3}MCNP (automated adjoint accelerated MCNP) code that automatically prepares parameters for source and transport biasing within a weight-window VR approach based on the S{sub n} adjoint function. A{sup 3}MCNP prepares the necessary input files for performing multigroup, 3-D adjoint S{sub n} calculations using TORT.« less
Parallel CARLOS-3D code development

DOE Office of Scientific and Technical Information (OSTI.GOV)

Putnam, J.M.; Kotulski, J.D.

1996-02-01

CARLOS-3D is a three-dimensional scattering code which was developed under the sponsorship of the Electromagnetic Code Consortium, and is currently used by over 80 aerospace companies and government agencies. The code has been extensively validated and runs on both serial workstations and parallel super computers such as the Intel Paragon. CARLOS-3D is a three-dimensional surface integral equation scattering code based on a Galerkin method of moments formulation employing Rao- Wilton-Glisson roof-top basis for triangular faceted surfaces. Fully arbitrary 3D geometries composed of multiple conducting and homogeneous bulk dielectric materials can be modeled. This presentation describes some of the extensions tomore » the CARLOS-3D code, and how the operator structure of the code facilitated these improvements. Body of revolution (BOR) and two-dimensional geometries were incorporated by simply including new input routines, and the appropriate Galerkin matrix operator routines. Some additional modifications were required in the combined field integral equation matrix generation routine due to the symmetric nature of the BOR and 2D operators. Quadrilateral patched surfaces with linear roof-top basis functions were also implemented in the same manner. Quadrilateral facets and triangular facets can be used in combination to more efficiently model geometries with both large smooth surfaces and surfaces with fine detail such as gaps and cracks. Since the parallel implementation in CARLOS-3D is at high level, these changes were independent of the computer platform being used. This approach minimizes code maintenance, while providing capabilities with little additional effort. Results are presented showing the performance and accuracy of the code for some large scattering problems. Comparisons between triangular faceted and quadrilateral faceted geometry representations will be shown for some complex scatterers.« less
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers

NASA Technical Reports Server (NTRS)

Skiles, J. W.; Schulbach, C. H.

1994-01-01

Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
Open-Source Development of the Petascale Reactive Flow and Transport Code PFLOTRAN

NASA Astrophysics Data System (ADS)

Hammond, G. E.; Andre, B.; Bisht, G.; Johnson, T.; Karra, S.; Lichtner, P. C.; Mills, R. T.

2013-12-01

Open-source software development has become increasingly popular in recent years. Open-source encourages collaborative and transparent software development and promotes unlimited free redistribution of source code to the public. Open-source development is good for science as it reveals implementation details that are critical to scientific reproducibility, but generally excluded from journal publications. In addition, research funds that would have been spent on licensing fees can be redirected to code development that benefits more scientists. In 2006, the developers of PFLOTRAN open-sourced their code under the U.S. Department of Energy SciDAC-II program. Since that time, the code has gained popularity among code developers and users from around the world seeking to employ PFLOTRAN to simulate thermal, hydraulic, mechanical and biogeochemical processes in the Earth's surface/subsurface environment. PFLOTRAN is a massively-parallel subsurface reactive multiphase flow and transport simulator designed from the ground up to run efficiently on computing platforms ranging from the laptop to leadership-class supercomputers, all from a single code base. The code employs domain decomposition for parallelism and is founded upon the well-established and open-source parallel PETSc and HDF5 frameworks. PFLOTRAN leverages modern Fortran (i.e. Fortran 2003-2008) in its extensible object-oriented design. The use of this progressive, yet domain-friendly programming language has greatly facilitated collaboration in the code's software development. Over the past year, PFLOTRAN's top-level data structures were refactored as Fortran classes (i.e. extendible derived types) to improve the flexibility of the code, ease the addition of new process models, and enable coupling to external simulators. For instance, PFLOTRAN has been coupled to the parallel electrical resistivity tomography code E4D to enable hydrogeophysical inversion while the same code base can be used as a third-party library to provide hydrologic flow, energy transport, and biogeochemical capability to the community land model, CLM, part of the open-source community earth system model (CESM) for climate. In this presentation, the advantages and disadvantages of open source software development in support of geoscience research at government laboratories, universities, and the private sector are discussed. Since the code is open-source (i.e. it's transparent and readily available to competitors), the PFLOTRAN team's development strategy within a competitive research environment is presented. Finally, the developers discuss their approach to object-oriented programming and the leveraging of modern Fortran in support of collaborative geoscience research as the Fortran standard evolves among compiler vendors.
OSIRIS - an object-oriented parallel 3D PIC code for modeling laser and particle beam-plasma interaction

NASA Astrophysics Data System (ADS)

Hemker, Roy

1999-11-01

The advances in computational speed make it now possible to do full 3D PIC simulations of laser plasma and beam plasma interactions, but at the same time the increased complexity of these problems makes it necessary to apply modern approaches like object oriented programming to the development of simulation codes. We report here on our progress in developing an object oriented parallel 3D PIC code using Fortran 90. In its current state the code contains algorithms for 1D, 2D, and 3D simulations in cartesian coordinates and for 2D cylindrically-symmetric geometry. For all of these algorithms the code allows for a moving simulation window and arbitrary domain decomposition for any number of dimensions. Recent 3D simulation results on the propagation of intense laser and electron beams through plasmas will be presented.
An object-oriented, coprocessor-accelerated model for ice sheet simulations

NASA Astrophysics Data System (ADS)

Seddik, H.; Greve, R.

2013-12-01

Recently, numerous models capable of modeling the thermo-dynamics of ice sheets have been developed within the ice sheet modeling community. Their capabilities have been characterized by a wide range of features with different numerical methods (finite difference or finite element), different implementations of the ice flow mechanics (shallow-ice, higher-order, full Stokes) and different treatments for the basal and coastal areas (basal hydrology, basal sliding, ice shelves). Shallow-ice models (SICOPOLIS, IcIES, PISM, etc) have been widely used for modeling whole ice sheets (Greenland and Antarctica) due to the relatively low computational cost of the shallow-ice approximation but higher order (ISSM, AIF) and full Stokes (Elmer/Ice) models have been recently used to model the Greenland ice sheet. The advance in processor speed and the decrease in cost for accessing large amount of memory and storage have undoubtedly been the driving force in the commoditization of models with higher capabilities, and the popularity of Elmer/Ice (http://elmerice.elmerfem.com) with an active user base is a notable representation of this trend. Elmer/Ice is a full Stokes model built on top of the multi-physics package Elmer (http://www.csc.fi/english/pages/elmer) which provides the full machinery for the complex finite element procedure and is fully parallel (mesh partitioning with OpenMPI communication). Elmer is mainly written in Fortran 90 and targets essentially traditional processors as the code base was not initially written to run on modern coprocessors (yet adding support for the recently introduced x86 based coprocessors is possible). Furthermore, a truly modular and object-oriented implementation is required for quick adaptation to fast evolving capabilities in hardware (Fortran 2003 provides an object-oriented programming model while not being clean and requiring a tricky refactoring of Elmer code). In this work, the object-oriented, coprocessor-accelerated finite element code Sainou is introduced. Sainou is an Elmer fork which is reimplemented in Objective C and used for experimenting with ice sheet models running on coprocessors, essentially GPU devices. GPUs are highly parallel processors that provide opportunities for fine-grained parallelization of the full Stokes problem using the standard OpenCL language (http://www.khronos.org/opencl/) to access the device. Sainou is built upon a collection of Objective C base classes that service a modular kernel (itself a base class) which provides the core methods to solve the finite element problem. An early implementation of Sainou will be presented with emphasis on the object architecture and the strategies of parallelizations. The computation of a simple heat conduction problem is used to test the implementation which also provides experimental support for running the global matrix assembly on GPU.
Motion streaks in fast motion rivalry cause orientation-selective suppression.

PubMed

Apthorp, Deborah; Wenderoth, Peter; Alais, David

2009-05-14

We studied binocular rivalry between orthogonally translating arrays of random Gaussian blobs and measured the strength of rivalry suppression for static oriented probes. Suppression depth was quantified by expressing monocular probe thresholds during dominance relative to thresholds during suppression. Rivalry between two fast motions or two slow motions was compared in order to test the suggestion that fast-moving objects leave oriented "motion streaks" due to temporal integration (W. S. Geisler, 1999). If fast motions do produce motion streaks, then fast motion rivalry might also entail rivalry between the orthogonal streak orientations. We tested this using a static oriented probe that was aligned either parallel to the motion trajectory (hence collinear with the "streaks") or was orthogonal to the trajectory, predicting that rivalry suppression would be greater for parallel probes, and only for rivalry between fast motions. Results confirmed that suppression depth did depend on probe orientation for fast motion but not for slow motion. Further experiments showed that threshold elevations for the oriented probe during suppression exhibited clear orientation tuning. However, orientation-tuned elevations were also present during dominance, suggesting within-channel masking as the basis of the extra-deep suppression. In sum, the presence of orientation-dependent suppression in fast motion rivalry is consistent with the "motion streaks" hypothesis.

Inter-view prediction of intra mode decision for high-efficiency video coding-based multiview video coding

NASA Astrophysics Data System (ADS)

da Silva, Thaísa Leal; Agostini, Luciano Volcan; da Silva Cruz, Luis A.

2014-05-01

Intra prediction is a very important tool in current video coding standards. High-efficiency video coding (HEVC) intra prediction presents relevant gains in encoding efficiency when compared to previous standards, but with a very important increase in the computational complexity since 33 directional angular modes must be evaluated. Motivated by this high complexity, this article presents a complexity reduction algorithm developed to reduce the HEVC intra mode decision complexity targeting multiview videos. The proposed algorithm presents an efficient fast intra prediction compliant with singleview and multiview video encoding. This fast solution defines a reduced subset of intra directions according to the video texture and it exploits the relationship between prediction units (PUs) of neighbor depth levels of the coding tree. This fast intra coding procedure is used to develop an inter-view prediction method, which exploits the relationship between the intra mode directions of adjacent views to further accelerate the intra prediction process in multiview video encoding applications. When compared to HEVC simulcast, our method achieves a complexity reduction of up to 47.77%, at the cost of an average BD-PSNR loss of 0.08 dB.
Novel Optical Processor for Phased Array Antenna.

DTIC Science & Technology

1992-10-20

parallel glass slide into the signal beam optical loop. The parallel glass acts like a variable phase shifter to the signal beam simulating phase drift...A list of possible designs are given as follows , _ _ Velocity fa (100dB/cm) Lumit Wavelength I M2I1 TeO2 Longi 4.2 /m/ns about 3 GHz 1.4 4m 34 Fast...subject to achievable acoustic frequency, the preferred materials are the slow shear wave in TeO2 , the fast shear wave in TeO2 or the shear waves in
MetaQuant: a tool for the automatic quantification of GC/MS-based metabolome data.

PubMed

Bunk, Boyke; Kucklick, Martin; Jonas, Rochus; Münch, Richard; Schobert, Max; Jahn, Dieter; Hiller, Karsten

2006-12-01

MetaQuant is a Java-based program for the automatic and accurate quantification of GC/MS-based metabolome data. In contrast to other programs MetaQuant is able to quantify hundreds of substances simultaneously with minimal manual intervention. The integration of a self-acting calibration function allows the parallel and fast calibration for several metabolites simultaneously. Finally, MetaQuant is able to import GC/MS data in the common NetCDF format and to export the results of the quantification into Systems Biology Markup Language (SBML), Comma Separated Values (CSV) or Microsoft Excel (XLS) format. MetaQuant is written in Java and is available under an open source license. Precompiled packages for the installation on Windows or Linux operating systems are freely available for download. The source code as well as the installation packages are available at http://bioinformatics.org/metaquant
ClusCo: clustering and comparison of protein models.

PubMed

Jamroz, Michal; Kolinski, Andrzej

2013-02-22

The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline. Clusco is fast and easy-to-use software for high-throughput comparison of protein models with different similarity measures (cRMSD, dRMSD, GDT_TS, TM-Score, MaxSub, Contact Map Overlap) and clustering of the comparison results with standard methods: K-means Clustering or Hierarchical Agglomerative Clustering. The application was highly optimized and written in C/C++, including the code for parallel execution on CPU and GPU, which resulted in a significant speedup over similar clustering and scoring computation programs.
Kip, Version 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Staley, Martin

2017-09-20

This high-performance ray tracing library provides very fast rendering; compact code; type flexibility through C++ "generic programming" techniques; and ease of use via an application programming interface (API) that operates independently of any GUI, on-screen display, or other enclosing application. Kip supports constructive solid geometry (CSG) models based on a wide variety of built-in shapes and logical operators, and also allows for user-defined shapes and operators to be provided. Additional features include basic texturing; input/output of models using a simple human-readable file format and with full error checking and detailed diagnostics; and support for shared data parallelism. Kip is writtenmore » in pure, ANSI standard C++; is entirely platform independent; and is very easy to use. As a C++ "header only" library, it requires no build system, configuration or installation scripts, wizards, non-C++ preprocessing, makefiles, shell scripts, or external libraries.« less
Numerical verification of bounce-harmonic resonances in neoclassical toroidal viscosity for tokamaks.

PubMed

Kim, Kimin; Park, Jong-Kyu; Boozer, Allen H

2013-05-03

This Letter presents the first numerical verification for the bounce-harmonic (BH) resonance phenomena of the neoclassical transport in a tokamak perturbed by nonaxisymmetric magnetic fields. The BH resonances were predicted by analytic theories of neoclassical toroidal viscosity (NTV), as the parallel and perpendicular drift motions can be resonant and result in a great enhancement of the radial momentum transport. A new drift-kinetic δf guiding-center particle code, POCA, clearly verified that the perpendicular drift motions can reduce the transport by phase-mixing, but in the BH resonances the motions can form closed orbits and particles radially drift out fast. The POCA calculations on resulting NTV torque are largely consistent with analytic calculations, and show that the BH resonances can easily dominate the NTV torque when a plasma rotates in the perturbed tokamak and therefore, is a critical physics for predicting the rotation and stability in the International Thermonuclear Experimental Reactor.
Fast QC-LDPC code for free space optical communication

NASA Astrophysics Data System (ADS)

Wang, Jin; Zhang, Qi; Udeh, Chinonso Paschal; Wu, Rangzhong

2017-02-01

Free Space Optical (FSO) Communication systems use the atmosphere as a propagation medium. Hence the atmospheric turbulence effects lead to multiplicative noise related with signal intensity. In order to suppress the signal fading induced by multiplicative noise, we propose a fast Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) code for FSO Communication systems. As a linear block code based on sparse matrix, the performances of QC-LDPC is extremely near to the Shannon limit. Currently, the studies on LDPC code in FSO Communications is mainly focused on Gauss-channel and Rayleigh-channel, respectively. In this study, the LDPC code design over atmospheric turbulence channel which is nether Gauss-channel nor Rayleigh-channel is closer to the practical situation. Based on the characteristics of atmospheric channel, which is modeled as logarithmic-normal distribution and K-distribution, we designed a special QC-LDPC code, and deduced the log-likelihood ratio (LLR). An irregular QC-LDPC code for fast coding, of which the rates are variable, is proposed in this paper. The proposed code achieves excellent performance of LDPC codes and can present the characteristics of high efficiency in low rate, stable in high rate and less number of iteration. The result of belief propagation (BP) decoding shows that the bit error rate (BER) obviously reduced as the Signal-to-Noise Ratio (SNR) increased. Therefore, the LDPC channel coding technology can effectively improve the performance of FSO. At the same time, the BER, after decoding reduces with the increase of SNR arbitrarily, and not having error limitation platform phenomenon with error rate slowing down.
Design of neurophysiologically motivated structures of time-pulse coded neurons

NASA Astrophysics Data System (ADS)

Krasilenko, Vladimir G.; Nikolsky, Alexander I.; Lazarev, Alexander A.; Lobodzinska, Raisa F.

2009-04-01

The common methodology of biologically motivated concept of building of processing sensors systems with parallel input and picture operands processing and time-pulse coding are described in paper. Advantages of such coding for creation of parallel programmed 2D-array structures for the next generation digital computers which require untraditional numerical systems for processing of analog, digital, hybrid and neuro-fuzzy operands are shown. The optoelectronic time-pulse coded intelligent neural elements (OETPCINE) simulation results and implementation results of a wide set of neuro-fuzzy logic operations are considered. The simulation results confirm engineering advantages, intellectuality, circuit flexibility of OETPCINE for creation of advanced 2D-structures. The developed equivalentor-nonequivalentor neural element has power consumption of 10mW and processing time about 10...100us.
Coding for reliable satellite communications

NASA Technical Reports Server (NTRS)

Lin, S.

1984-01-01

Several error control coding techniques for reliable satellite communications were investigated to find algorithms for fast decoding of Reed-Solomon codes in terms of dual basis. The decoding of the (255,223) Reed-Solomon code, which is used as the outer code in the concatenated TDRSS decoder, was of particular concern.
Hypercube matrix computation task

NASA Technical Reports Server (NTRS)

Calalo, Ruel H.; Imbriale, William A.; Jacobi, Nathan; Liewer, Paulett C.; Lockhart, Thomas G.; Lyzenga, Gregory A.; Lyons, James R.; Manshadi, Farzin; Patterson, Jean E.

1988-01-01

A major objective of the Hypercube Matrix Computation effort at the Jet Propulsion Laboratory (JPL) is to investigate the applicability of a parallel computing architecture to the solution of large-scale electromagnetic scattering problems. Three scattering analysis codes are being implemented and assessed on a JPL/California Institute of Technology (Caltech) Mark 3 Hypercube. The codes, which utilize different underlying algorithms, give a means of evaluating the general applicability of this parallel architecture. The three analysis codes being implemented are a frequency domain method of moments code, a time domain finite difference code, and a frequency domain finite elements code. These analysis capabilities are being integrated into an electromagnetics interactive analysis workstation which can serve as a design tool for the construction of antennas and other radiating or scattering structures. The first two years of work on the Hypercube Matrix Computation effort is summarized. It includes both new developments and results as well as work previously reported in the Hypercube Matrix Computation Task: Final Report for 1986 to 1987 (JPL Publication 87-18).
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

NASA Technical Reports Server (NTRS)

Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

2012-01-01

In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
Implementation of a flexible and scalable particle-in-cell method for massively parallel computations in the mantle convection code ASPECT

NASA Astrophysics Data System (ADS)

Gassmöller, Rene; Bangerth, Wolfgang

2016-04-01

Particle-in-cell methods have a long history and many applications in geodynamic modelling of mantle convection, lithospheric deformation and crustal dynamics. They are primarily used to track material information, the strain a material has undergone, the pressure-temperature history a certain material region has experienced, or the amount of volatiles or partial melt present in a region. However, their efficient parallel implementation - in particular combined with adaptive finite-element meshes - is complicated due to the complex communication patterns and frequent reassignment of particles to cells. Consequently, many current scientific software packages accomplish this efficient implementation by specifically designing particle methods for a single purpose, like the advection of scalar material properties that do not evolve over time (e.g., for chemical heterogeneities). Design choices for particle integration, data storage, and parallel communication are then optimized for this single purpose, making the code relatively rigid to changing requirements. Here, we present the implementation of a flexible, scalable and efficient particle-in-cell method for massively parallel finite-element codes with adaptively changing meshes. Using a modular plugin structure, we allow maximum flexibility of the generation of particles, the carried tracer properties, the advection and output algorithms, and the projection of properties to the finite-element mesh. We present scaling tests ranging up to tens of thousands of cores and tens of billions of particles. Additionally, we discuss efficient load-balancing strategies for particles in adaptive meshes with their strengths and weaknesses, local particle-transfer between parallel subdomains utilizing existing communication patterns from the finite element mesh, and the use of established parallel output algorithms like the HDF5 library. Finally, we show some relevant particle application cases, compare our implementation to a modern advection-field approach, and demonstrate under which conditions which method is more efficient. We implemented the presented methods in ASPECT (aspect.dealii.org), a freely available open-source community code for geodynamic simulations. The structure of the particle code is highly modular, and segregated from the PDE solver, and can thus be easily transferred to other programs, or adapted for various application cases.
SKIRT: Hybrid parallelization of radiative transfer simulations

NASA Astrophysics Data System (ADS)

Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

2017-07-01

We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Development of a Simulink Library for the Design, Testing and Simulation of Software Defined GPS Radios. With Application to the Development of Parallel Correlator Structures

DTIC Science & Technology

2014-05-01

function Value = Select_Element(Index,Signal) %# eml Value = Signal(Index); Code Listing 1 Code for Selector Block 12 | P a g e 4.3...code for the Simulink function shiftedSignal = fcn(signal,Shift) %# eml shiftedSignal = circshift(signal,Shift); Code Listing 2 Code for CircShift
Self-Scheduling Parallel Methods for Multiple Serial Codes with Application to WOPWOP

NASA Technical Reports Server (NTRS)

Long, Lyle N.; Brentner, Kenneth S.

2000-01-01

This paper presents a scheme for efficiently running a large number of serial jobs on parallel computers. Two examples are given of computer programs that run relatively quickly, but often they must be run numerous times to obtain all the results needed. It is very common in science and engineering to have codes that are not massive computing challenges in themselves, but due to the number of instances that must be run, they do become large-scale computing problems. The two examples given here represent common problems in aerospace engineering: aerodynamic panel methods and aeroacoustic integral methods. The first example simply solves many systems of linear equations. This is representative of an aerodynamic panel code where someone would like to solve for numerous angles of attack. The complete code for this first example is included in the appendix so that it can be readily used by others as a template. The second example is an aeroacoustics code (WOPWOP) that solves the Ffowcs Williams Hawkings equation to predict the far-field sound due to rotating blades. In this example, one quite often needs to compute the sound at numerous observer locations, hence parallelization is utilized to automate the noise computation for a large number of observers.
Breakdown of Spatial Parallel Coding in Children's Drawing

ERIC Educational Resources Information Center

De Bruyn, Bart; Davis, Alyson

2005-01-01

When drawing real scenes or copying simple geometric figures young children are highly sensitive to parallel cues and use them effectively. However, this sensitivity can break down in surprisingly simple tasks such as copying a single line where robust directional errors occur despite the presence of parallel cues. Before we can conclude that this…
Epoch of Reionization : An Investigation of the Semi-Analytic 21CMMC Code

NASA Astrophysics Data System (ADS)

Miller, Michelle

2018-01-01

After the Big Bang the universe was filled with neutral hydrogen that began to cool and collapse into the first structures. These first stars and galaxies began to emit radiation that eventually ionized all of the neutral hydrogen in the universe. 21CMMC is a semi-numerical code that takes simulated boxes of this ionized universe from another code called 21cmFAST. Mock measurements are taken from the simulated boxes in 21cmFAST. Those measurements are thrown into 21CMMC and help us determine three major parameters of this simulated universe: virial temperature, mean free path, and ionization efficiency. My project tests the robustness of 21CMMC on universe simulations other than 21cmFAST to see whether 21CMMC can properly reconstruct early universe parameters given a mock “measurement” in the form of power spectra. We determine that while two of the three EoR parameters (Virial Temperature and Efficiency) have some reconstructability, the mean free path parameter in the code is the least robust. This requires development of the 21CMMC code.
Bit-parallel arithmetic in a massively-parallel associative processor

NASA Technical Reports Server (NTRS)

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
Modular time division multiplexer: Efficient simultaneous characterization of fast and slow transients in multiple samples

NASA Astrophysics Data System (ADS)

Kim, Stephan D.; Luo, Jiajun; Buchholz, D. Bruce; Chang, R. P. H.; Grayson, M.

2016-09-01

A modular time division multiplexer (MTDM) device is introduced to enable parallel measurement of multiple samples with both fast and slow decay transients spanning from millisecond to month-long time scales. This is achieved by dedicating a single high-speed measurement instrument for rapid data collection at the start of a transient, and by multiplexing a second low-speed measurement instrument for slow data collection of several samples in parallel for the later transients. The MTDM is a high-level design concept that can in principle measure an arbitrary number of samples, and the low cost implementation here allows up to 16 samples to be measured in parallel over several months, reducing the total ensemble measurement duration and equipment usage by as much as an order of magnitude without sacrificing fidelity. The MTDM was successfully demonstrated by simultaneously measuring the photoconductivity of three amorphous indium-gallium-zinc-oxide thin films with 20 ms data resolution for fast transients and an uninterrupted parallel run time of over 20 days. The MTDM has potential applications in many areas of research that manifest response times spanning many orders of magnitude, such as photovoltaics, rechargeable batteries, amorphous semiconductors such as silicon and amorphous indium-gallium-zinc-oxide.
Modular time division multiplexer: Efficient simultaneous characterization of fast and slow transients in multiple samples.

PubMed

Kim, Stephan D; Luo, Jiajun; Buchholz, D Bruce; Chang, R P H; Grayson, M

2016-09-01

A modular time division multiplexer (MTDM) device is introduced to enable parallel measurement of multiple samples with both fast and slow decay transients spanning from millisecond to month-long time scales. This is achieved by dedicating a single high-speed measurement instrument for rapid data collection at the start of a transient, and by multiplexing a second low-speed measurement instrument for slow data collection of several samples in parallel for the later transients. The MTDM is a high-level design concept that can in principle measure an arbitrary number of samples, and the low cost implementation here allows up to 16 samples to be measured in parallel over several months, reducing the total ensemble measurement duration and equipment usage by as much as an order of magnitude without sacrificing fidelity. The MTDM was successfully demonstrated by simultaneously measuring the photoconductivity of three amorphous indium-gallium-zinc-oxide thin films with 20 ms data resolution for fast transients and an uninterrupted parallel run time of over 20 days. The MTDM has potential applications in many areas of research that manifest response times spanning many orders of magnitude, such as photovoltaics, rechargeable batteries, amorphous semiconductors such as silicon and amorphous indium-gallium-zinc-oxide.

Proceedings of the Interservice/Industry Training Systems Conference (9th), Held at Washington, DC, on 30 November - 2 December 1987

DTIC Science & Technology

1987-12-01

requires much more data, but holds fast to the idea that the FV approach, or some other model, is critical if the job analysis process is to have its...Ada compiled code executes twice as fast as Microsoft’s Fortran compiled code. This conclusion is at variance with the results obtained from...finish is not so important. Hence, if a design methodology produces coda that will not execute fast enough on processors suitable for flight
TU-AB-BRC-12: Optimized Parallel MonteCarlo Dose Calculations for Secondary MU Checks

DOE Office of Scientific and Technical Information (OSTI.GOV)

French, S; Nazareth, D; Bellor, M

Purpose: Secondary MU checks are an important tool used during a physics review of a treatment plan. Commercial software packages offer varying degrees of theoretical dose calculation accuracy, depending on the modality involved. Dose calculations of VMAT plans are especially prone to error due to the large approximations involved. Monte Carlo (MC) methods are not commonly used due to their long run times. We investigated two methods to increase the computational efficiency of MC dose simulations with the BEAMnrc code. Distributed computing resources, along with optimized code compilation, will allow for accurate and efficient VMAT dose calculations. Methods: The BEAMnrcmore » package was installed on a high performance computing cluster accessible to our clinic. MATLAB and PYTHON scripts were developed to convert a clinical VMAT DICOM plan into BEAMnrc input files. The BEAMnrc installation was optimized by running the VMAT simulations through profiling tools which indicated the behavior of the constituent routines in the code, e.g. the bremsstrahlung splitting routine, and the specified random number generator. This information aided in determining the most efficient compiling parallel configuration for the specific CPU’s available on our cluster, resulting in the fastest VMAT simulation times. Our method was evaluated with calculations involving 10{sup 8} – 10{sup 9} particle histories which are sufficient to verify patient dose using VMAT. Results: Parallelization allowed the calculation of patient dose on the order of 10 – 15 hours with 100 parallel jobs. Due to the compiler optimization process, further speed increases of 23% were achieved when compared with the open-source compiler BEAMnrc packages. Conclusion: Analysis of the BEAMnrc code allowed us to optimize the compiler configuration for VMAT dose calculations. In future work, the optimized MC code, in conjunction with the parallel processing capabilities of BEAMnrc, will be applied to provide accurate and efficient secondary MU checks.« less
Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography

NASA Astrophysics Data System (ADS)

Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo

2017-08-01

We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
Large-scale three-dimensional phase-field simulations for phase coarsening at ultrahigh volume fraction on high-performance architectures

NASA Astrophysics Data System (ADS)

Yan, Hui; Wang, K. G.; Jones, Jim E.

2016-06-01

A parallel algorithm for large-scale three-dimensional phase-field simulations of phase coarsening is developed and implemented on high-performance architectures. From the large-scale simulations, a new kinetics in phase coarsening in the region of ultrahigh volume fraction is found. The parallel implementation is capable of harnessing the greater computer power available from high-performance architectures. The parallelized code enables increase in three-dimensional simulation system size up to a 5123 grid cube. Through the parallelized code, practical runtime can be achieved for three-dimensional large-scale simulations, and the statistical significance of the results from these high resolution parallel simulations are greatly improved over those obtainable from serial simulations. A detailed performance analysis on speed-up and scalability is presented, showing good scalability which improves with increasing problem size. In addition, a model for prediction of runtime is developed, which shows a good agreement with actual run time from numerical tests.
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

2004-01-01

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

PubMed

Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

2004-09-01

We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Calculation and benchmarking of an azimuthal pressure vessel neutron fluence distribution using the BOXER code and scraping experiments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holzgrewe, F.; Hegedues, F.; Paratte, J.M.

1995-03-01

The light water reactor BOXER code was used to determine the fast azimuthal neutron fluence distribution at the inner surface of the reactor pressure vessel after the tenth cycle of a pressurized water reactor (PWR). Using a cross-section library in 45 groups, fixed-source calculations in transport theory and x-y geometry were carried out to determine the fast azimuthal neutron flux distribution at the inner surface of the pressure vessel for four different cycles. From these results, the fast azimuthal neutron fluence after the tenth cycle was estimated and compared with the results obtained from scraping test experiments. In these experiments,more » small samples of material were taken from the inner surface of the pressure vessel. The fast neutron fluence was then determined form the measured activity of the samples. Comparing the BOXER and scraping test results have maximal differences of 15%, which is very good, considering the factor of 10{sup 3} neutron attenuation between the reactor core and the pressure vessel. To compare the BOXER results with an independent code, the 21st cycle of the PWR was also calculated with the TWODANT two-dimensional transport code, using the same group structure and cross-section library. Deviations in the fast azimuthal flux distribution were found to be <3%, which verifies the accuracy of the BOXER results.« less
Generating performance portable geoscientific simulation code with Firedrake (Invited)

NASA Astrophysics Data System (ADS)

Ham, D. A.; Bercea, G.; Cotter, C. J.; Kelly, P. H.; Loriant, N.; Luporini, F.; McRae, A. T.; Mitchell, L.; Rathgeber, F.

2013-12-01

This presentation will demonstrate how a change in simulation programming paradigm can be exploited to deliver sophisticated simulation capability which is far easier to programme than are conventional models, is capable of exploiting different emerging parallel hardware, and is tailored to the specific needs of geoscientific simulation. Geoscientific simulation represents a grand challenge computational task: many of the largest computers in the world are tasked with this field, and the requirements of resolution and complexity of scientists in this field are far from being sated. However, single thread performance has stalled, even sometimes decreased, over the last decade, and has been replaced by ever more parallel systems: both as conventional multicore CPUs and in the emerging world of accelerators. At the same time, the needs of scientists to couple ever-more complex dynamics and parametrisations into their models makes the model development task vastly more complex. The conventional approach of writing code in low level languages such as Fortran or C/C++ and then hand-coding parallelism for different platforms by adding library calls and directives forces the intermingling of the numerical code with its implementation. This results in an almost impossible set of skill requirements for developers, who must simultaneously be domain science experts, numericists, software engineers and parallelisation specialists. Even more critically, it requires code to be essentially rewritten for each emerging hardware platform. Since new platforms are emerging constantly, and since code owners do not usually control the procurement of the supercomputers on which they must run, this represents an unsustainable development load. The Firedrake system, conversely, offers the developer the opportunity to write PDE discretisations in the high-level mathematical language UFL from the FEniCS project (http://fenicsproject.org). Non-PDE model components, such as parametrisations, can be written as short C kernels operating locally on the underlying mesh, with no explicit parallelism. The executable code is then generated in C, CUDA or OpenCL and executed in parallel on the target architecture. The system also offers features of special relevance to the geosciences. In particular, the large scale separation between the vertical and horizontal directions in many geoscientific processes can be exploited to offer the flexibility of unstructured meshes in the horizontal direction, without the performance penalty usually associated with those methods.
Scalable descriptive and correlative statistics with Titan.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Thompson, David C.; Pebay, Philippe Pierre

This report summarizes the existing statistical engines in VTK/Titan and presents the parallel versions thereof which have already been implemented. The ease of use of these parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; then, this theoretical property is verified with test runs that demonstrate optimal parallel speed-up with up to 200 processors.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

DOE Office of Scientific and Technical Information (OSTI.GOV)

Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

NASA Astrophysics Data System (ADS)

Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

2017-10-01

In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

NASA Technical Reports Server (NTRS)

Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

2001-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
Scalable Computing of the Mesh Size Effect on Modeling Damage Mechanics in Woven Armor Composites

DTIC Science & Technology

2008-12-01

manner of a user defined material subroutine to provide overall stress increments to, the parallel LS-DYNA3D a Lagrangian explicit code used in...finite element code, as a user defined material subroutine . The ability of this subroutine to model the effect of the progressions of a select number...is added as a user defined material subroutine to parallel LS-DYNA3D. The computations of the global mesh are handled by LS-DYNA3D and are spread
Reconstruction of coded aperture images

NASA Technical Reports Server (NTRS)

Bielefeld, Michael J.; Yin, Lo I.

1987-01-01

Balanced correlation method and the Maximum Entropy Method (MEM) were implemented to reconstruct a laboratory X-ray source as imaged by a Uniformly Redundant Array (URA) system. Although the MEM method has advantages over the balanced correlation method, it is computationally time consuming because of the iterative nature of its solution. Massively Parallel Processing, with its parallel array structure is ideally suited for such computations. These preliminary results indicate that it is possible to use the MEM method in future coded-aperture experiments with the help of the MPP.
Parallel design of JPEG-LS encoder on graphics processing units

NASA Astrophysics Data System (ADS)

Duan, Hao; Fang, Yong; Huang, Bormin

2012-01-01

With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
FAST-PT: a novel algorithm to calculate convolution integrals in cosmological perturbation theory

DOE Office of Scientific and Technical Information (OSTI.GOV)

McEwen, Joseph E.; Fang, Xiao; Hirata, Christopher M.

2016-09-01

We present a novel algorithm, FAST-PT, for performing convolution or mode-coupling integrals that appear in nonlinear cosmological perturbation theory. The algorithm uses several properties of gravitational structure formation—the locality of the dark matter equations and the scale invariance of the problem—as well as Fast Fourier Transforms to describe the input power spectrum as a superposition of power laws. This yields extremely fast performance, enabling mode-coupling integral computations fast enough to embed in Monte Carlo Markov Chain parameter estimation. We describe the algorithm and demonstrate its application to calculating nonlinear corrections to the matter power spectrum, including one-loop standard perturbation theorymore » and the renormalization group approach. We also describe our public code (in Python) to implement this algorithm. The code, along with a user manual and example implementations, is available at https://github.com/JoeMcEwen/FAST-PT.« less
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

PubMed Central

Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

PubMed

Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
Targeting multiple heterogeneous hardware platforms with OpenCL

NASA Astrophysics Data System (ADS)

Fox, Paul A.; Kozacik, Stephen T.; Humphrey, John R.; Paolini, Aaron; Kuller, Aryeh; Kelmelis, Eric J.

2014-06-01

The OpenCL API allows for the abstract expression of parallel, heterogeneous computing, but hardware implementations have substantial implementation differences. The abstractions provided by the OpenCL API are often insufficiently high-level to conceal differences in hardware architecture. Additionally, implementations often do not take advantage of potential performance gains from certain features due to hardware limitations and other factors. These factors make it challenging to produce code that is portable in practice, resulting in much OpenCL code being duplicated for each hardware platform being targeted. This duplication of effort offsets the principal advantage of OpenCL: portability. The use of certain coding practices can mitigate this problem, allowing a common code base to be adapted to perform well across a wide range of hardware platforms. To this end, we explore some general practices for producing performant code that are effective across platforms. Additionally, we explore some ways of modularizing code to enable optional optimizations that take advantage of hardware-specific characteristics. The minimum requirement for portability implies avoiding the use of OpenCL features that are optional, not widely implemented, poorly implemented, or missing in major implementations. Exposing multiple levels of parallelism allows hardware to take advantage of the types of parallelism it supports, from the task level down to explicit vector operations. Static optimizations and branch elimination in device code help the platform compiler to effectively optimize programs. Modularization of some code is important to allow operations to be chosen for performance on target hardware. Optional subroutines exploiting explicit memory locality allow for different memory hierarchies to be exploited for maximum performance. The C preprocessor and JIT compilation using the OpenCL runtime can be used to enable some of these techniques, as well as to factor in hardware-specific optimizations as necessary.
OpenSWPC: an open-source integrated parallel simulation code for modeling seismic wave propagation in 3D heterogeneous viscoelastic media

NASA Astrophysics Data System (ADS)

Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi

2017-07-01

We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.

[Activities of Bay Area Research Corporation

NASA Technical Reports Server (NTRS)

2003-01-01

During the final year of this effort the HALFSHEL code was converted to work on a fast single processor workstation from it s parallel configuration. This was done because NASA Ames NAS facility stopped supporting space science and we no longer had access to parallel computer time. The single processor version of HALFSHEL was upgraded to address low density cells by using a a 3-D SOR solver to solve the equation Delta central dot E = 0. We then upgraded the ionospheric load packages to provide a multiple species load of the ionosphere out to 1.4 Rm. With these new tools we began to perform a series of simulations to address the major topic of this research effort; determining the loss rate of O(sup +) and O2(sup +) from Mars. The simulations used the nominal Parker spiral field and in one case used a field perpendicular to the solar wind flow. The simulations were performed for three different solar EUV fluxes consistent with the different solar evolutionary states believed to exist before today. The 1 EUV case is the nominal flux of today. The 3 EUV flux is called Epoch 2 and has three times the flux of todays. The 6 EUV case is Epoch 3 and has 6 times the EUV flux of today.
Quantum transport and nanoplasmonics with carbon nanorings - using HPC in computational nanoscience

NASA Astrophysics Data System (ADS)

Jack, Mark A.

2011-10-01

Central theme of this talk is the theoretical study of toroidal carbon nanostructures as a new form of metamaterial. The interference of ring-generated electromagnetic radiation in a regular array of nanorings driven by an incoming polarized wave front may lead to fascinating new optoelectronics applications. The tight-binding method is used to model charge transport in a carbon nanotorus: All transport observables can be derived from the Green's function of the device region in a non-equilibrium Green's function algorithm. We have calculated density-of-states D(E) and transmissivities T(E) between two metallic leads under a small voltage bias. Electron-phonon coupling is included for low-energy phonon modes of armchair and zigzag nanorings with atomic displacements determined by a collaborator's finite-element based code. A numerically fast and stable algorithm has been developed via parallel linear algebra matrix routines (PETSc) with MPI parallelism to reach significant speed-up. Production runs are planned on the NSF XSEDE network. This project was supported in parts by a 2010 NSF TeraGrid Fellowship and the Sunshine State Education and Research Computing Alliance (SSERCA). Two summer students were supported as 2010 and 2011 NCSI/Shodor Petascale Computing undergraduate interns.[4pt] In collaboration with Leon W. Durivage, Adam Byrd, and Mario Encinosa.
Xyce Parallel Electronic Simulator : users' guide, version 2.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont

2004-06-01

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator capable of simulating electrical circuits at a variety of abstraction levels. Primarily, Xyce has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability the current state-of-the-art in the following areas: {sm_bullet} Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. {sm_bullet} Improved performance for allmore » numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. {sm_bullet} Device models which are specifically tailored to meet Sandia's needs, including many radiation-aware devices. {sm_bullet} A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). {sm_bullet} Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing of computing platforms. These include serial, shared-memory and distributed-memory parallel implementation - which allows it to run efficiently on the widest possible number parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. One feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the Xyce These input formats include standard analytical models, behavioral models look-up Parallel Electronic Simulator is designed to support a variety of device model inputs. tables, and mesh-level PDE device models. Combined with this flexible interface is an architectural design that greatly simplifies the addition of circuit models. One of the most important feature of Xyce is in providing a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia now has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods) research and development can be performed. Ultimately, these capabilities are migrated to end users.« less
Methodes iteratives paralleles: Applications en neutronique et en mecanique des fluides

NASA Astrophysics Data System (ADS)

Qaddouri, Abdessamad

Dans cette these, le calcul parallele est applique successivement a la neutronique et a la mecanique des fluides. Dans chacune de ces deux applications, des methodes iteratives sont utilisees pour resoudre le systeme d'equations algebriques resultant de la discretisation des equations du probleme physique. Dans le probleme de neutronique, le calcul des matrices des probabilites de collision (PC) ainsi qu'un schema iteratif multigroupe utilisant une methode inverse de puissance sont parallelises. Dans le probleme de mecanique des fluides, un code d'elements finis utilisant un algorithme iteratif du type GMRES preconditionne est parallelise. Cette these est presentee sous forme de six articles suivis d'une conclusion. Les cinq premiers articles traitent des applications en neutronique, articles qui representent l'evolution de notre travail dans ce domaine. Cette evolution passe par un calcul parallele des matrices des PC et un algorithme multigroupe parallele teste sur un probleme unidimensionnel (article 1), puis par deux algorithmes paralleles l'un mutiregion l'autre multigroupe, testes sur des problemes bidimensionnels (articles 2--3). Ces deux premieres etapes sont suivies par l'application de deux techniques d'acceleration, le rebalancement neutronique et la minimisation du residu aux deux algorithmes paralleles (article 4). Finalement, on a mis en oeuvre l'algorithme multigroupe et le calcul parallele des matrices des PC sur un code de production DRAGON ou les tests sont plus realistes et peuvent etre tridimensionnels (article 5). Le sixieme article (article 6), consacre a l'application a la mecanique des fluides, traite la parallelisation d'un code d'elements finis FES ou le partitionneur de graphe METIS et la librairie PSPARSLIB sont utilises.
Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Earl, Christopher; Might, Matthew; Bagusetty, Abhishek

This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.
Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

DOE PAGES

Earl, Christopher; Might, Matthew; Bagusetty, Abhishek; ...

2016-01-26

This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.
Performance of the OVERFLOW-MLP and LAURA-MLP CFD Codes on the NASA Ames 512 CPU Origin System

NASA Technical Reports Server (NTRS)

Taft, James R.

2000-01-01

The shared memory Multi-Level Parallelism (MLP) technique, developed last year at NASA Ames has been very successful in dramatically improving the performance of important NASA CFD codes. This new and very simple parallel programming technique was first inserted into the OVERFLOW production CFD code in FY 1998. The OVERFLOW-MLP code's parallel performance scaled linearly to 256 CPUs on the NASA Ames 256 CPU Origin 2000 system (steger). Overall performance exceeded 20.1 GFLOP/s, or about 4.5x the performance of a dedicated 16 CPU C90 system. All of this was achieved without any major modification to the original vector based code. The OVERFLOW-MLP code is now in production on the inhouse Origin systems as well as being used offsite at commercial aerospace companies. Partially as a result of this work, NASA Ames has purchased a new 512 CPU Origin 2000 system to further test the limits of parallel performance for NASA codes of interest. This paper presents the performance obtained from the latest optimization efforts on this machine for the LAURA-MLP and OVERFLOW-MLP codes. The Langley Aerothermodynamics Upwind Relaxation Algorithm (LAURA) code is a key simulation tool in the development of the next generation shuttle, interplanetary reentry vehicles, and nearly all "X" plane development. This code sustains about 4-5 GFLOP/s on a dedicated 16 CPU C90. At this rate, expected workloads would require over 100 C90 CPU years of computing over the next few calendar years. It is not feasible to expect that this would be affordable or available to the user community. Dramatic performance gains on cheaper systems are needed. This code is expected to be perhaps the largest consumer of NASA Ames compute cycles per run in the coming year.The OVERFLOW CFD code is extensively used in the government and commercial aerospace communities to evaluate new aircraft designs. It is one of the largest consumers of NASA supercomputing cycles and large simulations of highly resolved full aircraft are routinely undertaken. Typical large problems might require 100s of Cray C90 CPU hours to complete. The dramatic performance gains with the 256 CPU steger system are exciting. Obtaining results in hours instead of months is revolutionizing the way in which aircraft manufacturers are looking at future aircraft simulation work. Figure 2 below is a current state of the art plot of OVERFLOW-MLP performance on the 512 CPU Lomax system. As can be seen, the chart indicates that OVERFLOW-MLP continues to scale linearly with CPU count up to 512 CPUs on a large 35 million point full aircraft RANS simulation. At this point performance is such that a fully converged simulation of 2500 time steps is completed in less than 2 hours of elapsed time. Further work over the next few weeks will improve the performance of this code even further.The LAURA code has been converted to the MLP format as well. This code is currently being optimized for the 512 CPU system. Performance statistics indicate that the goal of 100 GFLOP/s will be achieved by year's end. This amounts to 20x the 16 CPU C90 result and strongly demonstrates the viability of the new parallel systems rapidly solving very large simulations in a production environment.
Development of Parallel Code for the Alaska Tsunami Forecast Model

NASA Astrophysics Data System (ADS)

Bahng, B.; Knight, W. R.; Whitmore, P.

2014-12-01

The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.
VizieR Online Data Catalog: ynogkm: code for calculating time-like geodesics (Yang+, 2014)

NASA Astrophysics Data System (ADS)

Yang, X.-L.; Wang, J.-C.

2013-11-01

Here we present the source file for a new public code named ynogkm, aim on calculating the time-like geodesics in a Kerr-Newmann spacetime fast. In the code the four Boyer-Lindquis coordinates and proper time are expressed as functions of a parameter p semi-analytically, i.e., r(p), μ(p), φ(p), t(p), and σ(p), by using the Weiers- trass' and Jacobi's elliptic functions and integrals. All of the ellip- tic integrals are computed by Carlson's elliptic integral method, which guarantees the fast speed of the code.The source Fortran file ynogkm.f90 contains three modules: constants, rootfind, ellfunction, and blcoordinates. (3 data files).
A Wideband Fast Multipole Method for the two-dimensional complex Helmholtz equation

NASA Astrophysics Data System (ADS)

Cho, Min Hyung; Cai, Wei

2010-12-01

A Wideband Fast Multipole Method (FMM) for the 2D Helmholtz equation is presented. It can evaluate the interactions between N particles governed by the fundamental solution of 2D complex Helmholtz equation in a fast manner for a wide range of complex wave number k, which was not easy with the original FMM due to the instability of the diagonalized conversion operator. This paper includes the description of theoretical backgrounds, the FMM algorithm, software structures, and some test runs. Program summaryProgram title: 2D-WFMM Catalogue identifier: AEHI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 4636 No. of bytes in distributed program, including test data, etc.: 82 582 Distribution format: tar.gz Programming language: C Computer: Any Operating system: Any operating system with gcc version 4.2 or newer Has the code been vectorized or parallelized?: Multi-core processors with shared memory RAM: Depending on the number of particles N and the wave number k Classification: 4.8, 4.12 External routines: OpenMP ( http://openmp.org/wp/) Nature of problem: Evaluate interaction between N particles governed by the fundamental solution of 2D Helmholtz equation with complex k. Solution method: Multilevel Fast Multipole Algorithm in a hierarchical quad-tree structure with cutoff level which combines low frequency method and high frequency method. Running time: Depending on the number of particles N, wave number k, and number of cores in CPU. CPU time increases as N log N.
The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project

NASA Technical Reports Server (NTRS)

Woo, Alex C.; Hill, Kueichien C.

1996-01-01

The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80

NASA Astrophysics Data System (ADS)

Kamat, Manohar P.; Watson, Brian C.

1992-02-01

The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
Tutorial: Parallel Computing of Simulation Models for Risk Analysis.

PubMed

Reilly, Allison C; Staid, Andrea; Gao, Michael; Guikema, Seth D

2016-10-01

Simulation models are widely used in risk analysis to study the effects of uncertainties on outcomes of interest in complex problems. Often, these models are computationally complex and time consuming to run. This latter point may be at odds with time-sensitive evaluations or may limit the number of parameters that are considered. In this article, we give an introductory tutorial focused on parallelizing simulation code to better leverage modern computing hardware, enabling risk analysts to better utilize simulation-based methods for quantifying uncertainty in practice. This article is aimed primarily at risk analysts who use simulation methods but do not yet utilize parallelization to decrease the computational burden of these models. The discussion is focused on conceptual aspects of embarrassingly parallel computer code and software considerations. Two complementary examples are shown using the languages MATLAB and R. A brief discussion of hardware considerations is located in the Appendix. © 2016 Society for Risk Analysis.
Developing Information Power Grid Based Algorithms and Software

NASA Technical Reports Server (NTRS)

Dongarra, Jack

1998-01-01

This exploratory study initiated our effort to understand performance modeling on parallel systems. The basic goal of performance modeling is to understand and predict the performance of a computer program or set of programs on a computer system. Performance modeling has numerous applications, including evaluation of algorithms, optimization of code implementations, parallel library development, comparison of system architectures, parallel system design, and procurement of new systems. Our work lays the basis for the construction of parallel libraries that allow for the reconstruction of application codes on several distinct architectures so as to assure performance portability. Following our strategy, once the requirements of applications are well understood, one can then construct a library in a layered fashion. The top level of this library will consist of architecture-independent geometric, numerical, and symbolic algorithms that are needed by the sample of applications. These routines should be written in a language that is portable across the targeted architectures.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80

NASA Technical Reports Server (NTRS)

Kamat, Manohar P.; Watson, Brian C.

1992-01-01

The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

NASA Astrophysics Data System (ADS)

Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren

2011-10-01

Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.
CAFNA{reg{underscore}sign}, coded aperture fast neutron analysis for contraband detection: Preliminary results

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, L.; Lanza, R.C.

1999-12-01

The authors have developed a near field coded aperture imaging system for use with fast neutron techniques as a tool for the detection of contraband and hidden explosives through nuclear elemental analysis. The technique relies on the prompt gamma rays produced by fast neutron interactions with the object being examined. The position of the nuclear elements is determined by the location of the gamma emitters. For existing fast neutron techniques, in Pulsed Fast Neutron Analysis (PFNA), neutrons are used with very low efficiency; in Fast Neutron Analysis (FNS), the sensitivity for detection of the signature gamma rays is very low.more » For the Coded Aperture Fast Neutron Analysis (CAFNA{reg{underscore}sign}) the authors have developed, the efficiency for both using the probing fast neutrons and detecting the prompt gamma rays is high. For a probed volume of n{sup 3} volume elements (voxels) in a cube of n resolution elements on a side, they can compare the sensitivity with other neutron probing techniques. As compared to PFNA, the improvement for neutron utilization is n{sup 2}, where the total number of voxels in the object being examined is n{sup 3}. Compared to FNA, the improvement for gamma-ray imaging is proportional to the total open area of the coded aperture plane; a typical value is n{sup 2}/2, where n{sup 2} is the number of total detector resolution elements or the number of pixels in an object layer. It should be noted that the actual signal to noise ratio of a system depends also on the nature and distribution of background events and this comparison may reduce somewhat the effective sensitivity of CAFNA. They have performed analysis, Monte Carlo simulations, and preliminary experiments using low and high energy gamma-ray sources. The results show that a high sensitivity 3-D contraband imaging and detection system can be realized by using CAFNA.« less
Positive Selection Underlies Faster-Z Evolution of Gene Expression in Birds

PubMed Central

Dean, Rebecca; Harrison, Peter W.; Wright, Alison E.; Zimmer, Fabian; Mank, Judith E.

2015-01-01

The elevated rate of evolution for genes on sex chromosomes compared with autosomes (Fast-X or Fast-Z evolution) can result either from positive selection in the heterogametic sex or from nonadaptive consequences of reduced relative effective population size. Recent work in birds suggests that Fast-Z of coding sequence is primarily due to relaxed purifying selection resulting from reduced relative effective population size. However, gene sequence and gene expression are often subject to distinct evolutionary pressures; therefore, we tested for Fast-Z in gene expression using next-generation RNA-sequencing data from multiple avian species. Similar to studies of Fast-Z in coding sequence, we recover clear signatures of Fast-Z in gene expression; however, in contrast to coding sequence, our data indicate that Fast-Z in expression is due to positive selection acting primarily in females. In the soma, where gene expression is highly correlated between the sexes, we detected Fast-Z in both sexes, although at a higher rate in females, suggesting that many positively selected expression changes in females are also expressed in males. In the gonad, where intersexual correlations in expression are much lower, we detected Fast-Z for female gene expression, but crucially, not males. This suggests that a large amount of expression variation is sex-specific in its effects within the gonad. Taken together, our results indicate that Fast-Z evolution of gene expression is the product of positive selection acting on recessive beneficial alleles in the heterogametic sex. More broadly, our analysis suggests that the adaptive potential of Z chromosome gene expression may be much greater than that of gene sequence, results which have important implications for the role of sex chromosomes in speciation and sexual selection. PMID:26067773
Large-scale trench-perpendicular mantle flow beneath northern Chile

NASA Astrophysics Data System (ADS)

Reiss, M. C.; Rumpker, G.; Woelbern, I.

2017-12-01

We investigate the anisotropic properties of the forearc region of the central Andean margin by analyzing shear-wave splitting from teleseismic and local earthquakes from the Nazca slab. The data stems from the Integrated Plate boundary Observatory Chile (IPOC) located in northern Chile, covering an approximately 120 km wide coastal strip between 17°-25° S with an average station spacing of 60 km. With partly over ten years of data, this data set is uniquely suited to address the long-standing debate about the mantle flow field at the South American margin and in particular whether the flow field beneath the slab is parallel or perpendicular to the trench. Our measurements yield two distinct anisotropic layers. The teleseismic measurements show a change of fast polarizations directions from North to South along the trench ranging from parallel to subparallel to the absolute plate motion and, given the geometry of absolute plate motion and strike of the trench, mostly perpendicular to the trench. Shear-wave splitting from local earthquakes shows fast polarizations roughly aligned trench-parallel but exhibit short-scale variations which are indicative of a relatively shallow source. Comparisons between fast polarization directions and the strike of the local fault systems yield a good agreement. We use forward modelling to test the influence of the upper layer on the teleseismic measurements. We show that the observed variations of teleseismic measurements along the trench are caused by the anisotropy in the upper layer. Accordingly, the mantle layer is best characterized by an anisotropic fast axes parallel to the absolute plate motion which is roughly trench-perpendicular. This anisotropy is likely caused by a combination of crystallographic preferred orientation of the mantle mineral olivine as fossilized anisotropy in the slab and entrained flow beneath the slab. We interpret the upper anisotropic layer to be confined to the crust of the overriding continental plate. This is explained by the shape-preferred orientation of micro-cracks in relation to local fault zones which are oriented parallel the overall strike of the Andean range. Our results do not provide any evidence for a significant contribution of trench-parallel mantle flow beneath the subducting slab to the measurements.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.